Visualizing State Level Data With R and Statebins

Problem

For an organization that is geographically distributed, understanding how interactions are distributed across the geography becomes important. Presenting that information in a simple and effective manner is best.

Actions

We’ll walk through a simple example of how to visualize your state data from Google Analytics (any other analytics solution would be similar).

Example

Within the Google Analytics interface, you can find a visualization that looks like this:

googleanalytics

It’s handy for understanding traffic and other engagement patterns from a geographical perspective. We can simplify it a bit an add some depth using some statebins.

The Statebin comes from the Washington Post:

washingtonpost

We’ll take the same data from this graph and make it a statebin!

Step one is preparing the data. All you need is a state column and another column (in this case percentage new visitors). Data should look like this:

Region newsessions Other Data
New York .23 1
California .77 2
Florida .65 3


From here we’ll use R. First thing we need to do is translate the state data into state abbreviations. For example, “New York” would become “NY” and so on. R already has a handy function for this:

library(openintro)
data = read.csv("mydat.csv", header=TRUE)
data$states = as.character(state2abbr(data$Region))

From here, we just need to adjust our options and make a plot:

library(statebins)

gg <- statebins_continuous(data, "states", "newsessions",legend_position="bottom",
                           legend_title="%", font_size=5,
                           brewer_pal="Blues", text_color="black", breaks = 4,
                           plot_title="% New Visitors Feb 2014 - Feb 2015", title_position="top")

statebin

References:

Statebins

Washington Post

Dataset


Real-Time Reporting Adobe Analytics API Tutorial

Information

*No prior API or programming knowledge is necessary to complete this tutorial.

Problem

How to create a standalone Adobe Analytics real-time dashboard using the API?

Adobe has a nice real-time dashboard built directly into Reports & Analytics (SiteCatalyst). It is easy to configure and get up an running quickly. But how do you create your own real-time dashboard outside of Adobe Analytics using the API?

Actions

Use the lessons from an Adobe Summit lab session to walk-though how to create your own Adobe Analytics real-time dashboard. The dashboard uses the Adobe Analytics real-time API and D3.js for data visualization. When you are done with this tutorial your dashboard will show your company’s data and look like the sample below.

Adobe Analytics Real-Time Dashboard

Explanation (resolution)

Start by downloading the analytics realtime dashboard example from GitHub. Find this in your downloads and unzip it. I unzipped the folder on my desktop for this example.

I skipped Lesson 1-5 to keep the tutorial short. When you have time I highly recommend going through these lessons.

Lesson 6

####Create a Snazzy Real-Time Adobe Analytics Dashboard Now

1) Update the config.js file in the analytics-realtime-dashboard-example-master that you downloaded above. The path to the file on my computer is: C:\Users\ryanpraski\Desktop\analytics-realtime-dashboard-example-master\js\config.js

Below is a sample config.js file. Make sure to update it with your credentials and save it on your computer.

var config = {
   username:     "User Name:Company Name",
    secret:       "abcdefghijklmnopgrstuvwxyz",
    reportSuite:  "rsid",
    endpoint:     "api.omniture.com"
};

2) Find the Lesson 6 folder on your computer. The path to the folder on my computer is: C:\Users\ryanpraski\Desktop\analytics-realtime-dashboard-example-master\lessons\lesson_6

3) Open the README file in the lesson_6 folder in your text editor or on GitHub.

4) Open the lesson_6.html file in your web browser and then open it a second time in a text editor. The path to the folder on my computer is: C:\Users\ryanpraski\Desktop\analytics-realtime-dashboard-example-master\lessons\lesson_6\lesson_6.html

5) When you load the lesson_6.html file in your browser you should see an animated number flash on the screen of total page views for the past 15 minutes in the top left, a continuously updating trend line with page views by minute for the past 15 minutes, and a table with the top 10 pages by page views for the last 15 minutes.

6) Follow the instructions in the lesson_6 README file to add the donut chart to your dashboard. If you have trouble getting the the donut chart included on your lesson_6.html file, you can open a complete version of the dashboard with the Lesson 6 code updates already done for you. The path to this complete “final dashboard” file on my computer is: C:\Users\ryanpraski\Desktop\analytics-realtime-dashboard-example-master\index.html

7) Your real-time dashboard should look like the screen shot below and will dynamically update with real-time Adobe Analytics data from the reporting API.

Adobe Analytics Real-Time Dashboard

8) If you’d like to create a different report that uses the same visualizations change the query parameters in the lesson_6.html or index.html file within the var params show in the code snippet below:

    var method ='Report.Run';
    //edit query here
    var params = { 
        "reportDescription":{
            "source": "realtime",
            "reportSuiteID": config.reportSuite,
            "metrics": [
                { "id": "pageviews" }
            ], "elements": [
                { "id": "page" }
            ],
            "dateFrom": "-15 minutes",
            "dateGranularity": "minute:1"
        }
    };

If you have any questions please see the references below or hit me up on Twitter @ryanpraski.

References

Full Post on ryanpraski.com

Adobe Analytics Real-Time Dashboard Example Lab Lessons on GitHub


Identifying Duplicate Sales Data using Python

Problem

More often than not, analysts are given data that is unstructured and dirty. In this example, an analyst has been provided a 300MB Excel Workbook containing sales transactions, however it has been noted that several of the transactions have been duplicated in the data. Before doing any analysis of the sales data, the duplicate data needs to be cleaned up.

Actions

Our sample dataset contains every order transaction for 2015. The data is structured in such a way that each item purchased, in an order, is a unique row in the data. If an order contained three unique product SKUs, that one order would have three rows in the dataset.

We need to identify orders that contain duplicated order line items, those duplicates would have a SKU that appears in more than one row of data, for a given order_id, as shown below.

Duplicate Data Example

Notes about the data:

  • order_id: A unique id that serves as the key to group line items into a single order
  • order_item_cd: Product SKU
  • order_item_quantity: The number each SKU purchased for an order
  • order_item_cost_price: The individual unit price of a SKU purchased

Explanation (resolution)

Let’s begin begin by importing the pandas library. pandas is an open source, BSD-licensed library providing high-performance, easy-to-use data structures and data analysis tools for Python.

In [1]:
# We will use data structures and data analysis tools provided in the pandas library
import pandas as pd

Next, we will use pandas to import all of the sales data from our Excel Workbook into a pandas DataFrame. The pandas DataFrame, along with Series, is one of the most important data structures you will use as a data analyst. Creating a DataFrame is one of the first things we typically do after launching Python.

A DataFrame is tabular in nature and has a ?spreadsheet like? data structure containing an ordered collection of columns and rows.

In [2]:
# Import retail sales data from an Excel Workbook into a data frame
path = '/Documents/analysis/python/examples/2015sales.xlsx'
xlsx = pd.ExcelFile(path)
df = pd.read_excel(xlsx, 'Sheet1')

We have our data in a DataFrame, which resembles the structure shown in the example sales data above, now we need a way to identify all of the duplicate rows in our dataset. To accomplish this we are going to add a new column to our DataFrame, named is_duplicated, that will hold a boolean value identifying if the row is a duplicate or not. To populate the boolean value for each row in the dataset, we are going to use a very powerful function called duplicated which we will feed the columns we want to evaluate for duplication. In this case, we will mark a row as a duplicate if we find more than one row that has the same order_id and order_item_sku.

In [3]:
# Let's add a new boolean column to our DataFrame that will identify a duplicated order line item (False=Not a duplicate; True=Duplicate)
df['is_duplicated'] = df.duplicated(['order_id', 'order_item_cd'])

Before we go any further, let's get a sense of how big of an issue we have here. We are going to sum up all of the rows that were marked as a duplicate. This will give us the number of duplicate line items in the dataset.

In [4]:
# We can sum on a boolean column to get a count of duplicate order line items
df['is_duplicated'].sum()

Out[4]:
13372

Ok, so this isn’t just a few records here and there that are a problem, our dataset contains 13,372 line items that have been marked as being duplicates. Before we look to do any cleanup, let’s further understand the impact of this dirty data. Let’s find out how many duplicate units are in our dataset, we would expect at least 13,372 units but there is a high likelihood that customers often purchase more than one unit of any given SKU.

Like we did in the previous step, let’s sum up the number of items purchased that were marked as being duplicates in our dataset.

In [5]:
# Now let's quantify the impact of duplicate line items in terms of over counted units sold
df[df['is_duplicated']].order_item_quantity.sum()

Out[5]:
63234.0

Let’s see what the impact to revenue is. In order to get the impacted revenue, we will add a new column to our DataFrame which will be the line item total for each row in the dataset. To get this value, we will multiple quantity purchased by the item cost, for each row.

Once we have the total for each line item, we can again sum all of the duplicated line items, this time using our revenue value.

In [6]:
# With Python, we can use 'vectorizing' to multiple two columns and place the result in a new column
# This saves us from having to loop over every row in our sales data
df['line_item_total'] = df.order_item_quantity * df.order_item_cost_price

# We can use our boolean sum shortcut again to get the sum of revenue from duplicated order line items
df[df['is_duplicated']].line_item_total.sum()

Out[6]:
4736155.8047346813

Let's clean the duplicate data up.

In [7]:
# We are only interested in the non-duplicated order line items, so let's get those now
df_nodup = df.loc[df['is_duplicated'] == False]

It’s always a good practice to constantly check our work. Let’s do a quick sanity check to make sure we have gotten rid of the problematic duplicate data.

In [8]:
# Let's do a quick sanity check to make sure we now have a dataset with no identified duplicate order line items
df_nodup[df_nodup['is_duplicated']].line_item_total.sum()

Out[8]:
0

Now we can export this cleaned up dataset so that we can do additional analysis of our 2015 sales data without having to remove all the duplicate data again.

In [9]:
# Finally let's save our cleaned up data to a csv file
df_nodup.to_csv('2015sales_nodup.csv', encoding='utf-8')

References

GitHub Gist