Apple Health Data - Export, Analyze and Visualize with R

Problem

The Apple Healthkit App dashboard is not useful beyond a daily snapshot of your steps, miles, and flights climbed.

How to export, analyze and visualize Apple Health Kit Steps data using R?

Actions

  • Export Apple HealthKit data from your iPhone.
  • Run R Script for analysis and visualizations.

Apple Health Steps by Day of Week R Visualization

Explanation (resolution)

How to Export Apple Health App Data

1) Launch the Apple Health App on your iPhone. The app icon is a heart.

2) Tap the Health Data icon in the bottom navigation. This will launch a list of all your Apple Heath data. In the List view, tap the “All” item which is the first item in the list.

Apple HealthKit Export 1

3) Tap the send arrow icon in the top right. This will launch an alert that says “Exporting Health Data Preparing…” The export preparation took 4 minutes for my health data, so be patient.

Apple HealthKit Export 2

4) Once the data is ready to send you’ll see an overlay where you can select how to send the health data export. I chose to send the file via email. The file name is export.zip.

Apple HealthKit Export 3 Apple HealthKit Export 4

How to use R to Analyze and Visualize your Apple Health Data

1) Make sure to install the packages used in the R script if you don’t already have them installed

> install.packages(c("dplyr","ggplot2","lubridate","XML"))

2) Run the R Script

3) Page through the 4 plots and see the R Console for data tables.

A boxplot showing my steps data by month and a bar graph showing my steps data by month are shown below.

Apple Health Steps Boxplot Apple Health Steps Bar Graph

My summary monthly steps statistics for 2015 are shown below. This data is output to the R Console.

 month  |   mean    |  sd | median  | max |  min  |    25%  |   75%|
 ------ | --------- | ----|-------- | --- | ------|-------- | ---- | 
 chr |  dbl |  dbl |  dbl| dbl| dbl  |  dbl |  dbl|
 01 | 6928.45| 3499.36|  5924.0| 15173|  2286|  4007.00|  8581.0|
 02 |12000.07| 5727.69| 11977.5| 22675|  4097|  6853.25| 15649.5|
 03 |11271.26| 2579.44| 11662.0| 15199|  6269|  9723.50| 12667.0|
 04 |14846.27| 5825.21| 14257.0| 25357|  4445| 10925.75| 19322.0|
 05 |13119.45 |5139.61| 12405.0| 25031|  2971|  9829.50| 16222.0|
 06 |11457.70 |5083.92|  9904.5| 25643|  4301|  7424.25| 14225.0|
 07 |16419.06 |7369.98| 14750.0| 35582|  4546| 11243.50| 20911.5|
 08 |13968.32 |6855.27| 12189.0| 32019|  2897| 10469.00| 14561.0|
 09 |13096.07 |5272.44| 12753.0| 29838|  5737| 10155.50| 15987.0|
 10 |12150.16 |5163.45| 11227.0| 27174|  3906|  8952.00| 15359.5|
 11 |10442.80 |4405.78|  9476.5| 22683|  3814|  8233.75| 12669.5|
 12  |8331.03 |3933.16|  8098.0| 15450|  1556|  5192.00| 11396.5|

References

Full Post on ryanpraski.com

If you have any questions or hit me up on Twitter @ryanpraski


Creating Interactive Dashboards in R

Problem

Once painstakingly collected, data is most value when actionable steps are taken from it. Methods of safely distributing and communicating results from data are not all created equally. Dashboards are a safe and effective way to communicate results from data.

Actions

We’ll walk through an example of how to create a dashboard in R using flexdashboard, an R framework for creating dashboards with R and Markdown.

Example

For this example we’ll grab data from Google Analytics and use it to make 3 charts on a dashboard.

  1. A histogram
  2. A time series chart
  3. Sortable table

To get this all to work, we’ll need to wrap our code in a flexdashboard template. You can find out how to do that here. We’ll use the row template to get our charts displaying well on the page and allow for vertical scrolling.

---
title: "My Google Analytics Dashboard"
output: 
  flexdashboard::flex_dashboard:
    orientation: rows
    vertical_layout: scroll
---

For chart 1 we’ll use DT, a data table library to create a searchable and sortable table for our dashboard. I used data from the in-market segment in Google Analytics. Plotting the chart with default stylings is easy:

datatable(my_data_table)

Chart 2 will be our histogram. The code for creating a histogram in Highcharter is very simple as well:

hchart(dataset$Sessions, color = "#B71C1C", name = "Sessions") %>% 
  hc_title(text = "Sessions May 2015 - May 2016")

This should plot a histogram of traffic to my website from May 2015 through May 2016. This chart is also zoomable, which is a nice feature to get for free.

Chart 3 is a time series chart. I will also use Highcharter for this as well:

highchart() %>%
  hc_add_series_times_values(as.Date(dataset$Day.Index) , dataset$Sessions, name = "Sessions")

This gives us visits by time over the same range.

From here you can click “Knit” in Rstudio and it will compile this markdown document into HTML. You will have a working, interactive and responsive dashboard. I’ve published a working version to rPubs for reference here.

image

References


Stop Scraping - Analyzing Sports Data Using Stattleship

Problem

You love sports. You love data. If you’ve ever gone on an epic journey in search of sports data, you’ve probably resorted to scraping data from sites like ESPN or Baseball Reference and have spent countless hours writing Python code to use the powerful web scraping library BeautifulSoup. Maybe you even have a nice Python client that scrapes stats.nba.com or you wrote an R package to scrape Baseball Reference.

However, we all know that websites get redesigned, formats change and sometimes access to specific stats is revoked (e.g. the NBA did this with player movement tracking but attributes it to ‘technical difficulties’). Suddenly, you are scrambling to update your scraping code in order to account for a few new divs or other elements on the web page that were renamed.

Actions

This is where the Stattleship API comes to the rescue. We’ve built an easy-to-access set of sports data, stats and accomplishments for multiple sports (and expanding). We’ve partnered with Gracenote to provide all of our NFL, NBA, NHL and MLB game data. Our service is designed for creative fans who want to use Stattleship data to build sports apps that scale.

We’ve cleaned, structured, and categorized all of the box score, player stats and game log data for you. We’ve even quantified specific performances and feats so that you can quickly identify whether a stat is a common-place occurrence or a record-breaking achievement.

In this tutorial, we will show you how to use our R wrapper to access live and up-to-date sports data via the Stattleship API. There is also a Ruby gem for all of you Rubyists: https://github.com/stattleship/stattleship-ruby

Example

There are a few main endpoints that we will focus on. They are as follows:

  • games - contains game information such as date, attendance, scoreline, etc.
  • players- contains player information such as name, position, draft round, school, salary, weight, birthday, etc.
  • teams - contains team information such as name, division, colors, hashtags, etc.
  • game_logs - contains player-level game log information such as Kevin Durant’s assists from a specific game, total minutes played, etc.
  • team_game_logs - contains team-level game log information such as total points, rebounds and turnovers by the Boston Celtics in a specific game

Here’s a basic Entity-Relationship Diagram outlining how these objects relate to one another:

In order to get started, sign up for a free API token from here. Now load up RStudio and start following along! The first few lines of this R script will install the R package from Github to your machine:

(If you don’t have devtools installed you will have to install that first.)

install.packages("devtools")
devtools::install_github("stattleship/stattleship-r")

## Load the stattleshipR package
library(stattleshipR)

Next you need to set your API token in the R environment:

## Get your free token from www.stattleship.com
set_token("your-API-token-goes-here")

Now we need to specify the sport, league and endpoint we are interested in. In this case we will fetch all MLB regular season game logs for the Red Sox to date. This is as easy as setting three parameters: sport, league and ep for endpoint.

sport <- 'baseball'
league <- 'mlb'
ep <- 'game_logs'

This last parameter, called q_body is where we can set more granular options such as requesting a specific team, stat, player or season.

q_body <- list(team_id='mlb-bos', status='ended', interval_type='regularseason')

Now that all of our parameters are set, we can use ss_get_result to send our request to the Stattleship API. Notice the walk=TRUE option. This ensures that the request will walk through all pages of results and return everything. Results are returned 40 rows per page.

gls <- ss_get_result(sport=sport, league=league, ep=ep, query=q_body, walk=TRUE)  

We now have a list in R that has 5 elements (there were 5 pages of results returned).

> length(gls)
[1] 5

We want to combine all of the pages of results into one data.frame:

game_logs<-do.call('rbind', lapply(gls, function(x) x$game_logs))  

Let’s check out all of the data we now have access to from yesterday’s games:

colnames(game_logs)        

Wow, over 80 variables for each player including strikeouts, walks, doubles, runs, and more.

I want to include player information into this data set though so I have more than just player_ids. Let’s retrieve all Boston Red Sox players by changing the ep to players and pass team_id='mlb-bos into the options list.

sport <- 'baseball'  
league <- 'mlb'  
ep <- 'players'  
q_body <- list(team_id='mlb-bos')

pls <- ss_get_result(sport=sport, league=league, ep=ep, query=q_body, walk=TRUE)  
players<-do.call('rbind', lapply(pls, function(x) x$players))

In order to merge the two data.frames we can simply rename the id column to player_id so we can use merge in R like this:

colnames(players)[1] <- 'player_id'
game_logs <- merge(players, game_logs, by='player_id')

Now let’s do some basic player calculations using dplyr.

install.packages('dplyr')
library(dplyr)

## Only include game_logs where a player actually played the game, calculate mean batting average,
## total runs, total bases and pull in salary info
stats <-
  game_logs %>%
  filter(game_played == TRUE) %>%
  group_by(name) %>%
  summarise(totalRuns = sum(runs), meanBA = mean(batting_average), totalBases=sum(total_bases), salary=max(salary))

Good! Now let’s plot it all using ggplot2.

ggplot(stats, aes(x=totalRuns, y=meanBA, size=totalBases, label=name, color=salary)) + geom_text()

This plot actually enables us to visualize 4 different variables at once. The x and y axes display total runs and mean batting average, the size of the labels indicate how many total bases the player has had, and the color indicates salary. Xander Bogaerts and Travis Shaw are looking like quite a bargain at this point in the season already.

Let us know if you build anything cool with this awesome sports data API! We also welcome contributions to our R wrapper via pull requests in Github. We have a public Slack channel as well where you can join us to talk sports, data, get R or Stattleship API help, and provide feedback.

References / Other Examples