Stop Scraping - Analyzing Sports Data Using Stattleship

Problem

You love sports. You love data. If you’ve ever gone on an epic journey in search of sports data, you’ve probably resorted to scraping data from sites like ESPN or Baseball Reference and have spent countless hours writing Python code to use the powerful web scraping library BeautifulSoup. Maybe you even have a nice Python client that scrapes stats.nba.com or you wrote an R package to scrape Baseball Reference.

However, we all know that websites get redesigned, formats change and sometimes access to specific stats is revoked (e.g. the NBA did this with player movement tracking but attributes it to ‘technical difficulties’). Suddenly, you are scrambling to update your scraping code in order to account for a few new divs or other elements on the web page that were renamed.

Actions

This is where the Stattleship API comes to the rescue. We’ve built an easy-to-access set of sports data, stats and accomplishments for multiple sports (and expanding). We’ve partnered with Gracenote to provide all of our NFL, NBA, NHL and MLB game data. Our service is designed for creative fans who want to use Stattleship data to build sports apps that scale.

We’ve cleaned, structured, and categorized all of the box score, player stats and game log data for you. We’ve even quantified specific performances and feats so that you can quickly identify whether a stat is a common-place occurrence or a record-breaking achievement.

In this tutorial, we will show you how to use our R wrapper to access live and up-to-date sports data via the Stattleship API. There is also a Ruby gem for all of you Rubyists: https://github.com/stattleship/stattleship-ruby

Example

There are a few main endpoints that we will focus on. They are as follows:

  • games - contains game information such as date, attendance, scoreline, etc.
  • players- contains player information such as name, position, draft round, school, salary, weight, birthday, etc.
  • teams - contains team information such as name, division, colors, hashtags, etc.
  • game_logs - contains player-level game log information such as Kevin Durant’s assists from a specific game, total minutes played, etc.
  • team_game_logs - contains team-level game log information such as total points, rebounds and turnovers by the Boston Celtics in a specific game

Here’s a basic Entity-Relationship Diagram outlining how these objects relate to one another:

In order to get started, sign up for a free API token from here. Now load up RStudio and start following along! The first few lines of this R script will install the R package from Github to your machine:

(If you don’t have devtools installed you will have to install that first.)

install.packages("devtools")
devtools::install_github("stattleship/stattleship-r")

## Load the stattleshipR package
library(stattleshipR)

Next you need to set your API token in the R environment:

## Get your free token from www.stattleship.com
set_token("your-API-token-goes-here")

Now we need to specify the sport, league and endpoint we are interested in. In this case we will fetch all MLB regular season game logs for the Red Sox to date. This is as easy as setting three parameters: sport, league and ep for endpoint.

sport <- 'baseball'
league <- 'mlb'
ep <- 'game_logs'

This last parameter, called q_body is where we can set more granular options such as requesting a specific team, stat, player or season.

q_body <- list(team_id='mlb-bos', status='ended', interval_type='regularseason')

Now that all of our parameters are set, we can use ss_get_result to send our request to the Stattleship API. Notice the walk=TRUE option. This ensures that the request will walk through all pages of results and return everything. Results are returned 40 rows per page.

gls <- ss_get_result(sport=sport, league=league, ep=ep, query=q_body, walk=TRUE)  

We now have a list in R that has 5 elements (there were 5 pages of results returned).

> length(gls)
[1] 5

We want to combine all of the pages of results into one data.frame:

game_logs<-do.call('rbind', lapply(gls, function(x) x$game_logs))  

Let’s check out all of the data we now have access to from yesterday’s games:

colnames(game_logs)        

Wow, over 80 variables for each player including strikeouts, walks, doubles, runs, and more.

I want to include player information into this data set though so I have more than just player_ids. Let’s retrieve all Boston Red Sox players by changing the ep to players and pass team_id='mlb-bos into the options list.

sport <- 'baseball'  
league <- 'mlb'  
ep <- 'players'  
q_body <- list(team_id='mlb-bos')

pls <- ss_get_result(sport=sport, league=league, ep=ep, query=q_body, walk=TRUE)  
players<-do.call('rbind', lapply(pls, function(x) x$players))

In order to merge the two data.frames we can simply rename the id column to player_id so we can use merge in R like this:

colnames(players)[1] <- 'player_id'
game_logs <- merge(players, game_logs, by='player_id')

Now let’s do some basic player calculations using dplyr.

install.packages('dplyr')
library(dplyr)

## Only include game_logs where a player actually played the game, calculate mean batting average,
## total runs, total bases and pull in salary info
stats <-
  game_logs %>%
  filter(game_played == TRUE) %>%
  group_by(name) %>%
  summarise(totalRuns = sum(runs), meanBA = mean(batting_average), totalBases=sum(total_bases), salary=max(salary))

Good! Now let’s plot it all using ggplot2.

ggplot(stats, aes(x=totalRuns, y=meanBA, size=totalBases, label=name, color=salary)) + geom_text()

This plot actually enables us to visualize 4 different variables at once. The x and y axes display total runs and mean batting average, the size of the labels indicate how many total bases the player has had, and the color indicates salary. Xander Bogaerts and Travis Shaw are looking like quite a bargain at this point in the season already.

Let us know if you build anything cool with this awesome sports data API! We also welcome contributions to our R wrapper via pull requests in Github. We have a public Slack channel as well where you can join us to talk sports, data, get R or Stattleship API help, and provide feedback.

References / Other Examples