DailyR rvest web scraping
Dear Friends, extracting data from the web is an important skill to have in data science. R provides many packages to ‘scrape’ data. In this post, I use the rvest package to scrape data from the top premier league scorers from a BBC site.
I’m a huge Liverpool fan and want to check out how teams and players are doing. First, browse the BBC website and inspected the url. Use the inspect feature from your browser to inspect the data and appropriate xpath.
Major Steps
- Use read_html and html_nodes to scrape the data
- Use strsplit to separate the features of each player’s stats
- Use data.table to organize the data
- Use plot_ly to visualize the results
Check out this Video for Step By Step Instructions
Scrape the Data
knitr::opts_chunk$set(echo = TRUE)
library(rvest)
url = "http://www.bbc.com/sport/football/premier-league/top-scorers" # website to scrape
x_path = '//*[@id="top-scorers"]/ol' # xpath
website <- read_html(url)
top_scorers <- website %>%
html_nodes(xpath = x_path) %>%
html_text() # text scraped from website
substring(top_scorers, 1, 400) # inspect first 200 characters
## [1] " Mohamed Salah Liverpool 148 mins per goal 3256 mins played 22 Goals scored 8 Assists Shots on targetTotal 62% 64 104 Pierre-Emerick Aubameyang Arsenal 124 mins per goal 2731 mins played 22 Goals scored 5 Assists Shots on targetTotal 56% 40 72 Sadio Mané Liverpool 140 mins per goal 3085 mins played 22 Goals scored 1 Assists "
Place the Data in a Data.Table
The data.table package is a great tool to work with data. Check out my post here for further details. Let’s wrnagle this data into something that makes sense and is easy to visualize.
library(data.table)
library(pander)
top_scorers <- strsplit(top_scorers, " ") # Use the space marker to split the data near player names
top_scorers <- data.table(name = top_scorers[[1]]) # place the results in a data.table
top_scorers$team <- sapply(top_scorers$name, function(x) unlist(strsplit(x, " "))[2]) # use the smaller space marker to split near team names
top_scorers$name <- sapply(top_scorers$name, function(x) unlist(strsplit(x, " "))[1]) # cleans up name column, remove everything after the space marker
digits <- sapply(1:length(top_scorers$team), function(x) as.numeric(unlist(strsplit(gsub("[^\\d ]+", " ", top_scorers$team[x], perl = TRUE), " "))[x != ""])) # extract all the numerical data from the text
digits <- unlist(digits) # turns the list of 24 vectors into a single vector
digits <- digits[!is.na(digits)] # removes NAs
dim(digits) <- c(7,25) # conforms the single vector into a matrix wiht 7x24 dimensions
digits <- data.table(t(digits)) # convert the matrix into a data.table
colnames(digits)[] <- c("minutes_per_goal", "minutes_played", "goals_scored", "assists", "shots_on_target_percentage", "shots_on_target", "shot_attempts") # column headers
top_scorers$team <- sapply(top_scorers$team, function(x) unlist(strsplit(x, " "))[1]) # clean up name column, remove everything after the space marker
top_scorers <- cbind(top_scorers, digits) # combine the data.tables
pander(top_scorers[1:5,]) # checkout the first 5 player data
name | team | minutes_per_goal | minutes_played |
---|---|---|---|
Mohamed Salah | Liverpool | 148 | 3256 |
Pierre-Emerick Aubameyang | Arsenal | 124 | 2731 |
Sadio Mané | Liverpool | 140 | 3085 |
Sergio Agüero | Man City | 118 | 2479 |
Jamie Vardy | Leicester | 152 | 2728 |
goals_scored | assists | shots_on_target_percentage | shots_on_target |
---|---|---|---|
22 | 8 | 62 | 64 |
22 | 5 | 56 | 40 |
22 | 1 | 55 | 42 |
21 | 8 | 49 | 43 |
18 | 4 | 58 | 40 |
shot_attempts |
---|
104 |
72 |
76 |
87 |
69 |
Plot the Data
OK, now that we wrangled the data into a data.table, let’s look at the data briefly with a chart. Plotly is a great package that enables users to interact with the chart. Let’s check it out.
library(plotly) # uber web-based interactive graphing tools
data(top_scorers)
top_scorers$team <- as.factor(top_scorers$team) # make teams as.factor
p <- plot_ly(top_scorers, # the data.table
x = ~ minutes_per_goal,
y = ~ goals_scored,
z = ~ assists,
color = ~ team) %>% # make the teams as.factor
add_markers() %>%
layout(scene = list(xaxis = list(title = 'Minutes per Goal'),
yaxis = list(title = 'Goals Scores'),
zaxis = list(title = 'Assists')))
p