Using sentiment analysis to predict ratings of popular tv series

Unless you’ve been living under a rock for the last few years, you have probably heard of TV shows such as Breaking Bad, Mad Men, How I Met Your Mother or Game of Thrones. While I generally don’t spend a whole lot of time watching TV, I have also undergone some pretty intense binge-watching sessions in the past (they generally coincided with exam periods, which was actually not a coincidence…). As I was watching the epic final season of Breaking Bad, it got me thinking on how TV series compare to one another, and how their ratings evolve over time. I therefore decided to look a bit further into user rating trends of popular TV series (and by popular I mean the ones I know). For this, I simply had to define a quick scraping function in R that retrieves the average IMDB user ratings assigned to each episode of a given series.

'scrape_ratings' <- function(url)
{
  require(XML)
  # get HTML of url
  doc <- htmlParse(url)

  # find all tables in webpage
  tables <- readHTMLTable(doc)

  # find largest table and return as dataframe
  nrows <- unlist(lapply(tables, function(t) dim(t)[1]))
  df <- tables[[which.max(nrows)]]

  return(df)
}

# IMDB id of Breaking Bad is "tt0903747"
url <- 'http://www.imdb.com/title/tt0903747/epdate'
series.ratings <- scrape_ratings(url)

After some minor data cleaning, I was able to plot the evolution of IMDB user ratings for some of the most popular TV series. Breaking Bad looks like the highest rated series, followed closely by Game of Thrones. It is also interesting to note the big drop in ratings for shows such as Family Guy, South Park and How I Met Your Mother. The same goes for the Simpsons, who (I’ve been told) used to be excellent and are now much less fun to watch.
rating_all_series

Since I’ve recently taken an interest in NLP and some of the challenges associated with it, I also decided to perform a sentiment analysis of the TV series under study. In this case, we can use the AFINN list of positive and negative words in the English language, which provides 2477 words weighted in a range of [-5, 5] according to their “negativeness” or “positiveness”. For example, the phrase below would be scored as -3 (terrible) -2 (mistake) + 4 (wonderful) = -1

"There is a terrible mistake in this work, but it is still wonderful!"

I used a Python scraper (for any midly sophisticated scraping purposes, the BeautifulSoup Python library still has no equal in R) to retrieve the transcripts of all episodes in each TV series and computed their overall sentiment score, which produced the figure below. Here, the higher the sentiment score, the more “positive” was the episode, and vice-versa.

sentiment_all_series

Of the TV series featured here, we can see that Game of Thrones is by far the most negative of them all, which is not surprising given the plotting, killing and general all out warring that goes on in this show. On the flip side, Glee was the most positive TV series, which also makes a lot sense, given how painfully corny it can be. Of the shows that have already ended (Friends, West Wing and Grey’s anatomy), It is interesting to observe a progressive rise of positiveness as we get closer to the final episode, presumably because the writers try and end the series on a high note. I have included more detailed graphs of the rating and sentiments for each TV series at the bottom of this post.

Looking at the plot above, we can wonder whether user ratings are somehow dependent on the sentiments of a given episode. We can investigate this further by fitting a simple model in which the response is the IMDB user ratings, and predictor variables are sentiment, number of submitted votes, and TV series.

sentiment rating   VoteCount series
148       8.4      2352      BBT
61        8.4      1691      Breaking Bad
115       7.9      1418      BBT
109       8.2      1458      Game of Thrones
194       8.1      1356      Simpsons
131       8.5      1406      Simpsons

For the purpose of this study, I considered two types of model:  multiple regression and MARS (Multivariate Adaptive Regression Splines, implemented in the earth R package), and assessed their performance  using 10-fold cross-validation. Below is a plot of the root mean squared error scored by both method at each fold.

RMSEMARS appears to perform better, which is likely due to the fact that it is designed to capture non-linear and interaction effects. Overall, we see that MARS does a good job of predicting user ratings of episodes based off its overall sentiment, as the difference between true rating and predicted rating is normally distributed around zero and has relatively standard deviation.

MARS_prediction

In conclusion, while this is a relatively unrigorous study, it appears that we can predict with reasonable accuracy the average IMDB user ratings that will be assigned to an episode, so long as we know its overall sentiment score and the number of submitted votes. Of course, we could probably obtain far better accuracy if we could account for other elements such a humor, suspense and so on. Furthermore, we could extend this to predict individual user ratings rather than the average, which would ultimately make more sense since people tend to respond differently to TV series (although it would be interesting to actually confirm that). You can scroll down to look at more detailed plots of user ratings and sentiment analysis for different popular TV series. As usual, all the relevant code can be found on my GitHub account.

 


Big Bang Theory
BBT


Breaking Bad
Breaking Bad


Family Guy
Family Guy


Friends
Friends


Glee
Glee


Game of Thrones
GoT


Grey’s Anatomy
Greys Anatomy


How I Met Your Mother
HIMYM


Mad Men
Mad Men


Sex in the City
Sex in the City


Simpsons
Simpsons


South Park
South Park


West Wing
West Wing

Advertisements

On the carbon footprint of the NBA

It’s no secret that I enjoy basketball, but I’ve often wondered about the carbon footprint that can be caused by 30 teams each playing an 82-game season. Ultimately, that’s 2460 air flights across the whole of the USA, each carrying 30+ individuals.

For these reasons, I decided to investigate the average distance travelled by each NBA team during the 2013-2014 NBA season. In order to do so, I had to obtain the game schedule for the whole 2013-2014 season, but also the distances between arenas in which games are played. While obtaining the regular season schedule was straightforward (a shameless copy and paste), for the distance between arenas, I first had to extract the coordinates of each arena, which could be achieved using the geocode function in the ggmap package.

Example: finding the coordinates of NBA arenas:

# find geocode location of a given NBA arena
library(maps)
library(mapdata)
library(ggmap)
geo.tag1 <- geocode('Bankers Life Fieldhouse')
geo.tag2 <- geocode('Madison Square Garden')
print(geo.tag1)
geo.tag1
        lon     lat
1 -86.15578 39.7639

Once the coordinate of all NBA arenas were obtained, we can use this information to compute the pairwise distance matrix between each NBA arena. However we first had to define a function to compute the distance between two pairs of latitude-longitude.

Computing the distance between two coordinate points:

# Function to calculate distance in kilometers between two points
# reference: http://andrew.hedges.name/experiments/haversine/
earth.dist <- function (lon1, lat1, lon2, lat2, R)
{
  rad <- pi/180
  a1 <- lat1 * rad
  a2 <- lon1 * rad
  b1 <- lat2 * rad
  b2 <- lon2 * rad
  dlon <- b2 - a2
  dlat <- b1 - a1
  a <- (sin(dlat/2))^2 + cos(a1) * cos(b1) * (sin(dlon/2))^2
  c <- 2 * atan2(sqrt(a), sqrt(1 - a))
  d <- R * c
  real.d <- min(abs((R*2) - d), d)
  return(real.d)
}

Using the function above and the coordinates of NBA arenas, the distance between any two given NBA arenas can be computed with the following lines of code.
Computing the distance matrix between all NBA arenas:

# compute distance between each NBA arena
dist <- c()
R <- 6378.145 # define radius of earth in km
lon1 <- geo.tag1$lon
lat1 <- geo.tag1$lat
lon2 <- geo.tag2$lon
lat2 <- geo.tag2$lat
dist <- earth.dist(lon1, lat1, lon2, lat2, R)

print(dist)
485.6051

By performing this operation on all pairs of NBA teams, we can compute a distance matrix, which can be used in conjunction with the 2013-2014 regular season schedule to compute the total distance travelled by each NBA teams. Finally, all that was left was to visualize the data in an attractive manner. I find the googleVis is a great resource for that, as it provides a convenient interface between R and the Google Chart Tools API. Because wordpress.com does not support javascript, you can view the interactive graph by clicking on the image below.

distance_barchart

Total distance (in km) travelled by all NBA teams during the 2013-2014 NBA regular season

Incredibly, we see that the aggregate number of kilometers travelled by NBA teams amounts to 2,108,806 kms! I hope the players have some kind of frequent flyer card…We can take this a step further by computing the amount of CO2 emitted by each NBA team during the 2013-2014 season. The NBA charters standard A319 Airbus planes, which according to the Airbus website emits an average of 9.92 kg of CO2 per km. Again, you can view the interactive graph of CO2 by clicking on the image below.

distance_CO2

Total amount of CO2 (in kg) consummed by all NBA teams during the 2013-2014 NBA regular season

Not surprisingly, Oregon and California-based teams travel and pollute the most, since the NBA is mid-east / east coast heavy in its distribution of teams. It is somewhat ironic that the hipster / recycle-crazy / eco-friendly citizens of Portland are also the host of the most polluting NBA team 🙂
What is also interesting is to plot the trail of flights (or pollution) achieved by the NBA throught the season.

team_travels

Great circle maps of all airplane flights completed by NBA teams during the 2013-2014 regular season.

I’ve been thinking about designing an algorithm that finds the NBA season schedule with minimal carbon footprint, which is essentially an optimization problem. The only issue is that there are a huge amount of restrictions to consider, such as christmas day games, first day of season games etc… More on that later.
As usual, all the relevant code for this analysis can be found on my github account.