The biggest liars in US politics

Anyone that follows US politics will be aware of the tremendous changes and volatility that has struck the US political landscape in the past year. In this post, I leverage third-party data to surface who are the most frequent liars, and show how to build a containerized Shiny app to visualize direct comparisons between individuals.

Comparing the contribution of NBA draft picks

When it comes to the NBA draft, experts tend to argue about a number of things: at which position will a player be selected? what is the best draft class ever? etc… Luckily, the wealth of data made available by the great people of make it possible to address a number of these, and other questions.

To begin, I started off by writing a quick Python script to scrape draft data for the time period of 1980-2014 (see at the end of this post for the source code, or on my GitHub). For the purpose of this analysis, I focussed on some key metrics that I deemed to be informative and useful enough to investigate further, which included:

  • Name of player (player)
  • College of drafted player (college)
  • Year of draft (draft_year)
  • Draft pick rank (rank)
  • Team that drafter the player (team)
  • Total games played (gp)
  • Total minutes played (mp)
  • Minutes per game (mpg)
  • Points per game (ppg)
  • Assists per game (apg)
  • Rebounds per game (rbg)
  • Win shares (ws)
  • Win shares per 48 minutes (ws_48)
  • Total years in league (yrs)

Once this was achieved, I began by measuring the respective contribution of each pick position during the period of 1980-2014. Here, I simply computed and normalized the median statistics for each pick position. Not too surprisingly, higher draft picks tend to be contribute more to their respective teams, although we do notice that some late 2nd round draft pick have high win shares per 48 numbers. It turns out that these correspond to the picks at which Kurt Rambis (57th) and Manu Ginobilli (58th) were picked…but more on this later

Next, I decided to estimate the quality of each draft year by measuring how players performed in comparison to players picked at the same rank during other years. I was somewhat surprised to discover that the draft crop of 2008 was the one with the highest win shares, although looking back at the players that participated at that draft, it makes a lot of sense! On the other hand, the vaunted draft class of 1984 (Olajuwon, Barkley, Jordan) and 2003 (James, Anthony, Wade, Bosh) did not fare as well, which may be attributable to the fact that these included more elite players, but were far less deep in the lower picks of the draft.
Next, I looked at the longevity of each draft pick, in other words how long each draft pick is expected to remain in the NBA league, which can be achieved by using survival curves. Not too surprisingly, higher draft picks are much more likely to stay longer in the league. As a general observation, this also means that NBA teams are quite proficient at selecting the right players at the right position.


At this point, we can examine the relative performance of NBA teams with regards to their drafting skills. To do this, I compared the performance of each player compared to the average performance of other players drafted at the same position, computed the respective ratios, and summed these up for each NBA team. This analysis revealed that the top 5 drafting teams were Detroit Pistons, Cleveland Cavaliers, Memphis Grizzlies, Phoenix Suns and the San Antonio Spurs (I purposely ignored the Brooklyn Nets and the New Orleans Hornets because of the small number of years these two teams have been in the league.)


Finally, I decided to look for the best players picked at each position. Again, I compared each player’s career stats to the average numbers obtained by other players picked at the same position. For display purposes, I only show the top three players at each pick position, although you can easily reproduce the results by re-running my code here. (At this point, I should take the opportunity to advertise the great stargazer R package, which allows to quickly output R objects into LaTex or HTML tables). The results I obtained made a lot of sense, and I was very interested to learn that even at pick position 13, Kobe Bryant was only the 2nd best pick, as he was outnumbered by none other than the Mailman himself (i.e. Karl Malone). Of course, this analysis only considers numbers as opposed to achievements and trophies, but I think it is still amusing to find that Kobe Bryant isn’t even the most productive player at his position.

Best pick 2nd best pick 3rd best pick
1 LeBron James John Wall Allen Iverson
2 Isiah Thomas Jason Kidd Gary Payton
3 Michael Jordan Deron Williams Pau Gasol
4 Chris Paul Russell Westbrook Stephon Marbury
5 Charles Barkley Dwyane Wade Kevin Garnett
6 Damian Lillard Brandon Roy Antoine Walker
7 Kevin Johnson Stephen Curry Alvin Robertson
8 Andre Miller Clark Kellogg Detlef Schrempf
9 Dirk Nowitzki Tracy McGrady Andre Iguodala
10 Paul Pierce Brandon Jennings Paul George
11 Fat Lever Michael Carter-Williams Terrell Brandon
12 Mookie Blaylock Muggsy Bogues John Bagley
13 Karl Malone Kobe Bryant Sleepy Floyd
14 Tim Hardaway Clyde Drexler Peja Stojakovic
15 Steve Nash Al Jefferson Gary Grant
16 John Stockton Nikola Vucevic Metta World Peace
17 Jrue Holiday Josh Smith Shawn Kemp
18 Mark Jackson Ty Lawson Joe Dumars
19 Rod Strickland Zach Randolph Jeff Teague
20 Larry Nance Jameer Nelson Paul Pressey
21 Rajon Rondo Darren Collison Michael Finley
22 Scott Skiles Reggie Lewis Kenneth Faried
23 Tayshaun Prince A.C. Green Wilson Chandler
24 Sam Cassell Kyle Lowry Arvydas Sabonis
25 Jeff Ruland Mark Price Nicolas Batum
26 Vlade Divac Kevin Martin George Hill
27 Dennis Rodman Jamaal Tinsley Jordan Crawford
28 Tony Parker Sherman Douglas Gene Banks
29 Toni Kukoc Josh Howard P.J. Brown
30 Gilbert Arenas Nate McMillan David Lee
31 Doc Rivers Danny Ainge Nikola Pekovic
32 Rashard Lewis Brent Price Luke Walton
33 Grant Long Dirk Minniefield Steve Colter
34 Carlos Boozer Mario Chalmers C.J. Miles
35 Mike Iuzzolino DeAndre Jordan Derek Smith
36 Clifford Robinson Ersan Ilyasova Omer Asik
37 Nick Van Exel Mehmet Okur Jeff McInnis
38 Chandler Parsons Chris Duhon Steve Blake
39 Rafer Alston Earl Watson Khris Middleton
40 Monta Ellis Dino Radja Lance Stephenson
41 Cuttino Mobley Popeye Jones Otis Smith
42 Stephen Jackson Patrick Beverley Matt Geiger
43 Michael Redd Eric Snow Trevor Ariza
44 Chase Budinger Malik Rose Cedric Henderson
45 Goran Dragic Hot Rod Williams Antonio Davis
46 Jeff Hornacek Jerome Kersey Voshon Lenard
47 Paul Millsap Mo Williams Gerald Wilkins
48 Marc Gasol Micheal Williams Cedric Ceballos
49 Andray Blatche Haywoode Workman Kyle O’Quinn
50 Ryan Gomes Paul Thompson Lavoy Allen
51 Kyle Korver Jim Petersen Lawrence Funderburke
52 Fred Hoiberg Anthony Goldwire Lowes Moore
53 Anthony Mason Tod Murphy Greg Buckner
54 Sam Mitchell Shandon Anderson Mark Blount
55 Luis Scola Kenny Gattison Patrick Mills
56 Ramon Sessions Amir Johnson Joe Kopicki
57 Manu Ginobili Marcin Gortat Frank Brickowski
58 Kurt Rambis Don Reid Robbie Hummel
60 Isaiah Thomas Drazen Petrovic Robert Sacre

As usual, all the code for this analysis can be found on GitHub account.


With regards to the analysis shown above, it is important to highlight a few potential caveats:

  • I worked with career averages, which somewhat ignores the years of peak performance achieved by certain players. However, I feel it that career averages are a reasonably good proxy for overall player competence.
  • I completely ignored the fact that some teams had more opportunities to select higher draft picks than others (cough…Cleveland…cough). As such, there may be a bias towards historically bad teams that would have been in the top 5 picks more often than others. However, I did compare each player to others that were picked at the same position, some hopefully this will bypass the issue (for example, if a team had plenty of NO 1 picks that were bad compared to other NO 1 picks, this insight will be revealed in the analysis)

Analyzing package dependencies and download logs from Rstudio, and a start towards building an R recommendation engine

In this post, I will focus on the analysis of available R packages and the behavior of its users. In essence, this involves looking at the data in two different ways (1) relationships among available R packages in CRAN and (2) tracking the behavior of R users through download logs on CRAN mirrors. I will then leverage all this data to make a feeble attempt towards building an R recommendation engine…

Investigating dependency relationships among available R packages in CRAN

In this first section, I look at the dependency relationships among all available R packages in CRAN. To do so, I first wrote a Python script to scrape all the data from CRAN, and used a dictionnary data structure in which each key was a given target package, and its associated element was the list of packages it depended on. For example the entry for the package mixtools, which depends on the boot, MASS and segmented packages, is stored as:

package_dependency['mixtools'] = ["boot", "MASS", "segmented"]

Once this is done, it is straightforward to build an adjacency matrix and plot the network of R package dependencies (Couldn’t resist the temptation! I love me some network). For the visualization, I used the d3Network package made available by Christopher Gandrud. By clicking on the image below, you will be able to download a html file, which you can open in your browser to display an interactive force-directed network that shows dependencies among all R packages in CRAN. Each node has been color-coded according to the community it belongs to. For the sake of transparency, I should add that I removed all packages with no dependencies.


Graphical model summarizing the dependencies among all available R packages in CRAN

As an added bonus, I have also included the subset of the script used to generate this graph above.

# load required packages

# assume mat is an adjacency matrix

# create edge list
g <- graph.adjacency(mat, mode = 'directed')
# remove loops
g <- simplify(g)
df <- get.edgelist(g, names=TRUE)
df <-
colnames(df) <- c('source', 'target')
df$value <- rep(1, nrow(df))

# get communities
fc <-
#ebc <-, directed=TRUE)
com <- membership(fc) <- data.frame(name=names(com), group=as.vector(com))
links <- data.frame(source=match(df$source,$name)-1,

d3ForceNetwork(Links = links, Nodes =,
               Source = "source", Target = "target",
               Value = "value", NodeID = "name",
               linkDistance = 250,
               Group = "group", width = 1300, height = 1300,
               opacity = 1, zoom = TRUE, file='network.html')

We can next look for the hub packages in our network, which is simply done by summing each rows of the adjacency matrix and sorting in descending order. The top terms will correspond to the packages on which many other packages depend on. Using this data, we can study the rate at which newly released packages depend on those hub packages, as shown in the figure below.


We can also look at the rate of updates that have occurred for each R package. You can notice a huge spike on October 2012. This could be attributable to the Rstudio 0.97.168 release of October 14th, 2012, which came with the addition of the RStudio CRAN mirror (via Amazon CloudFront) for fast package downloads.


Number of R packages updates each month between 2012 and 2014

Investigating properties of R package downloads in the Rstudio CRAN server

Back in June 2013, the Rstudio blog announced that it was providing anonymised log data gathered from its own CRAN mirror, an announcement that prompted a few people to analyze some of the available data: Felix Schonbrodt showed how to track R package downloadsTal Galili looked for the most popular R packages, and James Cheshire also created a map showing the activity of R users across the world. First, we must download all the log data available on the Rstudio CRAN server, which we can do by using the example script provided here. Once we have collected all the data, we can begin to plot out some of the most obvious questions, for example, the number of downloads that occur every day:

Number of daily R package downloads on the Rstudio CRAN server Number_of_package_downloads

We can also look at the number of unique users (which to my understanding can be inferred from the daily unique ids assigned to each IP adress) that download R packages from the Rstudio CRAN server:

Number of daily unique users to download R packages Number_of_R_users

We can notice a couple of things from the two plots above:

  1. The weekly cycles in both user numbers and R package downloads, with strong activity during the weekdays and considerable dips during the weekend
  2. The obvious dips in activity during the Christmas-New year season, most notably during New Year’s eve, which unequivocally confirms the widespread belief that R users are indeed wild party animals that would rather celebrate than download R packages.
  3. There is an obvious peak in activity in both the number of unique users and package downloads that occurred on the 22nd June 2014. While I cannot confirm the causality of this, it is intriguing to note that Rstudio 0.98.932 was released the previous day. This version was particularly exciting because it introduced all new R markdown functionnalities and also allowed to produce interactive documents, which presumably could have driven a lot of people to update their version of Rstudio. (But again, those are only my meandering thoughts…)

Finally, we can check the operating system on which R users were depending at the time of their R package downloads:

Kernel_of_R_users - 1

We see that R users depend on various flavors of MAC OS or Linux, but that a wide majority of R users are Windows users. At this time, I feel like I should make some snide comment but that would be misguided, since it would (clearly!) mean insulting a lot of people. Also, if you are reading this and are one of the users still depending on MAC OS Tiger 10.4 – Stop. Let it go. It’s had its time. Upgrade.

Identifying pairs of packages that are often downloaded together

One thing that has alway surprised me about R is the complete absence of a package recommendation feature. While “recommendation features” can often be annoying and get in the way of things, I believe it should be possible to seamlessly embed this feature within the R experience, for example by replicating something along the lines of the package dependencies argument in the install.packages() function. R does have a “suggests” section in the package description, but I find it to be lacking a little.
There has been previous attempts to build an R package recommendation engine, most notably the competition hosted by KAGGLE a few years ago. Here, I just took the most straightforward approach, and simply looked at the number of co-occurring downloads among all pairs of R packages. Of course, I excluded all co-occurring downloads that could have been attributed to package dependencies. The top 20 co-occurring R package downloads are shown in the table below. The colummn headers are:

  • package1: the first package in the pair of co-occuring R packages
  • prob(package1): the frequency with which package 1 is downloaded at the same time as package 2. It is calculated as the ratio of (number of times package 1 is downloaded at the same time as package 2) / (number of total downloads for package 1)
  • package2: the second package in the pair of co-occuring R packages
  • prob(package2): the frequency with which package 2 is downloaded at the same time as package 1. It is calculated as the ratio of (number of times package 2 is downloaded at the same time as package 1) / (number of total downloads for package 2)
  • the total number of times package 1 and package 2 were downloaded at the same time
package 1 prob(package1) package 2 prob(package2) cooccurence count
1 stringr 0.7 digest 0.69 377437
2 munsell 0.85 labeling 0.86 318600
3 scales 0.77 reshape2 0.69 315957
4 scales 0.77 gtable 0.86 312918
5 munsell 0.83 dichromat 0.86 311984
6 munsell 0.83 ggplot2 0.44 311875
7 labeling 0.84 dichromat 0.85 310973
8 ggplot2 0.44 dichromat 0.85 310945
9 munsell 0.82 gtable 0.85 309486
10 labeling 0.84 ggplot2 0.44 308816
11 gtable 0.85 dichromat 0.85 307714
12 labeling 0.83 gtable 0.85 307147
13 stringr 0.56 plyr 0.51 304778
14 proto 0.8 gtable 0.84 303274
15 scales 0.74 proto 0.8 302578
16 proto 0.79 dichromat 0.83 301528
17 proto 0.79 munsell 0.8 300881
18 reshape2 0.66 gtable 0.83 300809
19 proto 0.79 labeling 0.81 299554
20 reshape2 0.65 munsell 0.79 298857

The table above shows that many co-occurring downloads of R packages involve color schemes. While package dependencies were removed from this analysis, we could actually further improve this by eliminating all pairs belonging to the same community of packages detected in Part I of this analysis.
Thanks for reading!

Competitive balance and home court advantage in the NBA

Two years ago, the entire NBA season went into lockout because of mostly financial reasons. However, one central point was also about keeping a competitive balance within the NBA, so that large and small-market teams alike would have a chance to compete for a championship. THis brings us to the obvious question “Is there competitive balance in the NBA”? If we define competitive balance by the variety of teams that win a championship then the blunt answer is definite “no”. Under true competitive balance, and assuming 30 teams per season, then a fair league would roughly allow each team to have 1/30 chances of winning the championship during any given season. If we look at the actual distribution of championships across teams from 1980 to now, we can see that this is clearly not the case:


We can use the properties of the multinomial distribution to find the probability of actually observing this distribution under the scenario of a fair league (1/30 chances of winning for each team), which happens to be p =3.812135e-27…

# NBA_finals_txt contains a list of all NBA champions from 1947 to present
dat <- read.table(file='NBA_finals_data.txt', sep='\t', header=TRUE)
Year Lg Champion Runner.Up
1 2014 NBA San Antonio Spurs Miami Heat
2 2013 NBA Miami Heat San Antonio Spurs
3 2012 NBA Miami Heat Oklahoma City Thunder
4 2011 NBA Dallas Mavericks Miami Heat
5 2010 NBA Los Angeles Lakers Boston Celtics
6 2009 NBA Los Angeles Lakers Orlando Magic</code>

# restrict analysis from 1980 to present
champions <- as.vector(dat[which(dat$Year >= 1980), 'Champion'])
champ.freq <- table(champions)

# create vector of number of championships won by each team
# we assume that there were ~30 active teams per year
obs <- c(champ.freq, rep(0, (30-length(champ.freq))))

# compute probability of observing the list of champions we've had from 1980 to present
nom <- factorial(sum(champ.freq))
denom <- prod(sapply(obs, factorial))
prob <- (nom / denom) * prod(sapply(obs, function(x) (1/30)^x))
[1] 3.812135e-27

While we’ve established that the competitive balance in the NBA is skewed towards a subset of teams, we can also attempt to define competitive balance as the separation between individual teams in the league. One way would be to look at playoff appearances over the years, but I focussed instead on closeness of games (mostly because it was more appealling). I have written a Python script to scrape all game scores from 1946 to 2014, which also includes home team information. The data was scraped from the landofbasketball website and dumped in a SQL database. Both scripts and data file can be obtained from my github account.

By defining competitive advantage as the point differential obtained between two teams that play against each other, a smaller overall point differential indicates that teams are of a similar level, and by extension, that the overall league is competitive. In order to do this, we can first look at the total number of points scored per game across the years 1980 to now.


While the 80’s had high scoring games, this gradually decreased from 1987 onwards, reaching its lowest point in the late 90’s. The turn of the century saw an increase in the number of points scored per game. Interestingly, the trend above does not match to the point differential per game, which stays pretty constant throughout the period of 1980-2014.


This indicates that despite changes in scoring trends over the past three decades, the overall game competitiveness of the NBA has remained fairly stable. Next, we can break down the data to the monthly level. In the plot below, we show the total number of points scored during each month of the NBA season from 1980 to 2014. Here, each line represents a month, and each block of line represents a year (omitting July-September when no games are played). As you can see, the total number of points scored tends to be stable during the course of a season, although there is a noticeable drop in points scored in the last two months of the season (May-June), which corresponds to the playoff portion of the season.


Again, we can compare the pattern above to the point differential of each game, as assessed on a monthly basis. There is a little fluctuation over time, although we notice that playoff games (played in the last two months of each season) tend to be a lot closer. Again, this shows that the volume of points scored during different months and years does not impact the overall competitiveness of the league, presumably because teams adapt as a whole to the pace and style of play that occurs during any given period time of time.


Competitive advantage can also be applied to games that are played at home or away. I often hear players, coaches and experts talk of the benefits of home-court advantage, and how the fans can really inspire the home team to a victory. Here, we can visualize the average point differential for teams when they play at home or away.

Average point differential for teams playing at home

Average point differential for teams playing away from home

Finally, we can digress a little and visualize the average number of points scored by each team during the period of 1980 to 2014


The luckiest team in the NBA

While the NBA finals are in full swing and the two best teams are battling it out for the ultimate prize, another 28 are now in summer vacation. In order to achieve their goal of still playing this time next year, teams often look to improve their roster through trades, player development, and most importantly, the NBA draft. Through the draft, savvy teams can drastically improve their results by selecting potential franchise players, players that fill a given need, or that fit team chemistry.

Since 1994, the NBA has used a weighted lottery system in which teams with the worst regular-season record are conferred a higher probability of obtaining the first pick (more details here). More recently, the 2014 draft has attracted some attention because of its depth (some well-respected scouts and analysts have projected up to five potential franchise players in there!). The interest was further increased when the Cleveland Cavaliers won the first pick for the third time in four years and with only 1.7% probability of winning it. While we cannot deny their luck, it also got me thinking about which franchise has had the most luck in the draft from 1994 to now. For this, I used a Python script to scrape draft data from the Real GM website.

First, we can look at the number of times each team was part of the NBA draft lottery, and their average draft position between 1994 and 2014.


In the past twenty years, the LA Clippers, Golden State Warriors, Toronto Raptors, Washington Wizards and Minnessota Timberwolves have participated in the lottery the most often. It is interesting to notice that of these five teams, only the Minnesota Timberwolves were not part of the playoffs this season. Amazingly, the San Antonio Spurs have only been in the lottery once since 1994, which was in 1997 when they landed the first pick and Tim Duncan (who 17 years later, is still active and playing a leading role in this year’s Finals!). For each of the teams shown here, we can also count the number of times each team received the first pick in the draft, or one in the top 3.


Here, the Chicago Bulls, LA Clippers and Philadelphia Sixers have accumulated the most top 3 picks in the past twenty years. However, the Cleveland Cavaliers have received by far the most first picks in the lottery. While informative, the figure above is not normalized for the number of times each team has been in the lottery, and also doesn’t show the number of positions gained or lost during each lottery. To do this, we can look at the luck of each team by comparing its expected pick against where it actually ended up after the lottery order was selected.


The heatmap above shows the change in position for each team in the lottery between 1994 and 2014. Blue cells indicate that a team gained positions during the lottery draw, red cells indicate that a team lost positions during the lottery, and white cells means that there were no change in position or that the team was not part of the lottery. Overall, the Cleveland Cavaliers, Philadelphia Sixers and Chicago Bulls have been the three luckiest teams in the draft. The two biggest gains in lottery position occured in 2014 (Cleveland Cavaliers) and 2007 (Chicago Bulls), when both teams jumped from position 9 to number 1 (with 0.017 probability). On the flipside, the Minnessota Timberwolves and Sacramento Kings have been the unluckiest franchises since the NBA draft started a weighted lottery system.

Next, I intend to extend this analysis by exploring which NBA team has the best scouting track record. In other words, I will look at the total contribution that each player brought to the team that drafted them.

Using sentiment analysis to predict ratings of popular tv series

Unless you’ve been living under a rock for the last few years, you have probably heard of TV shows such as Breaking Bad, Mad Men, How I Met Your Mother or Game of Thrones. While I generally don’t spend a whole lot of time watching TV, I have also undergone some pretty intense binge-watching sessions in the past (they generally coincided with exam periods, which was actually not a coincidence…). As I was watching the epic final season of Breaking Bad, it got me thinking on how TV series compare to one another, and how their ratings evolve over time. I therefore decided to look a bit further into user rating trends of popular TV series (and by popular I mean the ones I know). For this, I simply had to define a quick scraping function in R that retrieves the average IMDB user ratings assigned to each episode of a given series.

'scrape_ratings' <- function(url)
  # get HTML of url
  doc <- htmlParse(url)

  # find all tables in webpage
  tables <- readHTMLTable(doc)

  # find largest table and return as dataframe
  nrows <- unlist(lapply(tables, function(t) dim(t)[1]))
  df <- tables[[which.max(nrows)]]


# IMDB id of Breaking Bad is "tt0903747"
url <- ''
series.ratings <- scrape_ratings(url)

After some minor data cleaning, I was able to plot the evolution of IMDB user ratings for some of the most popular TV series. Breaking Bad looks like the highest rated series, followed closely by Game of Thrones. It is also interesting to note the big drop in ratings for shows such as Family Guy, South Park and How I Met Your Mother. The same goes for the Simpsons, who (I’ve been told) used to be excellent and are now much less fun to watch.

Since I’ve recently taken an interest in NLP and some of the challenges associated with it, I also decided to perform a sentiment analysis of the TV series under study. In this case, we can use the AFINN list of positive and negative words in the English language, which provides 2477 words weighted in a range of [-5, 5] according to their “negativeness” or “positiveness”. For example, the phrase below would be scored as -3 (terrible) -2 (mistake) + 4 (wonderful) = -1

"There is a terrible mistake in this work, but it is still wonderful!"

I used a Python scraper (for any midly sophisticated scraping purposes, the BeautifulSoup Python library still has no equal in R) to retrieve the transcripts of all episodes in each TV series and computed their overall sentiment score, which produced the figure below. Here, the higher the sentiment score, the more “positive” was the episode, and vice-versa.


Of the TV series featured here, we can see that Game of Thrones is by far the most negative of them all, which is not surprising given the plotting, killing and general all out warring that goes on in this show. On the flip side, Glee was the most positive TV series, which also makes a lot sense, given how painfully corny it can be. Of the shows that have already ended (Friends, West Wing and Grey’s anatomy), It is interesting to observe a progressive rise of positiveness as we get closer to the final episode, presumably because the writers try and end the series on a high note. I have included more detailed graphs of the rating and sentiments for each TV series at the bottom of this post.

Looking at the plot above, we can wonder whether user ratings are somehow dependent on the sentiments of a given episode. We can investigate this further by fitting a simple model in which the response is the IMDB user ratings, and predictor variables are sentiment, number of submitted votes, and TV series.

sentiment rating   VoteCount series
148       8.4      2352      BBT
61        8.4      1691      Breaking Bad
115       7.9      1418      BBT
109       8.2      1458      Game of Thrones
194       8.1      1356      Simpsons
131       8.5      1406      Simpsons

For the purpose of this study, I considered two types of model:  multiple regression and MARS (Multivariate Adaptive Regression Splines, implemented in the earth R package), and assessed their performance  using 10-fold cross-validation. Below is a plot of the root mean squared error scored by both method at each fold.

RMSEMARS appears to perform better, which is likely due to the fact that it is designed to capture non-linear and interaction effects. Overall, we see that MARS does a good job of predicting user ratings of episodes based off its overall sentiment, as the difference between true rating and predicted rating is normally distributed around zero and has relatively standard deviation.


In conclusion, while this is a relatively unrigorous study, it appears that we can predict with reasonable accuracy the average IMDB user ratings that will be assigned to an episode, so long as we know its overall sentiment score and the number of submitted votes. Of course, we could probably obtain far better accuracy if we could account for other elements such a humor, suspense and so on. Furthermore, we could extend this to predict individual user ratings rather than the average, which would ultimately make more sense since people tend to respond differently to TV series (although it would be interesting to actually confirm that). You can scroll down to look at more detailed plots of user ratings and sentiment analysis for different popular TV series. As usual, all the relevant code can be found on my GitHub account.


Big Bang Theory

Breaking Bad
Breaking Bad

Family Guy
Family Guy



Game of Thrones

Grey’s Anatomy
Greys Anatomy

How I Met Your Mother

Mad Men
Mad Men

Sex in the City
Sex in the City


South Park
South Park

West Wing
West Wing

On the carbon footprint of the NBA

It’s no secret that I enjoy basketball, but I’ve often wondered about the carbon footprint that can be caused by 30 teams each playing an 82-game season. Ultimately, that’s 2460 air flights across the whole of the USA, each carrying 30+ individuals.

For these reasons, I decided to investigate the average distance travelled by each NBA team during the 2013-2014 NBA season. In order to do so, I had to obtain the game schedule for the whole 2013-2014 season, but also the distances between arenas in which games are played. While obtaining the regular season schedule was straightforward (a shameless copy and paste), for the distance between arenas, I first had to extract the coordinates of each arena, which could be achieved using the geocode function in the ggmap package.

Example: finding the coordinates of NBA arenas:

# find geocode location of a given NBA arena
geo.tag1 <- geocode('Bankers Life Fieldhouse')
geo.tag2 <- geocode('Madison Square Garden')
        lon     lat
1 -86.15578 39.7639

Once the coordinate of all NBA arenas were obtained, we can use this information to compute the pairwise distance matrix between each NBA arena. However we first had to define a function to compute the distance between two pairs of latitude-longitude.

Computing the distance between two coordinate points:

# Function to calculate distance in kilometers between two points
# reference:
earth.dist <- function (lon1, lat1, lon2, lat2, R)
  rad <- pi/180
  a1 <- lat1 * rad
  a2 <- lon1 * rad
  b1 <- lat2 * rad
  b2 <- lon2 * rad
  dlon <- b2 - a2
  dlat <- b1 - a1
  a <- (sin(dlat/2))^2 + cos(a1) * cos(b1) * (sin(dlon/2))^2
  c <- 2 * atan2(sqrt(a), sqrt(1 - a))
  d <- R * c
  real.d <- min(abs((R*2) - d), d)

Using the function above and the coordinates of NBA arenas, the distance between any two given NBA arenas can be computed with the following lines of code.
Computing the distance matrix between all NBA arenas:

# compute distance between each NBA arena
dist <- c()
R <- 6378.145 # define radius of earth in km
lon1 <- geo.tag1$lon
lat1 <- geo.tag1$lat
lon2 <- geo.tag2$lon
lat2 <- geo.tag2$lat
dist <- earth.dist(lon1, lat1, lon2, lat2, R)


By performing this operation on all pairs of NBA teams, we can compute a distance matrix, which can be used in conjunction with the 2013-2014 regular season schedule to compute the total distance travelled by each NBA teams. Finally, all that was left was to visualize the data in an attractive manner. I find the googleVis is a great resource for that, as it provides a convenient interface between R and the Google Chart Tools API. Because does not support javascript, you can view the interactive graph by clicking on the image below.


Total distance (in km) travelled by all NBA teams during the 2013-2014 NBA regular season

Incredibly, we see that the aggregate number of kilometers travelled by NBA teams amounts to 2,108,806 kms! I hope the players have some kind of frequent flyer card…We can take this a step further by computing the amount of CO2 emitted by each NBA team during the 2013-2014 season. The NBA charters standard A319 Airbus planes, which according to the Airbus website emits an average of 9.92 kg of CO2 per km. Again, you can view the interactive graph of CO2 by clicking on the image below.


Total amount of CO2 (in kg) consummed by all NBA teams during the 2013-2014 NBA regular season

Not surprisingly, Oregon and California-based teams travel and pollute the most, since the NBA is mid-east / east coast heavy in its distribution of teams. It is somewhat ironic that the hipster / recycle-crazy / eco-friendly citizens of Portland are also the host of the most polluting NBA team 🙂
What is also interesting is to plot the trail of flights (or pollution) achieved by the NBA throught the season.


Great circle maps of all airplane flights completed by NBA teams during the 2013-2014 regular season.

I’ve been thinking about designing an algorithm that finds the NBA season schedule with minimal carbon footprint, which is essentially an optimization problem. The only issue is that there are a huge amount of restrictions to consider, such as christmas day games, first day of season games etc… More on that later.
As usual, all the relevant code for this analysis can be found on my github account.

On the trade history and dynamics of NBA teams

While good draft picks and deft management can help you win championships, there is no doubt that NBA teams can massively gain, or lose, by trading players with one another. Here, I played around with some publicly available data given at, and had a look at the numbers behind all trades undertaken in the NBA from 1948 to present.

After some quick Python scraping and data cleaning, I first looked at the overall number of trades that were performed in the NBA during the period 1948-present.


Total number of trades completed by all NBA teams active between the 1948 and 2014 seasons.

Clearly, we see that the number of trades grows as we move along the years, which can probably be attributed to many factors such as the increasing ease of travelling/mobility and the growing number of teams in the NBA (Of course, the 2014 season is still ongoing so all the number are not all in yet!). Next, I set out to look at whether any NBA teams showed preferential attachment with one another, i.e. do any NBA teams show apreference towards trading with one another rather than with other teams? This could easily be summarized by constructing an adjacency matrix M of dimension N x N (where N is the number of NBA teams), in which each cell N(i, j) gives the number of trades operated between team i and j. For simplicity, I restricted the analysis to teams that are currently active.


While the plot above is pretty(ish), it is not very informative. The data is a lot more instructive if visualized as a mixed correlation plot (use the corrplot package!… we’ll also conveniently ignore the obvious caveat that I have not normalized the data for how long each team has been in the NBA….)


Adjacency matrix of number of trades completed between all pairs of currently active NBA teams. This includes all historical trade data from 1948 to present.

We can take this a step further and ask ourselves which teams have had the most success with trades, and also what are the best indivual trades ever performed in the history of the NBA? For this, I collected the win share data associated to each trade. The win share (WS) metric is an estimate of the number of wins produced by a player, and is a good way to determine how many victories an NBA player contributed to his team during his tenure there (more details can be found here). By computing the differential win share per trade (WS gained in trade – WS lost in trade), it is possible to gain an insight in the quality of each trade.


Distribution of win shares gained or lost by each team in the NBA. This includes all historical trade data from 1948 to present.

In the plot above, I marked inactive teams with a hyphen. We can see that the three currently active teams with the highest mean win share per trade are the LA Lakers, Dallas Mavericks and LA Clippers. In terms of win shares, the three greatest trade ever completed were:

  • June 9, 1980: the Boston Celtics traded a 1980 1st round draft pick (used to select Rickey Brown) and a 1980 1st round draft pick (used to select Joe Barry Carroll) to the Golden State Warriors for Robert Parish and a 1980 1st round draft pick (used to select Kevin McHale).
  • June 24, 1998: the Dallas Mavericks traded Robert ‘Tractor’ Traylor to the Milwaukee Bucks for Pat Garrity and Dirk Nowitzki (!)
  • July 11, 1996: the Charlotte Hornets traded Kobe Bryant to the Los Angeles Lakers for Vlade Divac.

Ultimately, I would like to create an interactive version of the plot above, so that details of each trade appear when the mouse is hovered over any given point. This is currently a work in progress that I will eventually publish here. All the relevant code for this analysis can be found on my github account.

And the most loyal fans in the NBA are…

NBA basketball is the one the sports I enjoy watching the most. As I was ordering my (undisclosed amount)th beer while watching a game during after-work hours, it occurred to me how often I had seen sparsely populated arenas during games, with large areas of seats going unoccupied. This got me to thinking about the average fan attendance for NBA teams, what could be the factors influencing attendance, and ultimately, which NBA team had the most loyal fans?

After some online browsing, Python scraping and data cleansing, I was able to obtain a good amount of data from the awesome guys at Unfortunately, I could not find any records of fan attendance beyond 1981, so this analysis will be restricted to the period between 1981 to 2013 (with records for 2002-2006 also missing). First, I wanted to see if there were any trends in NBA fan attendance per season.


Fan attendance for each NBA teams during the seasons 1981 to 2013. Years marked with a red asterik represent shortened seasons due to a lockout. Data for the year 2002 to 2006 was not available

The two most striking features of the plot above are the obvious increase in fan attendance from 1981 to 1995, and the subsequent stagnation thereafter. This makes sense, since this period is widely regarded as the golden era and renaissance of basketball, full of rivalries and Hall of Fame players in their prime. Unsurprisingly, the year 1999 and 2012, which were both shortened by ~4 months due to a lockout, saw a drop in total number of fan attendance (purely as a result of lesser games being played – if I were more rigorous, I would normalize for this and also the overall US population, but I wanted to visualize the raw numbers).

Next, I investigated whether team success (the net number of wins per season) during a season could be an indicator of fan attendance. Not surprisingly, teams that won more also attracted more fans (doh!). This was true regardless of the conference in which the team was (East or West).


Fan attendance as a function of number of wins for all NBA teams during the period of 1981-2013

I also looked at whether fans were more attracted by teams that scored a lot, or by teams that put an emphasis on defense. However, I had to consider historical trends in scoring, and adjust for the fact that defenses/offenses have gotten more sophisticated over time. Therefore, I decided to look at the fan attendance numbers of each NBA team during a given season, and plot that as a function team’s deviation from the median number of points scored by all teams during that season. The plot below shows the aggregate of all points after considering each individual season between 1981 and 2013. Interestingly, although teams that score more attract more fans, it seems that good defense is even more likely to attract crowds.


Fan attendance as a function of the number of points scored for and against the home team. To adjust for the variability in offensive/defensive points scored at each season, the attendance numbers are plotted against the home team’s deviation from the season average.

Of course, the caveat of the above plot is that teams that score a lot and/or defend well are more likely to win, and thus attract more fans. Indeed, winning teams usually develop bandwagon fans and thus inflate their attendance numbers. Therefore, I sought to find out who were the most loyal fans in the NBA. In my mind, the mark of a truly loyal fanbase is one that shows up to support its team regardless of win/loss ratio. For these reasons, I plotted the fan attendance of each NBA team normalized per number of wins.


And so the most loyal fanbase are the good people of Memphis, Minnesota and Toronto!

I will add all the relevant code to my github account soon (basically as soon as I’ve commented it!)