Recently, I came across this interesting blog post http://blog.revolutionanalytics.com/2013/12/k-means-clustering-86-single-malt-scotch-whiskies.html by the Revolutions blog poster Luba Gloukhov. This post initially caught my attention because of the originality of the dataset: 86 scottish whiskeys marked on a scale of 0-4 in 12 different taste profile (source data is here). Now I know what I like, and I like my whiskey, so I liked what I saw.
For these reasons, I set out to analyze the data a little bit more. Since Luba had already addressed most of what could be done in terms of clustering analysis, I restricted myself to mostly visualization of the data. First, I simply set out to plot a straightforward infographic of the data (the full version with all 86 whiskeys is available here Whiskey_taste_profile_infographic):
I then geotagged each distillery to the Scottish territory and color coded them according to the marks given for each taste profile. The most distinctive pattern we see is the clear separation between the land-based whiskeys and those from the Isles of Argyll.
Geo-tagging of scottish whiskeys in 12 different taste profiles
We can then look at how geographical position (i.e. longitude and latitude) correlates to each taste profile.
Correlation between geographical position (longitude and latitude) of whiskeys and their taste profile score.
Better yet, we can look at similarities between different whiskeys and how these are affected by geographical location.
Correlation matrix of Scottish whiskeys constructed on the basis on their similarities in 12 different taste profiles
In this post, I will compare the performance of R and Python when reading data in JSON format. More specifically, I will conduct an extremely simple analysis of the famous YELP Houston-based user ratings file (~216Mb), which will consist of reading the data and plotting a histogram of the ratings given by users. I tried to ensure that the workload in both scripts was as similar as possible, so that I can establish which language is most quickest.
# import required packages
# define function read_json
'read_json' <- function()
# read json file
json.file <- sprintf("%s/data/yelp_academic_dataset_review.json", getwd())
raw.json <- scan(json.file, what="raw()", sep="\n")
# format json text to human-readable text
json.data <- lapply(raw.json, function(x) fromJSON(x))
# extract user rating information
user.rating <- unlist(lapply(json.data, function(x) x$stars))
# not shown
# compute total time needed
elapsed <- system.time(read_json())
user system elapsed
32.295 0.509 38.172
# import modules
# start process time
start = time.clock()
# read in yelp data
yelp_files = "%s/data/yelp_academic_dataset_review.json" % os.getcwd()
yelp_data = 
with open(yelp_files) as f:
for line in f:
# extract user rating information
user_rating = 
for item in yelp_data:
elapsed = (time.clock() - start)
As expected, Python was significantly faster than R (12.5s vs. 38.2s) when reading this JSON file. In fact, experience tells me that this will be the case for almost any file format… 🙂
I was recently made aware of this great article by Alexis C. Madrigal, senior editor at the Atlantic. Although the underlying analysis he performs is relatively straightforward, it is more the fact that he actually thought of doing it…and went through with it! It is a relatively long read for an article but worth a read in my opinion.
I am currently doing something quite similar analyzing MTA traffic data released by New York City, and hope to show the results here soon.
Great news for cancer research. Today, it was announced that a posthumous donation by Daniel K. Ludwig totaling $540 million would be distributed across six different research centers. The six institutions that will benefit from this donation include:
- Harvard Medical School
- Memorial Sloan-Kettering Cancer Center in New York City
- Johns Hopkins University Medical School
- Stanford University Medical School
- University of Chicago Medical School
This is a timely announcement, as funding in the biomedical field is at an all-time state of volatility. Furthermore, it continues the growing trend of philanthropists billionaires donating large sums to various institutions (Sloan Kettering and Weill Cornell Medical College in NYC come to mind)
I will be curious to see whether these centers will try and cooperate with one another in order to tackle the problem from different and complementary angles, or whether they will go in a all-out-war on the topic of “who is the quickest to publish a widely incomplete study in Nature”. No doubt though that Harvard Medical School and MIT will most likely be working hand-in-hand on this problem, which in my completely unimportant opinion, is a great thing. All in all, as a researcher that has dabbed in the field of cancer research, I think this is great news.
More details on this announcement can be found here: http://www.bostonglobe.com/news/nation/2014/01/06/harvard-mit-cancer-research-centers-receive-grants/LMcFhSyx6Ao8m7cdMWZIjM/story.html
A huge area of interest in statistics and machine learning is that of graph recovery, both directed and undirected. The application of network recovery tools are practically limitless, ranging from social media to genomics and sports analytics. I have read about lots of different methods to recover causal relationships among a group of variables, and the complexity of the underlying algorithms range from simple correlation measures to more sophisticated concepts such as random forest or variational Bayes.
In my opinion, one of the major caveat in the field of network recovery is the lack of methods capable of inferring causal network from high-dimensional data. With the advent of the Big Data trend, I believe that this lack of available software will become more and more glaring.
The usual pipeline for researchers that aim to develop new methods for network recovery involves the assessment against synthetic data and subsequently on real-life data. For these reasons, I am posting here a few publicly available databases that provide a wide breadth of interesting datasets to analyze.
Stanford Network Analysis Platform (SNAP)
NYC open data