Predicting the Number of “Useful” Votes a Yelp Review Will Receive

A few months ago I wrote about creating a submission for the Digit Recognizer tutorial “competition” on Kaggle that could correctly recognize 95% of handwritten digits.

Enthused by this, I decided to participate in a real competition on Kaggle, and picked the Yelp Recruiting Competition which challenges data scientists to predict the number of “useful” votes a review on Yelp will receive. Good reviews on Yelp accumulate lots of Useful, Funny and Cool votes over time. “What if we didn’t have to wait for the community to vote on the best reviews to know which ones are high quality?”, was the question posed by Yelp as the motivation for the challenge.

I am pleased to report that I ended the competition in the top one-third of the leaderboard (110 out of 352). Although the final result was decent, there were many stumbling blocks along the way.

Data

The training data consisted of ~230,000 reviews along with data on users, businesses and checkins. The data was in JSON format, so the first step was to convert it to tab-delimited format, using a simply Python script, so that it could be easily loaded into R.

Visualization

Next, I tried to understand the data by visualizing it. Here is a distribution of the number of useful votes:NewImage

Evaluation

Because Kaggle only allows two submissions every day, I created a function to evaluate the results of the prediction before submission, by replicating the algorithm used by Kaggle to evaluate the results i.e. the Root Mean Squared Logarithmic Error (“RMSLE”):

Evaluation  Yelp Recruiting Competition | Kaggle

where:

  • ϵ is the RMSLE value (score)
  • n is the total number of reviews in the data set
  • pi is the predicted number of useful votes for review i
  • ai is the actual number of useful votes for review i
  • log(x) is the natural logarithm of x

Refining the Model

Next, I split the data into training and validation sets in a 70:30 ratio and created a linear regression using just two independent variables: ‘star rating’ and ‘length of review’. This model resulted in an error of ~0.67 on the test data i.e. after submission.

Next, I hypothesized that the good reviews were written by good reviewers and for each review, calculated the average number of useful votes that the user writing the review received for all the other reviews that he/she wrote. Including this variable reduced the error dramatically to ~0.55.

Next, I incorporated more user data i.e. the number of reviews written by the user, the number of funny/useful/cool votes given and the the average star rating. None of these variables proved to be predictive of the number of useful votes with linear regression so I tried random forests, but to no avail.

Next, I incorporated business data to see if the type of business, the star rating or number of reviews received would increase the predictive power of the model. But again, these failed to reduce the error.

Next, I incorporated checkin data to see if the number of checkins would improve the model. Again, this failed to reduce the error.

Having exhausted all the easy options, I turned to text mining to analyze the actual content of the review. I split the reviews into two categories – ham (good reviews with more than five useful votes) and spam (bad reviews with five useful votes or less). For each category, I created a “term document matrix” i.e. a matrix with terms as columns, documents (review text) as rows and cells as the frequency of the term in the document. I then created a list of the most frequent terms in each category that were distinct i.e. that were only in one category or the other. To the model I added variables from the frequencies of each of these words and in addition added the frequencies of the exclamation mark (!) and comma (,). The final list of words for which I created frequency variables was:

  • , (comma)
  • !
  • nice
  • little
  • time
  • chicken
  • good
  • people
  • pretty
  • you
  • service
  • wait
  • cheese
  • day
  • hot
  • night
  • salad
  • sauce
  • table

The frequency variables improved the predictive power of the model significantly and resulted in an error of ~0.52.

Visualization of Final Model

Here is a heatmap of predicted (x-axis) vs actual (y-axis) useful votes:

NewImage

For lower numbers of useful votes (i.e. up to ~8) there is a relatively straight diagonal line indicating that by-and-large the prediction and actual values coincide. Beyond this, the model starts to falter and there is a fair amount of scattering.

Improvements

I couldn’t find time to improve the model even further, but I am fairly confident that additional text mining approaches such as stemming and natural language processing would do so.

Big Data(bases): Making Sense of OldSQL vs NoSQL vs NewSQL

A few months ago, I had the great pleasure of meeting and discussing big data with Michael Stonebraker, a legendary computer scientist at MIT who specializes in database systems and is considered to be the forefather of big data. Stonebraker developed INGRES, which helped pioneer the use of relational databases, and has formed nine companies related to database technologies.

Until recently, the choice of a database architecture was largely a non-issue. Relational databases were the de-facto standard and the main choices were Oracle, SQL Server or an open source database like MySQL. But with the advent of big data, scalability and performance issues with relational databases became commonplace. For online processing, NoSQL databases have emerged as a solution to these problems. NoSQL is a catch-all for different kinds of database architectures — key-value stores, document databases, column family databases and graph databases. Each has it’s own relative advantages and disadvantages. However, in order to get scalability and performance, NoSQL databases give up “queryability” (i.e. not being able to use SQL) and ACID transactions.

More recently a new type of database has emerged that offers high performance and scalability without giving up SQL and ACID transactions. This class of database is called NewSQL, a term coined by Stonebraker. He provides an excellent overview of OldSQL vs NoSQL vs NewSQL in this video.

Some key points from the video:

  • SQL is good.
  • Traditional databases are slow not because SQL is slow. It’s because of their architecture and the fact that they are running code that is 30 years old.
  • NewSQL provides performance and scalability while preserving SQL and ACID transactions by using a new architecture that drastically reduces overhead.

In the video, Stonebraker talks about VoltDB, an open source NewSQL database that comes from a company of the same name founded by him. Some of the performance figures of VoltDB are pretty amazing:

  • 3 million transactions per second on a “couple hundred cores”
  • 45x the performance of “a SQL vendor who’s name has more than three letters and less than nine”
  • 5-6 times faster than Cassandra and same speed as Memcached on key-value operations

VoltDB sounds like an extremely compelling alternative to NoSQL databases, and certainly warrants a look if you want to move from a traditional “OldSQL” database to one that is highly scalable and performant without losing SQL and ACID.

Amateur Data Scientist?: How I Built a Handwritten Digit Recognizer with 95% Accuracy

Almost two years ago, I wrote a post entitled Stats are Sexy, in which I mentioned the emerging discipline of data science. Soon after, I discovered the amazing platform Kaggle, which lets companies host Netflix Prize -style competitions where data scientists from all over the world compete to come up with the best predictive models. Fascinated, I really wanted to learn machine learning by competing in the “easy” Digit Recognizer competition which requires taking an image of a handwritten digit, and determining what that digit is, but struggled to gather the know-how in the limited free time that I had. Instead, I quenched my desire to do innovative work with data by building data visualization showcases as a Technical Evangelist for Infragistics, my employer at the time: Population Explosion and Flight Watcher.

Now, as an MBA student at the Massachusetts Institute of Technology, I am taking a class called The Analytics Edge, which I am convinced is one of the most important classes I will take during my time at business school. More on that later. Part of my excitement for taking the class was to learn R, the most widely-used tool (by a long margin) among data scientists competing on Kaggle.

After several lectures, I had some basic knowledge of how to identify and solve the three broad types of data mining problems – regression, classification and clustering. So, I decided to see if I could apply what I had learnt so far by revisiting the Digit Recognizer competition on Kaggle, and signed up as a competitor.

I recognized this as a classification problem – given input data (an image), the problem is to determine which class (a number from 0 to 9) it belongs to. I decided to try CART (Classification and Regression Trees). I used 70% of the data (which contains 42,000 handwritten digits along with labels that identify what numbers the digits actually are) to build a predictive model, and 30% to test it’s accuracy. The CART model was only about 62% accurate in recognizing digits, so I tried Random Forests, which to my surprise turned out to be ~90% accurate! I downloaded the “real” test set which contains 28,000 handwritten digits, ran it through the model and created a file that predicted what each of the digits actually was.

I uploaded my prediction file and to my surprise it turned out that the accuracy was 93%. I increased the number of tree in the random forests to see if I could do better, and it indeed worked, bumping up the accuracy to 95% and moving me up 43 positions in the leaderboard:

 Digit submission

Here is the code in it’s entirety: DigitRecognizer.r

I was amazed that I was able to build something like this in a couple of hours in under 30 lines of code (including sanity checks). (Of course, I didn’t have to clean and normalize the data, which can be painful and time-consuming.) Next up: a somewhat ambitious project to recognize gestures made by moving smartphones around in the air. Updates to follow in a future post.

What’s really exciting from a business standpoint is that predictive analytics can be applied in a large number of business scenarios to gain actionable insights or to create economic value. Have a look at the competitions on Kaggle to get an idea.

It has been clear for some time that companies can obtain a significant competitive advantage through data analytics, but this is not limited to specific industries. A few experts from the MIT Sloan Management Review 2013 Data & Analytics Global Executive Study and Research Project published only a few days ago hint at the scale of “The Analytics Revolution”:

How organizations capture, create and use data is changing the way we work and live. This big idea, which is gaining currency among executives, academics and business analysts, reflects a growing belief that we are on the cusp of an analytics revolution that may well transform how organizations are managed, and also transform the economies and societies in which they operate.

Fully 67% of survey respondents report that their companies are gaining a competitive edge from their use of analytics. Among this group, we identified a set of companies that are relying on analytics both to gain a competitive advantage and to innovate. These Analytical Innovators constitute leaders of the analytics revolution. They exist across industries, vary in size and employ a variety of business models. 

If I was enthused about the power of analytics before, I am even more convinced now. Which is why I consider classes like the Analytics Edge that teach hard analytics skills to be extremely valuable for managers in the current global business environment.

Will you be an Analytics Innovator?

Map-Based Visualization Revisited

In January, I created an animated visualization showing the movement of domestic flights of Indian carrier Jet Airways over the course of a day:

Al flights

The time of day is depicted using a slider that moves across horizontally over time:

Al time  

The motivation behind this visualization, besides trying to build something cool, was to showcase the capabilities of my former employer’s Geographic Map API and encourage developers to build map-based visualizations.

I am thrilled to see that a similar approach is taken by Google’s visualization of flights to and from London, which was released earlier this month:

Google flights

The time of day in this case is depicted using a simple clock:

Google time

This visualization is part of the More Than a Map website showcasing the capabilities of the Google Maps API – do check it out.

I’ve always been a big fan of Google Maps. I remember looking at the newly released Street View feature in 2007 and shaking my head in awe. And MapsGL released earlier this year produced a similar reaction.

What’s great for developers is that they can leverage all of the goodness of Google Maps via the API for free! What will you build?

Chroma Chrono: Using Color to Visualize Time

The other day, it occurred to me that one of the oldest and simplest visualizations is the analog clock–it is a visualization of time using angles on a circular scale.

So, I decided to experiment with using color to visualize time and the result is Chromo Chrono:

Chroma Chrono: Using Color to Visualize Time

It comes in two flavors. This is Pulse:

1

A series of concentric circles, going outwards from the centre, represent the hour, minute, second and millisecond. Each circle is colored on a scale that ranges from orange to black.

Depending on whether it is AM or PM, the innermost (hour) circle, gets brighter or darker. It is orange at noon and as the day progresses, it gets darker until it is black at midnight. From midnight to noon, it get progressively brighter until it is orange at noon and the cycle starts again. 

Similarly, the minute and  second circles go from orange to black once every hour and minute respectively in the AM and in the reverse direction in the PM. The millisecond circle pulses from orange to black or vice versa once every second.

This mechanism is inspired by nature–the overall brightness of the clock is an approximation of the overall brightness of the sky.

The Kinetic works on the same principle, except it has rings instead of circles and doesn’t show milliseconds:

2 anno

The main difference, however is a guide inside of each ring that makes it easier to tell the time. In each ring, the point where the color of the ring matches the color of the guide represents the time in exactly the same way as on a traditional analog clock. These points are highlighted using green arrows. If you look at the second ring closely, you will notice that it appears to be somehow moving. This an optical illusion caused by the fact that the point at which the ring and guides match is moving around the circumference of the clock.