Almost two years ago, I wrote a post entitled Stats are Sexy, in which I mentioned the emerging discipline of data science. Soon after, I discovered the amazing platform Kaggle, which lets companies host Netflix Prize -style competitions where data scientists from all over the world compete to come up with the best predictive models. Fascinated, I really wanted to learn machine learning by competing in the “easy” Digit Recognizer competition which requires taking an image of a handwritten digit, and determining what that digit is, but struggled to gather the know-how in the limited free time that I had. Instead, I quenched my desire to do innovative work with data by building data visualization showcases as a Technical Evangelist for Infragistics, my employer at the time: Population Explosion and Flight Watcher.
Now, as an MBA student at the Massachusetts Institute of Technology, I am taking a class called The Analytics Edge, which I am convinced is one of the most important classes I will take during my time at business school. More on that later. Part of my excitement for taking the class was to learn R, the most widely-used tool (by a long margin) among data scientists competing on Kaggle.
After several lectures, I had some basic knowledge of how to identify and solve the three broad types of data mining problems – regression, classification and clustering. So, I decided to see if I could apply what I had learnt so far by revisiting the Digit Recognizer competition on Kaggle, and signed up as a competitor.
I recognized this as a classification problem – given input data (an image), the problem is to determine which class (a number from 0 to 9) it belongs to. I decided to try CART (Classification and Regression Trees). I used 70% of the data (which contains 42,000 handwritten digits along with labels that identify what numbers the digits actually are) to build a predictive model, and 30% to test it’s accuracy. The CART model was only about 62% accurate in recognizing digits, so I tried Random Forests, which to my surprise turned out to be ~90% accurate! I downloaded the “real” test set which contains 28,000 handwritten digits, ran it through the model and created a file that predicted what each of the digits actually was.
I uploaded my prediction file and to my surprise it turned out that the accuracy was 93%. I increased the number of tree in the random forests to see if I could do better, and it indeed worked, bumping up the accuracy to 95% and moving me up 43 positions in the leaderboard:
Here is the code in it’s entirety: DigitRecognizer.r
I was amazed that I was able to build something like this in a couple of hours in under 30 lines of code (including sanity checks). (Of course, I didn’t have to clean and normalize the data, which can be painful and time-consuming.) Next up: a somewhat ambitious project to recognize gestures made by moving smartphones around in the air. Updates to follow in a future post.
What’s really exciting from a business standpoint is that predictive analytics can be applied in a large number of business scenarios to gain actionable insights or to create economic value. Have a look at the competitions on Kaggle to get an idea.
It has been clear for some time that companies can obtain a significant competitive advantage through data analytics, but this is not limited to specific industries. A few experts from the MIT Sloan Management Review 2013 Data & Analytics Global Executive Study and Research Project published only a few days ago hint at the scale of “The Analytics Revolution”:
How organizations capture, create and use data is changing the way we work and live. This big idea, which is gaining currency among executives, academics and business analysts, reflects a growing belief that we are on the cusp of an analytics revolution that may well transform how organizations are managed, and also transform the economies and societies in which they operate.
Fully 67% of survey respondents report that their companies are gaining a competitive edge from their use of analytics. Among this group, we identified a set of companies that are relying on analytics both to gain a competitive advantage and to innovate. These Analytical Innovators constitute leaders of the analytics revolution. They exist across industries, vary in size and employ a variety of business models.
If I was enthused about the power of analytics before, I am even more convinced now. Which is why I consider classes like the Analytics Edge that teach hard analytics skills to be extremely valuable for managers in the current global business environment.
Will you be an Analytics Innovator?