Wednesday, March 18, 2015

Data Science Done Well Looks Easy


Data science has a ton of different definitions. For the purposes of this post I'm going to use the definition of data science we used when creating our Data Science program online. Data science is:

Data science is the process of formulating a quantitative question that can be answered with data, collecting and cleaning the data, analyzing the data, and communicating the answer to the question to a relevant audience [..].

A good data science project answers a real scientific or business analytics question. In almost all of these experiments the vast majority of the analyst's time is spent on getting and cleaning the data (steps 2-3) and communication and reproducibility (6-7). In most cases, if the data scientist has done her job right the statistical models don't need to be incredibly complicated to identify the important relationships the project is trying to find. In fact, if a complicated statistical model seems necessary, it often means that you don't have the right data to answer the question you really want to answer. One option is to spend a huge amount of time trying to tune a statistical model to try to answer the question but serious data scientist's usually instead try to go back and get the right data.

The result of this process is that most well executed and successful data science projects don't (a) use super complicated tools or (b) fit super complicated statistical models. The characteristics of the most successful data science projects I've evaluated or been a part of are: (a) a laser focus on solving the scientific problem, (b) careful and thoughtful consideration of whether the data is the right data and whether there are any lurking confounders or biases and (c) relatively simple statistical models applied and interpreted skeptically.

It turns out doing those three things is actually surprisingly hard and very, very time consuming. It is my experience that data science projects take a solid 2-3 times as long to complete as a project in theoretical statistics. The reason is that inevitably the data are a mess and you have to clean them up, then you find out the data aren't quite what you wanted to answer the question, so you go find a new data set and clean it up, etc. After a ton of work like that, you have a nice set of data to which you fit simple statistical models and then it looks super easy to someone who either doesn't know about the data collection and cleaning process or doesn't care.

This poses a major public relations problem for serious data scientists. When you show someone a good data science project they almost invariably think "oh that is easy" or "that is just a trivial statistical/machine learning model" and don't see all of the work that goes into solving the real problems in data science.