Tuesday, November 13, 2012

The Data Science Loop


Link

Ask a good question.

Answer the question while economizing on resources.

Communicate your results.

(Sometimes) Make recommendations to engineers or managers.

Asking a good question is probably the hardest thing to get right. If you neglect this step, you'll spend days of your life working on something that will have little impact. It's a skill that people who focus on technical training tend to be bad at [..].

The real art to asking good questions is to consider your audience. Who is going to be interested in the results and why are they going to care? I find that the best questions have punchy answers, are usually interesting to everyone, and usually affect a potential decision. On the last point, the key is to think about how someone within your organization might change their strategy due to your answer.

Effectively answering questions is where technical skills become important. It's easy to get caught up in fancy algorithms and methods, but those approaches are usually premature optimizations. The best answers are 1) cheap and 2) easy to explain. Give me a table of counts or event rates over regression coefficients or the first eigenvector of your matrix decomposition. Perhaps it's a bit modest, but I often describe data science as "advanced applied counting." [..]

Fancy, new, and complicated are usually bad qualities for a method. Take it from Jay Kreps, "read current 3-5 pubs and note the stupid simple thing they all claim to beat, implement that."

The other pattern I notice here is the unreasonable effectiveness of Polya's advice for solving a math problem, particularly this aphorism: "If you can't solve a problem, then there is an easier problem you can solve: find it." Paraphrased for data scientists, if there is a question you can't answer, there is an easier question you can answer (usually counting something!).

I firmly believe that data scientists should not be engineers or managers. Engineers build things, managers make decisions, data scientists answer questions. This is not to trivialize the role of data scientists, who plausibly account 2/3 of the steps in the build-measure-learn loop. The answers can (and should) inform decisions that managers make and help engineers build better products, but answers always lead to more (and better!) questions.

Don't let the data science technical jargon drive your impression of what is actually done in the field. In my experience, it's a research job where you have autonomy to ask and answer some really interesting questions. The fundamental challenge is being savvy enough to pick good questions and find concise answers using minimal resources. Then you must convince everyone to listen to you about what you found. In many ways it's similar to academic research, but the differences are that the cycle is tighter and your answers will often effect changes in the business.