Thursday, August 22, 2013

What Hackers Should Know About Machine Learning


Data analysis as an exploratory endeavor should be the first part of anything. You should never go into a project and say “The thing that I want to do is classification so I'm always going to run my favorite classification algorithm.” For the first half of the book we talk about “Here's a dataset, here's how to clean it up.” The chapters that John Miles White wrote on means, medians, modes, and distributions are always the things that you should do in the beginning. We want to hammer home that it's not just input-output. Input, look around, see what's going on, find structure in the data, then make the choice for methods. And then maybe iterate a couple of them. It's very cyclic. It's not linear [..]

My thinking has evolved on presenting results. The way I think about presenting results now is always in the browser as an interactive thing. There's a tremendous amount of value in providing the audience with the ability to ask second-order questions about what they are observing rather than first-order ones. Imagine the thing you are looking at is just a simple scatterplot and you see one outlier. So a first-order question would be who is that outlier? If you have an interactive thing where you can go over the dot and it tells you who that is, and the second order question is why is that an outlier?

Monday, August 5, 2013

10 Best Practices in Operational Analytics

Great set of slides on ensembles, feature engineering, data preperation.

Friday, August 2, 2013

KMeans on Categorical and Mixed Data Types

Below is a link to an article on performing KMeans on Australian Credit Dataset. A mixture of cosine distance and euclidian distance was used, and KMeans utilizes both with customizable weights to find clusters. Australian set has 2 clusters that are marked with + and -, and our success rate on this set was 82%. The language of the article is not English, but the code can be helpful for someone. Data is here.