## K-Nearest Neighbors: dangerously simple

Yeah, Mathbabe’s got it right: People who use kNN often don’t think about these things.

For those who aren’t familiar with this technique, here’s a description from Zhi-Hua Zhou in Ensemble Methods: Foundations and Algorithms (section 1.2.5):

“The $k$-nearest neighbor ($k$NN) algorithm relies on the principle that objects similar in the input space are also similar in the output space. It is a lazy learning approach since it does not have an explicit training process, but simply stores the training set instead. For a test instance, a $k$-near neighbor learner identifieds the $k$ insteances from the training set that are closest to the test instance. Then, for classification, the test instance will be classified to the majority class among the $k$ instances; while for regression, the test instance will be assigned the average value of the $k$ instances.”

I spend my time at work nowadays thinking about how to start a company in data science. Since there are tons of companies now collecting tons of data, and they don’t know what do to do with it, nor who to ask, part of me wants to design (yet another) dumbed-down “analytics platform” so that business people can import their data onto the platform, and then perform simple algorithms themselves, without even having a data scientist to supervise.

After all, a good data scientist is hard to find. Sometimes you don’t even know if you want to invest in this whole big data thing, you’re not sure the data you’re collecting is all that great or whether the whole thing is just a bunch of hype. It’s tempting to bypass professional data scientists altogether and try to replace them with software.

I’m here to say, it’s not clear that’s…

