“Double Plus Big Data”

Posted on 25 September 2013 by ecoquant

Big Data.

All the rage.

Why?

Apart from distributed software folks strutting their stuff, something which is likely to be fleeting, especially when quantum computing comes around, what does it buy anyone? I can see four possibilities, which I consider in turn below.

First, Big Data allows the pursuit of Outliers As Interesting Cases. That is, outliers are rare, and in a Big Data context, there are enough of them to make them worthy and subjects of deep study. Okay. Extreme values, or not, since these are no longer extreme. Doesn’t knowing what an outlier is depend crucially upon what an outlier is not? How does that lesson the need to characterize your main population really, really well?

Second, a claim, Big Data allows empirical cumulative distributions and empirical densities to stand in for any theoretical construct or statistical description of the same data you might like to throw at it …. Okay, but how do you isolate non-stationary factors in your Big dataset, because, after all, things do change, with time, especially, but possibly with place.

Third, Big Data allows the discovery of small features of the common, the individuals or transactions which make up the bulk of their distributional center, but exhibit behaviors which would be missed if the sample size were not so enormous. I can think of this as an instance of the first possibility, that of “pursuit of outliers as interesting cases”, but I see the point if far more data are needed to establish and characterize the behavior. But, I wonder, how precisely is the separation done, showing typical behavior from nuance? Sounds like it needs a good model, however much data there is available.

Fourth, Big Data allows the discovery of relationships which were not discernable at any dataset size smaller. Sorry, I think this complete bupcus. Apart from why there should be a dataset size threshold which should permit this, if a model were inaccessible or undemonstrable at lower sizes, what precisely differs that allows the same to be done at larger dataset size thresholds?

No, if faced with “big data”, my approach, after careful consideration, is to demand a set of specific questions for which answers are sought, and then sample from the “big data” population to try to do a calculation characterizing them. Yes, a “big data” infrastructure might be needed to take one or more samples of this kind, but it is not needed for doing analysis.

About ecoquant

See https://wordpress.com/view/667-per-cm.net/ Retired data scientist and statistician. Now working projects in quantitative ecology and, specifically, phenology of Bryophyta and technical methods for their study, notably Macrophotography. Some photos of mine: https://www.flickr.com/photos/198372469@N03/

View all posts by ecoquant →

This entry was posted in Bayesian, education, engineering, investing, mathematics, maths, notes, physics, rationality, reasonableness, statistics, stochastic algorithms, stochastic search. Bookmark the permalink.