Big Data.

All the rage.

Why?

Apart from distributed software folks *strutting their stuff*, something which is likely to be fleeting, especially *when* quantum computing comes around, what does it buy *anyone*? I can see *four* possibilities, which I consider in turn below.

First, Big Data allows the pursuit of Outliers As Interesting Cases. That is, outliers are rare, and in a Big Data context, there are enough of them to make them worthy and subjects of deep study. Okay. Extreme values, or not, since these are no longer extreme. Doesn’t *knowing* what an outlier is depend crucially upon *what an outlier is not*? How does that lesson the need to characterize your main population really, really well?

Second, a claim, Big Data allows empirical cumulative distributions and empirical densities to stand in for any theoretical construct or statistical description of the same data you might like to throw at it …. Okay, but how do you isolate non-stationary factors in your Big dataset, because, after all, *things do change*, with time, especially, but possibly *with place*.

Third, Big Data allows the discovery of small features of the common, the individuals or transactions which make up the bulk of their distributional center, but exhibit behaviors which would be missed if the sample size were not so enormous. I can think of this as an instance of the first possibility, that of “pursuit of outliers as interesting cases”, but I see the point if far more data are needed to establish and characterize the behavior. But, I wonder, how precisely is the separation done, showing typical behavior from nuance? Sounds like it needs a good model, however much data there is available.

Fourth, Big Data allows the discovery of relationships which were not discernable at any dataset size smaller. Sorry, I think this *complete bupcus*. Apart from why there should be a dataset size threshold which should permit this, if a model were inaccessible or undemonstrable at lower sizes, what precisely differs that allows the same to be done at larger dataset size thresholds?

No, if faced with “big data”, my approach, after careful consideration, is to demand a set of specific questions for which answers are sought, and then *sample* from the “big data” population to try to do a calculation characterizing them. Yes, a “big data” infrastructure might be needed to take one or more samples of this kind, but it is not needed for doing analysis.

### Like this:

Like Loading...

## About hypergeometric

See http://www.linkedin.com/in/deepdevelopment/ and http://667-per-cm.net