One of the things I find surprising, if not astonishing, is that in the rush to embrace Big Data, a lot of learning and statistical technique has been left apparently discarded along the way. I’m hardly the first to point this out. Moreover, there are remedies available. Still, there are books on predictive analytics published which, while they collect a set of interesting ad hoc techniques for rapid inference together, leave out a lot of traditional techniques and wise concerns. Many of the major software plexes, like Weka, Spark MLib, or the map-reduce framework with its strong algorithmic constraints, facilitate the organization of large datasets and their retrieval, but the standard practices necessarily omit things like subsampling, and assessing what your real sample size is. To be crude, 100 tonnes of crap is still crap, and a billion replicas of exactly the same record don’t give you any more information than what’s in the same record.
Fundamentally there seems to be this idea that traditional statistical methods are too slow for the world of Big Data. I think that’s meant in two different ways. The first, which almost everyone addresses when the matter is discussed, is that the data set sizes or the rates of streaming are so large it’s not possible to apply batch-oriented or heavy computation to them. The second, which seldom gets mentioned in my experience, is that there’s an emphasis by organizations on rapidly producing apparent results, and, so, it’s perceived that deep thinking or care in sampling is not consistent with the business mission (*). I say “apparent results” because often there are only poor ways to tell if results are adequate. Itemset methods, sometimes called association rules, which I have used in an application, have a number of frequentist statistics offered which are used for diagnosis. One, for instance, is called Confidence and is a very poor man’s estimate of a conditional probability. It’s recognized that it has limitations, but to treat the finding limitations as if they are a research result is, in my opinion, to feign ignorance. So, to me, the rapid production of apparent results is simply looking like keeping busy, without a quantitative way of knowing. And “the customers seem to continue to be happy” or “sales keep going up” which, while important of course, don’t necessary have any causal connection to what’s being done.
Facts are, of course, that there are lots of ways traditional statistics can inform efforts involving large datasets. Moreover, there are, indeed, techniques which have been known since the 1960s for keeping up with the onslaught of a large data stream, these have been greatly improved, and they are very much in use by people who know how to use them.
And I suspect that’s another thing: Most of the traditional techniques for doing prediction and inference on streams, namely dynamic linear models, state-space methods, and dynamic generalized linear models all use more mathematics, specifically, numerical linear algebra and basic multivariable calculus to do what they do. And the population of developers and managers and, even, engineers eschew use of these methods whenever they can, because they are perceived to be hard. Instead, things like inference based upon ad hoc methods of locally sensitive hashing are used because it is “standard practice”, and generalizations of clustering methods, without even inquiring if the topology of the problem admits use of these. Sure, I can see these methods have their place, but it’s not like these are the only way things can be done.
Rather than some kind of leap of faith that more and more data can make up for things like poor statistical power, I think smart organizations realize that sampling matters, and traditional critiques of business processes are important. Accordingly, I’m not interested in Big Data, I’m interested in Smart Data, no matter what its size. I think anyone who cares about what their results mean should be interested in Smart Data, too.
In addition to simply representing good practice, Smart Data techniques do things easier (in the big picture sense) than do ad hoc collections of ad hoc techniques, no matter how many times they are cross-validated. For instance, predicting consumer behavior on hypothetical products, or products they have never experienced, or products no one has ever experienced, is not something which extrapolations of existing evidence compendia can ever dream of doing. There needs to be a well-wrought model in order to do that. And the model needs to be evidence-based, too. In other words, if there’s no training data, or truth data to score it, simply observations, many of the present Big Data methods are hopeless.
(*) This is exemplified by the many competitions or hackathons where tough problems are expected to be solved in a short time by adversarial teams. Sure, speed is sometimes necessary, but does anyone seriously expect every business can be run that way and last? “Internet time” is and always was ridiculous. At least, they’ll lose their people. They could go bankrupt after making a big mistake that was not noticed due to the rush. Flexibility, yes. But careful competence akin to skunkworks, definitely. Death marches can’t work, nor can projects which feature any two or more of their characteristics: wishful thinking, escalation of commitment, optimism bias, and planning fallacy.