“Hadoop is NOT ‘Big Data’ is NOT Analytics”

Posted on 8 April 2017 by ecoquant

Arun Krishnan, CEO & Founder at $\mathbf{n!}\,$ Analytical Sciences comments on this serious problem with the field. Short excerpt:

… A person who is able to write code using Hadoop and the associated frameworks is not necessarily someone who can understand the underlying patterns in that data and come up with actionable insights. That is what a data scientist is supposed to do. Again, data scientists might not be able to write the code to convert “Big Data” into “actionable” data. That’s what a Hadoop practitioner does. These are very distinct job descriptions.

While the term analytics has become a catch-all phrase used across the entire value chain, I personally prefer to use it more for the job of actually working with the data to get analytical insights. That separates out upstream and downstream elements of the entire data mining workflow.

I have repeatedly observed practitioners and especially managers who treat — or would very much like to treat — tools and techniques from this area as if they were Magical Boxes, to which you can send arbitrary data and obtain wonderful results, like the elixir of the Alchemists. There is also a cynical aspect to the attitude of some managers — some seem indoctrinated by the old “Internet time“ and “agile sprint” notions — that if something does not show tangible and substantial progress over the short term (on the order of a week or two), there is something fundamentally wrong with the process. Sure, progress needs to be shown and reportable, but some problems, especially those involving data which are not obviously meaningful (*), demand a deep familiarization with the data and good deal of data cleansing (**). This is hard, especially when the data are large. And not all worthwhile problems can be solved in two weeks, even for a corporation. Consider the project and planning timelines which a Walt Disney Company does for their parks or a energy company like DONG does for their offshore wind projects.

This is unfortunate, and it is more than simply a matter of personal style. Projects which proceed with the magical thinking that the right tool or algorithm is going to solve all their issues typically fail, after expending large resources on computing assets, data licenses, and labor. When they do, they give analytics and “Big Data” a tarnished reputation, especially among upper management who blame and distrust new things rather than incompetent engineers or, perhaps, engineers without the integrity of explaining to their management that these tools have promise, but the project schedules for venturing into new sources of data are long, and best done with a very small team for the first portion.

In fact, one severe failing of the current suite of “Big Data” tools I see is that, while they are strong on certain modeling algorithms, and representational devices like Python panadas-esque and R-esque data frames, they offer little in the way of advanced data cleaning tools, ones which can marshall clusters to completely rewrite data in order for it to be useful for analysis and machine learning.

(*) Data which are obviously meaningful consist of self-evident records like purchasing transactions, or, as is increasingly less common, have records and fields documented carefully in a data dictionary. These have fallen out of fashion because of the NoSQL movement and I applaud the desire to push analysis and data sources beyond structured data offerings. However, just because an analytical can parse unstructured text does not mean it somehow automatically recovers meaning from that text. Indeed, what you have now, instead of structured data, is a problem in natural language processing, for which there are, indeed, excellent tools available, like Python’s nltk. But few people who embrace NoSQL know or use this kind of thing.

It is even harder to know what to do with semi-structured textual data, such as the headers of IETF RFC 2616. In these cases, while there is official guidance, there is no effective enforcement mechanism and, so, instances of these headers are, by the criteria of the RFC, malformed, even if there dialects in Internet communities which are self-consistent and practiced in breach of the RFC. The trouble is that, here, there is no computable definition of malformed, so what is meaningful is something which needs to be learned from the corpora available. This is not an easy task, and may be dependent not only upon the communities in question, but upon geographic origins and takeup, as well as Internet protocol and netblocks.

(**) There are plenty of examples of these in the single thread, single core world. There is, for instance, an open source version called OpenRefine.

About ecoquant

See https://wordpress.com/view/667-per-cm.net/ Retired data scientist and statistician. Now working projects in quantitative ecology and, specifically, phenology of Bryophyta and technical methods for their study.

View all posts by ecoquant →

This entry was posted in alchemy, American Statistical Association, artificial intelligence, big data, data science, engineering, Internet, jibber jabber, machine learning, natural language processing, NLTK, sociology, superstition. Bookmark the permalink.