R and “big data”


On 2nd November 2015, Wes McKinney, the developer of the highly useful Python pandas module (and other things, including books), wrote an amusing blog post, “The problem with the data science language wars“. I by no means disagree with him. But I do think the story there is incomplete.

I’ll focus upon R, since that’s what I both know best and has a community which is neither terribly familiar to Python people, and has aspects, features, and packages which, from my limited exposure, concern problems regarding which people in Python seem to have little interest. R remains my mainstay, as statistician, data scientist, and numerics engineer, and I have specific reasons for that (*), but I wanted to underscore and set aside some preconceptions.

First, the architectural model that is Hadoop and map-reduce is not the only possible concurrent architecture, even if it is popular. Moreover, it does not lend itself to calculating all kinds of things. To insist that a good solution to a problem demands it be force fit to a map-reduction architecture is going around doing data science with one hand tied behind your back.

Second, R itself contains responses to big data, apart from going the way of Revolution Analytics. In particular, there are a potent series of R packages from Professor John (“Jay”) Emerson of Yale, colleagues, and (sometimes former) students, bigmemory, biglm, biganalytics, bigpca, bigalgebra, bigtabulate, foreach, and synchronicity, which offer standard statistical analyses scaled to really big datasets. Jay has coauthored a recent paper describing these and their use. And he has some applicable pity quotes on his page:

“The plural of anecdote is not data.” – Roger Brinner

“The combination of some data and an aching desire for an answer does not ensure that a reasonable answer can be extracted from a given body of data.” – John Tukey

Third, R (as Python) has rich capabilities to use the HDF5 file format (as does Python), via R‘s package, rhdf5. (Rhetorical question: How many know about Bioconductor?) This supports keyed and hierarchical indexing of truly huge datasets, including the ability to iterate over the structures in various ways, and map them. Indeed, it overcomes a problem which even pandas has, and that is the need to retain this data in memory. With HDF5 and these interfaces, you can work on one piece of the elephant at a time. Moreover, the file format and its interfaces support single writer/multiple reader performance.

Fourth, parallelism in R has come a long way, with the R/parallel package and capability leading the charge into the world of individual statement level parallelism, along with constructs like those provided in the foreach package and others. If numerical computation is your thing, you want statement level parallelism! Even better, use of multiple cores and servers is now implicitly available in a large number of packages in R, whether bcp, boot, caret, various interfaces to GPU support, or the exquisite parallelMCMCcombine package of Miroshnikov and Conlon, implementing the Consensus Monte Carlo Algorithm of Scott, Blocker, and Bonassi.

This said, I still do use, for instance, Python. However, I typically use it to prepare data for an R program, increasingly to parse and reorganize data given me, producing an HDF5 version of a(n) (typically large) assortment of gzip’d files, using Python’s h5py module.

Yeah, language wars are silly. But so, too, is the idea that support of “big data” is the only thing that’s important, even if you want to do stuff with “big data”.


(*) Example of reasons for my preference for R are:

  • The vastly larger and superior CRAN library of packages for doing many specific statistical things
  • A more careful construction of flonum calculations in libraries and elsewhere
  • The fact that the distributional engines (e.g., the quantile function, distribution function, density function, and random generator for a particular statistical distribution) all are happy to return and receive logs of probabilities rather than probabilities themselves, something you won’t find, for instance, in Python’s numpy or scipy.
  • R supports parallel random number generation: No, you can’t properly generate random deviates across a cluster without doing cross-cluster synchronization of them.

About hypergeometric

See http://www.linkedin.com/in/deepdevelopment/ and http://667-per-cm.net
This entry was posted in Bayes, Bayesian, big data, bigmemory package for R, Jay Emerson, MCMC, numerics, Python 3, R, Yale University Statistics Department. Bookmark the permalink.

Leave a Reply

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out / Change )

Twitter picture

You are commenting using your Twitter account. Log Out / Change )

Facebook photo

You are commenting using your Facebook account. Log Out / Change )

Google+ photo

You are commenting using your Google+ account. Log Out / Change )

Connecting to %s