S. M. Stigler, "Data have a limited shelf life", Harvard Data Science Review, November 2019.
Data, unlike some wines, do not improve with age. The contrary view, that data are immortal, a view that may underlie the often-observed tendency to recycle old examples in texts and presentations, is illustrated with three classical examples and rebutted by further examination. Some general lessons for data science are noted, as well as some history of statistical worries about the effect of data selection on induction and related themes in recent histories of science.
Keywords: dead data, zombie data, post-selection inference, history
Of particular historical interest is whether or not modern scholars can ever properly interpret classic experiments, with their defects, like the Millikan oil drop experiment, or Eddington’s measurement of light deflection to confirm General Relativity.
Also of interest is whether enough metadata about old datasets in business, such as insurance or operations, or even scientific observation, is kept to be able to properly reconstruct the provenance.