Just because the data lies sometimes doesn’t mean it’s okay to censor it

Posted on 1 May 2017 by ecoquant

Or, there’s no such thing as an outlier …

Eli put up a post titled “The Data Lies. The Crisis in Observational Science and the Virtue of Strong Theory” at his lagomorph blog. Think of it: Data lying. Obviously this is worth a remark. After all, the Bayesian project is all above treating data as given and fixed, a nod of deep respect, and then, in a kind of generalization of maximum likelihood philosophy, finding those parameters offered by theory which are most consistent with it. But in experimental and, especially, observational science things aren’t so easy.

So I say … Maybe it is …

Well. Of course. Eddington: “It is also a good rule not to put overmuch confidence in the observational results that are put forward until they are confirmed by theory” (from his book). On the other hand …

It is also possible to score theory’s consistency with experiment with techniques better than t-tests and the like, notably the important information criteria that have been developed (Burnham and Anderson). These are bidirectional. For example, it is entirely possible an observational experiment, however well constructed, might be useless for testing a model. Observational experiments are not as powerful in this regard as are constructed experiments.

But I think the put-down of the random walk as a model is a bit strong. After all, that is the basis of a Kalman filter-smoother, at least in the step-level change version. Sure, the state equation need not assume random variation and could have a deterministic core about which there is random variation. But it is possible to posit a “null model” if you will which involves no more than a random walk to initialize, and then takes advantage of Markov chains as universal models to lock onto and track whatever a phenomenon is.

Better, it’s possible to integrate over parameters, as was done in the bivariate response for temperature anomalies in the above, to estimate best fits for process variance. It’s possible to use priors on these parameters, but the outcomes can be sensitive to initializations. It’s also possible to use non-parametric smoothing splines fit using generalized cross-validation. These are a lot better than some of the multiple sets of linear fits I’ve seen done in Nature Climate Change and they tell the same story:

No doubt, there are serious questions about how pertinent these models are to paleoclimate calculations. However, if they are parameterized correctly, especially in the manner of hierarchical Bayesian models, these could well provide constraints in the way of priors for processes which could be applicable to paleoclimate.

While certainly theory can be used, and much of it is approachable and very accessible, I understand why people might want to do something else. Business and economic forecasts are often done using ARIMA models, even if these are not appropriate.

But there is an important area of quantitative research which offers so-called model-free techniques for understanding complex systems, and, in my opinion, these should not be casually dismissed. In particular, the best quantitative evidence of which I am aware teasing out the causal role CO₂ has for forcing at all periods comes from this work. In fact, I’m surprised more people aren’t aware of — and use — the methods Ye, Deyle, Sugihara, and the rest of their team offer.

I should mention, too, that there are R packages called:

Package nwfscNLTS: Non-linear time series
Package rEDM: an R package for Empirical Dynamic Modeling and Convergent Cross-Mapping
Package multispatialCCM: Multispatial Convergent Cross Mapping

[P.S. Sorry, I can’t help it if Judith Curry likes it, too. It’s good stuff.]

But, personally, I like Bayesian Dirichlet stick-breaking …

About ecoquant

See https://wordpress.com/view/667-per-cm.net/ Retired data scientist and statistician. Now working projects in quantitative ecology and, specifically, phenology of Bryophyta and technical methods for their study, notably Macrophotography. Some photos of mine: https://www.flickr.com/photos/198372469@N03/

View all posts by ecoquant →

This entry was posted in Akaike Information Criterion, American Association for the Advancement of Science, American Meteorological Association, American Statistical Association, AMETSOC, Anthropocene, Bayes, Bayesian, climate, climate change, climate models, data science, dynamical systems, ecology, Eli Rabett, environment, Ethan Deyle, George Sughihara, Hao Ye, Hyper Anthropocene, information theoretic statistics, IPCC, Kalman filter, kriging, Lenny Smith, maximum likelihood, model comparison, model-free forecasting, physics, quantitative ecology, random walk processes, random walks, science, smart data, state-space models, statistics, Takens embedding theorem, the right to know, Timothy Lenton, Victor Brovkin. Bookmark the permalink.

1 Response to Just because the data lies sometimes doesn’t mean it’s okay to censor it

Pingback: Liang, information flows, causation, and convergent cross-mapping | Hypergeometric