The blog statcompute recently featured a proposal encouraging the use of ordinal models for difficult risk regressions involving count data. This is actually a second installment of a two-part post on this problem, the first dealing with flexibility in count regression.
I was drawn to comment because of a remark in the most recent post, specifically that
This class of models require to simultaneously estimate both mean and variance functions with separate sets of parameters and often suffer from convergence difficulties in the model estimation. All four mentioned above are distributions with two parameters, on which mean and variance functions are jointly determined. Due to the complexity, these models are not even widely used in the industry.
Now, admittedly, count regression can have its issues. The traditional methods of linear regression don’t smoothly extend to the non-negative integers, even when counts are large and bounded away from zero. But in the list of proposals offered, there was a stark omission of two categories of approaches. There was also no mention of drawbacks of ordinal models, and the author’s claim that the straw man distributions offered are “not even widely used in industry” may be true, but not for the reasons that paragraph implies.
I post a brief reaction here because the blog also does not offer a facility for commenting.
First of all, as any excursion into literature and textbooks will reveal, a standard approach is to use generalized linear models (GLMs) with link functions appropriate to counts. And, in fact, the author goes there, offering a GLM version of standard Poisson regression. But dividing responses into ordinal buckets is not a prerequisite for doing that.
GLMs are useful to know about for many reasons, including smooth extensions to logistic regression and probit models. Moreover, such an approach is thoroughly modern, because it leaves behind the idea that there is a unique distribution for every problem, however complicated it might be, and embraces the idea that few actual “real world” problems or datasets will be accurately seen as drawn from some theoretical distribution. That is an insight from industrial practice. Understanding logistic regression and GLMs has other important benefits beyond applicability to binary and ordinary responses, including understanding new techniques like boosting and generalized additive models (GAM).
Second, the presentation completely ignores modern Bayesian computational methods. In fact, these can use Poisson regression as the core model of counts, but posing hierarchical priors on the Poisson means drawn from hyperpriors is an alternative mechanism for representing overdispersion (or underdispersion). Naturally, one needn’t restrict regression to the Poisson, so Negative Binomial or other core models can be used. There are many reasons for using Bayesian methods but, to push back from the argument of the blog post as represented by the quote above, allaying fear of having too many parameters is one of the best and most pertinent. To a Bayesian, many parameters are welcome, and each are seen as random variables contributing to a posterior density, and, in modern approaches, linked together with a network of hyperpriors. While specialized methods are available, the key technique is Markov Chain Monte Carlo.
There are many methods available in R for using these techniques, including the arm, MCMCpack, and MCMCglmm packages. In addition, here are some references from the literature using Bayesian methods for count regression and risk modeling:
- Ä Ãzmen, H. Demirhan, “A Bayesian approach for zero-inflated count regression models by using the Reversible Jump Markov Chain Monte Carlo Method and an application”, Communications in Statistics: Theory and Methods, 2010, 39(12), 2109-2127
- L. N. Kazembe, “A Bayesian two part model applied to analyze risk factors of adult mortality with application to data from Namibia”, PLoS ONE, 2013, 8(9): e73500.
- W. Wu, J. Stamey, D. Kahle, “A Bayesian approach to account for misclassification and overdispersion in count data”, Int J Environ Res Public Health, 2015, 12(9), 10648–10661.
- J.M. Pérez-Sánchez, E. Gómez-Déniz, “Simulating posterior distributions for zero-inflated automobile insurance data”, arXiv:1606.00361v1 [stat.AP], 16 Nov 2015 10:50:40 GMT.
- A. Johansson, “A Comparison of regression models for count data in third party automobile insurance”, Department of Mathematical Statistics, Royal Institute of Technology, Stockholm, Sweden, 2014.
(The above is a reproduction of Figure 2 from W. Wu, J. Stamey, D. Kahle, “A Bayesian approach to account for misclassification and overdispersion in count data”, cited above.)</h6
Third, the author of that post uses the rms package, a useful and neat compendium of regression approaches and methods. The author of the companion textbook, Regression Modeling Strategies (2nd edition, 2015) by Professor Frank Harrell, Jr, cautions in it that
It is a common belief among practitioners … that the presence of non-linearity should be dealt with by chopping continuous variables into intervals. Nothing could be more disastrous.
See Section 2.4.1 (“Avoiding Categorization”) in the text. The well-justified diatribe against premature and unwarranted categorization spans three full pages.
Indeed, this caution appears in literature:
- D. G. Altman, P. Royston, “The cost of dichotomising continuous variables”, The BMJ, 332(7549), 2006, 1080.
- J. Cohen, “The cost of dichotomization”,
Applied Psychological Measurement, 7, June 1983, 249-253. - P. Royston, D. G. Altman, W. Sauerbrei, “Dichotomizing continuous predictors in multiple regression: a bad idea.”,
Statistics in Medicine, 25(1), January 2006, 127-141.
Note that Professor Harrell considers count data to be continuous, and deals with it by transformation, if necessary by applying splines.