A quick note on modeling operational risk from count data

The blog statcompute recently featured a proposal encouraging the use of ordinal models for difficult risk regressions involving count data. This is actually a second installment of a two-part post on this problem, the first dealing with flexibility in count regression.

I was drawn to comment because of a remark in the most recent post, specifically that

This class of models require to simultaneously estimate both mean and variance functions with separate sets of parameters and often suffer from convergence difficulties in the model estimation. All four mentioned above are distributions with two parameters, on which mean and variance functions are jointly determined. Due to the complexity, these models are not even widely used in the industry.

Now, admittedly, count regression can have its issues. The traditional methods of linear regression don’t smoothly extend to the non-negative integers, even when counts are large and bounded away from zero. But in the list of proposals offered, there was a stark omission of two categories of approaches. There was also no mention of drawbacks of ordinal models, and the author’s claim that the straw man distributions offered are “not even widely used in industry” may be true, but not for the reasons that paragraph implies.

I post a brief reaction here because the blog also does not offer a facility for commenting.

First of all, as any excursion into literature and textbooks will reveal, a standard approach is to use generalized linear models (GLMs) with link functions appropriate to counts. And, in fact, the author goes there, offering a GLM version of standard Poisson regression. But dividing responses into ordinal buckets is not a prerequisite for doing that.

GLMs are useful to know about for many reasons, including smooth extensions to logistic regression and probit models. Moreover, such an approach is thoroughly modern, because it leaves behind the idea that there is a unique distribution for every problem, however complicated it might be, and embraces the idea that few actual “real world” problems or datasets will be accurately seen as drawn from some theoretical distribution. That is an insight from industrial practice. Understanding logistic regression and GLMs has other important benefits beyond applicability to binary and ordinary responses, including understanding new techniques like boosting and generalized additive models (GAM).

Second, the presentation completely ignores modern Bayesian computational methods. In fact, these can use Poisson regression as the core model of counts, but posing hierarchical priors on the Poisson means drawn from hyperpriors is an alternative mechanism for representing overdispersion (or underdispersion). Naturally, one needn’t restrict regression to the Poisson, so Negative Binomial or other core models can be used. There are many reasons for using Bayesian methods but, to push back from the argument of the blog post as represented by the quote above, allaying fear of having too many parameters is one of the best and most pertinent. To a Bayesian, many parameters are welcome, and each are seen as random variables contributing to a posterior density, and, in modern approaches, linked together with a network of hyperpriors. While specialized methods are available, the key technique is Markov Chain Monte Carlo.

There are many methods available in R for using these techniques, including the arm, MCMCpack, and MCMCglmm packages. In addition, here are some references from the literature using Bayesian methods for count regression and risk modeling:

  1. Ä Ãzmen, H. Demirhan, “A Bayesian approach for zero-inflated count regression models by using the Reversible Jump Markov Chain Monte Carlo Method and an application”, Communications in Statistics: Theory and Methods, 2010, 39(12), 2109-2127
  2. L. N. Kazembe, “A Bayesian two part model applied to analyze risk factors of adult mortality with application to data from Namibia”, PLoS ONE, 2013, 8(9): e73500.
  3. W. Wu, J. Stamey, D. Kahle, “A Bayesian approach to account for misclassification and overdispersion in count data”, Int J Environ Res Public Health, 2015, 12(9), 10648–10661.
  4. J.M. Pérez-Sánchez, E. Gómez-Déniz, “Simulating posterior distributions for zero-inflated automobile insurance data”, arXiv:1606.00361v1 [stat.AP], 16 Nov 2015 10:50:40 GMT.
  5. A. Johansson, “A Comparison of regression models for count data in third party automobile insurance”, Department of Mathematical Statistics, Royal Institute of Technology, Stockholm, Sweden, 2014.

(The above is a reproduction of Figure 2 from W. Wu, J. Stamey, D. Kahle, “A Bayesian approach to account for misclassification and overdispersion in count data”, cited above.)</h6

Third, the author of that post uses the rms package, a useful and neat compendium of regression approaches and methods. The author of the companion textbook, Regression Modeling Strategies (2nd edition, 2015) by Professor Frank Harrell, Jr, cautions in it that

It is a common belief among practitioners … that the presence of non-linearity should be dealt with by chopping continuous variables into intervals. Nothing could be more disastrous.

See Section 2.4.1 (“Avoiding Categorization”) in the text. The well-justified diatribe against premature and unwarranted categorization spans three full pages.

Indeed, this caution appears in literature:

  1. D. G. Altman, P. Royston, “The cost of dichotomising continuous variables”, The BMJ, 332(7549), 2006, 1080.
  2. J. Cohen, “The cost of dichotomization”,
    Applied Psychological Measurement, 7, June 1983, 249-253.
  3. P. Royston, D. G. Altman, W. Sauerbrei, “Dichotomizing continuous predictors in multiple regression: a bad idea.”,
    Statistics in Medicine, 25(1), January 2006, 127-141.

Note that Professor Harrell considers count data to be continuous, and deals with it by transformation, if necessary by applying splines.

About ecoquant

See https://wordpress.com/view/667-per-cm.net/ Retired data scientist and statistician. Now working projects in quantitative ecology and, specifically, phenology of Bryophyta and technical methods for their study.
This entry was posted in American Statistical Association, Bayesian, Bayesian computational methods, count data regression, dichotomising continuous variables, dynamic generalized linear models, Frank Harrell, Frequentist, Generalize Additive Models, generalized linear mixed models, generalized linear models, GLMMs, GLMs, John Kruschke, maximum likelihood, model comparison, Monte Carlo Statistical Methods, multivariate statistics, nonlinear, numerical software, numerics, premature categorization, probit regression, statistical regression, statistics and tagged , , , . Bookmark the permalink.

Leave a reply. Commenting standards are described in the About section linked from banner.

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out /  Change )

Twitter picture

You are commenting using your Twitter account. Log Out /  Change )

Facebook photo

You are commenting using your Facebook account. Log Out /  Change )

Connecting to %s

This site uses Akismet to reduce spam. Learn how your comment data is processed.