## Polls, Political Forecasting, and the Plight of Five Thirty Eight

On 17th October 2016 AT 7:30 p.m., Nate Silver of FiveThirtyEight.com wrote about how, as former Secretary of State Hillary Clinton’s polling numbers got better, it was more difficult for FiveThirtyEight‘s models to justify increasing her probability of winning, although it did “stabilize” their predictions. Mr Silver is being a bit too harsh on their models, since the problem is fundamental, not just something which afflicts their particular model. In Mr Silver’s defense, he did write:

But there’s some truth to the notion that she’s encountering diminishing returns. And that’s for a simple reason: 88 percent and 85 percent are already fairly high probabilities. Our model is going to be stingy about assigning those last 10 or 15 percentage points of probability to Clinton as she moves from the steep, middle part of the probability distribution to the flatter part.

Well, maybe, except I’m not sure that is assignable to Secretary Clinton. It’s a mathematical phenomenon, one which Mr Silver may be aware of, but apparently did not want to comment upon saying “Before this turns into too much of a math lesson …”. I say Why not a math lesson?.

In particular, as a probability of an event, any event, gets more and more above 50% (or, symmetrically, less than 50%), the amount of information needed to “push it” the same distance it has gone grows, and as the probability (or improbability) of the event approaches certainty, the efficiency of additional information to improve the determination gets worse. It’s possible to be quantitative about all this.

Let’s have a look at this in the hypothetical case of two presidential candidates, one called T and one called H. Suppose that, with time, T‘s probability of winning, denoted here $[\mathbf{T}]$, decreases from 0.50. Since I’m only considering two candidates, $[\mathbf{H}] = 1 - [\mathbf{T}]$, so, then, $[\mathbf{H}]$ increases away from 0.50, and they sum to unity. This is a system with two components, and it’s entropy is equal to

$-[\mathbf{T}] \log_{2}{(\mathbf{T})} - [\mathbf{H}] \log_{2}{(\mathbf{H})}$.

Entropy for this system will hereafter be denoted $E(p)$. The amount of information needed to move, say, $[\mathbf{H}]$ up a unit of probability is the decrease in the entropy at the new state of affairs with respect to the old one. Adding information is kind of doing work, although, in this case, the “work” is evidence collected from polls and other sources.

So, for example, the amount of entropy when both candidates are tied is exactly 1 bit. (Entropy and information are measured in bits or nats.) When $[\mathbf{H}]$ is about 0.8885, the entropy is 0.5 bits. When $[\mathbf{H}]$ is 0.91, the entropy is about 0.436 bits. The rate of change of entropy with $[\mathbf{H}]$ is simply the derivative of $E(p)$ with respect to $p$, or $\log_{2}{(\frac{p}{1-p})}$, or, in other words, the log of the odds ratio, sometimes called “log odds”. If someone were to try to assess these “diminishing returns”, they might compare the additional information needed to progress to that needed to move from $[\mathbf{H}] = 0.50$ to $[\mathbf{H}] = 0.60$. So, let’s plot that:

(Click on image for a larger figure, and use browser Back Button to return to blog.)

So, again, what this shows is how much additional information is needed per unit of probability of winning compared to the information needed to improve chances of winning from 0.50 to 0.60 plotted for various probabilities of winning. Highlighted on the figure is the 0.90 probability of winning, close to present estimates, and it shows that the amount of information needed to improve by 1 unit is one hundred and nine times that needed to improve from 50% chance of winning to 60% chance of winning.

So, what does this mean in the context of political forecasts or, for that matter, any forecasts?

First, as suggested by Mr Silver, once you are at $[\mathbf{H}] = 0.9$, the additional information or evidence needed to move it higher, to $[\mathbf{H}] = 0.91$ or $[\mathbf{H}] = 0.92$ is substantial. In fact, just going from 0.90 to 0.91 requires almost six more multiples of the change in evidence needed from 0.50 to 0.60. This is true of any application. For example, to demonstrate, say, that a given engineered system has a reliability of, say, 0.995 requires a lot of testing and a lot of work, and necessarily takes a long time, simply because that 0.995 criterion is way out there on the “evidence sill”.

Second, this mathematical fact tends to downplay the significance of changes at high probabilities of winning. Going from 0.90 to 0.91 may not sound like a lot, but the information gathered to justify it is necessarily substantial.

Third, there are limits to political forecasting. These are not because the models are poor, or the techniques are poor, but because there is only so much information available in political polls and other sources. These observations have their own variability or noise, and that limits their information content. At some point in the above figure, the information content of the polls or observations is exhausted, and whatever uncertainty remains is the best anyone can do. This isn’t to say polling could not be improved, or samples might not be larger, or more systematic surveys might not be taken to improve results, using stratified sampling and other techniques. (These are pretty standard anyway, although they can cost a lot of money.) It’s just that you cannot squeeze more out of a set of data than it has. It also means there are limits to what political forecasting can do, even a group as talented as fivethirtyeight.com.

Nevertheless, if a particular candidate, say, H has $[\mathbf{H}] = 0.91$, that’s pretty darn good, especially when you consider the amount of information needed to establish that, and what that means, for example, about evidence for their popularity among the public. And this is an insight which I don’t believe is made available by examining variance of Bernoulli variables or coefficients of variation, measures which seem inappropriate this far out on the Bernoulli tail.

If you’d like to learn more about this kind of thing, I recommend Professor John Baez’s series of posts on information geometry. It is a little mathematical, but the investment in time and mind assets are decidedly worth it. There are many analogies between information and entropy and physical processes. For example, borrowing from classical statistical mechanics in Physics, information in this instance can be thought of as the additional cooling needed to bring a two-state system into a more rigid configuration, kind of like approaching Absolute Zero, at least with respect to entropy of perfect crystals.