Saturday, November 10, 2012

Does Nate Silver get his maths wrong?

Although I haven't had much time to write anything new for this blog recently, like most people I know I have of course been paying close attention to the unfolding circus of the US Presidential election. Now that all the action is done and dusted, there is a lot of discussion on the internet about the predictions made by poll aggregators, the best-known of whom is the New York Times blogger and statistician Nate Silver. And although I'm not going to talk about politics, I do want to make a comment about probabilities and the statistics of poll aggregation.

The reason Silver has been in the news so much is because even when a lot of innumerate and possibly partisan journalists were making a huge amount of noise about "Romneymentum" and predicting a landslide Republican victory, Silver calmly analysed the actual polls and stated that the odds continued to favour an Obama victory. For this he attracted an enormous amount of flak in the right-wing press and blogs, and a substantial section of the mainstream media refused to believe him. (See here if you'd like a frequently-updated list of everyone who attacked Silver and ended up looking stupid.) Even on election eve, when Silver gave Obama more than a 90% chance of victory, such impartial websites as the BBC continued to describe the result as being on a "knife-edge".

Of course now that Obama has won, the pendulum has turned the other way, and Silver is being hailed as a mathematical genius, a wizard, a guru and much else besides. This adulation rather ignores the fact that there were several other bloggers and poll aggregators who also predicted an Obama win. Some of these are Sam Wang at the Princeton Election Consortium, Drew Linzer at Votamatic, Pollster at the Huffington Post, Simon Jackman and so on. (To be fair, Silver does have a much more prominent position than the others, being at the New York Times. He did also provide very insightful regular explanations and updates, and some nice interactive graphics to help explore the data, so one can forgive the extra attention given to him.)

But the interesting question is not whether Nate Silver does a better job of predicting elections than, say, Donald Trump. How does he compare with other, equally numerate and mathematically minded poll-aggregating bloggers? And there is really a difference between them: for a large part of October, Silver rated Obama's chances of re-election at only around 65-75%, whereas Sam Wang regularly claimed it was more like a 99% chance. Or have a look at Votamatic, where the predicted median of electoral college votes for Obama has been fairly constant at 332 (the correct number) since July, compared to Silver's prediction of closer to 290 at the same time. You'll also notice, if you play with the graphics at FiveThirtyEight, that the margin of error Silver quoted on his prediction was significantly larger than that the other forecasters gave.

So in summary, Nate Silver made more conservative predictions and was more pessimistic about Obama's re-election chances than his direct competitors. (Why so? As I understand it, Wang and Linzer only include information from published poll data, appropriately weighted for freshness and sample size. Silver included also what he calls "fundamentals" — his estimate of other underlying factors, such as economic data and so on, which might influence the final outcome. Silver's exact model is proprietary, but for an educated guess at how it works, explained in layman's terms, see here.) In the end though, Obama was re-elected, as all the forecasters said he would be. Since the only differences between them lay in exactly how confident they were in their prediction (and also whether they thought he would win Florida ... ), and the election can only be run once, how do you decide whose model is better?

A common argument you may have heard in favour of Silver before Obama's most recent victory is that he correctly predicted 49 out of 50 state results in 2008. Add to that his predictions this year and his tally is now 99 correct out of 100. This is obviously impressive, and clearly you should trust Silver and his maths more than basically every conservative hack who attacked him.

But the problem is that Silver may be too accurate. To put it bluntly, if you say the probability of your prediction being correct is around 70%, you should expect to be wrong 30% of the time, not 1% of the time. So if you predict the right result 99% of the time but quote probabilities of only 70%, you are probably overstating the error bars on your prediction. On the other hand, if someone else says he is 99% sure and is right 99% of the time, his model might be better.

OK, some obvious caveats. It is misleading to count all 50 state predictions as successes for Silver's model. I doubt even Donald Trump would have had trouble predicting Utah would go Republican and New York would go Democrat. In reality, there were only around 10 or 11 states that were at all competitive in each election cycle, so adjusting for that would mean lowering Silver's actual success percentage. Also, his model didn't predict all results with equal confidence — for instance, maybe he gave only a 75% probability for Obama to win Ohio, but a 99% probability for him to win Minnesota.

How should one account for this in quantitatively comparing models? Unsurprisingly, Sam Wang himself suggests a method, which is to assign each prediction a Brier score. This takes the square of the difference between the predicted probability of an event occurring — e.g., 0.75 for a prediction of 75% probability — and the post facto probability (i.e. 1 if it did happen, 0 if it didn't). In the case of multiple predictions (for multiple states), you then average the Brier score for each prediction. Thus the Brier score punishes you for getting a prediction wrong, but it also punishes you for hedging your bets too much, even if you got the end result right. Getting a Brier score of exactly 0 would mean you knew the results in advance, getting a score of 0.25 is equivalent to just guessing randomly (if there are only two possible outcomes).

Have a look at Wang's result tables: it looks as though he did slightly better than Silver in predicting the Presidential race, and much better in predicting the Senate results. As he rather pithily put it,
additional factors used by FiveThirtyEight – “fundamentals” – may have actively hurt the prediction. This suggests that fundamentals are helpful mainly when polls are not available.
Anyway, the point of this post, despite the deliberately provocative title, was not really to attack Nate Silver. Even if he did hedge his bets and overstate his error bars a little, given his very public platform and the consequent stakes, I think that's understandable. The Princeton Election Consortium and Votamatic really don't have the same exposure, so Wang and Linzer would have faced less general opprobrium if they had made a mistake. Also, perhaps more statistics are needed to really make a judgement — though let's not have another election for some time please!

But I think this does illustrate a subtle point about probabilistic predictions that most of the media seem to have missed, so it is worth pointing out. And I feel a little personal guilt for being too pessimistic to believe Sam Wang before last Wednesday!


  1. One thing to keep in mind (and this "Brier score" doesn't) is that each one of Nate's state probabilities isn't independent from the rest. A lot of those states that were less than 90% certainties would probably have either *all* been Obama or *all* been Romney in most of Nate's individual events in his Monte Carlo.

    Also, Sam Wang had a 100% prior that the polls were correct (as you mention in your post). If Nate had done that he'd have had the same near certain probabilities. There will be an election at some point in the future where the polls are wrong and thus Sam will get most of the swing states states wrong and his Brier score will be very poor, whereas (despite also getting them wrong) Nate's won't be as bad because of his conservatism.

    I honestly don't think it is really all that possible to properly judge the accuracy of these models without many more elections (because of the non-independence of each state/senate race there just isn't enough data).

    Nice post though. I felt like you were reading my mind as I read it (then again, most people with a basic understanding of statistics might feel the same).

    1. Generally speaking, I agree that state polls could have been correlated and this should have been accounted for in the probabilities. When I was too pessimistic to believe Sam Wang, part of the reason for that was I had read other sensible people saying the same thing about his method.

      He does have a post on his website addressing this criticism. His answer is that adding a correlation to the model does not change his median prediction, and only changes the error bars slightly, so it constitutes unnecessary complication. The explanation is written in rather simplistic terms though, and as I have no detailed knowledge of his model I can't evaluate his claim.

      I agree you need more election data to really make a sensible judgement about relative merits. If you count presidential elections in each state, plus senate elections, house elections and so on, then it shouldn't take too long to get a reasonably large N.

    2. I disagree regarding how long it takes to get large N. The individual state results (even the senate ones, though less so) are correlated, there isn't that much independent data per election.

      Nate had a 10% probability Obama would lose. This was entirely due to the possibility that the state polls were correlated and systematically wrong across the board. Without ~10 *national* elections I can't see how Nate vs Sam can be resolved. That's 40 years. Perhaps I could accept 20 years for some indication to develop. I don't know, maybe that actually isn't so long! (Note that any a posteriori application of these models to previous elections isn't acceptable because the models were built based on those elections)

    3. I sort of agree and disagree. To sort out the question of correlation of polls might take that long (except of course remember that after each election you can compare the poll margins with the actual vote percentage and so learn more about the existence of any actual correlated bias). But the question of whether it is in general worthwhile including extra information via "fundamentals" can probably be disentangled much quicker.

      Of course I'm assuming science-like cooperation and information-sharing about the models, which won't be happening :-)

  2. In events like elections where there is only once chance of testing the model, election results cannot be the basis for assessing the validity of the model. All statistical models would have structural elements that underpin the relationships between the variables. It would be possible to verify these relationships from real data with good statistical confidence. Verifying the fit of such data would be a better way to evaluate the validity of the models. Unfortunately, since Nate Silvers model is proprietary, we have no way of knowing it. Nate Silver would surely be sharpening his knife for the next iteration