Thursday, November 22, 2012

The Structure of a Scientific Revolution

A few days ago, Peter Coles posted an interesting comment about skepticism on his blog. In it, he showed the first graphs with data showing evidence for the acceleration of the expansion of the Universe from observations of distant supernovae. His point was that though the plot isn't much to look at, the statistical evidence for cosmic acceleration from supernovae data was quickly established and very soon became very widely believed.

Coles goes on to make some points about global warming skepticism, and most of the long discussion in the comments that follow is concerned with this. Which is unfortunate, as none of that discussion provides an exception to the general rule that arguments about global warming are a complete waste of time, because people have already decided what they want to believe before they start, and are subsequently impervious to any evidence, logic or persuasion. So although you may want to read the discussion there, let's not talk about the climate here.

Instead, I want to stick with the cosmology and pick up on a question that is more interesting. Let's take a quick look at the picture of the supernovae data from the High-Z Supernova Search Team and the Supernova Cosmology Project:
Evidence for dark energy from the Hubble diagram of supernovae. The top panel shows the distance modulus of the supernovae as a function of redshift, with three different theoretical model curves. The bottom panel shows the residual distance modulus relative to that in the model with the dotted curve (a Milne universe).
As Coles points out, most people who are not cosmologists (and some who are) look at these data and find them deeply unconvincing. The error bars are large. The points are widely scattered, and at least both the top two curves seem to give reasonably good fits. On top of that, there appear to be two distinct groups of supernovae: one group rather close to us, and the other much further away at higher redshifts (and therefore observed as they were a long time ago). Any inferences we draw from relative calibrations of these two groups rest on the assumption that deep down they are the same kind of objects, and that the Universe they existed in remained the same kind of Universe.

So why exactly did this cause such a revolution in the field? How did so many physicists become convinced, almost overnight, of the existence of this mysterious dark energy, that constitutes most of the energy of the Universe, but that we didn't understand then and don't understand now?

Saturday, November 10, 2012

Does Nate Silver get his maths wrong?

Although I haven't had much time to write anything new for this blog recently, like most people I know I have of course been paying close attention to the unfolding circus of the US Presidential election. Now that all the action is done and dusted, there is a lot of discussion on the internet about the predictions made by poll aggregators, the best-known of whom is the New York Times blogger and statistician Nate Silver. And although I'm not going to talk about politics, I do want to make a comment about probabilities and the statistics of poll aggregation.

The reason Silver has been in the news so much is because even when a lot of innumerate and possibly partisan journalists were making a huge amount of noise about "Romneymentum" and predicting a landslide Republican victory, Silver calmly analysed the actual polls and stated that the odds continued to favour an Obama victory. For this he attracted an enormous amount of flak in the right-wing press and blogs, and a substantial section of the mainstream media refused to believe him. (See here if you'd like a frequently-updated list of everyone who attacked Silver and ended up looking stupid.) Even on election eve, when Silver gave Obama more than a 90% chance of victory, such impartial websites as the BBC continued to describe the result as being on a "knife-edge".

Of course now that Obama has won, the pendulum has turned the other way, and Silver is being hailed as a mathematical genius, a wizard, a guru and much else besides. This adulation rather ignores the fact that there were several other bloggers and poll aggregators who also predicted an Obama win. Some of these are Sam Wang at the Princeton Election Consortium, Drew Linzer at Votamatic, Pollster at the Huffington Post, Simon Jackman and so on. (To be fair, Silver does have a much more prominent position than the others, being at the New York Times. He did also provide very insightful regular explanations and updates, and some nice interactive graphics to help explore the data, so one can forgive the extra attention given to him.)

But the interesting question is not whether Nate Silver does a better job of predicting elections than, say, Donald Trump. How does he compare with other, equally numerate and mathematically minded poll-aggregating bloggers? And there is really a difference between them: for a large part of October, Silver rated Obama's chances of re-election at only around 65-75%, whereas Sam Wang regularly claimed it was more like a 99% chance. Or have a look at Votamatic, where the predicted median of electoral college votes for Obama has been fairly constant at 332 (the correct number) since July, compared to Silver's prediction of closer to 290 at the same time. You'll also notice, if you play with the graphics at FiveThirtyEight, that the margin of error Silver quoted on his prediction was significantly larger than that the other forecasters gave.

So in summary, Nate Silver made more conservative predictions and was more pessimistic about Obama's re-election chances than his direct competitors. (Why so? As I understand it, Wang and Linzer only include information from published poll data, appropriately weighted for freshness and sample size. Silver included also what he calls "fundamentals" — his estimate of other underlying factors, such as economic data and so on, which might influence the final outcome. Silver's exact model is proprietary, but for an educated guess at how it works, explained in layman's terms, see here.) In the end though, Obama was re-elected, as all the forecasters said he would be. Since the only differences between them lay in exactly how confident they were in their prediction (and also whether they thought he would win Florida ... ), and the election can only be run once, how do you decide whose model is better?

A common argument you may have heard in favour of Silver before Obama's most recent victory is that he correctly predicted 49 out of 50 state results in 2008. Add to that his predictions this year and his tally is now 99 correct out of 100. This is obviously impressive, and clearly you should trust Silver and his maths more than basically every conservative hack who attacked him.

But the problem is that Silver may be too accurate. To put it bluntly, if you say the probability of your prediction being correct is around 70%, you should expect to be wrong 30% of the time, not 1% of the time. So if you predict the right result 99% of the time but quote probabilities of only 70%, you are probably overstating the error bars on your prediction. On the other hand, if someone else says he is 99% sure and is right 99% of the time, his model might be better.

OK, some obvious caveats. It is misleading to count all 50 state predictions as successes for Silver's model. I doubt even Donald Trump would have had trouble predicting Utah would go Republican and New York would go Democrat. In reality, there were only around 10 or 11 states that were at all competitive in each election cycle, so adjusting for that would mean lowering Silver's actual success percentage. Also, his model didn't predict all results with equal confidence — for instance, maybe he gave only a 75% probability for Obama to win Ohio, but a 99% probability for him to win Minnesota.

How should one account for this in quantitatively comparing models? Unsurprisingly, Sam Wang himself suggests a method, which is to assign each prediction a Brier score. This takes the square of the difference between the predicted probability of an event occurring — e.g., 0.75 for a prediction of 75% probability — and the post facto probability (i.e. 1 if it did happen, 0 if it didn't). In the case of multiple predictions (for multiple states), you then average the Brier score for each prediction. Thus the Brier score punishes you for getting a prediction wrong, but it also punishes you for hedging your bets too much, even if you got the end result right. Getting a Brier score of exactly 0 would mean you knew the results in advance, getting a score of 0.25 is equivalent to just guessing randomly (if there are only two possible outcomes).

Have a look at Wang's result tables: it looks as though he did slightly better than Silver in predicting the Presidential race, and much better in predicting the Senate results. As he rather pithily put it,
additional factors used by FiveThirtyEight – “fundamentals” – may have actively hurt the prediction. This suggests that fundamentals are helpful mainly when polls are not available.
Anyway, the point of this post, despite the deliberately provocative title, was not really to attack Nate Silver. Even if he did hedge his bets and overstate his error bars a little, given his very public platform and the consequent stakes, I think that's understandable. The Princeton Election Consortium and Votamatic really don't have the same exposure, so Wang and Linzer would have faced less general opprobrium if they had made a mistake. Also, perhaps more statistics are needed to really make a judgement — though let's not have another election for some time please!

But I think this does illustrate a subtle point about probabilistic predictions that most of the media seem to have missed, so it is worth pointing out. And I feel a little personal guilt for being too pessimistic to believe Sam Wang before last Wednesday!