This week I stumbled across a fascinating paper by William H. Press and Freeman J. Dyson titled "Iterated Prisoner's Dilemma contains strategies that dominate any evolutionary opponent," published in the Proceedings of the National Academy of Sciences. I think it is fair to say it is the most interesting thing I have read this year. Press and Dyson have somehow discovered something completely new and profound in a field that has already been studied extensively for decades, and have written a beautifully clear and concise paper that you require only some elementary linear algebra to understand. This is the kind of paper we would all love to write, and I've found it hard to stop thinking about it since I first read it a few days ago.
(Incidentally, there is also a physics connection here. Although Press is a computational biologist at the University of Texas he happens to have written several papers in theoretical astrophysics and cosmology, including a famous one with Paul Schechter in 1974, which is one of the most highly cited theoretical papers in all of cosmology. He is also one of the authors of the Numerical Recipes books, which anyone who has ever done any programming has probably used at some point. Dyson is best known to physicists for his work on the theory of quantum electrodynamics, in particular for being the first person — other than Feynman — to recognise the importance of the Feynman diagram. According to Steven Weinberg he was "fleeced" of a Nobel. He's also a remarkable 88 years old. You can read a beautiful character portrait of this complicated man here.)
Returning to the paper itself, let me quote from the abstract:
A single-shot PD game is not very interesting: whether your opponent chooses to defect or cooperate, you can see it always pays to defect. Of course your opponent works out the same thing, so you both defect, though you'd both have been better off if you'd cooperated (hence the dilemma).
Once you allow the possibility of repeated games, however, you allow a much wider range of possible strategies, because a player's move in one round can influence his opponent's move in the next. Players who know they will play each other again several times (but do not know exactly how many more times) will achieve a far higher score if they agree to cooperate than if they always defect. Being nice to each other as a strategy is beneficial. But not too nice — an overly generous cooperator (or a forgetful player who doesn't remember his opponent's last move) can be taken advantage of by someone who sneakily defects every so often. A nice metaphor for many real-life situations perhaps, and indeed the iterated Prisoner's Dilemma has been used to model behaviour in all sorts of fields, from evolutionary biology to economics and psychology.
So what is the best strategy in the IPD game? In his book The Evolution of Cooperation, Robert Axelrod described a computer tournament he conducted: colleagues were invited to send in different strategies that they thought would do well at the game, and he simulated the games to see which strategy might confer an evolutionary advantage. Of all the strategies tested the best one was found to be "tit-for-tat": this dictates that a player starts out cooperating but then punishes any possible defection by simply copying her opponent's last move. This is a strategy that is not greedy, not overly vengeful, does well with other cooperative players, but is not too nice to allow itself to be taken advantage of. In other words, it was the IPD embodiment of the Golden Rule. For a long time tit-for-tat has been considered the most robust basic strategy.
Now Press and Dyson have said that isn't true. In fact, they've found a whole class of other strategies, called Zero Determinant (ZD) strategies which do better than TFT, or indeed any other strategy. Along the way, they've also discovered some unusual properties of these strategies. I'll simply describe some of them without going into any mathematical details, but if you know a little linear algebra the paper itself is very easy to follow.
First of all, they show that it doesn't help to remember anything more than your opponent's most recent move — extra memory confers no extra advantage in 2x2 games. Then they show that it is possible to adopt a strategy for choosing whether to cooperate or defect which exactly determines your opponent's long-run score — irrespective of what strategy your opponent plays. Of course if your opponent were to play the same strategy he would determine your score, and you'd have no influence over that ...
And then they show that it is also possible to play a very successful 'extortionate' strategy: you determine the maximum score your opponent can achieve, and can set it quite low. But in order to achieve even this he must always cooperate, thereby giving you a much higher score. Your opponent may realise he's getting an unfair deal, but the only way he can punish you for it is by hurting himself. In fact the limit to the score you can achieve comes from your own greed: if you demand too much, eventually your opponent can do no worse by always defecting and so will stop cooperating altogether.
So far so brilliant, but also so distressing. TFT seemed like a mathematical moral: you will do best if you play fair. But now it seems that this isn't the case: playing an unfair extortionate strategy lets you do even better. TFT is merely a special case of ZD strategies, one that is fair and therefore overly generous. (Of course it isn't sensible to read too much of a parallel between these simple games and human interactions. For one thing 2x2 games lack an external rule of law, the threat of which prevents people from being too greedy.)
However, what happens if one ZD player plays another? If I set my strategy to make twice as much as you and you set yours to make twice as much as me, neither of us makes anything!1 ZD players will need to arrive at a 'treaty' to do well. I can determine your score and you can't change that, but equally you can set mine. We can therefore agree on an equitable distribution of the points, and neither has any motive to defect from the treaty since doing so will not improve our own score. Should one payer nevertheless defect, his opponent can tell, and immediately punish him. Ultimately, fair play is not just rewarded but enforced. In a sense, TFT is regained (though unlike in TFT each player's score can be deterministically set to the maximum possible).
Of course, as Press and Dyson note, arriving at such a treaty requires what they call "a theory of mind". If you play an extortionate ZD strategy against me, I have to hurt myself in order to punish you. This is only worth doing if I think you will notice that I am punishing you and therefore repent! Otherwise I may as well accept my fate and take what crumbs you offer. This reduces IPD to a game of chicken, so the really successful strategy is to set your ZD algorithm once and then "go for lunch".
What will be interesting to see is whether ZD strategies are actually evolutionary, i.e. whether in a population of players with different strategies, ZD can grow to become the dominant one (it's pretty obvious that in a population already dominated by ZD, no other strategy has a chance).
There are lots more nuggets to uncover in the paper, but soon I'll have written more than Press and Dyson themselves so I'd better stop here. It's a very interesting thing to play with though!
For some further discussion, including interviews with the authors, you may want to read this. There is also this follow-up paper.
(Incidentally, there is also a physics connection here. Although Press is a computational biologist at the University of Texas he happens to have written several papers in theoretical astrophysics and cosmology, including a famous one with Paul Schechter in 1974, which is one of the most highly cited theoretical papers in all of cosmology. He is also one of the authors of the Numerical Recipes books, which anyone who has ever done any programming has probably used at some point. Dyson is best known to physicists for his work on the theory of quantum electrodynamics, in particular for being the first person — other than Feynman — to recognise the importance of the Feynman diagram. According to Steven Weinberg he was "fleeced" of a Nobel. He's also a remarkable 88 years old. You can read a beautiful character portrait of this complicated man here.)
Returning to the paper itself, let me quote from the abstract:
The two-player Iterated Prisoner’s Dilemma game is a model for both sentient and evolutionary behaviors, especially including the emergence of cooperation. It is generally assumed that there exists no simple ultimatum strategy whereby one player can enforce a unilateral claim to an unfair share of rewards. Here, we show that such strategies unexpectedly do exist. In particular, a player X who is witting of these strategies can (i) deterministically set her opponent Y’s score, independently of his strategy or response, or (ii) enforce an extortionate linear relation between her and his scores. Against such a player, an evolutionary player’s best response is to accede to the extortion. Only a player with a theory of mind about his opponent can do better, in which case Iterated Prisoner’s Dilemma is an Ultimatum Game.I assume most people reading this have probably encountered the Prisoner's Dilemma before. If not, this is a nice short introduction to the scenario, or read this for much more depth. It is an example of a general set of two-player games in which the players have to decide whether to cooperate or defect, with different payoffs depending on what they and their opponent decide to do. The outcomes can be summarised in a diagram like this (unlike in the classic PD here high scores are desirable):
Prisoner's Dilemma: the canonical payoff matrix |
Once you allow the possibility of repeated games, however, you allow a much wider range of possible strategies, because a player's move in one round can influence his opponent's move in the next. Players who know they will play each other again several times (but do not know exactly how many more times) will achieve a far higher score if they agree to cooperate than if they always defect. Being nice to each other as a strategy is beneficial. But not too nice — an overly generous cooperator (or a forgetful player who doesn't remember his opponent's last move) can be taken advantage of by someone who sneakily defects every so often. A nice metaphor for many real-life situations perhaps, and indeed the iterated Prisoner's Dilemma has been used to model behaviour in all sorts of fields, from evolutionary biology to economics and psychology.
So what is the best strategy in the IPD game? In his book The Evolution of Cooperation, Robert Axelrod described a computer tournament he conducted: colleagues were invited to send in different strategies that they thought would do well at the game, and he simulated the games to see which strategy might confer an evolutionary advantage. Of all the strategies tested the best one was found to be "tit-for-tat": this dictates that a player starts out cooperating but then punishes any possible defection by simply copying her opponent's last move. This is a strategy that is not greedy, not overly vengeful, does well with other cooperative players, but is not too nice to allow itself to be taken advantage of. In other words, it was the IPD embodiment of the Golden Rule. For a long time tit-for-tat has been considered the most robust basic strategy.
Now Press and Dyson have said that isn't true. In fact, they've found a whole class of other strategies, called Zero Determinant (ZD) strategies which do better than TFT, or indeed any other strategy. Along the way, they've also discovered some unusual properties of these strategies. I'll simply describe some of them without going into any mathematical details, but if you know a little linear algebra the paper itself is very easy to follow.
First of all, they show that it doesn't help to remember anything more than your opponent's most recent move — extra memory confers no extra advantage in 2x2 games. Then they show that it is possible to adopt a strategy for choosing whether to cooperate or defect which exactly determines your opponent's long-run score — irrespective of what strategy your opponent plays. Of course if your opponent were to play the same strategy he would determine your score, and you'd have no influence over that ...
And then they show that it is also possible to play a very successful 'extortionate' strategy: you determine the maximum score your opponent can achieve, and can set it quite low. But in order to achieve even this he must always cooperate, thereby giving you a much higher score. Your opponent may realise he's getting an unfair deal, but the only way he can punish you for it is by hurting himself. In fact the limit to the score you can achieve comes from your own greed: if you demand too much, eventually your opponent can do no worse by always defecting and so will stop cooperating altogether.
So far so brilliant, but also so distressing. TFT seemed like a mathematical moral: you will do best if you play fair. But now it seems that this isn't the case: playing an unfair extortionate strategy lets you do even better. TFT is merely a special case of ZD strategies, one that is fair and therefore overly generous. (Of course it isn't sensible to read too much of a parallel between these simple games and human interactions. For one thing 2x2 games lack an external rule of law, the threat of which prevents people from being too greedy.)
However, what happens if one ZD player plays another? If I set my strategy to make twice as much as you and you set yours to make twice as much as me, neither of us makes anything!1 ZD players will need to arrive at a 'treaty' to do well. I can determine your score and you can't change that, but equally you can set mine. We can therefore agree on an equitable distribution of the points, and neither has any motive to defect from the treaty since doing so will not improve our own score. Should one payer nevertheless defect, his opponent can tell, and immediately punish him. Ultimately, fair play is not just rewarded but enforced. In a sense, TFT is regained (though unlike in TFT each player's score can be deterministically set to the maximum possible).
Of course, as Press and Dyson note, arriving at such a treaty requires what they call "a theory of mind". If you play an extortionate ZD strategy against me, I have to hurt myself in order to punish you. This is only worth doing if I think you will notice that I am punishing you and therefore repent! Otherwise I may as well accept my fate and take what crumbs you offer. This reduces IPD to a game of chicken, so the really successful strategy is to set your ZD algorithm once and then "go for lunch".
What will be interesting to see is whether ZD strategies are actually evolutionary, i.e. whether in a population of players with different strategies, ZD can grow to become the dominant one (it's pretty obvious that in a population already dominated by ZD, no other strategy has a chance).
There are lots more nuggets to uncover in the paper, but soon I'll have written more than Press and Dyson themselves so I'd better stop here. It's a very interesting thing to play with though!
For some further discussion, including interviews with the authors, you may want to read this. There is also this follow-up paper.
1 This sentence is a bit loose and could be misleading. What I mean is that I set my strategy such that if you did what would be best for you and always cooperated, I would get two-thirds of the extra payoff we earn above what we would get if we both always defected. Except of course you are also playing the same strategy, so there is no extra payoff: our long run scores are the same as if we always defected against each other.
Whoa, this is super cool. Thanks for sharing!
ReplyDeleteGreat article. Thanks!
ReplyDelete