Are Ratings Inflated? Some Evidence For A Negative Answer
Most of us - and that would include many elite players with a vested psychological interest in a negative answer - are inclined to say yes. In fact, it's not just "yes" but "yes, obviously; everyone knows this".
It turns out that "everyone" may be wrong. IM Ken Regan, GM Bartlomiej Macieja and Guy Haworth have co-authored an academic paper (Regan and Haworth are computer scientists, Macieja studied physics) arguing that ratings have remained stable since their inception. Their paper offers an objective method to examine the issue, and the data tested by that method support their surprising conclusion.
To grossly oversimplify, their test procedure is to use a computer engine to examine players' games, comparing their moves (post-opening) to those found by an engine. This generates what they call an Intrinsic Performance Rating (IPR), and a large sample size reveals that ratings and IPRs correlate over time; ergo, no inflation.
Don't buy it? That's what the paper is for: check out all the gritty details for yourself!
Reader Comments (14)
Unless the authors have come up with a way to isolate out known theory from all the games they analyzed the IPR will be biased against players of previous generations. To put it another way, IPR itself will inflate. The computer aided theory has grown enormously (in some cases to move 35 or 40), and hence the current players will have an unfair advantage over older generations, if their strength is evaluated by computer programs.
Hi Hari - you really should read the paper first ;-)
I have considered that argument, but the essential point is this: our measures are concerned only with the quality of the moves made on the board, regardless of their provenance. Players have only their grey matter at the board (well, I have a separate mission to help ensure this, which this work serves)---if it is well-stocked with ECO's and engine-prepared lines, so much the better.
Could Lasker have competed if he'd had these resources, and our better nutrition nowadays? Sure!---and I've used his mathematics (primary ideal decompositions) in my research. But all I can judge are the moves he actually made. Capablanca may need no such "excuse"---the older version of the paper has historical performance-rating estimates and marks Capa's 2936 for New York 1927 as "not a typo"!
Your point is also addressed by our chart showing basically the same non-inflationary error pattern for moves 17--32 only. Yes prep in some lines may go past Move 30, but Jeff Sonas' recent article on ChessBase.com here puts the Average Novelty Depth by 2010-11 at 22 ply (Black's Move 11/White's move 12) for 2200+ players, and 26 ply (Black's Move 13/White's Move 14) for top-20 players. The standard deviations are such that Move 17 is novel for the great majority of games, and there seem to be quite a few annotations recently in New In Chess and online of players in the super-tournaments being on their own by the late teens. Thank you for the interest...
[DM: Ken, maybe the average novelty for the top 20 comes after 26 plies, but it's at least as important to know when that preparation finished. Determining that information may not be easy to obtain, but here are a few possibilities:
1. The obvious one: read (or view) annotations by one or more of the players - sometimes they'll provide that data.
2. Create little databases for each player and look for trends where, if anywhere, their IPR levels vary, post-novelty. Maybe their next 5-6 moves after the novelty (at least if they initiated the new move) are highly accurate by the computer's standards, but then start dropping off. If so, you might then decide to evaluate their IPRs only at TN + 12 ply or something like that.
3. Compare how players IPRs fare when they deliver the TN and when they receive it. Sometimes there won't be much difference because the new move was easy to predict, but in general I'd assume some statistically significant differences to emerge.]
Hari: They bring up this objection in section 2, and say in response: "We are concerned only with the quality of moves made on the board, irrespective of whether and how they are prepared... Also we find that today’s elite make fewer clear mistakes than their forbears [sic]."
I can read legalese, but my numberian is a bit rusty. Looking at the paper in question, these things caught my eye:
-1. They used computers for their fact-finding; my distrust in brute-force calculators kicks in here.
-2. They pretty much ignore the game's circumstances (correct me if I'm wrong); the most important part being that there's an opponent who makes moves that may not require a brilliancy-prize reply.
-3. Conclusion: present day 2800s are over 100 points better than Karpov when he was 2695; ergo: no inflation.
That's all well and good, in theory, but what are ratings supposed to do? They're supposed to show you how strong a player is relative to active chessplayers everywhere. Right now 2700 is the barrier for superGM, but it won't take long for the bar to go to 2800 when there are 2900s around. That's inflation plain and simple, which is stated somewhere in the paper in fact, they weren't out to disprove that.
To come back to (2), put Fischer or Lasker in the current age with the benefits of chess advancement and they'll perform at 2800-level. If you send Anand and Carlsen back to face Alekhine and Capablanca, they'll have an incredibly hard time; I cannot back any of it up with evidence, but the counterargument is equally impossible at least :-)
Thanks "dfan"---I've re-uploaded a version with the bear fix. For Dennis M. and generally a query: is the level of prep that would go past Move 17 likely to be found en-masse enough in Cat. 11--14 events to throw up my Moves 17--32 stats?
In tests of individual performances in events, I have tried to identify the end of prep by looking for the first long think by players after a novelty. Along with bad gamescores, it is one of my irks that move times are not regularly recorded in PGN files of major events. Overall I aim to improve the data format for recorded computer analysis and build into my nearly 200 pages of computer code an automated facility for recognizing the TN's etc,m which would help automate what Dennis suggests.
Thanks all for the responses... I brought up my earlier concern because things were not clear for me from the paper. Ken's detailed response makes it more clear but I am still evolving my opinion on your points. As Dennis mentioned, usually preparations go well beyond novelties, so Jeff Sonas' average novelty depth has little meaning in this context. One thing I forgot was the players from earlier generations got to analyze adjourned positions which gives them some advantage. In all it looks like a complex situation, but the paper is a good start for further investigations. Good luck to the authors in their future research !
Perseus asked the question "what are ratings supposed to do?", but I don't think the answer is as simple as he suggests. There's the question of whether ratings should measure absolute performance, or (as Perseus suggests) relative performance -"how strong a player is relative to active chess players everywhere". I'm not an expert by any means, but I have some views on this.
To me, the paper makes sense in the context of current ratings being representative of absolute strength, and indicating that the chess population as a whole is getting stronger. Whilst it is one thing to say that Lasker and Capablanca (and even, say, Karpov) would beat modern day 2700s if they had access to modern opening theory and engines, the fact is that they didn't. Modern players do have access to these things, and as a consequence the moves they play are stronger. The fact that they play at a lower level than modern players is NOT an indictment of them - we should expect progress as competitors learn more in any discipline, and it happens in all sports (although they seem to be getting slower in road cycling these days!).
If ratings should be representative of relative strength, then to me the increase in top ratings is not as easily attributable to an increase in overall playing strength, and 'rating inflation' could be seen as an issue. One simple (and probably only partial) explanation would be that the chess population is increasing in size, and so if we focus on any rating bracket (at the top or bottom) we will see an increase in the number of players in that bracket. If being "A Super GM" means being in the top 40 in the world, then that the threshold of that top 40 will naturally increase as the chess population increases. Another explanation would involve FIDE's use of minimum ratings, where players entering the system near the threshold will never be under-rated, but will often be over-rated, injecting an excess of points into the system and causing inflation. A third explanation would be that the top players are improving more quickly than the 'average' player, and so their ratings are increased relative to the 'average'.
For answers on the question of absolute v relative ratings, I would love to see the rating progress (or otherwise) of a chosen version of a chess engine. If we run the same chess engine, with the same opening books/tablebases/hardware etc, does its rating increase or decrease over time? Unlike a human being (we can't bring a 2695 rated Karpov forward in time!), an engine can stay at the same playing level indefinitely, so if its rating stays the same, that would indicate that the rating system is doing a good job of measuring absolute strength. If its rating starts going up, that would indicate that there is some sort of rating inflation going on (or a decrease in the absolute strength of the playing pool - unlikely). If it goes down, that would be interesting - it could indicate some sort of rating deflation, or it could indicate that the rating system is measuring absolute strength and the chess population is getting stronger over time (to be expected as theory progresses). Does anyone have any thoughts about what would happen in this scenario?
Obviously, this discussion is about FIDE ratings.
What about USCF ratings?
Hari, the bottom half of Figure 3 in the paper is interesting. The 1970's error rate was higher than more recent error rates until move 40. From move 40 to about 55, the error rates in the two eras were similar. After move 55, contemporary error rates are higher than they were 35 years ago. Better opening prep now, better adjourned analysis then + better time controls for late in the game...
Perseus, ratings are indisputably higher now than they were 20 or 30 years ago. The question is whether that "inflation" is a result of stronger play or because of some flaw inherent in the rating system that allows the rating to drift upward while the level of play remains the same. I'd assume that the empirical inflation is due to a combination of more players, improving play, and genuine drift of the ratings curve. Ken et al. show some evidence that the strength of play has improved enough to account for the empirical inflation. Intriguing possibility, but the analysis is soft from a statistical standpoint (possibly because their data is too noisy for the sample size, or possibly because they don't have the necessary statistical expertise to make a convincing case).
Steve, great question about how much a good opening book helps a computer. My guess is that it would help some but not all that much. I bet someone has experimented with the issue...
We did devote a full page to address the kind of argument raised by Perseus---this was part of GM Macieja's contribution, along with the population section. From the perspective of a FIDE officer the question about Elo 'inflation' is, what should be done about it?---and our answer is a resounding: Nothing! (Whether to change from Elo to another system is a different question...)
Regarding Steve's query about evaluating engines, we have some data on that but haven't decided where to go with it. One problem with trying to put engines on a human scale is the lack of engine-human games with something really at stake.
JH, the final section is about Canadian ratings---I'd be happy to have helpers tackle USCF ratings. With the Canadian Open I had a significant portion of a whole federation's upper echelon in one event, and a fair number of foreign players besides. (Plus I played in it myself---see my report for Susan Polgar here.)
Eupseiphos: indeed the SAE data is noisy---we resorted to a 4-year moving average and there are still places where the lines cross. The IPR data uses a regression technique of my own devising, described in the AAAI 2011 paper. We admit not knowing how to compute confidence intervals for it, but the estimated error bars for the IPR's of individual (9-game) performances are yawning wide, which may explain their volatility which surprised me. Any suggestions on better statistical analysis will be most welcome, thanks!
Ken,
A confidence interval for your custom regression should not be difficult to construct via bootstrap. The procedure is straightforward:
1. take a random sample of size n (with replacement) from your data set, where n is the size of your data set;
2. calculate your regression parameter(s) and store in a column.
3. repeat steps 1 and 2 a thousand times.
The distribution of the resulting simulated data will closely approximate the distribution of your statistic, so you should be able to estimate variance, CI's, p-values, etc.
One analysis that would be nice to see would be a regression of IPR ~ Elo + time + Elo*time. If time and Elo*time are not significant, it may be that your sample is too small or that the inflation is too small to be of interest. To decide which explanation makes the most sense, you'd need to estimate the power of your test vis-a-vis some reasonable definition of what an "interesting" degree of inflation would be. If your test proves to have good power, then you could argue: "If there were inflation of 50 points (or whatever) in the past 30 years, our test was strong enough to detect it with 90% probability. Since we did not detect any inflation, we have pretty strong evidence that there has not been significant inflation. "
One way to find out if there is a rating inflation is to have a virtual human player ( a chess engine) with a fixed playing strength take part regularly in tournaments . If the rating of the engine remains relatively constant after a few decades then we have no inflation :-)
[DM: A decent idea, but what about opening evolution? "Strength" is in part a function of knowing what to do in certain middlegames. Ratings are a relative measure, but not an absolute one. Maybe A is stronger than B 10 years ago and stronger again now, and consistently scores about 6 out of 10 in their matches. But if both have been studying and learning, their ratings would be the same but not their strength. Thus in this scenario, ratings as a measure of strength would be deflated.]
"Unlike a human being (we can't bring a 2695 rated Karpov forward in time!),..."
Indeed we can, although whether someone wants to put in the necessary work is relevant. Reverse engineering Karpov v2695 is not impossible by any means, though in practice it will mean genetic algorithms that eventually, always play what Karpov played at the board.
[DM: Genetic algorithms? Maybe that's a term of art in the AI community; you don't really mean that just cloning Karpov (or in some hypothetical future, computer-cloning the digital equivalent of his DNA) is going to produce someone or some program whose chess outputs are equivalent, do you? That seems implausible, to put it mildly, as it suggests that his intensive training with Semyon Furman and others made no difference, to say nothing of his competitive experience, self-study and so on. As for producing a program that would reproduce all the good moves and all the mistakes he made at a certain point in time, that seems (a) extremely unlikely and (b) susceptible to curve-fitting problems. It also assumes a sort of narrow determinism in the production of a player's moves, as if specific preparation, health, rest, confidence, reading the opponent and a host of other factors played no role.]