The Daily Update: Biel; US Women's, Junior Championships; Rybka vs. Stockfish (UPDATED)
The "Young Grandmasters Tournament" in Biel started today; here are the results:
Rodshtein - Negi 1-0
Andreikin - Son 1/2-1/2
Giri - Tomashevsky 1/2-1/2
Vachier-Lagrave - Caruana 1/2-1/2
Howell - So 0-1
The U.S. Women's Championship had been a two-player race between Anna Zatonskih and Irina Krush, with Zatonskih either in clear or shared first in every round...until the very end. She was unable to beat Sabina Foisor, but Krush beat Abby Marshall to take clear first with 8/9. Zatonskih was half a point behind - tied, surprisingly, with Tatev Abrahamyan. (They were three points ahead of their nearest pursuers.) It was an excellent performance by the top three.
There's also a top three in the U.S. Junior Championship, but they're tied and in need of a playoff. GM Ray Robson started half a point ahead of Parker Zhao and a point ahead of IM Sam Shankland. Shankland had lost his first two games but caught fire after that, going 5/6 before the last round, and then crushing Conrad Holt to finish with 6 points. Robson had a disaster, losing to Warren Harper, which cleared the way for Zhao. Zhao had a winning endgame against John Daniel Bryant, but let him escape a bishop and two pawns vs. knight ending with a draw. (The difficulties arose because of wrong-colored bishop and rook-pawn worries.)
I'll update this when I know more; meanwhile, here's the website for both of these US championship events.
**UPDATE** The playoffs will be Tuesday at 10 a.m. St. Louis time (= 11 a.m. ET, 5 p.m. CET), and you can find the procedure here.
Finally, it was a competitive match for a while, but now it's a cooking show with Rybka 4 playing the part of the rolling pin and Stockfish 1.8 as the pizza dough. The 48 game match continues, but it's 26-16. Stockfish has gone winless the last 15 games, with eight losses interspersed over that stretch.
Reader Comments (9)
Rybka's 26/42 is about 62%. That corresponds to a rating difference of 90-100 points. Which is about what's expected: Rybka 3 was barely fending off Stockfish 1.6.x at this speed of games, while Rybka 4 is said to be about a 70 Elo improvement on Rybka 3. Maybe 50-70, anyway conceded to be less than the "100" previosuly envisioned.
However, a caveat on the statistical significance of so few games. Compare to election polls, where 1200 is a commonly recommended sample size to get the proverbial "3 point margin of error". (Namely, the square root of 1200 is about 2.9 percent of 1200.) A win is like polling 2 people and getting 2 votes for (Minneapolis mayor) R.T. Rybak, a draw is like 1 vote for Rybak and 1 for (Bloomington mayor) S. Stockton, a loss is like 2 votes for Stockton. Thus one needs at least 600 games to get the accuracy of a common presidential poll.
The 3% margin of error translates to about 25 rating points, so if the results kept up at that rate for 558 more games, you'd be justified in saying you were 90% confident that Rybka 4 is 75--125 Elo ahead of Stockfish. From only 42 games, you can't even be that confident that Rybka 4 is better! This point about the huge number of games needed to estimate ratings to within 10 Elo with 90% confidence has been made in papers by Professor Ernst A. Heinz of MIT---alas the URLs at his webpage seem to be working less well than when I found said papers 1-1/2 years ago.
The brutal beating that Rybka 4 is giving Stockfish 1.8 is shocking given how close the match was through the first 26 games. If I didn't know better I would think Stockfish is reacting like a human and had lost heart around game 30 One of the things I felt turned the match around was the second half of the tourament when Rybka seemed to begin to play the openings very well For example Rybka simple took apart Stockfish, in Game 38 by transforming it from a Reti in a Semi-slav Botvinnik. The game featured Rybka giving up a Queen for Rook and Knight and a clear positional advantage. Stockfish never could figure out it was in trouble until near the end.
Stockfish seemed to do best in the early going when the openings were tactical. In Blumenfeld and the King's Gambit Stockfish went 3 1/2 out of 4,
I have used the Stockfish 1.8 engine, as well as a few others over 2900 to analysis the games using a computer very close in strenght as one in the match. It is puzzling that in the match Stockfish chose moves that clearly were weak even by its evaluation. It would be interesting to know why.
@Kenneth Regan
"Compare to election polls..."
That's a pretty suspicious comparison, since a huge factor in election polls (sampling bias) is non-existent here. You need to poll a lot of people in elections, to be sure your sample is a fairly close match with society. That number increases to ensure some statistical significance - but this is not the only factor at work. [of course you could argue that the position set used introduces some bias, but it's hardly the same]
I don't know whether your numbers are correct (it seems highly unlikely to me), but there shouldn't be any need to appeal to analogy if the mathematics is right.
I'd also point out that determining which player is better (against the other) in a head-to-head match is a different problem from determining either's Elo. An Elo rating gives an estimate of likely performance against any one opponent, using calculations from games against a broad cross-section of opponents. This is actually quite similar to election polling, since you again need to make sure that the sample of opponents is representative of the population. If professor Heinz's papers concern calculation of Elo, they might well take in to account the number of games it'd usually take to play a significant cross-section of opponents. This isn't a factor in head-to-head matches.
One player winning a 48 game match doesn't mean that his Elo should necessarily be higher - but the same could be said for a 100000 game match. All it means is that he's highly likely to outperform that particular opponent in head-to-head matches.
From this match, I'd be very confident that Rybka 4 would win a 48 game rematch with Stockfish 1.8; I couldn't say much at all about their relative Elo strength.
What's your opinion on the Armageddon games that involve wagering on time (like will occur at the US Junior)? I think this is definitely influenced by Greg Shahade (with his poker obsession). I am not sure how I feel about it. Part of me is uneasy with the introduction of variability - every game should have the same starting conditions, with the only uniqueness being the actual play. Then again, the playoff game type where Black 'wins' with a draw is strange enough (and strictly speaking not actual chess), and perhaps adding the ability to wager time gives both sides a way to determine how much that non-chess benefit is worth to them.
@Prefer not: I don't mind it too much, because we're talking about a tiebreaker and because the conditions are up to the players themselves. But I'd prefer something like the FIDE k.o. model instead: a pair of G/30s (or maybe 20 + 10s), and then a pair of G/15s (or 10 + 5s), etc., with an Armageddon speed game coming only a the very end.
@JC: My numbers are correct, and my analogy is fine as far as I intended it. You are right that a match is sensitive to particular factors of the two players, and that Elo estimations are better served by running against large fields, as CEGT and CCRL do. But that is not what Martin Thoresen is doing, and his match is attracting a lot of attention---and gathering (over-)interpretations like you can read in this thread: "rolling pin", "brutal beating", "losing heart" etc. I agree that people shouldn't draw Elo conclusions from the match, but guess what those selling Rybka 4 care about?---Elo with a Euro sign for the E. I raised a caveat by comparing it to something people are familiar with.
My basic point can be made from the results of this match itself. From games 5 thru 25---half the match so far---Stockfish played Rybka even. If those had been the only games, we'd be seeing radically different leapt-to conclusions. In the other 21 games, Rybka killed to the tune of 10-1 with 10 draws. 21 games is about what used to be standard world-championship match length, so people wouldn't shrink from drawing conclusions from such a match length either.
Was game 42 re-started owing to a technical glitch? The score stands 25.2--16.5; Dennis and I both recorded seeing 26-16.
Yeah. Stockfish crashed, so the game was rerun.
While we look at the the programs and howw ell or poorly they play it is the programers who have given them their abilities. What I beleive we see in a match is the not just flaws and weaknesses the programers have imbrued them with but the unintented consequences of programing trade offs. Just like human players some match programs match up better than others. In this match sometimes was the opening selected, or the middle game. However, even using the 3-4-5 Men Nalimov Tablebases, some of the end game play in this match was painful to watch.
One advantage I saw as Rybka's was in assessing unblanced but equal material exchanges, for example a Queen for Rook and minor piece. Stockfish always seemed to be on the wrong end of those.
@Kenneth
While the even play through 5-->25 is clearly significant, it's not as though that's a typical 21 game streak in this match: choose any other, and Rybka wins. Drawing a conclusion from this match that Rybka would be hugely likely to win a 12 or 16 game match might be silly (though it'd clearly expect above 50% odds). Doing so over a 48 game match is hardly comparable.
Of course people are too ready to jump to conclusions from relatively few games - I'd prefer world-championship matches to be much longer too. However, it remains perfectly reasonable to conclude that Rybka 4 is highly likely to win a 48 game rematch. I wouldn't call this victory brutal/crushing/..., since it's only around 2 Rybka wins to each 1 Stockfish win. In particular, computers don't lose through blunders / blindness / over-confidence / off-days /..., so Stockfish's 7 wins are a consequence of Rybka's being outplayed. If you outplay an opponent twice for each game you get outplayed, I wouldn't say that you've crushed them.
I agree that many of the descriptions are over the top, but no-one sane would bet on Stockfish 1.8 in an identical rematch. I remain impressed at Stockfish's performance, and I certainly think it does well enough to make future matches interesting to watch - I just wouldn't expect it to win over 48 games.
Perhaps 64 games would be a somewhat-practical target to hope for in these kinds of matches if you want greater accuracy. Whether that'd be enough depends entirely on the results - if it comes out 33:31, it wasn't; if it comes out 42:22, it was. 64 is a nice chessy number, and would allow a few more positions (preferably a few more moves in). The starting positions could also be a way to maintain interest through longer matches, if its considered that observers would lose interest over so many games - throwing in some of the more interesting/dynamic/theoretically-relevant starting positions towards the end would be sure to keep people watching.
@Larry
On the unbalanced exchanges front, Stockfish did come out on top in game 22. Rybka was rating things around -0.2 while Stockfish thought it had around -1.5. Stockfish won the game, so it seems likely that its assessment was the more accurate. Game 6 was hardly a triumph for Rybka on this front either - at move 42, on trading his queen for two rooks, Rybka evaluates things at +0.2, while Stockfish puts things around -1.2; again, Stockfish won.
I think the sample size really is too small to draw conclusions in this area. I don't think we yet have even 10 games with unbalanced material exchanges from fairly even positions.