I got replay from stockfish community below
Dear NCM,
It’s wonderful to hear that you’re looking to improve. Please find my suggestions below.
As has been stated elsewhere, Elo gaps above around 400 elo cannot be directly measured, not with 20k games.
Above 450 Elo such gaps cannot be measured even with a million games. Running tests against SF 7 with SF 14+
is simply a waste of your valuable hardware time.
So it is music to hear that improvements are on the way. As for your questions, I shall compare your current
setup with the settings that fishtest uses in the official progression tests.
First, testing against the most recent release is a wonderful choice. Official progression tests are against the
most recent major release – 15.0, not 15.1 – but the difference is quite minimal, as far as progression accuracy
is concerned. Anything remotely like that is a million times better than SF 7 these days.
Fishtests are typically done on 1 thread, at TCs of 10+0.1 and 60+0.6. These TCs are for a baseline of about 1.0 Mnps.
The fishtest worker code automatically measures worker nps and adjusts the local worker TC to be the equivalent
of 1.0 Mnps at the nominal TC.
Your website reports that you do 30+0.3s on 2 threads at 1.6 Mnps per thread, which looks like about the equivalent
of 96+0.96 on 1 thread @ 1 Mnps. I think this is a perfectly fine time per game to use, although you could can
certainly choose to make it longer or shorter as you please. The official progression tests, as well as the WDL
curve and cenitpawn calibration data, are all measured at the fishtest standard LTC of 60+0.6 on 1 thread at 1.0 Mnps.
I suggest not going shorter than fishtest LTC, although like I said you’re currently noticeably above it.
(Your hash setting is also very reasonable and normal. Fishtest hash is in the range 12-16 MB/(10s @ 1.0 Mnps), which
is about where your setting is as well.)
As for the Elo differences per patch, these days it’s typically around 1-2 Elo per patch, requiring more than 20k games
to be able to get a small enough error bar. On the other hand, you run progression tests for every commit, and that’s
a lot of computing power by any standard. Official progression tests are 40k or 60k games, but doing that much for
every commit is a tall burden indeed – official progression is only about once a month.
Depending on exactly how much hardware you’re willing to commit to every commit (lol), I recommend trading some time
control for more games per test, although I would go no shorter than the fishtest LTC of 60s @ 1.0 Mnps. It’s worth
noting that, rarely, Stockfish does some tuning at TCs longer than LTC, such that, occasionally, VLTC performance
improves more than LTC, however this is indeed rare, only once a year on average. So there is some utility to your
setup using Very Long TC, however as you say this must be balanced against games per test and total hardware usage.
In a world with infinite resources, I would say continue with your current setup of roughly 100s+1.0s @ 1.0 Mnps TC
while also doubling the games per test to 40k. Presuming this is too much, then perhaps 80s @ 1.0 Mnps with 30k games
per test, or reduce it further to 60s @ 1.0 Mnps with 30k games per test. Any of these are all great choices.
Oh and I haven’t mentioned books yet. I’m not sure what books you’re using, but regular fishtests long ago switched
from the 8moves family of books to the UHO family, namely UHO_XXL_+0.90_+1.19.epd. This family is much higher bias,
with openings much closer to 50/50 win/draw, which decreases the drawrate and improves the sensitivity to Elo of
each game pair. The official progression tests are also in the process of switching to the UHO family. There’s
essentially no reason for any progression tests, official or otherwise, or indeed engine tourneys and rating lists,
to continue using the older, higher drawrate 8moves family. I strongly recommend UHO for your tests.
Finally, we come to my toughest recommendation. It is with a heavy heart that I recommend deleting the existing tests
since about SF14 from your website, and ideally re-running them under the new conditions. As I said near the beginning,
measuring Elo gaps above +400 is simply not feasible with only 20k games per test, and the “results” since SF14 are
about 99% noise. And unfortunately, a lot of viewers won’t realize just how noisy those datapoints are. In my humble
opinion, I think it’s best for Stockfish if that old noisy data was removed. Obviously redoing all of those tests
with new settings requires an extraordinary amount of hardware resources, but even if those tests are not redone,
it’s still better to remove the old noise. As far as the chart, it’s still reasonable to have it all on one chart
even with different testing conditions, perhaps using different colors. If you had infinite resources, I would suggest
redoing all the tests since SF13 with the new setup, coloring all data from the new setup as red, then between SF13 and
SF14 we could see both the red and blue data points, and then from SF14 onwards only red data points. Since you dont
have infinite resources of course, what you do is up to you.
In any case, I greatly look forward to seeing a new test setup from NCM, it should be very fruitful to keep track of
how crazy good Stockfish is, even as it grows stronger. The new setup should clearly reveal the upward trend since
SF14. I believe a dozen other SF devs and contributors also look forward to seeing a modernized setup as well, since
the NCM presentation of its data is one of the best, if not the best, in the entire engine chess world.
Many thanks for reading.
Respectfully,
Dubslow