Testing against sf7 and timing

NCM dve build tests are illogical , show big mistakes , play against sf7 , then 30 seconds bullets games , , either 1 second moves in 1000 games , or opponent make sf 15 .1 ,
sf author said in discord below

  1. Dubslow Today at 4:06 PM

ncm is such garbage, i just opened it now to tcheck if there’s anything useful, and no, it’s even worse than i expected, and i expected it to be terrible

  1. [4:07 PM]

(and it will continue to be garbage until they retire SF7 in favor of SF12 13 or 14 or 15 or similar)

  1. elo gaps above 400 cannot be directly measured with 20k games, and elo gaps above 500 cannot be measured even with a million games (edited)

Ha… fair enough :slight_smile:

I’ve kept sf7 as the baseline no other reason than to have one giant graph. But yes, the error bars are getting huge.

I’ve really been wanting to address this somehow. I’ve been busy over the past couple years migrating everything on the back-end from Ruby to Elixir, and as of a few days ago, that’s 100% complete and in great shape. So now is a perfect time to do this.

What I’ve been thinking is to, instead of testing against SF7, test each dev build against the prior official release. E.g., current dev builds would be tested against SF 15.1.

I have lots of questions. For instance, time controls. Do those need to be adjusted? Right now they are set dynamically by periodically checking the “stockfish bench” score. I did that a long time ago when there were multiple types of servers running the dev build tests, but now that all of the hardware is identical, I think I can get rid of that and just pick a constant time control.

What’s the right balance of time control / games per dev build now that the Elo differences will be narrower?

Do we need to make any changes to the opening book, adjust any parameters to cutechess, etc?

I’m tied up for the rest of today, but if you can help me get the discussion going with the Stockfish community, that would be much appreciated!

I forweded your message and their replay below

  1. oh thank god

  2. [11:04 PM]

we’ve been trying to tell ncm to fix itself for a while

  1. [11:04 PM]

let me compose a reply

I got replay from stockfish community below

Dear NCM,

It’s wonderful to hear that you’re looking to improve. Please find my suggestions below.

As has been stated elsewhere, Elo gaps above around 400 elo cannot be directly measured, not with 20k games.
Above 450 Elo such gaps cannot be measured even with a million games. Running tests against SF 7 with SF 14+
is simply a waste of your valuable hardware time.

So it is music to hear that improvements are on the way. As for your questions, I shall compare your current
setup with the settings that fishtest uses in the official progression tests.

First, testing against the most recent release is a wonderful choice. Official progression tests are against the
most recent major release – 15.0, not 15.1 – but the difference is quite minimal, as far as progression accuracy
is concerned. Anything remotely like that is a million times better than SF 7 these days.

Fishtests are typically done on 1 thread, at TCs of 10+0.1 and 60+0.6. These TCs are for a baseline of about 1.0 Mnps.
The fishtest worker code automatically measures worker nps and adjusts the local worker TC to be the equivalent
of 1.0 Mnps at the nominal TC.

Your website reports that you do 30+0.3s on 2 threads at 1.6 Mnps per thread, which looks like about the equivalent
of 96+0.96 on 1 thread @ 1 Mnps. I think this is a perfectly fine time per game to use, although you could can
certainly choose to make it longer or shorter as you please. The official progression tests, as well as the WDL
curve and cenitpawn calibration data, are all measured at the fishtest standard LTC of 60+0.6 on 1 thread at 1.0 Mnps.
I suggest not going shorter than fishtest LTC, although like I said you’re currently noticeably above it.
(Your hash setting is also very reasonable and normal. Fishtest hash is in the range 12-16 MB/(10s @ 1.0 Mnps), which
is about where your setting is as well.)

As for the Elo differences per patch, these days it’s typically around 1-2 Elo per patch, requiring more than 20k games
to be able to get a small enough error bar. On the other hand, you run progression tests for every commit, and that’s
a lot of computing power by any standard. Official progression tests are 40k or 60k games, but doing that much for
every commit is a tall burden indeed – official progression is only about once a month.

Depending on exactly how much hardware you’re willing to commit to every commit (lol), I recommend trading some time
control for more games per test, although I would go no shorter than the fishtest LTC of 60s @ 1.0 Mnps. It’s worth
noting that, rarely, Stockfish does some tuning at TCs longer than LTC, such that, occasionally, VLTC performance
improves more than LTC, however this is indeed rare, only once a year on average. So there is some utility to your
setup using Very Long TC, however as you say this must be balanced against games per test and total hardware usage.

In a world with infinite resources, I would say continue with your current setup of roughly 100s+1.0s @ 1.0 Mnps TC
while also doubling the games per test to 40k. Presuming this is too much, then perhaps 80s @ 1.0 Mnps with 30k games
per test, or reduce it further to 60s @ 1.0 Mnps with 30k games per test. Any of these are all great choices.

Oh and I haven’t mentioned books yet. I’m not sure what books you’re using, but regular fishtests long ago switched
from the 8moves family of books to the UHO family, namely UHO_XXL_+0.90_+1.19.epd. This family is much higher bias,
with openings much closer to 50/50 win/draw, which decreases the drawrate and improves the sensitivity to Elo of
each game pair. The official progression tests are also in the process of switching to the UHO family. There’s
essentially no reason for any progression tests, official or otherwise, or indeed engine tourneys and rating lists,
to continue using the older, higher drawrate 8moves family. I strongly recommend UHO for your tests.

Finally, we come to my toughest recommendation. It is with a heavy heart that I recommend deleting the existing tests
since about SF14 from your website, and ideally re-running them under the new conditions. As I said near the beginning,
measuring Elo gaps above +400 is simply not feasible with only 20k games per test, and the “results” since SF14 are
about 99% noise. And unfortunately, a lot of viewers won’t realize just how noisy those datapoints are. In my humble
opinion, I think it’s best for Stockfish if that old noisy data was removed. Obviously redoing all of those tests
with new settings requires an extraordinary amount of hardware resources, but even if those tests are not redone,
it’s still better to remove the old noise. As far as the chart, it’s still reasonable to have it all on one chart
even with different testing conditions, perhaps using different colors. If you had infinite resources, I would suggest
redoing all the tests since SF13 with the new setup, coloring all data from the new setup as red, then between SF13 and
SF14 we could see both the red and blue data points, and then from SF14 onwards only red data points. Since you dont
have infinite resources of course, what you do is up to you.

In any case, I greatly look forward to seeing a new test setup from NCM, it should be very fruitful to keep track of
how crazy good Stockfish is, even as it grows stronger. The new setup should clearly reveal the upward trend since
SF14. I believe a dozen other SF devs and contributors also look forward to seeing a modernized setup as well, since
the NCM presentation of its data is one of the best, if not the best, in the entire engine chess world.

Many thanks for reading.


SF author joost vandle openion below

vondele Today at 12:13 AM

I actually was originally suggesting to have the graph against SF7, many years ago, and really appreciate the long-lived nature of the data, which makes it unique. However, time has come to change. I still think it would be interesting not to test against our latest release, but against something else, e.g. SF 14, otherwise it is too similar to our progression test. It tells us something about contempt. So upping the version (e.g. SF 14 instead of SF 7), and switching books. I would propose he moves to the UHO_4060_v2.epd.zip book. For the time control and number of games, I think it is a reasonable choice.

Posting an update here based off some discussion earlier today on the Stockfish discord, partially so I don’t forget, but mostly to solicit corrections if I have something wrong.

Big thanks to @puttutathy, Dubslow, vondele, deadwing for their recent guidance, and to all who have tried in the past to enlighten me on the issue :slight_smile:

I’m hoping to have something to show this upcoming week.

General Changes

  1. Change the baseline to Stockfish 14
  2. Change the opening book to books/UHO_4060_v2.epd.zip at master · official-stockfish/books · GitHub
  3. Keep Threads at 2
  4. Keep games per build at 20,000
  5. Increase hash to 64

Changes to Time Control

Because all of the hardware is now identical, we’ll use a constant time control of 30+0.3 with no scaling.

I’ll still measure and record the “stockfish bench” score for Stockfish 14 periodically for the purpose of catching performance anomalies and identifying/discarding bad results (due to thermal issues, etc), but the measurements won’t be used for TC scaling.

Changes to Adjudication

  1. Adjust -resign to movecount=3 score=600. (“Adjudicate the game as a loss if an engine’s score is at least 600 centipawns below zero for at least 3 consecutive moves.”
  2. Adjust -draw to movenumber=34 movecount=8 score=5 (Adjudicate the game as a draw if the score of both engines is within 5 centipawns from zero after 34 full moves have been played.)

Changes Related to the "Pentanomial Results"

This is a new concept to me, and one of my favorite additions. We’ll ensure that each engine always plays both sides of the positions selected at random from the book. (Currently this usually happens, but not always: specifically it may not happen when the servers are racing to finish the 20,000th game.)

Then we track the frequency of the scores of the two-game rounds: that is, how many times did a round result in the dev build scoring 0.0, 0.5, 1.0, 1.5, 2.0.

To better record this in the PGN generated by cutechess-cli, we’ll pass -rounds 2 and -games 500.

All the best to NCM, no doubt the new dev build test will make NCM number 1 in the world, because ccrl testing and rating has become totally untrustworthy, CEG is no longer conducting tests, Ipman test is honest but not useful to anyone, in this case NCM new tests, NCM There is no doubt that it will become world no 1, the time when Chendry can be proud of it is not far, with sf14 as a base and uho book I think NCM might start at around +120 elo,
We (paid members) are requested to add at least 10 second thinking time.

1 Like