I ask this chess puzzle to every new LLM

I don't think all this is needed to prove that LLMs aren't there yet.

Here is a simple trivial one:

"make ssh-keygen output decrypted version of a private key to another file"

I'm pretty sure everyone on the LLM hypetrain will agree that just that prompt should be enough for GPT-4o to give a correct command. After all, it's SSH.

However, here is the output command:

  ssh-keygen -p -f original_key -P "current_passphrase" -N "" -m PEM -q -C "decrypted key output" > decrypted_key
  chmod 600 decrypted_key

Even the basic fact that ssh-keygen is an in-place tool and does not write data to stdout is not captured strongly enough in the representation for it to be activated with this prompt. Thus, it also overwrites the existing key, and your decrypted_key file will contain "your identification has been saved with the new passphrase", lol.

Maybe we should set up a cron job - sorry, chatgpt task - to auto-tweet this in reply to all of the openai employees' hype tweets.

Edit:

chat link: https://chatgpt.com/share/67962739-f04c-800a-a56e-0c2fc8c2dd...

Edit 2: Tried it on deepseek

The prompt pasted as is, it gave the same wrong answer: https://imgur.com/jpVcFVP

However, with reasoning enabled, it caught the fact that the original file is overwritten in its chain of thought, and then gave the correct answer. Here is the relevant part of the chain of thought in a pastebin: https://pastebin.com/gG3c64zD

And the correct answer:

  cp encrypted_key temp_key && \
  ssh-keygen -p -f temp_key -m pem -N '' && \
  mv temp_key decrypted_key

I find it quite interesting that this seemingly 2020-era LLM problem is only correctly solved on the latest reasoning model, but cool that it works.

Kim_Bruning · a year ago

Ah, I see. You phrased it in a misleading way. And once mislead, non-reasoning models can't/won't back up once they're down the wrong path.

Slight improvement:

"make ssh-keygen output decrypted version of a private key to another file . Use chain reasoning, think carefully to make sure the answer is correct. Summarize the correct commands at the end."

This improved the odds for me of getting the right answer in the format you were looking for in GPT-4o and Claude.

These things aren't magic oracles, they're tools.

porridgeraisin · a year ago

What was misleading? It's a very reasonable prompt that contains all the information required to generate the rest of the answer.

I didn't ask or expect any format. The accurate answer in whatever format is all that is expected.

Game_Ender · a year ago

It looks like o1 also gets the right answer after thinking about it for 14 seconds: https://chatgpt.com/share/67962ead-a5f8-800a-bd91-9a145b993e...

ANewFormation · a year ago

The thing that makes the puzzle neat is that it's one that a reasonably clever person who literally just learned the rules of chess should be able to solve.

There's no nuance to it whatsoever beyond needing to demonstrate knowledge of the rules of the game.

plorkyeran · a year ago

I think you have completely forgotten what it is like to be a beginner at chess if you think that someone who has just learned the rules of the game would be able to identify that the best move is to underpromote a pawn to force a draw.

Just from a personal layman's perspective: I find being able to reason about chess moves a fair way to measure a specific type of reasoning. The fact that LLMs aren't good at this show to me that they're doing something else which to me is equally disappointing as it is _interesting_.

Imustaskforhelp · a year ago

I am exactly 1600 at chess.com rating and though I don't do puzzle's much , what I do know is that if you push the white king to b2 then that pawn is losing , take that pawn and then you have 2 bishop endgame which is really really hard .

I once had a bishop and a knight endgame , I think It became draw on repetition.

Asking AI to do this is definitely flawed. This isn't reasoning. From what I know of 2 bishop end game , its more of hey lets trap the king in a box untill you could then snipe the king with your bishop (like his king could be on h1) yours on h3 your 1 bishop targeting g1 and the other bishop anywhere on the main diagonal with no other pieces.

But this is very much stalematey , since I am currently pondering how to get to this position without a stalemate! , if you move the bishop later , its stalement , Like seriously. https://www.chess.com/forum/view/endgames/two-bishop-checkma...

Just search 2 bishop checkmate is hard , a lot of guides exist just for this purpose , though in my 1000+ games I rarely got once or twice 2 bishop endgame , usually bishop or knight which is just as tricky or if I recall , the worst is knight and knight.

plank · a year ago

Replying to your other questions: Its been a while since I played chess regularly (in a chess club), but:

Two bishops (of different colour) is actually not that difficult. There are some simple heuristics to help you there (an LLM might actually tell you these, haven’t asked;-0)

Bishop+Knight is, in my opinion slightly more complicated, there are some ‘tricks’ necessary to keep the king from running from one courner to the next.

Bishop+bishop is - in most situations - a draw (you need three knights to mate).

plank · a year ago

But there is a reasoning (see my reply above): winning is not possible (only the queen is strong enough against two bishops), so draw should be the goal. And underpromoting to knight is only way to keep the piece for another move while still promoting.

tzs · a year ago

Humans can also be disappointing and interesting. I like to do Lichess puzzles not logged in, which mostly gives puzzles with a Lichess puzzle ratings in the 1400-1600 range with some going down to around 1200 or up to the 1700s. Presumably that is the range the average Lichess player is in.

For those who have not used Lichess, the puzzles it gives (unless you ask for a specific type) do not tell you what the goal is (mate, win material, get a winning endgame, save a bad position, etc) or how many moves it will take.

Here are some puzzles it has recently given me and their current ratings. These all have something in common.

  1492 https://lichess.org/training/KsrR0
  1506 https://lichess.org/training/RwLfy
  1545 https://lichess.org/training/TzZdx
  1557 https://lichess.org/training/IJfT7
  1564 https://lichess.org/training/oOMz4
  1604 https://lichess.org/training/uRRck
  1661 https://lichess.org/training/jBrLX
  1719 https://lichess.org/training/cpKAM

What they have in common is that they are all mate in one. I have seen composed mate in ones that puzzled even high rated players, but they involved something unusual like the mating move was an en passant capture.

None of the above puzzles are tricks like that.

So how are enough people failing them for their ratings to be that high?

benmmurphy · a year ago

i assume lichess has time based puzzles so people can be failing the puzzles because they are trying to optimize how many puzzles they solve rather than making no mistakes. also, i suspect lichess puzzle ratings probably do not match with lichess chess ratings (sure, they probably correlate but i suspect there could easily be a +200 average difference or something like that between them) . i can solve puzzles consistently at least 1000 rating higher on chess.com than my chess.com rapid rating. also, if you know these puzzles are mate in 1 then they are much easier. i'm guessing you didn't know when you initially solved them but i think its much harder for other people to judge the difficulty of these puzzles if they know the solution is mate in 1.

danparsonson · a year ago

Is it reasonable to imagine that LLMs should be able to play chess? I feel like we're expending a whole lot of effort trying to distort a screwdriver until it looks like a spanner, and then wondering why it won't grip bolts very well.

Why should a language model be good at chess or similar numerical/analytical tasks?

In what way does language resemble chess?

ahoka · a year ago

I think because LLMs are convincingly good at natural language tasks, humans tend to anthropomorphize them. Due to this, it often is assumed that they are good at everything humans are capable of.

Filligree · a year ago

Okay, but I'm not good at playing chess without seeing the chessboard. In fact I'm pretty awful at that.

thepoet · a year ago

It might be a reasonable ask for an LLM to 'remember' the endgame tablebase of solved games - which is less than a GB for all game with five or less pieces on the board. This puzzle specifically relies on this knowledge and the knowledge of how the chess pieces move.

GuB-42 · a year ago

LLM: given a sequence of words, what is the most likely next word

Chess engine: given a sequence of moves in a winning game, what is the most likely next move

I don't think LLMs will ever beat purpose built engines, but it is not inconcevable for them to play better chess than most humans.

Yeah, I don't think they are a useful measuring stick for LLMs.

My amateur opinion is that an "AI system" resembling AGI or ASI or whatever the acronym of the day is, will be modular, with different parts addressing different kinds of learning, rather than entirely end to end. One of the main milestones towards achieving this would be the ability to dynamically learn what is left to be learnt (finding gaps), and then potentially have it train itself to learn that, automatically. One of the half-milestones, I suppose, would be for humans to find gaps in the the ability first of all.

I attend a talk recently where they presented research that tried to distinguish effectively the following two types of LLM failures:

1) inability to generalize/give the output at the "representation layer" itself

2) has the information represented, but is not able to retrieve it for the given reasonable prompt, and requires "context scaling"

Which is a step towards this goal I suppose.

SkiFire13 · a year ago

Continuing asking the same puzzle to every LLM seems flawed to me, since LLMs may eventually "learn" the answer for this particular puzzle just because it has been seen somewhere, without actually becoming able to answer other chess-related puzzles in way that at least follows the chess rules.

It's fine until the LLMs succeed. Then you need to try another one.

echoangle · a year ago

What’s the solution? I’m a human and I can’t come up with a specific move I would call „the correct move“.

binarymax · a year ago

The board is backwards and black to move. It’s annoying in that chess puzzles should always have black on top and white on bottom, and a caption of whose move it is. It’s clear in the FEN, but the image reverses it with no explanation.

filleduchaos · a year ago

I mean, I disagree that a chess board should only ever be represented from the perspective of white. Or rather, I cannot square being even remotely decent at chess and being unable to figure out whose perspective it is from the labelling of the ranks and files.

Veedrac · a year ago

Note that it's black to move and black's pawn moves upwards. All move but one has instant counterplay from white.

Here's the board; you can enable the engine to get the answer: https://lichess.org/analysis/standard/8/6B1/8/8/B7/8/K1pk4/8...

Same!

Winning is not possible: only the queen is strong enough to win against two bishops, and that fails to the check and loss of queen from black tiled bishop.

So draw is most one can get. Underpromoting to knight (with check, thus avoiding the check by the bishop) is the only way to promote and keep the piece another move.

I guess in this situation the knight against two bishops keeps the draw.

emmelaich · a year ago

My goto is ask LLM to explain this poem (this is but on variation) ...

   Spring has sprung, the grass iz riz,
   I wonder where da boidies iz?
   Da boid iz on da wing! Ain’t that absoid?
   I always hoid da wing...wuz on da boid!

chatgpt up to o1 failed, o1 did very well. deepseek-r1 7b did ok too.

markisus · a year ago

I’m having a hard time with this one as a human.

Spring has sprung, grass has risen. I wonder where the birdies are? The bird is on the wing! Isn’t that absurd? I always heard the wing was on the bird!

Not sure what it means that the bird is on the wing.

janwillemb · a year ago

it means "in flight"

SideburnsOfDoom · a year ago

> Not sure what it means that the bird is on the wing

Is this hard to find out? I mean it's easy to find this https://www.dictionary.com/browse/on-the-wing https://www.allgreatquotes.com/hamlet-quotes-114/

So it's somewhat old-timey English English.

And to explain the joke, the humour is that the rest of it is phonetic 20th century New York English. E.g. pronouncing "Absurd" more like "absoid" to rhyme with "boid".

New York guy finds Shakespearean English absoid.

The other commenter has this right, that now the joke has been explained on the internet, it will be harvested and LLMS will regurgitate variations on the explanations, then people will believe that the LLMs have become "more intelligent" in general. They have not, they just have more data for this specific test.

Claude 3.5 Sonnet did rather well.

I get the impression it often does as well or better than o1 on many tasks, despite not being a reasoning model.

randomtoast · a year ago

If you input your own amateur-level chess games in PGN format and consult o1 about them, it provides a surprisingly decent high-level game analysis. It identifies the opening played, determines the pawn structures, comments on possible middlegame plans, highlights mistakes, and can quiet accurately assess whether an endgame is won or lost. Of course, as always, you should review o1's output and verify the statements, but there can be surprisingly (chess coach like) hints, insights, and improvement ideas that stockfish-only analysis cannot offer.

vidar · a year ago

Maybe run through Stockfish first and then the LLM?

andix · a year ago

This test is worthless in a few weeks, it's now going into the training data. Even repeatedly posting it into LLM services (with analytics enabled) could lead to inclusion in the training data.

Interestingly, this test has been in the public domain for the last seven years, since it is part of all possible chess games with 7 or less pieces, which is solved and published. It is a huge file, but the five pieces games dataset with the FEN is less than a GB. I wonder if it even got included in the training data earlier, or if it will be.

I don't think such datasets are going into AI training. But if this exact question keeps showing up in analytics data, and forum posts, it might end up in training sets.

mettamage · a year ago