OpenAI's o1 Playing Codenames

A small weakness in this test is that one of the keys to strategic Codenames play is understanding your partner. You're not just trying to connect the words, you're trying to connect them in a way that will be obvious to your partner. As a computing analogy: you're trying to serialize a few cards in a way that will be deserializable by the other player.

This test pairs o1 with itself, which means the serializer is the deserializer. So while it's impressive that it can link 4 words, most humans could also easily link 4 with as much stretching! We just don't tend to because we can't guarantee that the other human will make the same connections we did.

ModernMech · a year ago

lol I played this game with my family and they said my wife and I were cheating because I kept using inside jokes that made no sense to them but she would get immediately.

dgritsko · a year ago

That's a big part of what makes this game enjoyable - a clue that is very obvious to one person might not even cross the mind of someone else. To anyone reading this who hasn't played, it's definitely worth giving it a try.

lupire · a year ago

Same for Taboo for me. It's why we married.

cyode · a year ago

Stretching? Never! I see your 4-clue, o1, and raise you “QUEUE” for 5:

  - Line (Standing in the queue…)
  - London (they’re all queued up, innit?)
  - Log (*backend distsys handwaving*)
  - Mail (what do you think an inbox is, anyway?!)
  - Round (homophone “Q” is a typographically round letter)

paulddraper · a year ago

I think Round may be invalid but in any case I would not have gotten it.

suveen_ellawela · a year ago

thanks for the comment. I actually tried explicitly mentioning in the prompt that 'Your guesser follows the same reasoning process'. But this did not make any clear improvements. Maybe I should've done more prompt engineering.

lolinder · a year ago

Nah, prompt engineering wouldn't have solved the fundamental issue, which is that the associations between ideas as stored in the weights will be the same between the two AI players, which makes it an easier game for them than for a human equivalent. It'd be like two copies of you playing on a team, having shared all the same experiences right up until the moment the game starts.

And don't get me wrong, it's still a fun experiment! It's just that that 4 would never have worked if a human played against another human—there are simply too many other words that would be equally strongly associated:

* Gum: Gum is often wrapped in paper, so 'GUM' is strongly associated with the word 'PAPER'.

* King: King is a type of face card, which are printed on paper, so 'KING' is strongly associated with the word 'PAPER'. (Repeat for JACK.)

* Light: Paper is a lightweight material.

That's 4 others right there that are at least as closely connected in my head as LAWYER or LOG. The only reason why o1 pulled up the same four when guessing as it did when clueing is that it's the same model.

Again, I didn't mean this as a knock, just a warning about drawing too many conclusions from the test!

jncfhnb · a year ago

Ehhh I don’t think that’s accurate. The problem is not linking 4 words. It’s linking 4 words without accidentally triggering other, semantically adjacent words.

This task could probably be solved nearly just as well with old school word 2 vec embeddings

lolinder · a year ago

Right, that's what I meant to be getting at: when you connect 4 words with as much stretching as o1 did there, you're running a real risk that the other party connects a different set. Unless that other party is also you and has the same learned connections at top of mind.

Deleted Comment

furyofantares · a year ago

> This task could probably be solved nearly just as well with old school word 2 vec embeddings

I've tried. This approach is well beyond awful.

I don’t find this “super good”. It’s mostly giving 2 clues which is the most basic level of competence. The paper 4 clue is reasonable but a bit lucky (eg Jack is also a good guess). I also don’t see it actually using tactics properly, which I would consider part of being “super good”. The game isn’t just about picking a good clue each round!

Now obviously it’s still pretty decent at finding the clues. Probably better than a random human who hasn’t played much. Just I find the post’s level of hype overstated. It feels like the author isn’t very experienced with Codenames.

It would be interesting to compare AI:human vs human:human games to see which does better. It seems like AI:AI will overstate its success.

dang · a year ago

Ok, we've taken supergoodness out of the title now. Presumably the post is still interesting!

(Submitted title was "I got OpenAI o1 to play the boardgame Codenames and it's super good".)

groggo · a year ago

Can you elaborate on some of the more advanced tactics?

When I play, it's mostly about getting a good 2 clue each time. Then if you can opportunistically get a 3 or 4, that's awesome.

Some tactics come in for choosing the right pairs of 2's so you don't end up mismatched, or leaving clues that might be ambiguous with your opponent's... But that's mostly it.

It'll be fun for multiplayer! Just like how in other online games you can add in a AI to play as one of the players.

jncfhnb · a year ago

If you really want to get good, your goal is not so much to get as many tiles as possible, but rather to get the tiles that are semantically distinct from your opponent’s. A single mistake that triggers your opponent’s tile is generally enough to lose the game. And even if they don’t do it, having them uncover the tiles from their side that are semantically similar to your own team is also useful.

If you want to get nasty, you learn to abuse the fact that the tile layouts follow rules and that you can rule out certain tiles without considering the words.

mtmickush · a year ago

Other advanced tactics involve giving a broad clue that matches 3-4 of your own and just one other (either your opponents or a civilian). Your team can pick up all the matches across several turns and the one off doesn't hurt as much as the plus four helps

harrall · a year ago

I find the game more about reading the people on your team (and the other team) to understand how they think.

You have to give entirely different clues depending on the people you play with.

Sometimes you can also play adversarial and introduce doubt into the opposing team by giving topic-adjacent clues that cause them to avoid one of their own cards. It works better if someone on the other team tends to be a big doubter. It also can work when the other team constantly goes back and tries to pick n+1 cards that they think they missed from the last round, which gives you a lot of room to psychologically mess with them.

Sometimes you have a clue that only really matches 2, but because only 1 of the wrong matches is a neutral card and you could match 2 more by a massive stretch, you say “4.” Worse case, they get 2 right but then they pick the neutral card but in the best case, you stand to gain 4 for a clue that should only match 2.

I like Codenames because they are many meta ways to play the game. What makes Codenames unique is that, unlike a lot of other games (Catan, Secret Hitler, CAH, etc.), it’s an adversarial team game where the team dynamics and discussions are not secret so you can use them to your advantage.

blix · a year ago

experienced players who know their teammates well can reliably get 3-4s. if you only go for safe 2s against these opponents you will lose every time.

lupire · a year ago

They should at least play with two different AI models.

zeroonetwothree · a year ago

lsy · a year ago

Some of these clues wouldn't be very good for a human playing. "007" for example isn't a very good clue for "laser", not only because something happening to be in one of several films about a character doesn't rise to the typical level of salience, but also because other words on-board like "shark" and "astronaut" even moreso meet the criterion of featuring prominently in James Bond movies, and "astronaut" appears to be a game-ending choice.

croes · a year ago

Is that really surprising?

It’s basically the same brain playing with itself. Seems quite natural to link the code names to the same words.

Let different LLMs play.

deredede · a year ago

This is the take I thought I'd have, but in the last example, the guesser model reaches the correct conclusion using a different reasoning than the clue giver model.

The clue giver justifies the link of Paper and Log as "written records", and between Paper and Line as "lines of text". But the guesser model connects Paper and Log because "paper is made from logs" (reaching the conclusion through a different meaning of Log), and connects Paper and Line because "'lined paper' is a common type of paper".

Similarly, in the first example, the clue giver connects Monster and Lion because lions are "often depicted as a mythical beast or monster in legends" (a tenuous connection if you ask me), whereas the guesser model thought about King because of King Kong (which I also prefer to Lion).

wizzwizz4 · a year ago

> But the guesser model connects Paper and Log because "paper is made from logs" (reaching the conclusion through a different meaning of Log)

No, it doesn't. It reaches the conclusion because of vector similarity (simplified explanation): these explanations are post-hoc.

unlikelymordant · a year ago

generally there is a "temperature" parameter that can be used to add some randomness or variety to the LLMs outputs by changing the likelihood of the next word being selected. This means you could just keep regenerating the same response and get different answers each time. each time it will give different plausible responses, and this is all from the same model. This doesn't mean it believes any of them, it just keeps hallucinating likely text, some of which will fit better than others. It is still very much the same brain (or set of trained parameters) playing with itself.

elicksaur · a year ago

Or, have it play a human and compare human-human and llm-human pairs.

ushiroda80 · a year ago

Yeah not sure what’s impressive about this. Having the model be both the guesser and clue giver will of course have good results as it’s simply a reflections of o1’s weighting of tokens.

Interestingly this could be a way to potentially reverse engineer o1’s weightings

kennyloginz · a year ago

Could this just be a case of Reddit being included in the training data?

“ I read through codenames official rules to see if using "007" as a clue was allowed, and it turns out it is! To my surprise, I even came across a Reddit post where people were discussing and justifying why this clue fits perfectly within the rules.”

JohnMakin · a year ago

Yea, initially I thought this post was satire because of this.

that is a really interesting point. if it is true, this shows direct usage of a single training data point ( cus there are no other resources talking about this fact)

jprete · a year ago

Codenames is absolutely dead-center of what I expect Large Language Models to be good at. The fundamental skills of the game are: having an excellent embedding for word semantics and connotations; modeling other people's embeddings; a little bit of game strategy related to its competitive nature.

xnickb · a year ago

Somehow I expected AI to give clues that combine 4-5-6 words at a time. It's not at all impressive to me. And I'm not a serious player at all

vitus · a year ago

I am similarly less-than-impressed. If you click through to the website, you can watch the replay of one of the games mentioned in the article (the one with the clue "invader").

In that instance, the clues all matched 2-3 words, and the winning team got lucky twice (they guessed an unclued word using an unintended correlation, and their opponent guessed a different one of their unclued words.)

You also see a number of instances where the agents continue guessing words for a clue even though they've already gotten enough matches. For instance, in round 2, for the clue "Japan (2)", the blue team guesses sumo and cherry, then goes for a rather tenuous followup guess for round 1's 007 with "ring" (despite having gotten the two clued matches in the first round). A sillier example is in the final round, where the Red Team guesses 3 clues (thereby identifying all nine of their target words), then going ahead and guessing another word.

(For what it's worth, I think "shark" would have been a better guess for another 007 tie-in seeing as there are multiple Bond movies with sharks, but it's also not a match, and again, I wouldn't have gone for a third guess here when there were only two clued words.)

garretraziel · a year ago

This is allowed by the rules though. You can guess +1 to the number specified.

pama · a year ago

I was wondering about the same. It is possible that the instructions didn’t try to make the gameplay as aggressive as possible. A good model could optimize the separator to make it easy to guess the most words possible. By having access to its own state, it should be possible to reach 5–6 words in most cases. There is an argument for keeping words around that would increase the difficulty of the opponents guessing large/clean separations, so it is possible that optimal play includes simple pairs on occasion. Very interesting application nonetheless.

> It is possible that the instructions didn’t try to make the gameplay as aggressive as possible.

In case you're wondering, the prompts are available here: https://github.com/SuveenE/codenames-ai/blob/main/utils/prom...

wwtl12 · a year ago

The Mechanical Turk is super impressive if you don't know how it works.

captn3m0 · a year ago

I've been trying to do this with just word2vec, instead of throwing an LLM, since you just need to find a word with the appropriate distances optimized. https://github.com/captn3m0/ideas?tab=readme-ov-file#codenam...

qqqult · a year ago

I did that last summer, I compared the performance of different english word embedding models, as far as I remember the best ones were GloVe and a few knowledge graph word embeddings.

None of them were better than a human at giving hints for 3+ words though

I tried this many years ago (before LLMs) with hundreds of real human games and it was never that good.

dartos · a year ago

I love this.

Imagine the energy savings if more people didn’t just automatically reach for LLMs for their pet projects.