LLMs, Theory of Mind, and Cheryl's Birthday

> At least with respect to this problem, they had no theory of mind.

This is very interesting and insightful, but I take issue with the above conclusion. Your average software engineer would probably fail to code up a python solution to this problem. But most people would agree that the average software engineer, and the average person, possesses some theory of mind.

This seems to be a pattern I'm noticing with AI. The goalposts keep moving. When I was a kid, the turing test was the holy grail for "artificial intelligence." Now, your run-of-the-mill LLM can breeze through the turing test. But no one seems to care. "They are just imitating us, that doesn't count." Every couple years, AI/ML systems make revolutionary advances, but everyone pretends it's not a big deal because of some new excuse. The latest one being "LLMs can't write a python program to solve an entire class of very challenging logic problems. Therefore LLMs possess no theory of mind."

Let me stick my neck out and say something controversial. Are the latest LLMs as smart as Peter Norvig? No. Are they smarter than your average human? Yes. Can they outperform your average human at a randomly chosen cognitive task that has real-world applications? Yes. This is pretty darn revolutionary. We have crossed the rubicon. We are watching history unfold in real-time.

hyperG · a year ago

It is because the goalposts were wrong.

We once thought that a computer could not beat a grandmaster in chess or pass the Turing test without some undefined special human property. We were wrong about the computer needing this undefined special human property.

A spreadsheet has been much better at math than the average person for a long time too. A spreadsheet is a very useful human tool. LLMs are a revolutionary useful tool. For some people that doesn't seem to be enough though and they have to try to find or insist the LLM has the undefined special human property.

titanomachy · a year ago

I consider myself a pretty average human programmer, and I was able to solve the logic puzzle and write a python program for it in ~10 mins. [0]

I agree though, the people who are unable to solve this probably still have a theory of mind. It seems like we're setting a rather high bar.

[0] https://pastebin.com/q33K0HJ1

kenjackson · a year ago

With all due respect, if you wrote a python program for this in 10 minutes you are not an average programmer.

stavros · a year ago

Does that count as a program that solves the problem? Your program finds the unique days/months, but you're hardcoding the part where the program discerns who knows what.

Maybe that counts, I don't know, I'm genuinely asking.

nuancebydefault · a year ago

Let me say this. I am convinced i cannot write a program that solves the puzzle in 10 minutes.

I am convinced though that i can write such program, including some test cases, with the help of an llm like bing copilot in 10 minutes. The global reasoning/steps would be mine, the llm would fill in the details.

I'm also convinced that it will be a matter of time (less than 5 years) before these kind of problems are solved trivially by llms, without prior example in the training set being necessary.

In other words, 'theory of mind' (of type defined by the author of the article) has already emerged from machines.

People are a bit reluctant to believe that, me not so much.

GolDDranks · a year ago

> Now, your run-of-the-mill LLM can breeze through the turing test.

Can they? You can ask arbitrary questions in the Turing test. I doubt many models would be able successfully imitate humans in such adversarial conditions. Note that the Turing test doesn't require to judge to be unsophisticated or unknowledgeable about AI's capabilities or weaknesses. I believe that AI's are closer than ever passing the Turing test, but I'm sceptical until I see it.

stavros · a year ago

What kind of questions would you ask to distinguish?

Jerrrrrrry · a year ago

The goalposts will continue to move until GDP improves.

DoctorOetker · a year ago

until who's GDP moves?

Suppose nation X or power bloc Y's GDP improves due to ML, will nation Z without increasing GDP continue to move the goalposts?

burner_in_ma · a year ago

> Your average software engineer would probably fail to code up a python solution to this problem

[citation needed]. I say that, if you can't write a program that solves this problem, you don't have any business calling yourself a "software engineer".

from collections import defaultdict def find_cheryls_birthday(possible_dates): # Parse the dates into month and day dates = [date.split() for date in possible_dates] months = [month for month, day in dates] days = [day for month, day in dates] # Step 1: Albert knows the month and says he doesn't know the birthday # and that Bernard doesn't know either. This implies the month has no unique days. month_counts = defaultdict(int) day_counts = defaultdict(int) for month, day in dates: month_counts[month] += 1 day_counts[day] += 1 # Months with all days appearing more than once possible_months = [month for month in month_counts if all(day_counts[day] > 1 for m, day in dates if m == month)] filtered_dates = [date for date in dates if date[0] in possible_months] # Step 2: Bernard knows the day and now knows the birthday # This means the day is unique in the filtered dates filtered_days = defaultdict(int) for month, day in filtered_dates: filtered_days[day] += 1 possible_days = [day for day in filtered_days if filtered_days[day] == 1] filtered_dates = [date for date in filtered_dates if date[1] in possible_days] # Step 3: Albert now knows the birthday, so the month must be unique in remaining dates possible_months = defaultdict(int) for month, day in filtered_dates: possible_months[month] += 1 final_dates = [date for date in filtered_dates if possible_months[date[0]] == 1] # Convert back to original format return ' '.join(final_dates[0]) if final_dates else "No unique solution found." # Example usage: possible_dates = [ "May 15", "May 16", "May 19", "June 17", "June 18", "July 14", "July 16", "August 14", "August 15", "August 17" ] birthday = find_cheryls_birthday(possible_dates) print(f"Cheryl's Birthday is on {birthday}.")

Deducing things from the inability of an LLM to answer a specific question seemed doomed by the "it will be able to on the next itteration" principle.

It seems like the only way you could systematic chart the weaknesses of an LLM is by having a class of problems that get harder for LLMs at a steep rate, so a small increase in problem complexity requires a significant increase in LLM power.

aithrowawaycomm · a year ago

> It seems like the only way you could systematic chart the weaknesses of an LLM is by having a class of problems that get harder for LLMs at a steep rate

That would be any problem more complicated than O(n) complexity, even with chain-of-thought prompting[1].

Note that the O(n) thing can bite you in all sorts of unintuitive ways: if the LLM+CoT can perform an O(n) Task A and O(m) Task B, then it can't do the O(nm) task "for every step of A, perform B on the result" unless you come up with a task-specific prompt outlining the solution. The alternative is to play RLHF Whack-A-Mole, separately training the LLM on the combined task. (I think this weakness might be why LLMs are hitting a wall in enterprise deployment, and also explains why LLM agents don't actually work.) The only way this will get fixed is with a fundamentally more sophisticated architecture.

[1] https://www.quantamagazine.org/how-chain-of-thought-reasonin...

godelski · a year ago

  > Deducing things from the inability of an LLM to answer a specific question seemed doomed by the "it will be able to on the next itteration" principle.

That's orthogonal.

If we are pointing in the right direction(s) then yes, next iteration could resolve all problems.

If we are not pointing in the right direction(s) then no, next iteration will not resolve these problems.

Given LLMs rapid improvement in regurgitating knowledge from their training data but simultaneously slow improvement in their ability to generalize (such as logic "puzzles"), I think it is naive to assume we're pointed in the right direction. Maybe we're even pointing in mostly the right direction. But why assume we are?

We can continue in the direction we are going while simultaneously considering it might not be well aligned. If we are well aligned, that gives us more confidence and makes gathering funding easier. If we aren't, well it is easier to course correct sooner than later. In either case, you benefit from the analysis.

Understanding why things fail is more important than understanding why things succeed.

Uehreka · a year ago

GP is referring to the fact that if it becomes well known that LLM version X can’t solve problem Q, then the model’s trainers will make sure to include problem Q prominently in the training set, running it through over and over to ensure that version X+1 is able to solve the problem whether the model’s “reasoning” abilities have improved or not.

Thus observers of the LLM space like us need to keep finding novel “Bellweather problems” that we think will evaluate a model’s ability to reason, knowing that once we start talking about it openly the problem will no longer be a useful Bellweather.

By their nature as “weird-shaped” problems, these aren’t the kind of thing we’re guaranteed to have an infinite supply of. As the generations move on it will become more and more difficult to discern “actual improvements in reasoning” from “the model essentially has the solution to your particular riddle hard-coded”.

falcor84 · a year ago

... and such that the same increase in problem complexity requires a smaller increase in human effort to solve.

This was the idea with the Winograd schema challenge [0] and now the ARC benchmark [1], but human-level performance on the former was achieved in 2019, and very strong progress is being made over the last few months on the latter. But at the current point in time, it seems that we're pretty much reaching the limit of such challenges that are relatively easy for humans to solve in a single sitting, and we'll have to start switching to benchmarks which rely on extensive work over time, such as SWE-Bench [1], and even there it seems that state of the art AI agents are already doing better than the "average" human developer.

[0] https://en.wikipedia.org/wiki/Winograd_schema_challenge

[1] https://arcprize.org/

[2] https://www.swebench.com/

whack · a year ago

jawns · a year ago

A long time ago, I created a version of this challenge called "Cheryl's Murder."

My notebook not only solves logical induction problems like "Cheryl's Birthday," but it also generates them.

https://github.com/shaungallagher/cheryls-murder/blob/master...

xelxebar · a year ago

> Twice a year, all of the software engineers at our company are given several days to work on a project of their choosing.

Monetate sounds like it has (had?) some interesting leadership!

airstrike · a year ago

This is awesome, thanks for sharing

oli5679 · a year ago

Gp1-o1 preview solves this puzzle correctly in 13 seconds and has a thorough logical deduction in the comments and explanation.

I think it’s a bit unfair on llm to ask it to retrieve the puzzle definition from its training data. I posted the info on the puzzle from his notebook.

https://chatgpt.com/share/670103ae-1c18-8011-8068-dd21793727...

lagmg05 · a year ago

The question is if it solved the puzzle correctly before Norvig's article appeared. It could have been trained (I am told that existing models can be modified and augmented in any Llama discussion) on the article or on HN comments.

There could even be an added routine that special cases trick questions and high profile criticisms.

Fripplebubby · a year ago

While this is technically possible, it is not remotely practical and the downside risk of pushing out a borked model is much higher than the upside.

Training the model is expensive (obviously), but even if you are only training it slightly, running evaluations to determine whether the particular training checkpoint is at or above the quality bar is expensive, too.

Terretta · a year ago

> The question is if it solved the puzzle correctly before Norvig's article appeared. It could have been trained...

This caught me by surprise — is there a suggestion or evidence that despite the "knowledge cutoff" OpenAI is continuously retraining GPT-4o's chat-backing model(s) on day over day updates to the web?

Sure,

I guess the best way to test this is to compose a new question, of a similar format.

The question is to get it to write generic code

Disappointing that Norvig didn’t use the model that OpenAI states is their best model for programming.

Also using himself as the programmer seemed like a convenient choice. I’d much rather see him grab a random professional programmer for the task.

drhouse_md · a year ago

gpt-o1 was released Sept. 12th and Norvig ran his tests Sept 25th... I don't understand how Norvig didn't think to test gpt-o1, it actually irritates me lol

erwald · a year ago

o1 mini seems to get it on the first try (I didn't vet the code, but I tested it and it works on both examples provided in the notebook, `dates` and `gabe_dates`):

mewpmewp2 · a year ago

In addition to that after they create the 1st program with mistakes the author should have showed them the invalid output and let them have a chance to fix it. For humans solving this on the first try without running the code also tends to frequently not work.

fragmede · a year ago

"seems to" isn't good enough, especially since it's entirely possible to generate code that doesn't give the right answer. 4o is able to write some bad code, run it, recognize that it's bad, and then fix it, if you tell it to.

https://chatgpt.com/share/670086ed-67bc-8009-b96c-39e539791f...

Chinjut · a year ago

Did you actually run the "fixed" code here? Its output is an empty list, just like the pre-"fixed" code.

isaacfrond · a year ago

despite the name ‘mini’. it is actually more optimized for code. so that makes sense.

ynniv · a year ago

The problem with evaluating LLMs is that there's a random component, and the specific wording of prompts is so important. I asked Claude to explain the problem, then write python to solve it. When it ran there was an exception, so I pasted that back in and got the correct answer. I'm not sure what this says about theory of mind (the first script it wrote was organized into steps based on who knew what when, so it seems to grok that), but the real lesson is that if LLMs are an emulation of "human" intelligence, they should probably be given a python interpreter to check their work.

skybrian · a year ago

Yes, that helps. But if you iterate on this a few times (as I did last year with Code Interpreter), it reveals how much LLM's "like" to imitate patterns. Sure, often it will pattern-match on a useful fix and that's pretty neat. But after I told it "that fix didn't work" a couple times (with details about the error), it started assuming the fix wouldn't work and immediately trying again without my input. It learned the pattern! So, I learned to instead edit the question and resubmit.

LLM's are pattern-imitating machines with a random number generator added to try to keep them from repeating the same pattern, which is what they really "want" to do. It's a brilliant hack because repeating the same pattern when it's not appropriate is a dead giveaway of machine-like behavior. (And adding a random number generator also makes it that much harder to evaluate LLM's since you need to repeat your queries and do statistics.)

Although zero-shot question-answering often works, a more reliable way to get useful results out of an LLM is to "lean into it" by giving it a pattern and asking it to repeat it. (Or if you don't want it to follow a pattern, make sure you don't give it one that will confuse it.)

brain5ide · a year ago

If I understood correctly, that anectode in first paragraph looks like an interaction with a child who is trying something but lacks confidence.

benreesman · a year ago

Sonnet-3.5 seems a lot better at backing correct fixes out of TypeScript compiler errors than Python runtime errors. Which fair enough, I'm better at that too.

Of the two or three languages these things have enough training data on to hit "above average StackOverflow answer on demand", I'm being forced to re-evaluate my sometimes strident forecasts that LLM coding was mostly hype. I'm not quite ready to eat crow yet, but I've made sure there's clean silverware in case I need to (and I will admit it if I was conclusively full of shit).

It's still wildly over-stated and it's still a delicate game to come out ahead on the correct code after the hallucination rabbit holes have been deducted, but in certain verticals LLMs have become my first stop.

In the "strictly better than the sort of people who do this" regime is clickbait tech blog posts. I now almost always have them write me some fairly generic rant with a catchy title when I'm in the mood to read the sort of shit that gets frontpage because title. I don't post them because I'm not a spammer, but for my own private amusement? Beats the hell out of basically any low-detail technology essay. In a macabre way that's to me the more interesting commentary on theory of mind.

Don't take my word for it, but this crow is delicious.

al_potato · a year ago

This test plainly shows that even with the real solution in the training data, the wrong answer is written as though it's the correct answer. A human would say, "I'm not sure, I want to test it." The current AI summer is heaving with breathless claims of intelligence, comprehension, reasoning, etc.

I think these claims need to be balanced with a cold shower of reality. Personally, I find LLMs very impressive at what they do well; generating and summarizing and translating. People apologizing for LLMs' performance at straight-forward reasoning and programming tasks, suggesting various crutches and head-starts, gives me the creeps. It's not the Messiah. It's a very naughty computer program.

jfcoa · a year ago

This seems like a terrible test case since python examples are readily available in the training data: https://rosettacode.org/wiki/Cheryl%27s_birthday

It's interesting that so many of the model's fail to retrieve this, but any thta do solve it should clearly be able to do so with no reasoning/theory of mind.

kohlerm · a year ago

I agree this is not a great test. What's good about it is that it is a constraint satisfaction problem, and I would expect LLMs to be pretty bad at unknown problems of this kind. Simple reason, an LLM only has a a finite number of layers and it cannot do arbitrary long searches.

johnisgood · a year ago

I almost made ChatGPT write a Python program that creates a monthly work schedule (for imaginary workers) based on specific constraints (e.g. there are 10 workers, 2 shifts (morning and night), must work 40 hours per week, must have at least one weekend in a month off, 2 minimum workers per shift, no more than 3 consecutive working days, and so forth).

I am not sure if I could make it give me a working solution, however, and I have not tried Claude, for example, and I have not tried to do it with other programming languages. Maybe.

The issue was that it messed up the constraints and there were no feasible solutions, that said, it did give me a working program for this that had fewer constraints.

I don't understand what you're saying - the idea is that we're asking the LLM to generate code to perform the search, rather than run an arbitrarily long search on its own, right? So why should the number of layers it has matter?

rghall102 · a year ago

It is fascinating that the R solution just below the Python solution is much shorter and more readable. The same applies to Ruby and various Lisps.

It even applies to the VisualBasic solution!

pfisherman · a year ago

LLMs and NLP are to verbal reasoning what the calculator is to quantitative reasoning.

Language and by extension verbal reasoning is full of ambiguity and semantic slipperiness. For example, what degree of semantic similarity distinguishes synonymous from synonym-ish concepts? When do we partition concepts into homonyms?

I think part of the problem with how people evaluate LLMs is that the expectations that people have. Natural language != ontology. The expectation should be more Chomsky and less Boole. Asking it to solve math problems written in paragraph form is a waste of time. Use a calculator for that! Solving riddles? Code it up in prolog!

Instead you should be thinking of what operations you can do on concepts, meaning, and abstract ideas! That is what these things do.

missingrib · a year ago

Is this really verbal reasoning? It's just a logic problem.

How can one / should one combine the concepts of a dinosaur and monetary policy of the Ottoman Empire? What differentiates verbal reasoning from logic?

I don’t know that either of those can be solved well with formal languages or logic.

joe_the_user · a year ago