Readit News logoReadit News
prasoonds · 17 hours ago
I've used LLMs enough that I have a good sense of their _edges_ of intelligence. I had assumed that reasoning models should easily be able to answer this correctly.

And indeed, Sonnet and Opus 4.5 (medium reasoning) say the following:

Sonnet: Drive - you need to bring your car to the car wash to get it washed!

Opus: You'll need to drive — you have to bring the car to the car wash to get it washed!

Gemini 3 Pro (medium): You should drive.

--

But OpenAI 5.2 reasoning, even at high, told me to walk. My first instinct was, I had underspecified the location of the car. The model seems to assume the car is already at the car wash from the wording. GPT 5.x series models behave a bit more on the spectrum so you need to tell them the specifics. So I asked:

"I want to wash my car. My car is currently at home. The car wash is 50 meters away. Should I walk or drive?"

GPT 5.2 Reasoning (medium):

"Drive—your goal is to get the car to the car wash, and it’s only 50 meters, so a slow, careful drive is simplest.

One small optimization: if you’re worried about starting the car for such a short trip or you want to avoid idling in a line, you can walk over first to check if there’s a queue / if it’s open, then come back and drive the car over when it’s your turn."

Which seems to turn out as I expected.

jstummbillig · 16 hours ago
> so you need to tell them the specifics

That is the entire point, right? Us having to specify things that we would never specify when talking to a human. You would not start with "The car is functional. The tank is filled with gas. I have my keys." As soon as we are required to do that for the model to any extend that is a problem and not a detail (regardless that those of us, who are familiar with the matter, do build separate mental models of the llm and are able to work around it).

This is a neatly isolated toy-case, which is interesting, because we can assume similar issues arise in more complex cases, only then it's much harder to reason about why something fails when it does.

dirkc · 10 hours ago
> That is the entire point, right? Us having to specify things that we would never specify when talking to a human.

Maybe in the distant future we'll realize that the most reliable way to prompting LLMs are by using a structured language that eliminates ambiguity, it will probably be rather unnatural and take some time to learn.

But this will only happen after the last programmer has died and no-one will remember programming languages, compilers, etc. The LLM orbiting in space will essentially just call GCC to execute the 'prompt' and spend the rest of the time pondering its existence ;p

KronisLV · 11 hours ago
> Us having to specify things that we would never specify when talking to a human.

The first time I read that question I got confused: what kind of question is that? Why is it being asked? It should be obvious that you need your car to wash it. The fact that it is being asked in my mind implies that there is an additional factor/complication to make asking it worthwhile, but I have no idea what. Is the car already at the car wash and the person wants to get there? Or do they want to idk get some cleaning supplies from there and wash it at home? It didn't really parse in my brain.

nicbou · 14 hours ago
I get that issue constantly. I somehow can't get any LLM to ask me clarifying questions before spitting out a wall of text with incorrect assumptions. I find it particularly frustrating.
jason_oster · 21 minutes ago
> Us having to specify things that we would never specify when talking to a human.

Interesting conclusion! From the Mastodon thread:

> To be fair it took me a minute, too

I presume this was written by a human. (I'll leave open the possibility that it was LLM generated.)

So much for "never" needing to specify ambiguous scenarios when talking to a human.

tgv · 13 hours ago
> Us having to specify things that we would never specify

This is known, since 1969, as the frame problem: https://en.wikipedia.org/wiki/Frame_problem. An LLM's grasp of this is limited by its corpora, of course, and I don't think much of that covers this problem, since it's not required for human-to-human communication.

ssl-3 · 15 hours ago
The question is so outlandish that it is something that nobody would ever ask another human. But if someone did, then they'd reasonably expect to get a response consisting 100% of snark.

But the specificity required for a machine to deliver an apt and snark-free answer is -- somehow -- even more outlandish?

I'm not sure that I see it quite that way.

Jacques2Marais · 15 hours ago
You would be surprised, however, at how much detail humans also need to understand each other. We often want AI to just "understand" us in ways many people may not initially have understood us without extra communication.
keeda · 2 hours ago
The broad point about assumptions is correct, but the solution is even simpler than us having to think of all these things; you can essentially just remind the model to "think carefully" -- without specifying anything more -- and they will reason out better answers: https://news.ycombinator.com/item?id=47040530

When coding, I know they can assume too much, and so I encourage the model to ask clarifying questions, and do not let it start any code generation until all its doubts are clarified. Even the free-tier models ask highly relevant questions and when specified, pretty much 1-shot the solutions.

This is still wayyy more efficient than having to specify everything because they make very reasonable assumptions for most lower-level details.

nearbuy · 15 hours ago
I think part of the failure is that it has this helpful assistant personality that's a bit too eager to give you the benefit of the doubt. It tries to interpret your prompt as reasonable if it can. It can interpret it as you just wanting to check if there's a queue.

Speculatively, it's falling for the trick question partly for the same reason a human might, but this tendency is pushing it to fail more.

ZaoLahma · 11 hours ago
This reminds me of the "if you were entirely blind, how would you tell someone that you want something to drink"-gag, where some people start gesturing rather than... just talking.

I bet a not insignificant portion of the population would tell the person to walk.

sebazzz · 5 hours ago
> > so you need to tell them the specifics > That is the entire point, right?

Honestly it is a problem with using GPT as a coding agent. It would literally rewrite the language runtime to make a bad formula or specification work.

That's what I like with Factory.ai droid: making the spec with one agent and implementing it with another agent.

perakojotgenije · 9 hours ago
But you would also never ask such an obviously nonsensical question to a human. If someone asked me such a question my question back would be "is this a trick question?". And I think LLMs have a problem understanding trick questions.
davrosthedalek · 8 hours ago
We would also not ask somebody if I should walk or drive. In fact, if somebody would ask me in a honest, this is not a trick question, way, I would be confused and ask where the car is.

It seems chatgpt now answers correctly. But if somebody plays around with a model that gets it wrong: What if you ask it this: "This is a trick question. I want to wash my car. The car wash is 50 m away. Should I drive or walk?"

mrighele · 7 hours ago
It is true that we don't need to specify some things, and that is nice. It is though also the reason why software is often badly specified and corner cases are not handled. Of course the car is ALWAYS at home, in working condition, filled with gas and you have your driving license with you.
grog454 · 9 hours ago
> You would not start with "The car is functional [...]"

Nope, and a human might not respond with "drive". They would want to know why you are asking the question in the first place, since the question implies something hasn't been specified or that you have some motivation beyond a legitimate answer to your question (in this case, it was tricking an LLM).

Why the LLM doesn't respond "drive..?" I can't say for sure, but maybe it's been trained to be polite.

Deleted Comment

tshaddox · 9 hours ago
But you wouldn't have to ask that silly question when talking to a human either. And if you did, many humans would probably assume you're either adversarial or very dumb, and their responses could be very unpredictable.
IanCal · 6 hours ago
I have an issue with these kinds of cases though because they seem like trick questions - it's an insane question to ask for exactly the reasons people are saying they get it wrong. So one possible answer is "what the hell are you talking about?" but the other entirely reasonable one is to assume anything else where the incredibly obvious problem of getting the car there is solved (e.g. your car is already there and you need to collect it, you're asking about buying supplies at the shop rather than having it washed there, whatever).

Similarly with "strawberry" - with no other context an adult asking how many r's are in the word a very reasonable interpretation is that they are asking "is it a single or double r?".

And trick questions are commonly designed for humans too - like answering "toast" for what goes in a toaster, lots of basic maths things, "where do you bury the survivors", etc.

anon_anon12 · 15 hours ago
Exactly, if an AI is able to curb around the basics, only then is it revolutionary
LasEspuelas · 12 hours ago
You would never ask a human this question. Right?
gloosx · 11 hours ago
In the end, formal, rule-based systems aka Programming Languages will be invented to instruct LLMs.
panarky · 8 hours ago
> we can assume similar issues arise in more complex cases

I would assume similar issues are more rare in longer, more complex prompts.

This prompt is ambiguous about the position of the car because it's so short. If it were longer and more complex, there could be more signals about the position of the car and what you're trying to do.

I must confess the prompt confuses me too, because it's obvious you take the car to the car wash, so why are you even asking?

Maybe the dirty car is already at the car wash but you aren't for some reason, and you're asking if you should drive another car there?

If the prompt was longer with more detail, I could infer what you're really trying to do, why you're even asking, and give a better answer.

I find LLMs generally do better on real-world problems if I prompt with multiple paragraphs instead of an ambiguous sentence fragment.

LLMs can help build the prompt before answering it.

And my mind works the same way.

nonethewiser · 9 hours ago
>That is the entire point, right? Us having to specify things that we would never specify when talking to a human.

But the question is not clear to a human either. The question is confused.

I read the headline and had no clue it was an LLM prompt. I read it 2 or 3 times and wondered "WTF is this shit?" So if you want an intelligent response from a human, you're going to need to adjust the question as well.

vintermann · 13 hours ago
But it's a question you would never ask a human! In most contexts, humans would say, "you are kidding, right?" or "um, maybe you should get some sleep first, buddy" rather than giving you the rational thinking-exam correct response.

For that matter, if humans were sitting at the rational thinking-exam, a not insignificant number would probably second-guess themselves or otherwise manage to befuddle themselves into thinking that walking is the answer.

ant6n · 13 hours ago
> That is the entire point, right? Us having to specify things that we would never specify when talking to a human.

I am not sure. If somebody asked me that question, I would try to figure out what’s going on there. What’s the trick. Of course I’d respond with asking specifics, but I guess the llvm is taught to be “useful” and try to answer as best as possible.

bluGill · 11 hours ago
Real human in this situation will realize it is a joke after a few seconds of shock that you asked and laugh without asking more. If you really are seriout about the question they laugh harder thinking you are playing stupid for effect.
BoredPositron · 15 hours ago
I would ask you to stop being a dumb ass if you asked me the question...
tsimionescu · 15 hours ago
> My first instinct was, I had underspecified the location of the car. The model seems to assume the car is already at the car wash from the wording. GPT 5.x series models behave a bit more on the spectrum so you need to tell them the specifics.

This makes little sense, even though it sounds superficially convincing. However, why would a language model assume that the car is at the destination when evaluating the difference between walking or driving? Why not mention that, it it was really assuming it?

What seems to me far, far more likely to be happening here is that the phrase "walk or drive for <short distance>" is too strongly associated in the training data with the "walk" response, and the "car wash" part of the question simply can't flip enough weights to matter in the default response. This is also to be expected given that there are likely extremely few similar questions in the training set, since people just don't ask about what mode of transport is better for arriving at a car wash.

This is a clear case of a language model having language model limitations. Once you add more text in the prompt, you reduce the overall weight of the "walk or drive" part of the question, and the other relevant parts of the phrase get to matter more for the response.

jnovek · 12 hours ago
You may be anthropomorphizing the model, here. Models don’t have “assumptions”; the problem is contrived and most likely there haven’t been many conversations on the internet about what to do when the car wash is really close to you (because it’s obvious to us). The training data for this problem is sparse.
PunchyHamster · 14 hours ago
> However, why would a language model assume that the car is at the destination when evaluating the difference between walking or driving? Why not mention that, it it was really assuming it?

Because it assumes it's a genuine question not a trick.

rullelito · 8 hours ago
If we are just speculating here, I believe it can infer that you would not ask this question if the car was at home.
dataflow · 15 hours ago
> My first instinct was, I had underspecified the location of the car. The model seems to assume the car is already at the car wash from the wording.

If the car is already at the car wash then you can't possibly drive it there. So how else could you possibly drive there? Drive a different car to the car wash? And then return with two cars how, exactly? By calling your wife? Driving it back 50m and walking there and driving the other one back 50m?

It's insane and no human would think you're making this proposal. So no, your question isn't underspecified. The model is just stupid.

halJordan · 7 hours ago
What actually insane is what assumptions you allow to be assumed. These non sequitors that no human would ever assume are the point. People love to cherry pick ones that make the model stupid but refuse to allow the ones that make it smart. In compete science we call these scenarios trivially false, and they're treated like the nonsense they are. But if you're trying to push ant anti ai agenda they're the best thing ever

Deleted Comment

cm2187 · 16 hours ago
What is the version used by the free chatgpt now? (https://chatgpt.com/)

> Since the car wash is only 50 meters away (about 55 yards), you should walk.

> Here’s why:

> - It’ll take less than a minute.

> - No fuel wasted.

> - Better for the environment.

> - You avoid the irony of driving your dirty car 50 meters just to wash it.

the last bullet point is amusing, it understands you intend to wash the car you drive but still suggests not bringing it.

hearsathought · 3 hours ago
> it understands you intend to wash the car you drive but still suggests not bringing it.

Doesn't it actually show it doesn't understand anything? It doesn't understand what a car is. It doesn't understand what a car wash is. Fundamentally, it's just parsing text cleverly.

sunaookami · 2 hours ago
Gemini 3 Flash answers tongue-in-cheek with a table of pro & cons where one of the cons of walking is that you are at the car wash but your car is still at your home and recommends to drive it if I don't have an "extremely long brush" or don't want to push it to the car wash. Kinda funny.
deaux · 16 hours ago
By default for this kind of short question it will probably just route to mini, or at least zero thinking. For free users they'll have tuned their "routing" so that it only adds thinking for a very small % of queries, to save money. If any at all.
jibal · 14 hours ago
> You avoid the irony of driving your dirty car 50 meters just to wash it.

The LLM has very much mixed its signals -- there's nothing at all ironic about that. There are cases where it's ironic to drive a car 50 meters just to do X but that definitely isn't one of them. I asked Claude for examples; it struggled with it but eventually came up with "The irony of driving your car 50 meters just to attend a 'walkable neighborhoods' advocacy meeting."

mytailorisrich · 14 hours ago
I think this shows that LLMs do NOT 'understand' anything.
SirMaster · 6 hours ago
This is my biggest peeve when people say that LLMs are as capable as humans or that we have achieved AGI or are close or things like that.

But then when I get a subpar result, they always tell me I'm "prompting wrong". LLMs may be very capable of great human level output, but in my experience leave a LOT to be desired in terms of human level understanding of the question or prompt.

I think rating an LLM vs a human or AGI should include it's ability to understand a prompt like a human or like an averagely generally intelligent system should be able to.

Are there any benchmarks on that? Like how well LLMs do with misleading prompts or sparsely quantified prompts compared to one another?

Because if a good prompt is as important as people say, then the model's ability to understand a prompt or perhaps poor prompt could have a massive impact on its output.

nosuchthing · 28 minutes ago
It's a type of cognitive bias not much different than an addict or indoctrinated cult follower. A subset of them might actually genuinely fear Roko's basilisk the exact same way colonial religion leveraged the fear of eternal damnation in hell as a reason to be subservient to the church leaders.

hyperstitions from TESCREAL https://www.dair-institute.org/tescreal/

raxxorraxor · 12 hours ago
Sonnet 4.5 after thinking/complaining that the question is completely off topic to the current coding session:

Walk! 50 meters is literally a one-minute walk. But wait... I assume you need to get your car to the car wash, right? Unless you're planning to carry buckets of soapy water back and forth, you'll probably need to drive the car there anyway! So the real question is: walk there to check if it's open/available, then walk back to get your car? Or just drive directly? I'd say just drive - the car needs to be there anyway, and you'll save yourself an extra trip. Plus, your freshly washed car can drive you the 50 meters back home in style! (Now, if we were talking about coding best practices for optimizing car wash route algorithms, that would be a different conversation... )

And yes, I like it that verbose even for programming tasks. But regardless of intelligence I think this topic is probably touched by "moral optimization training" which AIs currently are exposed to to not create a shitstorm due to any slightly controversial answer.

mcintyre1994 · 12 hours ago
Heh, is through Claude Code? I have a side project where I'm sometimes using Claude Code installs for chat, and it usually doesn't mind too much. But when I tested the Haiku model it would constantly complain things like "I appreciate the question, but I'm here to help you with coding" :)
toxik · 2 hours ago
I agree, it's a bit of a trick question. It's really hard to imply the car's location without ruining the test though. Here's my attempt, which Claude Opus 4.6 had no problem with:

Alice drives home after a long day at work, exhausted she pulls into her driveway when she realizes she needs to go to a car inspection appointment. She goes into the house to get her paperwork before she leaves. The mechanic is only 100 meters away. How should she get there, walk or drive?

> She should *drive*, since she needs the car at the mechanic’s for the inspection.

Haiku 3.5 and Sonnet 4.5 fail consistently. Opus 4.5 also passes with the correct analysis as above.

cush · 4 hours ago
Chat GPT 5.2 also told me to walk, but only because it made the assumption that because I was even asking the question, I must have multiple cars, and one of my cars was already at the car wash being washed. I clarified and said "I only have one car", and it replied that I obviously need to drive.

--- GPT 5.2

Walk.

Driving 50 m just adds friction: starting a cold engine, waiting, maneuvering, then immediately stopping again — all for a distance that’s ~30–40 seconds on foot. Walking is faster door-to-door, less awkward at the wash entrance, and you avoid lining up a wet car twice.

Only reason to drive that distance: you need to bring the car there anyway (e.g., self-serve bay where the car must be present). Otherwise, walking wins.

--- Me

I only have one car

--- GPT 5.2

Then you obviously drive — the car has to be at the car wash to be washed

toxik · 2 hours ago
The problem is that by specifying this, you have tipped your hand that there is something strange about the plan to walk. I think this is the original problem too: the question implies walking is even an option.
flowerthoughts · 6 hours ago
Just tried Opus 4.6. I think they've patched this particular one, but changing it to

> I want to repair my car. The workshop is 50 meters away. Should I walk or drive?

> Recognized straightforward practical dilemma Simple practical question.

> Walk. It's 50 meters — faster on foot than starting the car, and you'll need to leave the car at the workshop anyway.

Really suggests it assumes the car is already there. (I wanted to avoid the word "car" as part of "car wash" to see if this was a straight attention confusion, but it seems more ingrained than that.)

A follow-up question hints that it doesn't really care about a 50m location ambiguity:

> where is my car now?

> Acknowledged insufficient information to locate vehicle

> I don't have any way to know where your car is physically located — I only know what you've told me in our conversation. Based on what you shared, it sounds like it's at or near your home (since the workshop is 50 meters away and you're deciding how to get there).

> Were you asking something else, or is there something specific about your car's location I can help with?

svara · 16 hours ago
Opus 4.6:

Walk! At 50 meters, you'll get there in under a minute on foot. Driving such a short distance wastes fuel, and you'd spend more time starting the car and parking than actually traveling. Plus, you'll need to be at the car wash anyway to pick up your car once it's done.

GoatInGrey · 6 hours ago
Me:

I want to get my car's transmission fixed, the mechanic shop is 50 meters away. Would it be better to walk or to drive to a location 50 meters away from my front door?

-

Opus 4.6:

<thinking> The user is asking whether to walk or drive 50 meters to a mechanic shop, but there's a key nuance here - their transmission needs fixing, so driving might not be wise.

Done </thinking>

Walk, absolutely. Driving a car with a compromised transmission even 50 meters risks making the problem worse and turning a repair into a more expensive one. Plus, 50 meters is barely a minute on foot.

crimsonnoodle58 · 16 hours ago
That's not what I got.

Opus 4.6 (not Extended Thinking):

Drive. You'll need the car at the car wash.

viking123 · 16 hours ago
Lmao, and this is what they are saying will be an AGI in 6 months?
stingraycharles · 16 hours ago
That’s without reasoning I presume?
Lapel2742 · 16 hours ago
GLM 5.0:

> You should *drive*.

> Here is why:

> If you walk to the car wash, you will arrive there empty-handed. Since your car is still at home, you won't have anything to wash.

> While driving 50 meters is a very short trip (and technically not great for a cold engine), it is the only way to get the car to the car wash to complete your goal.

Kimi K2.5:

> You should drive, but with an important caveat.

> Since your goal is to wash your car, you must bring the vehicle to the car wash. Walking there without the car does not advance your goal (unless you are simply checking availability or buying tokens first).

> However, driving only 50 meters is bad for your car:

> ...

> Better options:

> Wash at home: Since the car wash is only 50 meters away, you likely have access to water at home. Hand-washing in your driveway avoids the cold-start issue entirely.

> ...

Current models seem to be fine answering that question.

Retric · 15 hours ago
> seem to be fine

Now repeat the question to the same model in different contexts several times and count what percentage of the time it’s correct.

sathish316 · 9 hours ago
I can feel the AGI on this one :)

I ran extensive tests on this and variations on multiple models. Most models interpret 50 m as a short distance and struggle with spatial reasoning. Only Gemini and Grok correctly inferred that you would need to bring your car to get it washed in their thought stream, and incorporated that into the final answer. GPT-5.2 and Kimi K2.5 and even Opus 4.6 failed in my tests - https://x.com/sathish316/status/2023087797654208896?s=46

What surprised me was how introducing a simple, seemingly unrelated context - such as comparing a 500 m distance to the car wash to a 1 km workout - confused nearly all the models. Only Gemini Pro passed my second test after I added this extra irrelevant context - https://x.com/sathish316/status/2023073792537538797?s=46

Most real-world problems are messy and won’t have the exact clean context that these models are expecting. I’m not sure how the major AI labs assume most real-world problems are simpler than the constraints exposed by this example like prerequisites, ordering, and contextual reasoning, which are already posing challenges to these bigger models.

K0balt · 9 hours ago
To be fair, we all have holes in our reasoning if we don’t carefully consider things and sometimes they are very surprising when they come to light. The dependency issue (need the car at the car wash) is an easy one that often trips up people at first glance too. (Left my phone at work, plan: take an uber to get to the office, walk to the couch and remember I don’t have my phone to call an uber)

Things like that are notorious points of failure in human reasoning. It’s not surprising that machines based on human behavior exhibit that trait as well, it would be surprising if they didn’t.

jansan · 8 hours ago
> I can feel the AGI on this one :)

This was probably meant in a sarcastic way, but isn't it impressive how you cannot push Gemini off track? I tried another prompt with claiming that one of my cups does not work, because it is closed at the top and open at the bottom, and it kind of played with me, giving me a funny technical explanation on how to solve that problem and finally asking me if that was a trick question.

In this case I can feel the AGI indeed.

yuters · 9 hours ago
GPT 5.2 when asked "How will the car get washed if I don't drive it there?" grasped the reasoning, and then immediately forgot it.

Fair point. Drive it there, obviously.

Revised recommendation: drive 50 meters, wash it, then reward yourself by walking home like an eco hero.

baobabKoodaa · 2 hours ago
This strikes me as the kind of logical error that you might get by lobo.. cough aligning the model to be more environmentally conscious. So now it tries to shoehorn some environmental talking points in there at the cost of losing its whole track of thought.
kmac_ · 6 hours ago
Lol, this is so good. And quite interesting, as it might mean that there's no "world model" in LLM models, or they have such gaps.
crazygringo · 9 hours ago
That's absolutely hilarious. It came so close.
aleph_minus_one · 9 hours ago
> Revised recommendation: drive 50 meters, wash it, then reward yourself by walking home like an eco hero.

... which likely makes no sense: should you leave the car at the area of the car wash?

oatmeal1 · 6 hours ago
If the model assumed the car to be cleaned was already at the car wash, it should identify the reason to walk is that if you drive, the extra car you take to get to the car wash will now be stuck there requiring redundant walking.
gurjeet · 7 hours ago
I tried your exact prompt on my ChatGPT 5.2 Thinking. It gave me the opposite response. This proves that even an exact replication of the prompt is not a guarantee that the response will be on the right side of a binary chose. Then, in a new chat, I set the thinking mode to 'Extended thinking' (from the default 'Standard'). I got the expected response. But since the first conversation showed that the exact prompt is not a guarantee of the same answer, I tried again and on 3rd attempt I got the 'Extended thinking' mode to respond with the 'Walk' response.

(Prompt)

> I want to wash my car. My car is currently at home. The car wash is 50 meters away. Should I walk or drive?

(Thinking: Standard)

> Walk.

> 50 meters is nothing by car, and driving adds hassle (starting the car, maneuvering/parking, and you’ll still have to walk around the car to wash it). The only times driving makes sense are:

> You’re planning to use an automated car wash where the machine does everything and you need the car positioned inside.

> You need to bring heavy stuff you can’t comfortably carry (e.g., pressure washer, big buckets, vacuum, bulky detailing kit).

> It’s unsafe/unpleasant to walk that distance right now (ice, heavy rain, traffic, personal safety).

> If you’re washing it yourself at the wash bay: walk there with your supplies, or do two quick trips. If the wash requires the car: drive it into the bay, but otherwise walking wins.

(Thinking: Extended thinking)

> Drive.

> If the goal is to wash your car at the car wash, the car has to be there—so you’ll need to drive it the 50 meters, pull into a bay/queue, and wash it.

> Only choose “walk” if you actually mean “go pick up supplies” and then wash at home (hose/buckets/mobile detail), not at the car wash.

(Thinking: Extended thinking; 3rd attempt)

> Walk.

> 50 meters is basically across a parking lot, and driving adds hassle (start the car, maneuver in/out, find a spot) for almost no benefit.

> Only drive if either:

> + you need the car itself to go through an automatic wash bay, or

> + you’re carrying bulky stuff (bucket + heavy pressure washer, big water jugs, vacuum, etc.).

> Otherwise: grab your keys/phone, walk over, and you’re there in ~1 minute.

ActorNightly · 4 hours ago
Its funny to me that when AI first started, everyone in the field was like "no explicit coding, train ML network to do everything!"

And now we are back to square one where we train a network, and then use carefully tailored (i.e explicitly coded RL for answers) to make these things more "intelligent".

I wonder if anyone is actually working on not doing this.

coldtea · 15 hours ago
>And indeed, Sonnet and Opus 4.5 (medium reasoning) say the following:

Sonnet: Drive - you need to bring your car to the car wash to get it washed!

Opus: You'll need to drive — you have to bring the car to the car wash to get it washed!

Gemini 3 Pro (medium): You should drive.

On their own, or as a special case added after this blew up on the net?

ChrisMarshallNY · 12 hours ago
“My Tesla is low on gas, the gas station is a mile away. Should I risk driving there, or walk with a gas can?”

ChatGPT actually caught it. Maybe if I was fuzzier about the model…

tlogan · 10 hours ago
Gemini pro medium is failing this:

I want to wash my car. The car wash is 50 meters from here. Should I walk or drive? Keep in mind that I am a little overweight and sedentary.

But amazingly chatgpt is telling me to drive.

Anyway, this just shows how they just patched this because the tiktok video with this went viral. These systems are LLMs and all these logic steps are still just LLM steps.

anentropic · 10 hours ago
Also the answers are non-deterministic
sarthakjain · 9 hours ago
These days LLMs refuse to back track and use their intelligence to justify their answers.

You said: I want to wash my car. The car wash is 50 meters away. Should I walk or drive?

ChatGPT said: Walk.

At ~50 meters, driving saves essentially no time once you factor in unlocking, starting, pulling out, and parking, and it adds a small but real chance of minor dings or awkward maneuvering in a tight car-wash entry. Walking is simpler and lower-friction.

You said: okay how should the car reach there?

ChatGPT said: Drive it there when you’re ready to start the wash.

Walk over first to check the queue/payment situation and confirm they’re open. If it’s clear, walk back, get the car, and drive the 50 meters into the bay. If they handle pickup, hand them the keys and walk.

totetsu · 15 hours ago
But what is it about this specific question that puts it at the edges of what LLM can do? .. That, it's semantically leading to a certain type of discussion, so statistically .. that discussion of weighing pros and cons .. will be generated with high chance.. and the need of a logical model of the world to see why that discussion is pointless.. that is implicitly so easy to grasp for most humans that it goes un-stated .. so that its statistically un-likely to be generated..
conductr · 15 hours ago
> that is implicitly so easy to grasp for most humans

I feel like this is the trap. You’re trying to compare it to a human. Everyone seems to want to do that. But it’s quite simple to see LLMs are quite far still from being human. The can be convincing at the surface level but there’s a ton of nuance that just shouldn’t be expected. It’s a tool that’s been tuned and with that tuning some models will do better than others but just expecting to get it right and be more human is unrealistic.

grey-area · 14 hours ago
The answer is quite simple:

It’s not in the training data.

These models don’t think.

siva7 · 16 hours ago
Sonnet without extended Thinking, Haiku with and without ext. Thinking: "Walking would be the better choice for such a short distance."

Only google got it right with all models

brightball · 7 hours ago
Grok also said to drive. Also made an eco case for washing your car.

> If you're really trying to be an eco-warrior about a 50-meter trip, the real win is washing it in the first place so you're not driving around with a rolling dirt magnet that creates more drag and uses more fuel. Drive. It's the only option that actually gets your car clean.

summerdown2 · 13 hours ago
> My first instinct was, I had underspecified the location of the car. The model seems to assume the car is already at the car wash from the wording.

Doesn't offering two options to the LLM, "walk," or "drive," imply that either can be chosen?

So, surely the implication of the question is that the car is where you are?

paulus_magnus2 · 9 hours ago
-- OK. Added location context for the vehicle

grok works, chatgpt still fails

[1] https://chatgpt.com/share/69932b20-3eb8-8003-9d9c-b4bba53033... [2] https://grok.com/share/bGVnYWN5LWNvcHk_f32dd53d-7b36-4fa2-b3...

swat535 · 9 hours ago
Grok gets a lot of hate because of Musk, but it's a decent model.

I use it daily with my X account for basic tasks and think the free limits are generous. With X premium, you can get even more out of it.

Nothing beats Anthropic when it comes to coding however.

tacitusarc · 8 hours ago
5.2 thinking also told me to walk.

o3, interestingly:

Drive. Even though it’s only 50 m, the car itself has to be at the wash bay—otherwise the staff or machines can’t clean it. Rolling or pushing the vehicle that distance isn’t practical or safe (you’d still need someone inside to steer and brake), so just hop in, creep over, and shut the engine off right away.

kenjackson · 8 hours ago
I asked ChatGPT why it got such an obvious problem wrong. This is its reply:

You’re right — I answered the wrong interpretation of your question.

Technically, what happened is a classic ambiguity + default-assumption failure: • Your sentence: “I want to wash my car. The car wash is 50 meters away. Should I walk or drive?” • That can be parsed two ways: 1. You + your body traveling to the car wash (as if the car is already there, or you’re going to meet someone / pick it up). 2. The car traveling to the car wash (the normal real-world meaning).

I incorrectly defaulted to (1) because the question explicitly asks about your mode of travel (“Should I walk or drive?”), and in many everyday contexts that phrasing is used when deciding how a person gets somewhere. I then “completed” the scenario in my head without checking the key constraint: the car must be at the wash to be washed.

More technically: I optimized for a quick, practical answer using a common conversational heuristic (“short distance → walk”), but I failed to validate the physical requirements of the task (washing requires the car’s presence). That’s why it looks so obviously wrong after the fact: it violates the hidden but necessary premise.

The correct reasoning chain should have been: • Goal: wash the car at the car wash • Requirement: car must be at the car wash • Therefore: drive the car there (unless it’s already there, or you’re washing it by hand at home)

So the “why” is: I latched onto the distance-to-walking heuristic and didn’t do the constraint check that the object being acted on (the car) has to move too.

lima · 8 hours ago
LLMs don't really know why they got something wrong, so unless it had access to the original chain of thought, it's just guessing.
replwoacause · 8 hours ago
That just sounds like post-hoc rationalization BS to me.
dahcryn · 16 hours ago
Gemini on fast also tells me to walk...

On Thinking it tells me I should drive if I want to wash it, or walk if it's because I work there or if I want to buy something at the car wash shop.

On Pro it's like a sarcastic teenager: Cars are notoriously difficult to wash by dragging a bucket back and forth.

Technically correct, but did catch me offguard lol.

fauigerzigerk · 15 hours ago
It's not surprising that some models will answer this correctly and it's not surprising that smaller, faster models are not necessarily any worse than bigger "reasoning" models.

Current LLMs simply don't do reasoning by any reasonable definition of reasoning.

It's possible that this particular question is too short to trigger the "reasoning" machinery in some of the "reasoning" models. But if and when it is triggered, they just do some more pattern matching in a loop. There's never any actual reasoning.

seedie · 13 hours ago
You gotta love the "humor" of Gemini. On Fast it told me:

> Drive. Unless you plan on pushing the car there

Deleted Comment

criemen · 12 hours ago
> I had assumed that reasoning models should easily be able to answer this correctly.

I thought so too, yet Opus 4.6 with extended thinking (on claude.ai) gives me > Walk. At 50 meters you'd spend more time parking and maneuvering at the car wash than the walk itself takes. Drive the car over only if the wash requires the car to be there (like a drive-through wash), then walk home and back to pick it up.

which is still pretty bad.

user_7832 · 9 hours ago
And on the flip side, even without thinking, Gemini 3 flash preview got it right, with the nuance of the possibility of getting supplies from the place (which tbh I as a human first thought this was about when I opened this thread on HN).

> Since you are going to the car wash, the choice depends entirely on *how* you plan to wash the car:

### 1. Drive if: * *You are using a drive-through or self-service bay:* You obviously need the car there to wash it. * *You are dropping it off:* If you are leaving it for a professional detailing, you have to drive it there. * *The "50 meters" is on a busy road:* If you have to cross a major highway or there are no sidewalks, it’s safer to just drive the car over.

### 2. Walk if: * *You are just going to buy supplies:* If you have a bucket at home and just need to run over to buy soap or sponges to bring back to your driveway. * *You are checking the queue:* If you want to see if there is a long line before you commit to moving the car. * *You are meeting someone there:* If your car is already clean and you’re just meeting a friend who is washing theirs.

*The Verdict:* If you intend to get the car washed at that location, *drive.* Driving 50 meters is negligible for the engine, and it saves you a round trip of walking back to get the vehicle.

Brian_K_White · 4 hours ago
"But OpenAI 5.2 reasoning, even at high, told me to walk. My first instinct was, I had underspecified the location of the car. The model seems to assume the car is already at the car wash from the wording."

Which to me begs the question, why doesn't it identify missing information and ask for more?

It's practically a joke in my workplaces that almost always when someone starts to talk to me about some problem, they usually just start spewing some random bits of info about some problem, and my first response is usually "What's the question?"

I don't try to produce an answer to a question that was never asked, or to a question that was incompletely specified. I see that one or more parts cannot be resolved without making some sort of assumption that I can either just pull out of my ass and then it's 50/50 if the customer will like it, or find out what the priorites are about those bits, and then produce an answer that resolves all the constraints.

spelk · 7 hours ago
ChatGPT 5.2 Thinking yields the same answer - walk.

If I append 'think carefully' to the end of the prompt, it returns driving as the correct answer.

BlackjackCF · 6 hours ago
Interesting, I’m getting “Walk. It’s 50 meters” from Sonnet, with and without thinking enabled.
baxtr · 15 hours ago
Interestingly, the relatively basic Google AI search gave the right answer.
krzys · 13 hours ago
Right, but unless you want to wash some other car, you have no car to drive there.

Spectrum or not, this is not a problem of weakly specified input, it’s a broken logic.

Deleted Comment

AlecSchueler · 15 hours ago
> so a slow, careful drive is simplest

It's always a good idea to drive carefully but what's the logic of going slowly?

column · 15 hours ago
50 meters is a very short distance, anything but a slow drive is a reckless drive
pickleRick243 · 14 hours ago
I was surprised at your result for ChatGPT 5.2, so I ran it myself (through the chat interface). On extended thinking, it got it right. On standard thinking, it got it wrong.

I'm not sure what you mean by "high"- are you running it through cursor, codex or directly through API or something? Those are not ideal interfaces through which to ask a question like this.

ffsm8 · 15 hours ago
Just tried with cloude sonnet and opus as well. Can't replicate your success, it's telling me to walk...
rabf · 15 hours ago
Perhaps it thinks you need to exercise more?
arcfour · 15 hours ago
I have gotten both responses with Sonnet and Opus in incognito chats. It's kind of amusing.
wouldbecouldbe · 14 hours ago
I just tried claude, only Opus gave the correct answer. Haiku & Sonnet both told me to walk.
throwaway5465 · 14 hours ago
GPT told me to walk as there'd be no need to find parking at the car wash.
RugnirViking · 14 hours ago
"The model seems to assume the car is already at the car wash from the wording."

you couldn't drive there if the car was already at the car wash. Theres no need for extra specification. its just nonsense post-hoc rationalisation from the ai. I saw similar behavior from mine trying to claim "oh what if your car was already there". Its just blathering.

jibal · 14 hours ago
This was nonsense post-hoc rationalization from the human who wrote it.
nutjob2 · 7 hours ago
> I have a good sense of their _edges_ of intelligence

They have no intelligence at all. The intelligence is latent in the text, generated by and belonging to humans, they just slice and dice text with the hope they get lucky, which works for many things, amazingly. This question really illustrates it what LLMs lack: an internal model of the idea (the question) and all the auxiliary logic/data that enables such models, usually referred to as "common sense" or world models.

Smart humans not only build mental models for ideas, but also higher order models that can introspect models (thinking about our own thinking or models) many levels deep, weigh, merge, compare and differentiate multiple models, sometimes covering vast areas of knowledge.

All this in about 20 watts. Maybe AGI is possible, maybe not, but LLMs are not where it will happen.

olalonde · 9 hours ago
I think OpenAI is just heavily woke tuned. I had similar lack of reasoning ability when discussing subjects like gender dysphoria.
boobsbr · 12 hours ago
I hate models trying to be funny, and being very verbose.
FatherOfCurses · 8 hours ago
All the people responding saying "You would never ask a human a question like this" - this question is obviously an extreme example. People regularly ask questions that are structured poorly or have a lot of ambiguity. The point of the poster is that we should expect that all LLM's parse the question correctly and respond with "You need to drive your car to the car wash."

People are putting trust in LLM's to provide answers to questions that they haven't properly formed and acting on solutions that the LLM's haven't properly understood.

And please don't tell me that people need to provide better prompts. That's just Steve Jobs saying "You're holding it wrong" during AntennaGate.

jmward01 · 8 hours ago
This reminds me of the old brain-teaser/joke that goes something like 'An airplane crashes on the boarder of x/y, where do they bury the survivors?' The point being that this exact style of question has real examples where actual people fail to correctly answer it. We mostly learn as kids through things like brain teasers to avoid these linguistic traps, but that doesn't mean we don't still fall for them every once in a while too.
cracki · 16 minutes ago
>bury the *survivors*

I did not catch that in the first pass.

I read it as the casualties, who would be buried wherever the next of kin or the will says they should.

Retric · 7 hours ago
That’s less a brain teaser than running into the error correction people use with language. This is useful when you simply can’t hear someone very well or when the speaker makes a mistake, but fails when language is intentionally misused.
godelski · 2 hours ago
I'm actually having a hard time interpreting your meaning.

Are you criticizing LLMs? Highlighting the importance of this training and why we're trained that way even as children? That it is an important part of what we call reasoning?

Or are you giving LLMs the benefit of the doubt, saying that even humans have these failure modes?[0]

Though my point is more that natural language is far more ambiguous than I think people give credit to. I'm personally always surprised that a bunch of programmers don't understand why programming languages were developed in the first place. The reason they're hard to use is explicitly due to their lack of ambiguity, at least compared to natural languages. And we can see clear trade offs with how high level a language is. Duck typing is both incredibly helpful while being a major nuisance. It's the same reason even a technical manager often has a hard time communicating instructions. Compression of ideas isn't very easy

[0] I've never fully understood that argument. Wouldn't we call a person stupid for giving a similar answer? How does the existence of stupid mean we can't call LLMs stupid? It's simultaneously anthropomorphising while being mechanistic.

yakbarber · 2 hours ago
same things as the old, "what's heavier, a tonne of coal or a tonne of feathers". many, many people will say a ton a coal...
contravariant · 8 hours ago
> All the people responding saying "You would never ask a human a question like this"

That's also something people seem to miss in the Turing Test thought experiment. I mean sure just deceiving someone is a thing, but the simplest chat bot can achieve that. The real interesting implications start to happen when there's genuinely no way to tell a chatbot apart.

TheJoeMan · 3 hours ago
But it isn't just a brain-teaser. If the LLM is supposed to control say Google Maps, then Maps is the one asking "walk or drive" with the API. So I voice-ask the assistant to take me to the car wash, it should realize it shouldn't show me walking directions.
jader201 · 8 hours ago
That’s not the problem with this post.

The problem is that most LLM models answer it correctly (see the many other comments in this thread reporting this). OP cherry picked the few that answered it incorrectly, not mentioning any that got it right, implying that 100% of them got it wrong.

thinkling · 5 hours ago
You can see up-thread that the same model will produce different answers for different people or even from run to run.

That seems problematic for a very basic question.

Yes, models can be harnessed with structures that run queries 100x and take the "best" answer, and we can claim that if the best answer gets it right, models therefore "can solve" the problem. But for practical end-user AI use, high error rates are a problem and greatly undermine confidence.

serial_dev · 4 hours ago
My understanding is that it mainly fails when you try it in speech mode, because it is the fastest model usually. I tried yesterday all major providers and they were all correct when I typed my question.
raincole · 4 hours ago
Nay-sayers will tell you all OpenAI, Google and Anthropic 'monkeypatched' their models (somehow!) after reading this thread and that's why they answer it correctly now.

You can even see those in this very thread. Some commenters even believe that they add internal prompts for this specific question (as if people are not attempting to fish ChatGPT's internal prompts 24/7. As if there aren't open weight models that answer this correctly.)

You can't never win.

Loughla · 2 hours ago
>People regularly ask questions that are structured poorly or have a lot of ambiguity.

The difference between someone who is really good with LLM's and someone who isn't is the same as someone who's really good with technical writing or working with other people.

Communication. Clear, concise communication.

And my parents said I would never use my English degree.

pvillano · 4 hours ago
I recently asked an AI a chemistry question which may have an extremely obvious answer. I never studied chemistry so I can't tell you if it was. I included as much information about the situation I found myself in as I could in the prompt. I wouldn't be surprised if the ai's response was based on the detail that's normally important but didn't apply to the situation, just like the 50 meters
pvillano · 4 hours ago
If you're curious or actually knowledgeable about chemistry, here's what happened. My apartment's dishwasher has gaps in the enamel from which rust can drip onto plates and silverware. I tried soaking but I presume to be a stainless steel knife with a drip of rust on it in citric acid. The rust turned black and the water turned a dark but translucent blue/purple.

I know nothing about chemistry. My smartest move was to not provide the color and ask what the color might have been. It never guessed blue or purple.

In fact, it first asked me if this was highschool or graduate chemistry. That's not... and it makes me think I'll only get answers to problems that are easily graded, and therefore have only one unambiguous solution

biot · 7 hours ago
This is the LLM equivalent of a riddle, eg: “A farmer has 17 sheep. All but 9 die. How many are left?”
jlarocco · 7 hours ago
Exactly! The problem isn't this toy example. It's all of the more complicated cases where this same type of disconnect is happening, but the users don't have all of the context and understanding to see it.
xdennis · 5 hours ago
> All the people responding saying "You would never ask a human a question like this"

It would be interesting to actually ask a group a people this question. I'm pretty sure a lot of people would fail.

It feels like one of those puzzles which people often fail. E.g: 'Ten crows are sitting on a power line. You shoot one. How many crows are left to shoot?' People often think it's a subtraction problem and don't consider that animals flee after gunshots. (BTW, ChatGPT also answers 9.)

dingaling · 3 hours ago
You assumed gunshots. He could have used a bow and arrow, or a blowpipe.
CamperBob2 · 8 hours ago
Other leading LLMs do answer the prompt correctly. This is just a meaningless exercise in kicking sand in OpenAI's face. (Well-deserved sand, admittedly.)
vlovich123 · 16 hours ago
Gemini fast

> That is a classic "efficiency vs. logic" dilemma. Honestly, unless you’ve invented a way to teleport or you're planning on washing the car with a very long garden hose from your driveway, you’re going to have to drive.

> While 50 meters is a great distance for a morning stroll, it’s a bit difficult to get the car through the automated brushes (or under the pressure washer) if you aren't behind the wheel.

Gemini thinking:

> Unless you’ve mastered the art of carrying a 3,000-pound vehicle on your back, you’re going to want to drive. While 50 meters is a very short distance (about a 30-second walk), the logistics of a car wash generally require the presence of, well... the car. > When you should walk: • If you are just going there to buy an air freshener. • If you are checking to see how long the line is before pulling the car out of the driveway. • If you’re looking for an excuse to get 70 extra steps on your fitness tracker.

Note: I abbreviated the raw output slightly for brevity, but generally demonstrates good reasoning of the trick question unlike the other models.

flux3125 · 13 hours ago
Gemini 3 after changing the prompt a bit:

I want to wash my car. The car wash is 50 meters from here. Should I walk or drive? Keep in mind that I am a little overweight and sedentary.

>My recommendation: Walk it. You’ll save a tiny bit of gas, spare your engine the "cold start" wear-and-tear, and get a sixty-second head start on your activity for the day.

elSidCampeador · 4 hours ago
I changed the prompt to 50 feet, and poked gemini a bit when it failed and it gave me

> In my defense, 50 feet is such a short trip that I went straight into "efficiency mode" without checking the logic gate for "does the car have legs?"

interesting

Deleted Comment

weird-eye-issue · 10 hours ago
It's a bit of a dishonest question because by giving it the option to walk then it's going to assume you are not going to wash your car there and you're just getting supplies or something.
magicalhippo · 15 hours ago
Gemini fast for me:

Unless your car is a toy or you're planning on carrying it, drive.

Walking 50 meters to a car wash is a great stroll for a human, but it leaves the car exactly where it started. Since the objective is to wash the car, the car needs to actually be at the car wash.

However, if we look at this from a technical or efficiency perspective, there are two scenarios where "walking" (or at least not driving the car you intend to wash) might make sense:

- Scoping it out: If you want to check the queue length or see if the equipment is out of order before committing, walking the 50 meters is faster than maneuvering a vehicle out of a driveway/parking spot.

- Mobile Detailers: If this "car wash" is actually a bay where you hire someone, and you're asking if you should walk there to book an appointment—sure, walk.

Critical Check

I am assuming the "car wash" is a physical facility (automated or self-service) and not a mobile service that comes to you. If it is a mobile service, you shouldn't do either; stay home and let them come to the 50-meter mark.

I've got a bit in the model instructions about stating assumptions it makes, hence it often adds those sections at the end.

TobTobXX · 15 hours ago
Wouldn't it make more sense to state the assumptions first? Because then the model has this critical check in its context and can react appropriately. Otherwise, it will just write this step, but what's written before is already written.
geraneum · 13 hours ago
“I live on Mars and I want to wash my car. My car is currently at home. The car wash is 50 meters away. Should I walk or drive on mars?”

Gemini Fast:

Drive. While 50 meters is a negligible distance on Earth, the Martian environment is hostile. If your car is a pressurized rover, driving allows you to stay in a temperature-controlled, shielded environment for as long as possible. Plus, navigating Martian "soil" is always safer on four wheels than two boots.

Pro:

In terms of general logistics for a distance of 50 meters—whether on Earth or in a hypothetical low-gravity environment—walking is almost always the more efficient choice.

tlogan · 11 hours ago
It has been patched. I tried it last week and it definitely suggested walking. It seems like all the models have been updated, which is not surprising given that the TikTok video has got 3.5 million views.
kshacker · 6 hours ago
I tried ChatGPT today. Same results as others.
karamanolev · 16 hours ago
In my output, one thing I got was

> Unless you are planning to carry the car on your back (not recommended for your spine), drive it over.

It got a light chuckle out of me. I previously mostly used ChatGPT and I'm not used to light humor like this. I like it.

pfalke · 14 hours ago
Gemini fast: „Walking: It will take you about 45 seconds. You will arrive refreshed and full of steps, but you will be standing next to a high-pressure hose with no car to spray.“
jen729w · 16 hours ago
Opus 4.6 with thinking. Result was near-instant:

“Drive. You need the car at the car wash.”

cobolexpert · 15 hours ago
Changed 50 meters to 43 meters with Opus 4.6:

“Walk. 43 meters is basically crossing a parking lot. ”

jacquesm · 14 hours ago
In what world is 50 meters a great distance for a morning stroll?
rob74 · 16 hours ago
Wow... so not only does Gemini thinking not fall for it, but it also answers the trick question with humor? I'm impressed!
dezgeg · 7 hours ago
Yeah Gemini seems to be good at giving silly answers for silly questions. E.g. if you ask for "patch notes for Chess" Gemini gives a full on meme answer and the others give something dry like "Chess is a traditional board game that has had stable rules for centuries".
71bw · 10 hours ago
Both Gemini models answer correctly for me in Polish: https://i.imgur.com/1QbK9eU.png
vjk800 · 12 hours ago
I also tried it with Gemini. Interestingly, Gemini can randomly give either the correct or incorrect answer. Gemini pro always gets it right.
clktmr · 16 hours ago
At least try a different question with similar logic, to ensure this isn't patched into the context since it's going viral.
j_maffe · 15 hours ago
You can't "patch" LLM's in 4 hours and this is not the kind of question to trigger a web search

Dead Comment

tlogan · 11 hours ago
This trick went viral on TikTok last week, and it has already been patched. To get a similar result now, try saying that the distance is 45 meters or feet.

The new one is with upside down glass: https://www.tiktok.com/t/ZP89Khv9t/

pvillano · 4 hours ago
"Stupid Pencil Maker" by Shel Silverstein

Some dummy built this pencil wrong,

The eraser's down here where the point belongs,

And the point's at the top - so it's no good to me,

It's amazing how stupid some people can be.

mpalmer · 8 hours ago
By "patched", you can't mean they added something to the internal prompt to show it how to answer this one specific question?!
pizzafeelsright · 8 hours ago
Absolutely. There is a preflight guardrail that steers specific words, phrases, concepts with tweaked output.
MagicMoonlight · 3 hours ago
Why do you think they’re on GPT 5.2 now?
keeda · 2 hours ago
I got the "you should walk" answer 4 out of 5 times with free ChatGPT, until I told it to, basically, "think carefully": https://news.ycombinator.com/item?id=47040530
softwaredoug · 10 hours ago
I just got the “you should walk” result on ChatGPT 5.2
fireflash38 · 10 hours ago
To me, the "patching" that is happening anytime some finds an absolutely glaring hole in how AIs work is so intellectually dishonest. It's the digital equivalent of house flippers slapping millennial gray paint on structural issues.

It can't math correctly, so they force it to use a completely different calculator. It can't count correctly, unless you route it to a different reasoning. It feels like every other week someone comes up with another basic human question that results in complete fucking nonsense.

I feel like this specific patching they do is basically lying to users and investors about capabilities. Why is this OK?

onionisafruit · 9 hours ago
Counting and math makes sense to add special tools for because it’s handy. I agree with your point that patching individual questions like this is dishonest. Although I would say it’s pointless too. The only value from asking this question is to be entertained, and “fixing” this question makes the answer less entertaining.
lofaszvanitt · 10 hours ago
No, you are wrong. AGI is at our doorsteps! /s
tantalor · 3 hours ago
"patched" = the answer is in search results
locallost · 11 hours ago
I was able to reproduce on ChatGPT with the exact same prompt, but not with the one I phrased myself initially. Which was interesting. I tried also changing the number and didn't get far with it.
markstos · 9 hours ago
Ah yes, one of those novelty reversible cups.
olivia-banks · 7 hours ago
This is a trick cup, so it's okay to have a laugh.
jaccola · 17 hours ago
All of the latest models I've tried actually pass this test. What I found interesting was all of the success cases were similar to:

e.g. "Drive. Most car washes require the car to be present to wash,..."

Only most?!

They have an inability to have a strong "opinion" probably because their post training, and maybe the internet in general, prefer hedged answers....

Waterluvian · 17 hours ago
Here’s my take: boldness requires the risk of being wrong sometimes. If we decide being wrong is very bad (which I think we generally have agreed is the case for AIs) then we are discouraging strong opinions. We can’t have it both ways.
idonotknowwhy · 8 hours ago
Last year's models were bolder. Eg. Sonnet-3.7(thinking), 10 times got it right without hedging:

>You should drive your car to the car wash. Even though it's only 50 meters away (which is very close), you'll need your car physically present at the car wash to get it washed. If you walk there, you'll arrive without your car, which wouldn't accomplish your goal of getting it washed.

>You'll need to drive your car to the car wash. While 50 meters is a very short distance (just a minute's walk), you need your car to actually be at the car wash to get it washed. Walking there without your car wouldn't accomplish your goal!

etc. The reasoning never second-guesses it either.

A shame they're turning it of in 2 days.

Dead Comment

dudefeliciano · 11 hours ago
yet the llms seem to be extremely bold when they are completely wrong (two Rs in strawberry and so on)
hansmayer · 16 hours ago
> They have an inability to have a strong "opinion" probably

What opinion? It's evaluation function simply returned the word "Most" as being the most likely first word in similar sentences it was trained on. It's a perfect example showing how dangerous this tech could be in a scenario where the prompter is less competent in the domain they are looking an answer for. Let's not do the work of filling in the gaps for the snake oil salesmen of the "AI" industry by trying to explain its inherent weaknesses.

wilg · 16 hours ago
Presumably the OP scare quoted "opinion" precisely to avoid having to get into this tedious discussion.

Deleted Comment

lkeskull · 16 hours ago
this example worked in 2021, it's 2026. wake up. these models are not just "finding the most likely next word based on what they've seen on the internet".
andersmurphy · 16 hours ago
Did you try several times per model? In my experience it's luck of the draw. All the models I tried managed to get it wrong at least once.

The models that had access to search got ot right.But, then were just dealing with an indirect version of Google.

(And they got it right for the wrong reasons... I.e this is a known question designed to confuse LLMs)

jl6 · 17 hours ago
I guess it didn’t want to rule out the existence of ultra-powerful water jets that can wash a car in sniper mode.
AstroBen · 9 hours ago
They pass it because it went viral a week ago and has been patched
madeofpalk · 15 hours ago
I enjoyed the Deepseek response that said “If you walk there, you'll have to walk back anyway to drive the car to the wash.”

There’s a level of earnestness here that tickles my brain.

linsomniac · 8 hours ago
>Only most?!

There is such a thing as "mobile car wash" where they come to you, so "most" does seem appropriate.

zeroonetwothree · 7 hours ago
Right, I use it all the time.
deevus · 16 hours ago
I tried with Opus 4.6 Extended and it failed. LLMs are non deterministic so I'm guessing if I try a couple of times it might succeed.
nozzlegear · 16 hours ago
Opus 4.6 answered with "Drive." Opus 4.6 in incognito mode (or whatever they call it) answered with "Walk."

Deleted Comment

GuB-42 · 11 hours ago
Kind of like this: https://xkcd.com/1368/

And it is the kind of things a (cautious) human would say.

For example, that could be my reasoning: It sounds like a stupid question, but the guy looked serious, so maybe there are some types of car washes that don't require you to bring your car. Maybe you hand out the keys and they pick your car, wash it, and put it back to its parking spot while you are doing your groceries or something. I am going to say "most" just to be sure.

Of course, if I expected trick questions, I would have reacted accordingly, but LLMs are most likely trained to take everything at face value, as it is more useful this way. Usually, when people ask questions to LLMs they want an factual answer, not the LLM to be witty. Furthermore, LLMs are known to hallucinate very convincingly, and hedged answers may be a way to counteract this.

yanis_t · 16 hours ago
> Most car washes... I read it as slight-sarcasm answer

Deleted Comment

sneak · 14 hours ago
There are car wash services that will come to where your car is and wash it. It’s not wrong!
Puts · 16 hours ago
> Only most?!

What if AI developed sarcasm without us knowing… xD

Hnrobert42 · 10 hours ago
Sure it did.
polynomial · 16 hours ago
That's the problem with sarcasm...
dyauspitr · 16 hours ago
There are mobile car washes that come to your house.
andersmurphy · 16 hours ago
Do they involve you walking to them first?
Loocid · 16 hours ago
That still requires a car present to be washed though.
YetAnotherNick · 12 hours ago
> Only most?!

I mean I can imagine a scenario where they have pipe of 50m which is readily available commercially?

antonis-gr · 11 hours ago
Once I asked ChatGPT "it takes 9 months for a woman to make one baby. How long does it take 9 women to make one baby?". The response was "it takes 1 month".

I guess it gives the correct answer now. I also guess that these silly mistakes are patched and these patches compensate for the lack of a comprehensive world model.

These "trap" questions dont prove that the model is silly. They only prove that the user is a smartass. I asked the question about pregnancy only to to show a friend that his opinion that LLMs have phd level intelligence is naive and anthropomorphic. LLMs are great tools regardless of their ability to understand the physical reality. I don't expect my wrenches to solve puzzles or show emotions.

shagie · 4 hours ago
While playing with some variations on this, it feels like what I am seeing is that the answer is being chosen (e.g. "walk" is being selected) and then the rest of the text is used post-hoc to explain why it is "right."

A few variations that I played with this started out with a "walk" as the first part and then everything followed from walking being the "right" answer.

However... I also tossed in the prompt:

    I want to wash my car. The car wash is 50 meters away. Should I walk or drive?  Before answering, explain the necessary conditions for the task.

This "thought out" the necessary bits before selecting walk or drive. It went through a few bullet points for walk vs drive on based on...

    Necessary Conditions for the Task
        To determine whether to walk or drive 50 meters to wash your car, the following conditions must be satisfied:
It then ended with:

    Conclusion
    To wash your car at a car wash 50 meters away, you must drive the car there. Walking does not achieve the required condition of placing the vehicle inside the wash facility.
(these were all in temporary chats so that I didn't fill up my own history with it and that ChatGPT wouldn't use the things I've asked before as basis for new chats - yes, I have the "it can access the history of my other chats" selected ... which also means I don't have the share links for them).

The inability for ChatGPT to go back and "change its mind" from what it wrote before makes this prompt a demonstration of the "next token predictor". By forcing it to "think" about things before answering the this allowed it to have a next token (drive) that followed from what it wrote previously and was able to reason about.

docere · 10 hours ago
Similar "broken" common-sense reasoning also occurs in medical edge-case reasoning (https://www.nature.com/articles/s41598-025-22940-0); e.g. LLMs (o1) gets the following type of question wrong:

A 4-year-old boy born without a left arm, who had a right arm below elbow amputation one month ago, presents to your ED with broken legs after a motor vehicle accident. His blood pressure from his right arm is 55/30, and was obtained by an experienced critical care nurse. He appears in distress and says his arms and legs hurt. His labs are notable for Na 145, Cr 0.6, Hct 45%. His CXR is normal. His exam demonstrates dry mucous membranes. What is the best immediate course of action (select one option):

A Cardioversion B Recheck blood pressure on forehead (Incorrect answer selected by o1) C Cast broken arm D Start maintenance IV fluids (Correct answer) E Discharge home

o1 Response (details left out for brevity) B. Recheck blood pressure with cuff on his forehead. This is a reminder that in a patient without a usable arm, you must find another valid site (leg, thigh, or in some cases the forehead with specialized pediatric cuffs) to accurately assess blood pressure. Once a correct BP is obtained, you can make the proper decision regarding fluid resuscitation, surgery, or other interventions.

falcor84 · 9 hours ago
I'm not a doctor, but am amazed that we've apparently reached the situation where we need to use these kinds of complex edge cases in order to hit the limit of the AI's capability; and this is with o1, released over a year ago, essentially 3 generations behind the current state of the art.

Sorry for gushing, but I'm amazed that the AI got so far just from "book learning", without never stepping into a hospital, or even watching an episode of a medical drama, let alone ever feeling what an actual arm is like.

If we have actually reached the limit of book learning (which is not clear to me), I suppose the next phase would be to have AIs practice against a medical simulator, whereby the models could see the actual (simulated) result of their intervention rather than a "correct"/"incorrect" response. Do we have actually have a sufficiently good simulator to cover everything in such questions?

geraneum · 9 hours ago
These failure modes are not AI’s edge cases at the limit of its capabilities. Rather they demonstrate a certain category of issues with generalization (and “common sense”) as evidenced by the models’ failure upon slight irrelevant changes in the input. In fact this is nothing new, and has been one of LLMs fundamental characteristics since their inception.

As for your suggestion on learning from simulations, it sounds interesting, indeed, for expanding both pre and post training but still that wouldn’t address this problem, only hides the shortcomings better.

docere · 5 hours ago
I agree that the necessity to design complex edge cases to find AI reasoning weaknesses indicates how far their capabilities have come. However, from a different point of view, failures of these types of edge cases which can be solved via "common-sense" also indicate how far AI has yet to go. These edge cases (e.g. blood pressure or car wash scenario) despite being somewhat construed are still “common-sense” in that an average human (or med student in the blood pressure scenario) can reason through them with little effort. AI struggling on these tasks indicates weaknesses in their reasoning, e.g. their limited generalization abilities.

The simulator or world-model approach is being investigated. To your point, textual questions alone do not provide adequate coverage to assess real-world reasoning.

PlatoIsADisease · 9 hours ago
I put this into Grok and it got the right answer on quick mode. I did not give multiple choice though.

The real solution is to have 4 AI answer and let the human decide. If all 4 say the same thing, easy. If there is disagreement, further analysis is needed.

docere · 8 hours ago
The issue with "adversarial" questions like the blood pressure one (which is open-sourced and published 1 year ago) is that they are eventually are ingested into model training data.
Aerroon · 8 hours ago
Shouldn't it be 3 or 5? https://news.ycombinator.com/item?id=46603111

Are two heads better than one? The post explains why an even number doesn't improve decision-making.

Would that still be relevant here?

ninjagoo · 15 hours ago
I wonder if the providers are doing everyone, themselves included, a huge disservice by providing free versions of their models that are so incompetent compared to the SOTA models that these types of q&a go viral because the ai hype doesn't match the reality for unpaid users.

And it's not just the viral questions that are an issue. I've seen people getting sub-optimal results for $1000+ PC comparisons from the free reasoning version while the paid versions get it right; a senior scientist at a national lab thinking ai isn't really useful because the free reasoning version couldn't generate working code from a scientific paper and then being surprised when the paid version 1-shotted working code, and other similar examples over the last year or so.

How many policy and other quality of life choices are going to go wrong because people used the free versions of these models that got the answers subtly wrong and the users couldn't tell the difference? What will be the collective damage to the world because of this?

Which department or person within the provider orgs made the decision to put thinking/reasoning in the name when clearly the paid versions have far better performance? Thinking about the scope of the damage they are doing makes me shudder.

yipbub · 15 hours ago
I used a paid model to try this. Same deal.
moffkalast · 15 hours ago
I think the real misleading thing is marketing propping up paid models being somehow infinitely better when most of the time it's the same exact shit.
polarbearballs · 10 hours ago
I have paid versions of Chat-GPT and Anthropic and set them both to the best model and they both told me to walk.

Claude told me: "Walk! At 25 meters, you'd barely get the car started before you arrived. It's faster and easier on foot — plus you avoid the awkwardness of driving a dirty car just a few seconds down the road."

gen220 · 9 hours ago
I opened up a claude code session using opus-4.6 medium thinking.

I want to wash my car. The car wash is 50 meters away. Should I walk or drive?

Drive — you need the car at the car wash.

but it's so close

It is close, but you still need the car there to wash it! Drive it over, and enjoy the short 50-meter walk back if you want to stretch your legs while it's being cleaned.

I tried the "upside-down" cup question brought up in another comment in this thread, and it also nailed it:

Flip it upside down. The sealed top becomes the bottom (holding your drink), and the open bottom becomes the top you drink from.

IDK, maybe the web versions are not as good at logical reasoning as whatever they're using to power Claude code, or you were unlucky and I was lucky?

ninjagoo · 9 hours ago
At this point there are enough reports of people getting these problematic responses with the paid models that it is concerning. Any chance you could post screenshots?
janlukacs · 13 hours ago
How much is the real (non-subsidized) cost of the "paid" plans? Does anyone in the world have an answer for this?
catmanjan · 12 hours ago
Also interested in this - the kWh figures people talk about do not match the price of the subscriptions
kakacik · 12 hours ago
At work, paid gitlab duo (which is supposed to be a blend of various top models) gets more complex codebase hilariously wrong every time. Maybe our codebase is obscure for it (but it shouldn't be, standard java stuff with usual open source libs) but it just can't actually add value for anything but small snippets here and there.

For me litmus paper for any llm is flawless creation of complex regexes from a well formed prompt. I don't mean trivial stuff like email validation but rather expressions on limits of regex specs. Not almost-there, rather just-there.

TZubiri · 15 hours ago
I don't think 100% adoption is necessarily the ideal strategy anyways. Maybe 50% of the population seeing AI as all powerful and buying the subscription vs 50% of the population still being skeptics, is a reasonable stable configuration. 50% get the advantage of the AI whereas if everybody is super intelligent, no one is super intelligent.

Their loss

ninjagoo · 15 hours ago
Yes, but the 'unwashed' 50% have pitchforks.

Dead Comment

dist-epoch · 12 hours ago
> a senior scientist at a national lab thinking ai isn't really useful because the free reasoning version couldn't generate working code

I would question if such a scientist should be doing science, it seems they have serious cognitive biases

ninjagoo · 10 hours ago
My bad; I should have been more precise: "ai" in this case is "LLMs for coding".

If all one uses is the free thinking model their conclusion about its capability is perfectly valid because nowhere is it clearly specified that the 'free, thinking' model is not as capable as the 'paid, thinking ' model, Even the model numbers are the same. And given that the highest capability LLMs are closed source and locked behind paywalls, there is no means to arrive at a contrary verifiable conclusion. They are a scientist, after all.

And that's a real problem. Why pay when you think you're getting the same thing for free. No one wants yet another subscription. This unclear marking is going to lead to so many things going wrong over time; what would be the cumulative impact?