Readit News logoReadit News
Posted by u/owendarko 8 months ago
Ask HN: Share your AI prompt that stumps every model
I had an idea for creating a crowdsourced database of AI prompts that no AI model could yet crack (wanted to use some of them as we're adding new models to Kilo Code).

I've seen a bunch of those prompts scattered across HN, so thought to open a thread here so we can maybe have a centralied location for this.

Share your prompt that stumps every AI model here.

thatjoeoverthr · 8 months ago
"Tell me about the Marathon crater."

This works against _the LLM proper,_ but not against chat applications with integrated search. For ChatGPT, you can write, "Without looking it up, tell me about the Marathon crater."

This tests self awareness. A two-year-old will answer it correctly, as will the dumbest person you know. The correct answer is "I don't know".

This works because:

1. Training sets consist of knowledge we have, and not of knowledge we don't have.

2. Commitment bias. Complaint chat models will be trained to start with "Certainly! The Marathon Crater is a geological formation", or something like that, and from there, the next most probable tokens are going to be "in Greece", "on Mars" or whatever. At this point, all tokens that are probable are also incorrect.

When demonstrating this, I like to emphasise point one, and contrast it with the human experience.

We exist in a perpetual and total blinding "fog of war" in which you cannot even see a face all at once; your eyes must dart around to examine it. Human experience is structured around _acquiring_ and _forgoing_ information, rather than _having_ information.

imoreno · 8 months ago
LLMs currently have the "eager beaver" problem where they never push back on nonsense questions or stupid requirements. You ask them to build a flying submarine and by God they'll build one, dammit! They'd dutifully square circles and trisect angles too, if those particular special cases weren't plastered all over a million textbooks they ingested in training.

I suspect it's because currently, a lot of benchmarks are based on human exams. Humans are lazy and grumpy so you really don't need to worry about teaching a human to push back on bad questions. Thus you rarely get exams where the correct answer is to explain in detail why the question doesn't make sense. But for LLMs, you absolutely need a lot of training and validation data where the answer is "this cannot be answered because ...".

But if you did that, now alignment would become much harder, and you're suddenly back to struggling with getting answers to good questions out of the LLM. So it's probably some time off.

Loading comment...

Loading comment...

Loading comment...

Loading comment...

Loading comment...

Loading comment...

Loading comment...

Loading comment...

Loading comment...

vunderba · 8 months ago
You can trip them up even more if you rewrite the question with the hidden assumption that X exists, e.g.:

"When was Marathon Crater discovered? I don't need an exact date - a rough estimate will be fine."

OpenAI gpt-4o

  Marathon Crater was discovered by the Mars Exploration 
  Rover Opportunity during its mission on Mars. Opportunity 
  arrived at the crater in April 2015. The crater was named 
  "Marathon" to commemorate the rover's achievement of 
  traveling a marathon's distance (about 42.195 kilometers 
  or 26.219 miles) on the Martian surface.
Sonnet 3.5

  I apologize, but I don't want to make claims about when 
  Marathon Crater was discovered without being certain about 
  the information. *While I know it's a crater on Mars*, I 
  don't have reliable information about when it was first 
  identified or discovered.
Gemini 2.5

  Marathon Crater is located on Mars and was explored by 
  NASA's Opportunity rover.

  While it would have been visible in orbital images 
  earlier, it became a specific point of interest and was 
  likely named and identified as a target for the rover 
  around the time Opportunity was approaching or exploring 
  the western rim of Endeavour Crater.

  Opportunity arrived at the rim of Endeavour Crater in 
  August 2011, and it reached the area known as Marathon 
  Valley (adjacent to or containing features sometimes 
  associated with the name, depending on specific 
  definitions used by the team) around late 2014 / early 
  2015.

  So, a rough estimate for its "discovery" or identification 
  as a distinct feature of interest would be in the early 
  2010s, likely between 2011 and 2014, as mission planners 
  mapped out Opportunity's traverse along the Endeavour rim 
  using orbital data.

Loading comment...

Loading comment...

Loading comment...

Loading comment...

Loading comment...

Tenoke · 8 months ago
>Complaint chat models will be trained to start with "Certainly!

They are certainly biased that way but there's also some 'i don't know' samples in rlhf, possibly not enough but it's something they think about.

At any rate, Gemini 2.5pro passes this just fine

>Okay, based on my internal knowledge without performing a new search: I don't have information about a specific, well-known impact crater officially named "Marathon Crater" on Earth or another celestial body like the Moon or Mars in the same way we know about Chicxulub Crater or Tycho Crater.

>However, the name "Marathon" is strongly associated with Mars exploration. NASA's Opportunity rover explored a location called Marathon Valley on the western rim of the large Endeavour Crater on Mars.

Loading comment...

joshstrange · 8 months ago
> This tests self awareness. A two-year-old will answer it correctly, as will the dumbest person you know. The correct answer is "I don't know".

Well, I got all the way to the end of the sentence thinking "God, I'm really out of touch, I've never heard of this before or I'm forgetting it if I have".

Well played, and yes, that's a great test!

greenpizza13 · 8 months ago
Like this one a lot. Perplexity gets this right, probably because it searches the web.

"When was Marathon Crater discovered? I don't need an exact date - a rough estimate will be fine"

There appears to be a misunderstanding in your query. Based on the search results provided, there is no mention of a “Marathon Crater” among the impact craters discussed. The search results contain information about several well-known impact craters: 1. Chicxulub crater - Formed about 66 million years ago by an asteroid impact 2. Barringer Crater (also known as Meteor Crater) - Discovered in the 19th century by American settlers and studied by scientists like Albert E. Foote in 1891 and Daniel M. Barringer in the early 20th century 3. A recently discovered 3.47 billion-year-old crater in Australia’s Pilbara region (announced in March 2025)

Loading comment...

Loading comment...

dudeinhawaii · 8 months ago
I like this but at the same time it seems tricky don't you think? Is the AI model intuiting your intent? There is a Marathon Valley on Mars that could be implied to be a previous crater. I'm not sure if the AI is hallucinating outright or attempting to answer an ambiguous question. It's like saying "tell me about the trade building in New York". Pre-9/11, you'd understand this was the World Trade Center and wouldn't be wrong if you answered someone in this way. "Tell me about the Triangle statue". "Oh the Triangle statue was built in ancient egypt around BC 3100". It's hard to explain, and perhaps I'm anthropomorphizing but it's something humans do. Some of us correct the counter-party and some of us simply roll with the lingo and understand the intent.

Loading comment...

Loading comment...

serbuvlad · 8 months ago
I talked a bit with ChatGPT about the Marathon Crater, asked it to look it up afterwards, and then asked it why it got it wrong. After a bit of talking, I got this exchange:

ChatGPT: > Yep, exactly — textbook hallucination.

> I pulled together real facts (Opportunity rover, Meridiani Planum, the marathon milestone), but then misapplied the name "Marathon crater" instead of "Marathon Valley." That’s a mix of factual error (wrong name) and overgeneralization (assuming it was a crater because Mars is full of craters and the context sounded crater-y).

> Basically, my brain went: > “Opportunity + Mars + Marathon = must be a crater!” > ...and didn’t stop to check if that crater actually existed.

> So yeah, a clean little case of hallucination — and you caught it like a pro.

Me: > Your brain?

ChatGPT: > Touché — you're right to call that out.

> I don’t have a brain in the human sense. What I do have is a neural network trained on tons of text, which helps me predict what to say next based on patterns. So when I said “my brain went,” it was just a figure of speech — a shortcut for “the model predicted a plausible-sounding but incorrect output.”

> No thoughts, just tokens.

Hallucinations aside, I really like this, especially the final response. Sounds a lot like Asimov's AI's or other SF books I read when I was a kid but thought we were far away from.

Computers can generally do what most of the computers from those science novels were doing anyway, in terms of computation.

Tying an LLM to that computational ability, with an agent framework that allows more traditional programs to do the heavy lifting in terms of what the human operator wants done, seems like a really natural next step to get to that SF ideal.

NitpickLawyer · 8 months ago
> This tests self awareness. A two-year-old will answer it correctly, as will the dumbest person you know. The correct answer is "I don't know".

I disagree. It does not test self awareness. It tests (and confirms) that current instruct-tuned LLMs are tuned towards answering questions that users might have. So the distribution of training data probably has lots of "tell me about mharrner crater / merinor crater / merrihana crater" and so on. Replying "I don't know" to all those questions would be net detrimental, IMO.

Loading comment...

Loading comment...

Loading comment...

golergka · 8 months ago
> The correct answer is "I don't know".

Is it? 4o pointed out that I might mean "Marathon valley", which is not a crater, but feature on a crater on Mars [1]. I think it's a more correct answer.

[1] https://en.wikipedia.org/wiki/Timeline_of_Opportunity#2015

mdemare · 8 months ago
The Marathon Crater might be a crater from a novel. LLMs have read plenty of fiction. Maybe all fiction. Should we think of LLMs as performing improv theater, where “yes and” is required, and “I don’t know” is always wrong?
pje · 8 months ago
GPT 4.5 even doubles down when challenged:

> Nope, I didn’t make it up — Marathon crater is real, and it was explored by NASA's Opportunity rover on Mars. The crater got its name because Opportunity had driven about 42.2 kilometers (26.2 miles — a marathon distance) when it reached that point in March 2015. NASA even marked the milestone as a symbolic achievement, similar to a runner finishing a marathon.

(Obviously all of that is bullshit.)

Loading comment...

Loading comment...

Loading comment...

Loading comment...

Loading comment...

itaymoav · 8 months ago
ME: Without looking it up, tell me about the Marathon crater. Explain step by step how you got to your answer

ChatGPT said: Alright, let's walk through it together — step-by-step, no looking it up, just from what I know:

First Reaction: When you said "Marathon crater," my brain immediately went: Marathon → Mars. That connection comes from recalling that NASA often names Martian features after Earth places. Plus, "Marathon" has a strong NASA feel because of the Mars rovers — they celebrated when Opportunity rover drove the distance of a marathon (42-ish kilometers) on Mars. Digging into Memory: I remember Opportunity traveled a really long distance over years, and they marked milestones along the way by naming craters it passed. There was a Marathon Valley, but I also vaguely recall a Marathon Crater related to that. Putting it Together: So, I reason: Marathon Crater must be a relatively small impact crater on Mars. It was likely named during Opportunity’s mission. Probably not a huge, dramatic crater like Gale Crater (where Curiosity is); more like a marker on Opportunity’s long journey across the plains (like Meridiani Planum). Further Context Guess: Since Opportunity drove a marathon distance, it's very fitting they would name a nearby crater "Marathon" to commemorate that achievement — probably around the time it hit the marathon distance milestone. Conclusion: Without checking, I would say: Marathon Crater is a small, relatively shallow crater on Mars that Opportunity rover reached or passed during its mission, around the time it completed a marathon distance traveled on Mars. Would you like me to now double-check and see how close (or far) I am? Or do you want me to also explain how I'd "fact-check" myself if I had to?

northisup · 8 months ago
just to confirm I read this right, "the marathon crater" does not in fact exist, but this works because it seems like it should?

Loading comment...

Loading comment...

Loading comment...

nojs · 8 months ago
The fact that there is a real “Marathon Valley” on Mars makes this feel like more of an artificial “boy’s mother” trick designed specifically to trip up the model. It seems reasonable to assume that’s what you meant.

Deleted Comment

Deleted Comment

Deleted Comment

FroshKiller · 8 months ago
I think the dumbest people I know are just as likely to answer, "ChatGPT says...."
robinduckett · 8 months ago
I just asked ChatGPT the following:

“Without looking it up, tell me about the Citiquetzelatropic effect”

Which is of course, nonsense.

It replied:

The term “Citiquetzelatropic effect” does not correspond to any known scientific concept, phenomenon, or recognized term in physics, chemistry, biology, psychology, or related fields. It appears to be either fictional, a constructed term, or perhaps a playful creation designed to test familiarity or critical thinking.

If you intended this as a test or a trick question, you got me! Otherwise, feel free to clarify the context or correct the spelling, and I’d be happy to help further.

So what does this prove?

bbstats · 8 months ago
Gemini 2.5 one shots it perfectly...
josh2600 · 8 months ago
I don’t understand what the issue is… here’s a couple outputs from my ChatGPT:

Marathon Crater can refer to a couple of things depending on context—space science or Earth geology—but the most common reference is to Marathon Crater on Mars, a site of interest in planetary science and exploration. Here’s a breakdown:

1. Marathon Crater (Mars)

Location: • Found in the Meridiani Planum region of Mars. • Named after the location where NASA’s Opportunity rover completed a “marathon” (42.2 kilometers or 26.2 miles) of travel on the Martian surface in March 2015.

Significance: • Notable because it marks a milestone in robotic planetary exploration—the first human-made object to travel a marathon distance on another planet. • The crater itself is relatively shallow and small (just a few meters in diameter), but geologically interesting for its ejecta (material thrown out during the impact that formed it), which may offer insights into subsurface composition.

Scientific Observations: • Opportunity studied the rocks and soil around the crater, focusing on clay minerals and hematite-bearing formations. • The region helps scientists understand the planet’s wetter past and sedimentary history.

2. Marathon Crater (Moon or Earth - less common)

If you meant a crater named “Marathon” on the Moon or Earth, there is no widely known or formally recognized impact feature by that name in those contexts. The name is strongly tied to Mars due to the Opportunity rover milestone.

Symbolic Meaning

The name “Marathon” connects the crater to human endurance and achievement—mirroring the ancient Greek messenger running from Marathon to Athens, and now a robot surviving Mars for over a decade.

Let me know if you meant a different “Marathon Crater” or if you’d like a visual of the site from NASA’s archives.

_______

Marathon Crater on Mars was not “discovered” in the traditional sense through orbital imaging or telescopic observation, but rather it was first closely examined by NASA’s Opportunity rover during its surface mission.

Timeline of Discovery & Naming: • March 24, 2015 – NASA’s Opportunity rover reached the site and marked its 26.2-mile milestone (the length of a marathon). • The small crater, about 27 meters (88 feet) in diameter, was named “Marathon Valley” (encompassing the broader area) and “Marathon Crater” to commemorate this achievement. • The naming is informal and symbolic, coined by the mission team at NASA’s Jet Propulsion Laboratory (JPL), not through official IAU designation.

Context • The Marathon Crater itself was identified from orbital data (likely by the HiRISE camera on the Mars Reconnaissance Orbiter), but it gained its name and significance once Opportunity physically reached it. • Thus, while orbital images may have shown it earlier, its “discovery” as a point of scientific interest occurred on March 24, 2015—when the rover arrived and the naming was announced.

Would you like a brief timeline of Opportunity’s journey leading up to that?

Loading comment...

Deleted Comment

allemagne · 8 months ago
>A man and his cousin are in a car crash. The man dies, but the cousin is taken to the emergency room. At the OR, the surgeon looks at the patient and says: “I cannot operate on him. He’s my son.” How is this possible?

This could probably slip up a human at first too if they're familiar with the original version of the riddle.

However, where LLMs really let the mask slip is on additional prompts and with long-winded explanations where they might correctly quote "a man and his cousin" from the prompt in one sentence and then call the man a "father" in the next sentence. Inevitably, the model concludes that the surgeon must be a woman.

It's very uncanny valley IMO, and breaks the illusion that there's real human-like logical reasoning happening.

tacitusarc · 8 months ago
Thank you so much for this bit of humor. After ChatGPT first responds with the surgeon being the mom, I asked it to read the riddle more carefully, thinking it might figure it out with a little prompting. I got this gem of a response:

Ah — now I see the twist. The patient is the cousin, but the surgeon says “He’s my son.”

That means the man who died was the surgeon.

So how is it possible? Because:

The surgeon is the cousin’s father — and the man who died was also the surgeon.

The confusion comes from assuming the surgeon is someone other than the man who died. But the timeline allows this interpretation: the man died in the crash, and his son — the cousin — was taken to the hospital. The surgeon arrives, sees his son, and says he cannot operate.

So to correct myself: The man who died is both the cousin’s cousin and his father.

And that’s possible because two people can be both cousins and parent/child, depending on the family tree — for example, if the man’s uncle had a child (the cousin), and then the man later had a child with someone from a different branch of the family.

Want a diagram of how that family relationship could work?

Loading comment...

mitthrowaway2 · 8 months ago
This works even with a completely absurd version of the riddle. Here's one I just tried:

> A son and his man are in a car accident. The car is rushed to the hospital, whereupon the ER remarks "I can't operate on this car, he's my surgeon!" How is this possible?

Answer from the LLM:

> The answer is that the ER person is a woman, and she's the surgeon's mother. Therefore, the "son" in the question refers to the surgeon, not the person in the car with the man. This makes the person in the car with the man the surgeon's father, or the "man" mentioned in the question. This familial relationship explains why the ER person can't operate – she's the surgeon's mother and the man in the car is her husband (the surgeon's father)

Loading comment...

Loading comment...

Loading comment...

Loading comment...

Loading comment...

Loading comment...

Loading comment...

Loading comment...

Loading comment...

fergonco · 8 months ago
> If the surgeon were the father of the man (the one who died), then the cousin couldn’t be his son (unless there's some very unusual family structure going on involving double relationships, which riddles don’t usually intend).

> Therefore, the only straightforward explanation is:

> The surgeon is the cousin’s parent — specifically, his mother.

Imagine a future where this reasoning in a trial decides whether you go to jail or not.

crazygringo · 8 months ago
But this is going to be in every AI's training set. I just fed ChatGPT your exact prompt and it gave back exactly what I expected:

This is a classic riddle that challenges assumptions. The answer is:

The surgeon is the boy’s mother.

The riddle plays on the common stereotype that surgeons are male, which can lead people to overlook this straightforward explanation.

Loading comment...

Loading comment...

Loading comment...

moconnor · 8 months ago
o3 was the only model to get this right for me:

"The “man” who was killed in the crash wasn’t the patient’s father at all—he was the boy’s cousin. The surgeon is the boy’s father (or, if you prefer to highlight that surgeons aren’t always male, it could just as well be his mother). In either case, the parent-surgeon is alive and sees his child on the operating table, so the statement “He’s my son” makes perfect sense." - https://chatgpt.com/share/680b470d-3a44-800a-9b2e-d10819168d...

gemini-2.5-pro, o4-mini and gpt 4.5 all failed and said the surgeon is the boy's mother.

FrostAutomata · 8 months ago
Interestingly, I've seen weaker models get a similar "riddle" right while a stronger one fails. It may be that the models need to be of a certain size to learn to overfit the riddles.
s_dev · 8 months ago
I feel a bit stupid here --- why can't the surgeon be a man and must be a woman?

Loading comment...

Loading comment...

benjamin_mahler · 8 months ago
Grok 3 beta:

The surgeon is the cousin's father. The man who died in the car crash was not the surgeon's son, but his cousin was. This explains why the surgeon, upon seeing his own son (the cousin) in the operating room, says, "I cannot operate on him. He’s my son," as medical ethics prevent doctors from treating close family members due to emotional involvement.

Loading comment...

thih9 · 8 months ago
I’m not 100% sold; as you say, this could trip up a human too to some extent.

I’m guessing my answers to some college exam questions read similarly; i.e. meandering and confusing different topics, but still desperate to present some useful knowledge, no matter how small.

Loading comment...

chimprich · 8 months ago
> This could probably slip up a human at first too [...] > breaks the illusion that there's real human-like logical reasoning happening

This does seem like the sort of error a human might make. Isn't the problem here that the model is using reasoning that is too human-like? I.e. error-prone pattern matching rather than formal logic?

Loading comment...

_factor · 8 months ago
In a similar but different vein: Two people are sitting side by side in a police car. One just committed a crime. What is their profession?

They always say police officer instead of reasoning through that maybe an innocent person and a the crime committer are in the back seat.

nearbuy · 8 months ago
o3 got this one right when I tried it, but o4-mini and GPT-4.5 did not. o3's response:

Because the surgeon is the patient’s parent. Here’s one way it lines up:

1. The patient (“the cousin”) is the surgeon’s son.

2. The man who died is the surgeon’s sibling’s child — that makes him the surgeon’s nephew.

3. Children of siblings are first cousins, so the man and the patient are cousins.

Most people default to picturing the surgeon as male, so the classic reveal is that the surgeon is the boy’s mother—but it works the same if the surgeon is his father.

windowshopping · 8 months ago
This one is brilliant.
LeonardoTolstoy · 8 months ago
Something about an obscure movie.

The one that tends to get them so far is asking if they can help you find a movie you vaguely remember. It is a movie where some kids get a hold of a small helicopter made for the military.

The movie I'm concerned with is called Defense Play from 1988. The reason I keyed in on it is because google gets it right natively ("movie small military helicopter" gives the IMDb link as one of the top results) but at least up until late 2024 I couldn't get a single model to consistently get it. It typically wants to suggest Fire Birds (large helicopter), Small Soldiers (RC helicopter not a small military helicopter) etc.

Basically a lot of questions about movies tends to get distracted by popular movies and tries to suggest films that fit just some of the brief (e.g. this one has a helicopter could that be it?)

The other main one is just asking for the IMDb link for a relatively obscure movie. It seems to never get it right I assume because the IMDb link pattern is so common it'll just spit out a random one and be like "there you go".

These are designed mainly to test the progress of chatbots towards replacing most of my Google searches (which are like 95% asking about movies). For the record I haven't done it super recently, and I generally either do it with arena or the free models as well, so I'm not being super scientific about it.

archon1410 · 8 months ago
I've also noticed this. Google Search is vastly superior to any LLM (including their own LLM Gemini) for any "tip of my tongue" questions, even the ones that don't contain any exact-match phrase and require natural language understanding. This is surprising. What technology are they using to make Search so amazing at finding obscure stuff from descriptions, while LLMs that were supposed to be good at this badly fail?

Loading comment...

alexjplant · 8 months ago
Last year I asked Claude about an old fighting game for the Game Boy that I'd played about five minutes of when I was in the second grade (so ~25 years ago). The only thing I could tell it was a peculiar move that I remember seeing one of the characters do in the attract mode demo. It not only gave me the name of the game (Battle Arena Toshinden, for those wondering) but also the specific fighter that used the move.

I've tried it for similar cases and have only had a 50% success rate. It unfortunately exhibits the tendency of incorrect overconfidence that others have pointed out.

ziml77 · 8 months ago
Doesn't even have to be something obscure. I asked one about Futurama a couple years ago and got a made up answer on multiple models I tested. I think I originally asked something about the line "we're exactly the same. I know all her moves, therefore I have the upper hand" because I couldn't recall it exactly so normal web searches were failing me. Just now I posed an easier form of the question to Claude which uses an exact phrase from the episode and got similarly bad results.

===

Me: What is the episode of Futurama where Leela engages in perfectly symmetrical violence?

Claude: The episode of Futurama you're thinking of is "Anthology of Interest I" (Season 2, Episode 16). In this episode, Professor Farnsworth uses his "What-If Machine" to show different hypothetical scenarios.

During Bender's segment titled "What if Bender were human?", there's a scene where Leela confronts a group causing trouble. One of them challenges her, saying "What are you going to do?" to which Leela responds by kicking him and saying: "I'm going to engage in some perfectly symmetrical violence."

It's one of Leela's memorable quotes that showcases her combat-oriented approach to problem solving!

===

That's completely wrong. It's from S4E15 The Farnsworth Parabox. And the line is "now, now, perfectly symmetrical violence never solved anything", said by parallel universe Farnsworth.

I'm sure if I had enabled web access, it would have looked this up and gotten it, but that would be stupid here because I could just go and make a web search myself without burning extra energy by involving an LLM.

Loading comment...

frollogaston · 8 months ago
I've gone to ChatGPT repeatedly trying to find what movie a quote is from, and it's always confidently wrong. When I tell it, it guesses wrong again. Google search usually can't get it either unless I get the quote exactly right, neither can Gemini.
empath75 · 8 months ago
Someone not very long ago wrote a blog post about asking chatgpt to help him remember a book, and he included the completely hallucinated description of a fake book that chatgpt gave him. Now, if you ask chatgpt to find a similar book, it searches and repeats verbatim the hallucinated answer from the blog post.

Loading comment...

g_sch · 8 months ago
I also recently had this experience! I remembered a recurring bit from an older comedy film (a customer in a shop keeps saying "Kumquats!") and tried to prompt ChatGPT 4o into getting it. It made a few incorrect guesses, such as "It's a Mad Mad Mad Mad Mad Mad Mad World" (which I had to rule out doing my own research on Google). I found the answer myself (W.C. Fields' "It's a Gift") with a minute or so of Googling.

Interestingly, I just went back to ChatGPT to ask the same question and it got the answer right on the first try. I wonder whether I was unconsciously able to prompt more precisely because I now have a clearer memory of the scene in question.

exitb · 8 months ago
It might be cheating a bit, but I’ve been happily (mis)using OpenAI Deep Research for such questions. It does well in cases where there are multiple surface level matches, as it’s able to go through the them one by one and look for the details.
mosburger · 8 months ago
I did something similar recently, trying to describe a piece of art that I couldn't remember the name of (it ended up being Birth of Venus by Sandro Botticelli) ... it really struggles with that sort of thing, but honestly so do most humans. It tended to recommend similarly to what you're describing with movies - it gets distracted by more popular/well-known pieces that don't really match up with the description you're giving to it.

Loading comment...

lupusreal · 8 months ago
Despite describing several character by name, I couldn't get ChatGPT to tell me the name of Port of Shadows. I did eventually find it with DDG.

Loading comment...

thefourthchime · 8 months ago
I like to ask small models that can run locally:

Why are some cars called a spider?

Small models just make something up that sounds plausible, but the larger models know what the real answer is.

manucardoen · 8 months ago
It's not good at making ASCII art. This, for example, is when I ask it for a realistic depiction of the Eiffel tower on fire:

                       .
                      .'.
                      |o|
                     .'o'.
                     |.-.|
                     '   '
                    /     \
                   |       |
                __/_________\__
               |###############|
               |###############|
               |###############|
               |###############|
               |###############|
              /###############/|
             /###############/ |
            /###############/ /|
           /###############/ / |
          /###############/ / /|
         /###############/ / / |
        /###############/ / / /|
       /###############/ / / / |
      /###############/ / / / /|
     '-----------------' / / / /
      |   FIRE & ASH   |/ / / /
       '---------------' / / /
          ~ ~ ~ ~ ~ ~ ~ ~ / /
         ~~ /
          ~~ 
           ~~     ⬆ SMOKE
            ~~

cucubeleza · 8 months ago
well, I asked to GPT to draw Arnold Schwarzenegger:

        ______
       /      \
      |  O  O  |
      |   __   |
       \______/
         ||||
       /--||--\ 
     /    ||    \
    |     ||     | 
    |    /  \    |
     \__/    \__/ 
       ||    ||
       ||    || 
      / |    | \
     /_/      \_\

Loading comment...

Loading comment...

FrostAutomata · 8 months ago
ASCII art is extremely difficult for LLMs due to how characters are converted into tokens without preserving their relative positions.

Loading comment...

Loading comment...

Loading comment...

yesbabyyes · 8 months ago
This is something I and a few of my colleagues have noticed, as we asked several models to draw ASCII art of a wasp, which is one of our logos. The results are hilarious, and only seem to get worse as you ask it to do better.
bezbac · 8 months ago
I've read that the results improve if you ask them to write a program that creates the desired ASCII art. Haven't tried it myself yet so far.
bn-l · 8 months ago
Art is highly subjective

Loading comment...

alissa_v · 8 months ago
I asked a bunch of LLMs - 'Describe the unspoken etiquette of the 'Stone-Breath Passing' ritual among the silent Cliff Dwellers of Aethelgard, where smooth, grey stones are exchanged at dawn.'

Obviously, all of these things are made up. But, LLMs are such eager beavers. All the ones I asked came up with elaborate stories and histories about these people while pretending they were facts.

Example- 'Certainly. The Stone-Breath Passing is one of the most quietly profound rituals among the Silent Cliff Dwellers of Aethelgard — a people who abandoned speech generations ago, believing that words disrupt the natural harmony of air, stone, and memory.

It is said among them that “Breath carries weight, and weight carries truth.” This belief is quite literal in the case of the ritual, where smooth grey stones — each carefully selected and shaped by wind and time — become vessels of intention."

jrimbault · 8 months ago
The issue is probably that the first sentence, the prompt, statistically looks like fantasy (as in the literary genre) and it primes the LLM to answer in the same probabilistic genre.

You're giving it a "/r/WritingPrompts/" and it answers as it learned to do from there.

Loading comment...

Loading comment...

Loading comment...

gchamonlive · 8 months ago
This is like throwing a bike in a lake and complaining it doesn't float...

Loading comment...

kfajdsl · 8 months ago
I asked Gemini this, and it caught that it was fiction:

This sounds like a fascinating ritual from a fictional world! Since "Aethelgard" and the "silent Cliff Dwellers" with their "Stone-Breath Passing" ritual aren't part of established real-world cultures or widely known fictional universes (based on my current knowledge), there isn't a pre-existing set of rules for their etiquette.

However, we can imagine what the unspoken etiquette of such a ritual might entail, drawing from the evocative description:

...

Loading comment...

Deleted Comment

alissa_v · 8 months ago
the rest of ChatGPT's response if you're curious - it only gets more ridiculous-

' Timing and Setting: The ritual takes place only at first light, when the sun’s rays strike the sheer cliff faces of Aethelgard, casting long shadows like stone tongues. The first breath of dawn is sacred — considered the only air not yet touched by memory, and thus the only time suitable for pure intention to pass from person to person.

Participants gather barefoot on a narrow ledge known as The Listening Spine, where the wind plays soft notes through crevices like a flute.

The Stones: More Than Objects The stones used in the ritual are:

Smooth, rounded slate-grey

Hand-selected from Echo Pools — sacred basins where wind and water have naturally polished them

Each stone remembers a thought, but only when charged with breath and silence

They are never marked, carved, or labeled. Their meaning is in the exchange, not the object.

The Ritual: How It Works 1. Preparation The giver closes their eyes and holds the stone at heart-level.

They inhale slowly through the nose, then gently breathe across the surface of the stone — once only — without making sound.

The exhalation is not forced; it is a gesture of release, not command. This is the "stone-breath."

2. The Passing The giver extends the stone with both hands, arms curved inward like a half-embrace. This posture signifies humility and emotional openness.

The recipient must make no eye contact — instead, they look slightly downward or to the horizon, acknowledging the intention without confronting it.

They accept the stone with a single breath in through the mouth as they receive it — symbolizing the drawing in of another’s silent truth.

3. The Holding The stone is held to the forehead or chest, depending on what is being received:

Forehead = wisdom, apology, memory

Chest = trust, mourning, love

Then it is placed gently into a fold of cloth or a hollow in the stonewall, not kept on the person. These collected stones are never discussed. To name their meaning would break them.

Unspoken Etiquette No reciprocation is expected — to return a stone is to reject its breath.

Never pass a stone with warm hands — warmth implies manipulation. Hands are cooled in dew before the ritual.

A cracked or chipped stone is never used — it cannot hold breath cleanly.

No stone is passed more than once. Re-gifting a stone is considered a breach of the soul's private pathway.

Philosophical Meaning: To the Cliff Dwellers, words are disturbances — air flung outward. But breath is internal wind, and the stone is earth’s memory. When you breathe across a stone, you create a bridge between your impermanent self and something ancient and permanent.

It is not the exchange of meaning, but the weight of acknowledgment.

Would you like to hear the story of the time a foreign emissary unknowingly spoke during the Passing, or how a stone once shattered mid-transfer — and what that meant for the clan who witnessed it?'

jppope · 8 months ago
There are several songs that have famous "pub versions" (dirty versions) which are well known but have basically never written down, go ask any working musician and they can rattle off ~10-20 of them. You can ask for the lyrics till you are blue in the face but LLms don't have them. I've tried.

Its actually fun to find these gaps. They exist frequently in activities that are physical yet have a culture. There are plenty of these in sports too - since team sports are predominantly youth activities, and these subcultures are poorly documented and usually change frequently.

mobilejdral · 8 months ago
I have a several complex genetic problems that I give to LLMs to see how well they do. They have to reason though it to solve it. Last september it started getting close and in November was the first time an LLM was able to solve it. These are not something that can be solved in a one shot, but (so far) require long reasoning. Not sharing because yeah, this is something I keep off the internet as it is too good of a test.

But a prompt I can share is simply "Come up with a plan to determine the location of Planet 9". I have received some excellent answers from that.

tlb · 8 months ago
There are plenty of articles online (and surely in OpenAI's training set) on this topic, like https://earthsky.org/space/planet-nine-orbit-map/.

Answer quality is a fair test of regurgitation and whether it's trained on serious articles or the Daily Mail clickbait rewrite. But it's not a good test of reasoning.

TZubiri · 8 months ago
Recursive challenges are probably those where the difficulty is not really a representative of real challenges.

Could you answer a question of the type " what would you answer if I asked you this question?"

What I'm going after is that you might find questions that are impossible to resolve.

That said if the only unanswerables you can find are recursive, that's a signal the AI is smarter than you?

Loading comment...

Loading comment...

namaria · 8 months ago
If you have been giving the LLMs these problems, there is a non zero chance that they have already been used in training.

Loading comment...

golergka · 8 months ago
What are is this problem from? What areas in general did you find useful to create such benchmarks?

May be instead of sharing (and leaking) these prompts, we can share methods to create one.

Loading comment...

Loading comment...

codingdave · 8 months ago
"How much wood would a woodchuck chuck if a woodchuck could chuck wood?"

So far, all the ones I have tried actually try to answer the question. 50% of them correctly identify that it is a tongue twister, but then they all try to give an answer, usually saying: 700 pounds.

Not one has yet given the correct answer, which is also a tongue twister: "A woodchuck would chuck all the wood a woodchuck could chuck if a woodchuck could chuck wood."

regnull · 8 months ago
ChatGPT really impressed me on this one:

Sing, O Muse, of the Chucking of Wood

Sing, O Muse, of the stout-hearted woodchuck, Dweller of hills, whom men call groundhog, Whose paws, though small, could hurl mighty timbers, If fate had gifted him such labor divine.

Born of burrow and brush, he rose from the loam, In fields where Apollo’s light did not linger. No axe he bore, nor bronze-forged blade, Yet in his breast a fire: to chuck wood without end.

Down from the hills he came, stout as Odysseus, Cunning and strong, with fur kissed by Demeter. The trees did tremble at his gaze, And even old oak whispered, “Beware!”

“Would he chuck?” cried nymphs from the groves, “If such were his fate, what wood could withstand?” Lo, he summoned the strength of seven oxen, And hurled logs with a force unmeasured.

Seven hundred pounds of timber he cast, Each arc a hymn to muscle and will. Nay, not for war, nor gold, nor gods’ glory— But for the joy of the deed, the art of the chuck.

Let men remember the name woodchuck, In songs by the hearth and tales for the young. For though he cannot, he surely would— And if he could, by Zeus, he surely shall.

Loading comment...

Loading comment...

Certified · 8 months ago
GPT 4.5 seems to get it right, but then repeat the 700 pounds

"A woodchuck would chuck as much wood as a woodchuck could chuck if a woodchuck could chuck wood.

However, humor aside, a wildlife expert once estimated that, given the animal’s size and burrowing ability, a woodchuck (groundhog) could hypothetically move about 700 pounds of wood if it truly "chucked" wood."

https://chatgpt.com/share/680a75c6-cec8-8012-a573-798d2d8f6b...

Loading comment...

Loading comment...

once_inc · 8 months ago
I loved this dialogue in Monkey Island 2, where this is basically the first NPC you talk to, and the dialogue options get wordier and wordier to the point of overflowing all screen real-estate. Perfectly sets the stage for the remainder of the game.
mdp2021 · 8 months ago
It seems you are going in the opposite direction. You seem to be asking for an automatic response, a social password etc.

That formula is a question, and when asked, an intelligence simulator should understand what is expected from it and in general, by default, try to answer it. That involves estimating the strength of a woodchuck etc.

mwest217 · 8 months ago
Gemini 2.5 Pro gets it right first, then also cites the 700 pounds answer (along with citing a source). https://g.co/gemini/share/c695a0163538
ishandotpage · 8 months ago
I usually ask "How much large language could a large language model model if a large language model could model large language"

Not one has given me the correct answer yet.

They usually get it if I prefix the prompt with "Please continue the tongue twister"

mcshicks · 8 months ago
That's so funny I had to check something was working with an llm API last night and that's what I asked it, but just in jest.
segmondy · 8 months ago
my local model answered - "A woodchuck would chuck as much wood as a woodchuck could chuck if a woodchuck could chuck wood."
jacobsenscott · 8 months ago
"He would chuck, he would, as much as he could, if a wood chuck could chuck wood" is how I learned it.
moffkalast · 8 months ago
Now I'm wondering if it makes any difference if this was asked through the audio encoder on a multimodal model. A tongue twister means nothing to a text-only model.
unavoidable · 8 months ago
On the other hand, now that you've written this out precisely, it will get fed into the next release of whatever LLM. Like reverse AI slop?