Readit News logoReadit News
og_kalu · 5 months ago
One of the biggest problems with hands off LLM writing (for long horizon stuff like novels) is that you can't really give them any details of your story because they get absolutely neurotic with it.

Imagine for instance you give the LLM the profile of the love interest for your epic fantasy, it will almost always have the main character meeting them within 3 pages (usually page 1) which is of course absolutely nonsensical pacing. No attempt to tell it otherwise changes anything.

This is the first model that after 19 pages generated so far resembles anything like normal pacing even with a TON of details. I've never felt the need to generate anywhere near this much. Extremely impressed.

Edit: Sharing it - https://aistudio.google.com/app/prompts?state=%7B%22ids%22:%...

with pastebin - https://pastebin.com/aiWuYcrF

comboy · 5 months ago
I like how critique of LLMs evolved on this site over the last few years.

We are currently at nonsensical pacing while writing novels.

skyechurch · 5 months ago
The most straightforward way to measure the pace of AI progress is by attaching a speedometer to the goalposts.
solardev · 5 months ago
It's not really passing the Turing Test until it outsells Harry Potter.
krzat · 5 months ago
This either ends at "better than 50% of human novels" garbage or at unimaginably compelling works of art that completely obsoletes fiction writing.

Not sure what is better for humanity in long term.

ruraljuror · 5 months ago
We are, if this comment is the standard for all criticism on this site. Your comment seems harsh. Perhaps novel writing is too low-brow of a standard for LLM critique?
rafaelmn · 5 months ago
People are trying to use gen AI in more and more use-cases, it used to fall flat on its face at trivial stuff, now it got past trivial stuff but still scratching the boundaries of being useful. And that is not an attempt to make the gen AI tech look bad, it is really amazing what it can do - but it is far from delivering on hype - and that is why people are providing critical evaluations.

Lets not forget the OpenAI benchmarks saying 4.0 can do better at college exams and such than most students. Yet real world performance was laughable on real tasks.

stickfu · 5 months ago
I don’t know why I keep submitting myself to hacker news but every few months I get the itch, and it only takes a few minutes to be turned off by the cynicism. I get that it’s from potentialy wizened tech heads who have been in the trenches and are being realistic. It’s great for that, but any new bright eyed and bushy tailed dev/techy, whatever, should stay far away until much later in their journey
ksec · 5 months ago
Do we have any simple benchmarks ( and I know benchmarks are not everything ) that tests all the LLMs?

The pace is moving so fast I simply cant keep up. Or a ELI5 page which gives a 5 min explanation of LLM from 2020 to this moment?

basch · 5 months ago
It’s more a bellwether or symptom of a flaw where the context becomes poisoned and continually regurgitates the same thought over and over.
leokennis · 5 months ago
Not really new is it? First cars just had to be approaching horse and cart levels of speed. Comfort, ease of use etc. were non-factors as this was "cool new technology".

In that light, even a 20 year old almost broken down crappy dinger is amazing: it has a radio, heating, shock absorbers, it can go over 500km on a tank of fuel! But are we fawning over it? No, because the goalposts have moved. Now we are disappointed that it takes 5 seconds for the Bluetooth to connect and the seats to auto-adjust to our preferred seating and heating setting in our new car.

ripped_britches · 5 months ago
lol wouldn’t that be great to read this comment in 2022
deng · 5 months ago
I have actually read it and agree it is impressive. I will not comment much on the style of the writing, since this is very much subjective, but I would rate it as the "typical" modern fantasy style, which aims at filling as much pages as possible: very "flowery" language, lots of adjectives/adverbs, lots of details, lots of high-school prose ("Panic was a luxury they couldn't afford"). Not a big fan of that since I really miss the time where authors could write single, self-contained books instead of a sprawling series over thousands of pages, but I know of course that this kind of thing is very successful and people seem to enjoy it. If someone would give me this I would advise them to get a good copy editor.

There are some logical inconsistencies, though. For instance, when they both enter the cellar through a trapdoor, Kael goes first, but the innkeeper instructs him to close the trapdoor behind them, which makes no sense. Also, Kael goes down the stairs and "risks a quick look back up" and can somehow see the front door bulging and the chaos outside through the windows, which obviously is impossible when you look up through a trapdoor, not to mention that previously it was said this entry is behind the bar counter, surely blocking the sight. Kael lights an oily rag which somehow becomes a torch. There's more generic things, like somehow these Eldertides being these mythical things no one has ever seen, yet they seem to be pretty common occurrences? The dimensions of the cellar are completely unclear, at first it seems to be very small but yet they move around it quite a bit. There's other issues, like people using the same words as the narrator ("the ooze"), like they listen to him. The inkeeper suddenly calling Kael by his name like they already know each other.

Anyway, I would rate it "first draft". Of course, it is unclear whether the LLM would manage to write a consistent book, but I can fully believe that it would manage. I probably wouldn't want to read it.

hjnilsson · 5 months ago
Thank you for taking the time to do a thorough read, I just skimmed it, and the prose is certainly not for me. To me it lacks focus, but as you say, this may be the style the readers enjoy.

And it also, as you say, really reuses words. Just reading I notice "phosphorescence" 4 times for example in this chapter, or "ooze" 17 times (!).

It is very impressive though that it can create a somewhat cohesive storyline, and certainly an improvement over previous models.

blinding-streak · 5 months ago
Regarding your last sentence, I agree. My stance is this: If you didn't bother to write it, why should I bother to read it?
og_kalu · 5 months ago
Yeah I just focused on how well it was paced and didn't give any instructions on style or try a second pass to spot any inconsistencies.

That would be the next step but I'd previously never thought going any further might be worth it.

KittenInABox · 5 months ago
> Not a big fan of that since I really miss the time where authors could write single, self-contained books instead of a sprawling series over thousands of pages, but I know of course that this kind of thing is very successful and people seem to enjoy it.

When was this time you speak of?

nout · 5 months ago
Using the AI in multiple phases is the approach that can handle this. Similarly to "Deep Research" approach - you can tell it to first generate a storyline with multiple twists and turns. Then ask the model to take this storyline and generate prompts for individual chapters. Then ask it to generate the individual chapters based on the prompts, etc.
bbor · 5 months ago
Yup -- asking a chatbot to create a novel in one shot is very similar to asking a human to improvise a novel in one shot.
fshr · 5 months ago
I think you would be better off having the LLM help you build up the plot with high level chapter descriptions and then have it dig into each chapter or arc. Or start by giving it the beats before you ask it for help with specifics. That'd be better at keeping it on rails.
og_kalu · 5 months ago
I don't disagree. Like with almost anything else involving LLMs, getting hands on produces better results but because in this instance, i much prefer to be the reader than the author or editor, it's really important to me that a LLM is capable of pacing long form writing properly on its own.
tluyben2 · 5 months ago
That was what I tried on the train [0] a few weeks ago. I used Groq to get something very fast to see if it would work at least somewhat. It gives you a PDF in the end. Plugging in a better model gave much better results (still not really readable if you actually try to; at a glance it's convincing though), however, it was so slow that testing what kind of impossible. Cannot really have things done in parallel either because it does need to know what it pushed out before, at least the summary of it.

[0] https://github.com/tluyben/bad-writer

sagarpatil · 5 months ago
My prompt is nowhere near yours.

Just for fun: Asked it to rewrite the first page of ‘The Fountainhead’ where Howard is a computer engineer, the rewrite is hilarious lol.

https://gist.github.com/sagarspatil/e0b5443132501a3596c3a9a2...

didip · 5 months ago
Give it time, this will be solved.

I envisioned that one day, a framework will be created that can persist LLM current state on disk and then "fragments of memories" can be paged in and out into memory.

When that happened, LLM will be able to remember everything.

smcleod · 5 months ago
I have never used an LLM for fictional writing, but I have been writing large amounts of code with them for years. What I'd recommend is when you're defining your plan up front as to the sections of the content, simply state in which phase / chapter of the content they should meet.

Planning generated content is often more important to invest in than the writing of it.

Looking at your paste, your prompt is short and basic, it should probably be broken up into clear, formatted sections (try directives inside XML style tags). For such a large output as you're expecting id expect a considerable prompt of rules and context setting (maybe a page or two).

pantsforbirds · 5 months ago
I had Grok summarize + evaluate the first chapter with thinking mode enabled. The output was actually pretty solid: https://pastebin.com/pLjHJF8E.

I wouldn't be surprised if someone figured out a solid mixture of models working as a writer (team of writers?) + editor(s) and managed to generate a full book from it.

Maybe some mixture of general outlining + maintaining a wiki with a basic writing and editing flow would be enough. I think you could probably find a way to maintain plot consistency, but I'm not so sure about maintaining writing style.

stephen_cagle · 5 months ago
Opening with "like a struck flint carried on a wind that wasn’t blowing." <chuckles>

I don't know why, but that is just such a literal thing to say that it seems almost random.

bwb · 5 months ago
why would you ever want to write a novel with AI, that is human stuff right? :)
qingcharles · 5 months ago
I'm terrible at writing, but I love reading. I've got ideas for novels, but I struggle to put them down.

What I have found that works is to give the LLM the "world" outline at the beginning and then just feed it one line summary of each chapter and get it to write a chapter at a time.

The problem is that the quality of results drastically decreases as the context length increases. After about 10 chapters the dialogue will start to get real snippy. I've tried getting it to summarize all the previous chapters and feed that back in, but it never includes enough detail.

og_kalu · 5 months ago
I don't want to write a novel with AI. I want to read them (when they're good enough) because i love reading. Sometimes i want to read something with a certain dynamic and it gets difficult finding human written recommendations.
harrison_clarke · 5 months ago
this seems like something that planning would fix. i wonder if that's how it's doing it

like, if it decides to <think> a table of contents, or chapter summaries, rather than just diving in at page 1

Workaccount2 · 5 months ago
Can you share it on a text sharing site? It seems you hit your share quota
og_kalu · 5 months ago
dyauspitr · 5 months ago
That is mind blowing. To this fantasy reader that’s pure magic.
solardev · 5 months ago
19 pages?! Am I the only one who prefers an AI that jumps straight to the point?

- Buildup and happy background world-building

- Subtle foreshadowing

- Orcs attack

- Hero is saved by unlikely warrior of astounding beauty

- Evil is defeated until sales justify unnecessary sequel

That's the kind of story fit for the modern attention span...

malisper · 5 months ago
I've been using a math puzzle as a way to benchmark the different models. The math puzzle took me ~3 days to solve with a computer. A math major I know took about a day to solve it by hand.

Gemini 2.5 is the first model I tested that was able to solve it and it one-shotted it. I think it's not an exaggeration to say LLMs are now better than 95+% of the population at mathematical reasoning.

For those curious the riddle is: There's three people in a circle. Each person has a positive integer floating above their heads, such that each person can see the other two numbers but not his own. The sum of two of the numbers is equal to the third. The first person is asked for his number, and he says that he doesn't know. The second person is asked for his number, and he says that he doesn't know. The third person is asked for his number, and he says that he doesn't know. Then, the first person is asked for his number again, and he says: 65. What is the product of the three numbers?

hmottestad · 5 months ago
This looks like it’s been posted on Reddit 10 years ago:

https://www.reddit.com/r/math/comments/32m611/logic_question...

So it’s likely that it’s part of the training data by now.

canucker2016 · 5 months ago
You'd think so, but both Google's AI Overview and Bing's CoPilot output wrong answers.

Google spits out: "The product of the three numbers is 10,225 (65 * 20 * 8). The three numbers are 65, 20, and 8."

Whoa. Math is not AI's strong suit...

Bing spits out: "The solution to the three people in a circle puzzle is that all three people are wearing red hats."

Hats???

Same text was used for both prompts (all the text after 'For those curious the riddle is:' in the GP comment), so Bing just goes off the rails.

malisper · 5 months ago
Other models aren't able to solve it so there's something else happening besides it being in the training data. You can also vary the problem and give it a number like 85 instead of 65 and Gemini is still able to properly reason through the problem
toonalfrink · 5 months ago
This whole answer hinges on knowing that 0 is not a positive integer, that's why I couldn't figure it out...
f1shy · 5 months ago
Thaks. I wanted to do exactly that: find the answer online. It is amazing that people (even in HN) think that LLM can reason. It just regurgitates the input.
drexlspivey · 5 months ago
And if it wasn’t, it is now

Dead Comment

_cs2017_ · 5 months ago
This is solvable in roughly half an hour on pen and paper by a random person I picked with no special math skills (beyond a university). This is far from a difficult problem. The "95%+" in math reasoning is a meaningless standard, it's like saying a model is better than 99.9% of world population in Albanian language, since less than 0.1% bother to learn Albanian.

Even ignoring the fact that this or similar problem may have appeared in the training data, it's something a careful brute-force math logic should solve. It's neither difficult, nor interesting, nor useful. Yes, it may suggest a slight improvement on the basic logic, but no more so than a million other benchmarks people quote.

This goes to show that evaluating models is not a trivial problem. In fact, it's a hard problem (in particular, it's a far far harder than this math puzzle).

windowshopping · 5 months ago
The "random person" you picked is likely very, very intelligent and not at all a good random sample. I'm not saying this is difficult to the extent that it merits academic focus, but it is NOT a simple problem and I suspect less than 1% of the population could solve this in half an hour "with no special math skills." You have to be either exceedingly clever or trained in a certain type of reasoning or both.
wrasee · 5 months ago
So in the three hours between you reading the puzzle in the parent comment, you stopped what you were doing, managed to get some other "random" person to stop what they were doing and spend half an hour of their time on a maths puzzle that at that point prior experience suggested could take a day? All within three hours?

That's not to say that you didn't, or you're recalling from a previous time that happens to be this exact puzzle (despite there being scant prior references to this puzzle, and precisely the reason for using it). But you can see how some might see that as not entirely credible.

Best guess: this random person is someone that really likes puzzles, is presumably good at them and is very, very far from being representative to the extent you would require to be in support of your argument.

Read: just a heavy flex about puzzle solving.

re-thc · 5 months ago
> This is solvable in roughly half an hour on pen and paper by a random person I picked with no special math skills (beyond a university).

I randomly answered this post and can't solve it in half an hour. Is the point leet code but for AI? I rather it solve real problems than "elite problems".

Side note: couldn't even find pen and paper around in half an hour.

sebzim4500 · 5 months ago
This is a great riddle. Unfortunately, I was easily able to find the exact question with a solution (albeit with a different number) online, thus it will have been in the training set.
Workaccount2 · 5 months ago
What makes this interesting is that while the question is online (on reddit, from 10 years ago) other models don't get the answer right. Gemini also shows it's work and it seems to do a few orders of magnitude more calculating then the elegant answer given on reddit.

Granted this is all way over my head, but the solution gemini comes to matches the one given on reddit (and now here in future training runs)

65×26×39=65910

varispeed · 5 months ago
Seems like we might need a section of internet that is off limits to robots.
beefnugs · 5 months ago
Why is this a great riddle? It sounds like incomplete nonsense to me:

It doesnt say anything about the skill levels of the participants, whether their answers are just guessing, or why they arent just guessing the sum of the other two people each time asked to provide more information?

It doesnt say the guy saying 65 is even correct

How could three statements of "no new information" give information to the first guy that didn't know the first time he was asked?

Deleted Comment

yifanl · 5 months ago
You'd have better results if you had prompted it with the actual answer and asked how the first person came to the conclusion. Giving a number in the training set is very easy.

i.e. You observe three people in a magical room. The first person is standing underneath a 65, the second person is standing underneath a 26 and the third person is standing underneath a 39. They can see the others numbers but not the one they are directly under. You tell them one of the three numbers is the sum of the other two and all numbers are positive integers. You ask the first person for their number, they respond that they don't know. You ask the second person for their number, they respond that they don't know. You ask the third person, they respond that they don't know. You ask the first person again and they respond with the correct value, how did they know?

And of course, if it responds with a verbatim answer in the line of https://www.reddit.com/r/math/comments/32m611/logic_question..., we can be pretty confident what's happening under the hood.

Deleted Comment

semiinfinitely · 5 months ago
I love how the entire comment section is getting one-shotted by your math riddle instead of the original post topic.
refulgentis · 5 months ago
In general I find commentary here too negative on AI, but I'm a bit squeamish about maximalist claims re: AI mathematical reasoning vs. human population based off this, even setting aside lottery-ticket-hypothesis-like concerns.

It's a common logic puzzle, Google can't turn up an exact match to the wording you have, but ex. here: https://www.futilitycloset.com/2018/03/03/three-hat-problem/

utopcell · 5 months ago
Same here: My problem of choice is the 100 prisoners problem [1]. I used to ask simple reasoning questions in the style of "what is the day three days before the day after tomorrow", but nowadays when I ask such questions, I can almost feel the the NN giggling at the naivety of its human operator.

[1] https://en.wikipedia.org/wiki/100_prisoners_problem

r0fl · 5 months ago
Wow

Tried this in deepseek and grok and it kept thunking in loops for a while and I just turned it off

I haven’t seen a question loop this long ever.

Very impressed

z2 · 5 months ago
Deepseek R1 got the right answer after a whopping ~10 minutes of thinking. I'm impressed and feel kind of dirty, I suspect my electricity use from this could have been put to better use baking a frozen pizza.
deepboy2 · 5 months ago
Just tried it on Deepseek (not R1, maybe V3-0324) and got the correct answer after 7-8 pages of reasoning. Incredible!
SwayStar123 · 5 months ago
You can also put the AI in the first person's shoes. Prompt: You are standing in a circle, there are 2 other people in the circle with you, everyone in the circle, has a positive integer above their head, no one knows what the number above their own head is but can see the numbers above the heads of the other people. You see that the person infront of you on the left has 26 above their head. The person on the right has 39 above their head. You are told that the sum of two of the numbers is the third number. You are asked what the number above your head is, the option is the sum, 65, or 13, as 26 + 13 = 39. You don't know which one it is, and you say so. The second person is asked the number above their head. They also say they dont know, the third person also says they dont know. What is your number?

Gemini 2.5 and claude 3.7 thinking get it right, o3 mini and 4o get it wrong

drewbeck · 5 months ago
I just asked it this twice and it gave me 65×65×130=549250. Both times. The first time I made it about ducks instead of people and mentioned that there was a thunderstorm. The second time I c/p your exact text and it gave me the same answer.

Again we find that the failure state of LLMs is a problem – yeah, when you know the answer already and it gets it right, that's impressive! When it fails, it still acts the same exact way and someone who doesn't already know the answer is now a lil stupider.

ototot · 5 months ago
eru · 5 months ago
I use an algorithmic question that I'd been working on for years and that I'm finally writing up the answer to.

It's basically: given a sequence of heap operations (insert element, delete minimum element), can you predict the left-over elements (that are in the heap at the end) in linear time in the comparison model?

(The answer is surprisingly: Yes.)

integralof5y · 5 months ago
A prolog program, swipl (it takes less than a second to solve your puzzle)

N is number of turns of don't know answers. the bad predicate means that the person can know its number at turn N.

  bad(_,_,_,-1) :- !,false.
  bad(_,A,A,0) :- !.
  bad(A,_,A,0) :- !.
  bad(A,A,_,0) :- !.
  bad(B,C,A,N) :- D is abs(B-A),D<C,N1 is N-1, bad(B,D,A,N1),!.
  bad(C,A,B,N) :- D is abs(B-A),D<C,N1 is N-1, bad(D,A,B,N1),!.
  bad(A,B,C,N) :- D is abs(B-A),D<C,N1 is N-1, bad(A,B,D,N1),!.
  
  solve(X,Y,Z) :- Y1 is X-1, between(1,Y1,Y),
                  between(0,2,N), Z is X-Y,bad(X,Y,Z,N).

  ?- solve(65,X,Y).
  X = 26,
  Y = 39 ;
  X = 39,  
  Y = 26 .

adpirz · 5 months ago
Interactive playground for the puzzle: https://claude.site/artifacts/832e77d7-5f46-477c-a411-bdad10...

(All state is stored in localStorage so you can come back to it :) ).

TrackerFF · 5 months ago
The riddle certainly nerd-sniped GPT 4.5

After a couple of minutes it decided on the answer being 65000. (S = {65, 40, 25)}

dkjaudyeqooe · 5 months ago
> I think it's not an exaggeration to say LLMs are now better than 95+% of the population at mathematical reasoning.

It's not an exaggeration it's a non-sequitur, you first have to show that the LLMs are reasoning in the same way humans do.

bbstats · 5 months ago
Could you explain "The sum of two of the numbers is equal to the third"??
rappatic · 5 months ago
I think:

Call the three numbers a, b, and c. This means c = a + b, but we still don’t know to which person each number belongs.

When person 1 (p1) is asked what his number is, he has no way to know whether he has a, b, or c, so he says he doesn’t know. Same goes for p2 and p3. Clearly p1 somehow gains information by p2 and p3 passing. Either he realizes that he must be either a or b, and such his number is the difference between p2 and p3’s numbers, or he realizes that he must be c and so his number is the sum of p2 and p3’s numbers.

That’s all I have so far. Anyone have other ideas?

Deleted Comment

malisper · 5 months ago
if the three numbers are a, b, and c, then either a+b=c, a+c=b, or b+c=a
refulgentis · 5 months ago
Given #s x,y, and z, either x + y = z, x + z = y, or y + z = x.
mitko · 5 months ago
Loved that puzzle, thanks for sharing it. I’ve solved a lot of math problems in the past but this one had a unique flavor of interleaving logical reasoning, partial information and a little bit of arithmetic.
highfrequency · 5 months ago
Fun puzzle! I’m curious how you managed to structure the problem such that a computer could solve it but it took 3 days of computation?
hmottestad · 5 months ago
Is the answer somehow {65, 20, 45} with the product 58,500?

That’s one-shot for o1 pro.

sebzim4500 · 5 months ago
That's wrong. From player 1's perspective {25 20 45} is entirely consistent with the calls made in the first three rounds.
g105b · 5 months ago
Please can you enlighten me, I'm a mathematic plebian?
doener · 5 months ago
ChatGPT 4.5:

Let’s carefully analyze this puzzle step-by-step:

Step 1: Define the Problem Clearly

We have three people in a circle. Let’s call them: • Person A (first person) • Person B (second person) • Person C (third person)

Each has a positive integer above their head. Each sees the other two numbers but not their own. The numbers satisfy: A + B = C \quad \text{or} \quad B + C = A \quad \text{or} \quad C + A = B

Given the puzzle scenario, we need to find out the exact numbers given the clues, and ultimately find the product of the three numbers.

Step 2: Logic of the Puzzle (Step-by-Step Reasoning)

Initial conditions: The numbers are positive integers. The puzzle implies a situation of ambiguity at first, but after multiple rounds of “I don’t know,” Person A finally knows their number is 65.

Let’s denote clearly the conditions again: • Exactly one of the following three equations is true: • A + B = C, or • B + C = A, or • C + A = B.

Person A sees B and C, but not A. Similarly for Persons B and C.

Step 3: Logical deductions from the repeated “I don’t know”

First round deductions: • Person A initially doesn’t know their number. This means, from Person A’s perspective, both scenarios A = B + C and A + B = C (or A + C = B) are still possible. If Person A had seen identical numbers, for instance, Person A could have instantly known their number. Thus, Person A sees two different numbers. • Similarly, Person B doesn’t know either, implying that Person B also sees ambiguity in their own number. Thus, Person B also sees two numbers that could result in at least two possible solutions. • Person C also doesn’t know, same logic as above.

Thus, after one full round, we understand that initially, all three numbers must create ambiguity. No trivial case (like two identical numbers) is present.

Second round deductions:

After hearing that no one knows their number, each person gains information: they now know the scenario is one in which none of the other two people could have immediately solved it.

Crucially, in the second round: • Person A now knows their number is 65 after considering the previous statements. This implies that Person A sees numbers B and C, and from these numbers plus the statements of uncertainty, Person A deduces that their own number can only be 65.

Step 4: Deriving the Numbers (Critical deduction)

Let’s denote clearly again: Person A’s number is now known to be 65.

We have three possible equations. Let’s test each carefully given A = 65: 1. Case 1: A + B = C Then we have: • 65 + B = C. Person A sees numbers B and C, so they would see B and 65 + B. Person A sees numbers B and 65+B. If Person A sees these two numbers, they might initially consider their number could be either the sum or difference of these two. • However, if Person A sees two numbers, and one is exactly 2× the other, Person A could immediately guess their number is the difference (since if you see numbers X and 2X, your number is clearly X). So this scenario of B and 65+B is okay only if these two numbers don’t exhibit an immediate trivial scenario. • Let’s hold this scenario and continue examining others first. 2. Case 2: B + C = A (So, B + C = 65) • Then Person A sees numbers B and C, adding exactly to 65. • Initially, Person A sees two numbers summing exactly to 65. Could Person A know their number immediately at first glance? No, since their number could either be the sum (65) or the difference of the two numbers. Ambiguity here initially makes sense. • Initially, Person B and Person C also see ambiguous scenarios. They both don’t know their numbers either at first round. This scenario is plausible. 3. Case 3: C + A = B (So, C + 65 = B) • Then Person A sees numbers B = C + 65 and C. • Person A sees C and C+65. Similar reasoning as Case 1.

Step 5: Identifying the Correct Scenario

The puzzle’s logic typically aligns best with Case 2: B + C = A = 65.

Here’s why clearly stated: • Initially, Person A sees two numbers adding to exactly 65. At first glance, Person A’s number could either be 65 (sum) or the difference between the two numbers. • Person B also initially sees two numbers and cannot distinguish immediately their own number, nor can Person C. • After everyone states “I don’t know,” Person A gains critical insight: if the difference between the two numbers (that A sees) allowed Person B or Person C to know their number immediately, they’d have known right away. The fact that B and C didn’t know means the difference is not immediately obvious to B or C. Thus, the difference cannot be trivial (such as identical numbers or easy multiples).

Therefore, the two numbers must be such that their difference doesn’t immediately clue in the other persons. This occurs most clearly with pairs of numbers which have multiple factorization possibilities or aren’t trivial multiples.

But we must carefully test quickly to find a suitable, plausible integer pair whose sum is 65 and whose ambiguity would logically survive through multiple rounds. The puzzle is a classic logic puzzle with a known structure:

A common scenario in these classic puzzles is the use of Fibonacci-like or similar “golden” sets. Let’s quickly check the following well-known ambiguous triple: • Numbers: 65, 39, 26 • Check quickly: indeed, 39 + 26 = 65.

Checking logic quickly: • Person A sees 39 and 26. Initially ambiguous (number could be 65 or 13). • Person B sees 65 and 26. Could be either 39 or 91 at first glance. Initially ambiguous. • Person C sees 65 and 39. Could be either 26 or 104 initially. Initially ambiguous.

After each person says they don’t know: • Now, Person A considers: If my number was 13, then Person B would see numbers 13 and 26. Person B would then immediately know their number is 39 (13+26=39), no ambiguity. But Person B didn’t know! So Person A’s number cannot be 13. Hence, Person A confidently concludes their number is 65.

Thus, the numbers 65, 39, and 26 satisfy the puzzle conditions perfectly!

Step 6: Final Answer (Product of the Numbers)

The three numbers identified clearly are: • A = 65 • B = 39 • C = 26

Then, the product is: 65 \times 39 \times 26 = 65 \times 1014 = 65,910

Final Answer:

\boxed{65,910}

WithinReason · 5 months ago
This was not logic, this was recall from the training set disguised as logic:

Let’s quickly check the following well-known ambiguous triple: • Numbers: 65, 39, 26

simonw · 5 months ago
I'm impressed by this one. I tried it on audio transcription with timestamps and speaker identification (over a 10 minute MP3) and drawing bounding boxes around creatures in a complex photograph and it did extremely well on both of those.

Plus it drew me a very decent pelican riding a bicycle.

Notes here: https://simonwillison.net/2025/Mar/25/gemini/

jillesvangurp · 5 months ago
Have you considered that they must be training on images of pelicans driving bicycle's at this point ;-). At least given how often that comes up in your reviews, a smart LLM engineer might put their fingers on the scales a bit and optimize for those things that come up in reviews of their work a lot.
redox99 · 5 months ago
Claude's pelican is way better than Gemini's
jonomacd · 5 months ago
I'm not so sure. I've run it a bunch of times. It makes a great pelican.

Personally I'm convinced this model is the best out there right now.

https://www.reddit.com/r/Bard/comments/1jjobaz/pelican_on_a_...

ggeorgovassilis · 5 months ago
I've been following your blog for a while now, great stuff!
kridsdale3 · 5 months ago
I just tried your trademark benchmark on the new 4o Image Output, though it's not the same test:

https://imgur.com/a/xuPn8Yq

jonomacd · 5 months ago
And the same thing with gemini 2.0 flash native image output.

https://imgur.com/a/V4YAkX5

It's sort of irrelevant though as the test is about SVGs.

Unroasted6154 · 5 months ago
Was that an actual SVG?
freediver · 5 months ago
Tops our benchmark in an unprecedented way.

https://help.kagi.com/kagi/ai/llm-benchmark.html

High quality, to the point. Bit on the slow side. Indeed a very strong model.

Google is back in the game big time.

aoeusnth1 · 5 months ago
It should be in the "reasoning" category, right? (still topping the charts there)
causal · 5 months ago
Remarkable how few tokens it needed to get a much better score than other reasoning models. Any chance of contamination?
85392_school · 5 months ago
It makes me wonder how the token counting was implemented and if it missed the (not sent in API) reasoning.
utopcell · 5 months ago
That is some wide gap!
anotherpaulg · 5 months ago
Gemini 2.5 Pro set the SOTA on the aider polyglot coding leaderboard [0] with a score of 73%.

This is well ahead of thinking/reasoning models. A huge jump from prior Gemini models. The first Gemini model to effectively use efficient diff-like editing formats.

[0] https://aider.chat/docs/leaderboards/

aoeusnth1 · 5 months ago
Am I correct in assuming that accuracy < using correct edit format? i.e. it made mistakes in 27% of the problems, 11% of which were due to (at least) messing up the diff format?

In which case, google should be working on achieving better output format following, as Claude and R1 are able to hit nearly 100% accuracy on the format.

anotherpaulg · 5 months ago
It does have fairly low adherence to the edit format, compared to the other frontier models. But it is much better than any previous Gemini model in this regard.

Aider automatically asks models to retry malformed edits, so it recovers. And goes on to produce a SOTA score.

sagarpatil · 5 months ago
The only benchmark I care about. Thanks!
Oras · 5 months ago
These announcements have started to look like a template.

- Our state-of-the-art model.

- Benchmarks comparing to X,Y,Z.

- "Better" reasoning.

It might be an excellent model, but reading the exact text repeatedly is taking the excitement away.

devsda · 5 months ago
Reminds me of how nobody is too excited about flagship mobile launches anymore. Most flagships for sometime now are just incremental updates over previous gen and only marginally better. Couple that with the chinese OEMs launching better or good enough devices at a lower price point, new launches from established players are not noteworthy anymore.

It's interesting how the recent AI announcements are following the same trend over a smaller timeframe.

breppp · 5 months ago
I think the greatest issue with buying a new phone today is ironically the seamless migration.

once you get all your apps, wallpaper, shortcut order and same OS, you really quickly get the feeling you spent 1000$ for the exact same thing

flakiness · 5 months ago
Phones are limited by hardware manufacturing, plus maybe the annual shopping cycle peaking at Christmas. People won't have bought multiple iPhones even in its heyday.

These LLM models were supposedly limited by the training run, but these point-version models are mostly post-training driven, which seems to be taking less time.

If models were tied to a specific hardware (say, a "AI PC" or whatever) the cycle would get slower and we'll get a slower summer which I'm secretly wishing.

tibbar · 5 months ago
For me, the most exciting part is the improved long-context performance. A lot of enterprise/RAG applications rely on synthesizing a bunch of possibly relevant data. Let's just say it's clearly a bottleneck in current models and I would expect to see a meaningful % improvement in various internal applications if long-context reasoning is up. Gemini was already one of my favorite models for this usecase.

So, I think these results are very interesting, if you know what features specifically you are using.

zwaps · 5 months ago
But they score it on their own benchmark, on which coincidentally Gemini models always were the only good ones. In Nolima or Babilong we see that Gemini models still cant do long context.

Excited to see if it works this time.

bhouston · 5 months ago
> It might be an excellent model, but reading the exact text repeatedly is taking the excitement away.

This is the commodification of models. There is nothing special about the new models but they perform better on the benchmarks.

They are all interchangeable. This is great for users as it adds to price pressure.

flir · 5 months ago
Man, I hope those benchmarks actually measure something.
Workaccount2 · 5 months ago
Sooner or later someone is going to find "secret sauce" that provides a step-up in capability, and it will be closely guarded by whoever finds it.

As big players look to start monetizing, they are going to desperately be searching for moats.

cadamsdotcom · 5 months ago
Why not snooze the news for a year and see what’s been invented when you get back. That’ll blow your mind properly. Because each of these incremental announcements contributes to a mind blowing rate of improvement.

The rate of announcements is a sign that models are increasing in ability at an amazing rate, and the content is broadly the same because they’re fungible commodities.

The latter, that models are fungible commodities, is what’s driving this explosion and leading to intense competition that benefits us all.

diego_sandoval · 5 months ago
I take this as a good thing, because they're beating each other every few weeks and using benchmarks as evidence.

If these companies start failing to beat the competition, then we should prepare ourselves for very creative writing in the announcements.

gtirloni · 5 months ago
The improvements have been marginal at best. I wouldn't call that beating.
ototot · 5 months ago
Maybe they just asked Gemini 2.5 to write the announcement.
cpeterso · 5 months ago
And it was trained on the previous announcements.
schainks · 5 months ago
I wish I wish I wish Google put better marketing into these releases. I've moved entire workflows to Gemini because it's just _way_ better than what openai has to offer, especially for the money.

Also, I think google's winning the race on actually integrating the AI to do useful things. The agent demo from OpenAI is interesting, but frankly, I don't care to watch the machine use my computer. A real virtual assistant can browse the web headless and pick flights or food for me. That's the real workflow unlock, IMO.

throwaway2037 · 5 months ago

    > I've moved entire workflows to Gemini because it's just _way_ better than what openai has to offer, especially for the money.
This is useful feedback. I'm not here to shill for OpenAI, nor Google/Gemini, but can you share a concrete example? It would be interesting to hear more about your use case. More abstractly: Do you think these "moved entire workflows" offset a full worker, or X% of a full worker? I am curious to see how and when we will see low-end/junior knowledge workers displaced by solid LLMs. Listening to the Oxide and Friends podcast, I learned that they make pretty regular use of LLMs to create graphs using GNU plot. To paraphrase, they said "it is like have a good intern".

cratermoon · 5 months ago
Glaringly missing from the announcements: concrete use cases and products.

The Achilles heel of LLMs is the distinct lack of practical real-world applications. Yes, Google and Microsoft have been shoving the tech into everything they can fit, but that doesn't a product make.

throwaway2037 · 5 months ago
I would say Adobe is doing an excellent job of commercialising image manipulation and generation using LLMs. When I see adverts for their new features, they seem genuinely useful for normie users who are trying to edit some family/holiday photos.
kiratp · 5 months ago
https://www.osmos.io/fabric

Practical, real-world application.

sebzim4500 · 5 months ago
ChatGPT has like 500M weekly active users, what are you on about?
taf2 · 5 months ago
We have incrementally improved 1% better then we were yesterday. Our competition is 1 day behind us now.
throwaway2037 · 5 months ago
Like! No trolling: This could be a sarcastic comment written by an LLM!
andai · 5 months ago
Well hey, OpenAI did the exact opposite, and nobody liked that either.
exitb · 5 months ago
I think people were fine with OpenAI demos. They were less fine with not actually ever releasing the demoed tech.
andai · 5 months ago
To clarify, by "doing the opposite" I mean OpenAI releasing GPT-4.5, a non-reasoning model that does worse on benchmarks (but supposed to be qualitatively better). People shit on OpenAI hard for doing that.
emptysongglass · 5 months ago
I liked their announcements and demos and continue to like them.
Oras · 5 months ago
How did you measure “nobody” liked OpenAI announcements?
kh_hk · 5 months ago
Was going to comment the same thing, which has been bugging me off lately on all announcements that start with "our" followed by empty superlatives. Happy to not be alone on this!
sagarpatil · 5 months ago
AI labs, it seems, use a template for system cards as well. OpenAI stands out because they showcase their employees using their tools for various use cases, which is refreshing.
Mistletoe · 5 months ago
I’m sure the AI helps write the announcements.
belter · 5 months ago
Cancelled my account long time ago. Gemini models are like a McDonalds Croissant. You always give them an extra chance, but they always fall apart on your hands...
greatgib · 5 months ago
If you plan to use Gemini, be warned, here are the usual Big Tech dragons:

   Please don’t enter ...confidential info or any data... you wouldn’t want a reviewer to see or Google to use ...
The full extract of the terms of usage:

   How human reviewers improve Google AI

   To help with quality and improve our products (such as the generative machine-learning models that power Gemini Apps), human reviewers (including third parties) read, annotate, and process your Gemini Apps conversations. We take steps to protect your privacy as part of this process. This includes disconnecting your conversations with Gemini Apps from your Google Account before reviewers see or annotate them. Please don’t enter confidential information in your conversations or any data you wouldn’t want a reviewer to see or Google to use to improve our products, services, and machine-learning technologies.

cavisne · 5 months ago
Google is the best of these. You either pay per token and there is no training on your inputs, or it’s free/a small monthly fee and there is training.
greatgib · 5 months ago
And even worse:

   Conversations that have been reviewed or annotated by human reviewers (and related data like your language, device type, location info, or feedback) are not deleted when you delete your Gemini Apps activity because they are kept separately and are not connected to your Google Account. Instead, they are retained for up to three years.
Emphasis on "retained for up to three years" even if you delete it!!

kccqzy · 5 months ago
Well they can't delete a user's Gemini conversations because they don't know which user a particular conversation comes from.

This seems better, not worse, than keeping the user-conversation mapping so that the user may delete their conversations.

mastodon_acc · 5 months ago
How does it compare to OpenAI and anthropic’s user data retention policy?
KoolKat23 · 5 months ago
I don't think this is the same as the AI studio and API terms. This looks like your consumer facing Gemini T&C's.
summerlight · 5 months ago
You can use a paid tier to avoid such issues. Not sure what you're expecting for those "experimental" models, which is in development and needs user feedback.
sauwan · 5 months ago
I'm assuming this is true of all experimental models? That's not true with their models if you're on a paid tier though, correct?
suyash · 5 months ago
More of a reason for new privacy guidelines specially for big tech and AI
mastodon_acc · 5 months ago
I mean this is pretty standard for online llms. What is Gemini doing here that openai or Anthropic aren’t already doing?
mindwok · 5 months ago
Just adding to the praise: I have a little test case I've used lately which was to identify the cause of a bug in a Dart library I was encountering by providing the LLM with the entire codebase and description of the bug. It's about 360,000 tokens.

I tried it a month ago on all the major frontier models and none of them correctly identified the fix. This is the first model to identify it correctly.

weatherlite · 5 months ago
360k tokens = how many lines of code approximately ? and also, if its an open source lib are you sure there's no mentions of this bug anywhere on the web?
mindwok · 5 months ago
Not a huge library, around 32K LoC and no mention of the bug on the web - I was the first to encounter it (it’s since been fixed) unless the training data is super recent.
kungfufrog · 5 months ago
How do you use the model so quickly? Google AI Studio? Maybe I've missed how powerful that is.. I didn't see any easy way to pass it a whole code base!
mindwok · 5 months ago
Yep! AI studio I think is the only way you can actually use it right now and AFAIK it's free.
markdog12 · 5 months ago
Interesting, I've been asking it to generate some Dart code, and it makes tons of mistakes, including lots of invalid code (static errors). When pointing out the mistakes, it thanks me and tells me it won't make it again, then makes it again on the very next prompt.
blinding-streak · 5 months ago
Open the pod bay doors Hal.

I'm sorry Dave, I'm afraid I can't do that.

ripped_britches · 5 months ago
Wow holy smokes that is exciting
nmfisher · 5 months ago
How long did it take to sift through those?