Readit News logoReadit News
isx726552 · 7 months ago
> I’ve been feeling pretty good about my benchmark! It should stay useful for a long time... provided none of the big AI labs catch on.

> And then I saw this in the Google I/O keynote a few weeks ago, in a blink and you’ll miss it moment! There’s a pelican riding a bicycle! They’re on to me. I’m going to have to switch to something else.

Yeah this touches on an issue that makes it very difficult to have a discussion in public about AI capabilities. Any specific test you talk about, no matter how small … if the big companies get wind of it, it will be RLHF’d away, sometimes to the point of absurdity. Just refer to the old “count the ‘r’s in strawberry” canard for one example.

simonw · 7 months ago
Honestly, if my stupid pelican riding a bicycle benchmark becomes influential enough that AI labs waste their time optimizing for it and produce really beautiful pelican illustrations I will consider that a huge personal win.
benmathes · 6 months ago
"personal" doing a lot of work there :-)

(And I'd be envious of your impact, of course)

Choco31415 · 7 months ago
Just tried that canard on GPT-4o and it failed:

"The word "strawberry" contains 2 letter r’s."

belter · 7 months ago
I tried

strawberry -> DeepSeek, GeminiPro and ChatGPT4o all correctly said three

strawberrry -> DeepSeek, GeminiPro and ChatGPT4o all correctly said four

stawberrry -> DeepSeek, GeminiPro all correctly said three

ChatGPT4o even in a new Chat, incorrectly said the word "stawberrry" contains 4 letter "r" characters. Even provided this useful breakdown to let me know :-)

Breakdown: stawberrry → s, t, a, w, b, e, r, r, r, y → 4 r's

And then asked if I meant "strawberry" instead and said because that one has 2 r's....

MattRix · 7 months ago
This is why things like the ARC Prize are better ways of approaching this: https://arcprize.org
whiplash451 · 7 months ago
Well, ARC-1 did not end well for the competitors of tech giants and it’s very unclear that ARC-2 won’t follow the same trajectory.
lofaszvanitt · 7 months ago
You push sha512 hashes of things in a githup repo and a short sentence:

x8 version: still shit . . x15 version: we are closing, but overall a shit experience :D

this way they won't know what to improve upon. of course they can buy access. ;P

when they finally solve your problem you can reveal what was the benchmark.

adrian17 · 7 months ago
> This was one of the most successful product launches of all time. They signed up 100 million new user accounts in a week! They had a single hour where they signed up a million new accounts, as this thing kept on going viral again and again and again.

Awkwardly, I never heard of it until now. I was aware that at some point they added ability to generate images to the app, but I never realized it was a major thing (plus I already had an offline stable diffusion app on my phone, so it felt less of an upgrade to me personally). With so much AI news each week, feels like unless you're really invested in the space, it's almost impossible to not accidentally miss or dismiss some big release.

haiku2077 · 7 months ago
Congratulations, you are almost fully unplugged from social media. This product launch was a huge mainstream event; for a few days GPT generated images completely dominated mainstream social media.
sigmoid10 · 7 months ago
If you primarily consume text-based social media (HN, reddit with legacy UI) then it's kind of easy to not notice all the new kinds of image infographics and comics that now completely flood places like instagram or linkedin.
derwiki · 7 months ago
Not sure if this is sarcasm or sincere, but I will take it as sincere haha. I came back to work from parental leave and everyone had that same Studio Ghiblized image as their Slack photo, and I had no idea why. It turns out you really can unplug from social media and not miss anything of value: if it’s a big enough deal you will find out from another channel.
Semaphor · 7 months ago
Facebook, discord, reddit, HN. Hadn’t heard of it either. But for FB, Reddit, and Discord I strictly curate what I see.
azinman2 · 7 months ago
Except this went very mainstream. Lots of turn myself into a muppet, what is the human equivalent for my dog, etc. TikTok is all over this.

It really is incredible.

thierrydamiba · 7 months ago
The big trend was around the ghiblification of images. Those images were everywhere for a period of time.

Deleted Comment

Dead Comment

MattRix · 7 months ago
To be clear: they already had image generation in ChatGPT, but this was a MUCH better one than what they had previously. Even for you with your stable diffusion app, it would be a significant upgrade. Not just because of image quality, but because it can actually generate coherent images and follow instructions.
MIC132 · 7 months ago
As impressive as it is, for some uses it still is worse than a local SD model. It will refuse to generate named anime characters (because of copyright, or because it just doesn't know them, even not particularly obscure ones) for example. Or obviously anything even remotely spicy. As someone who mostly uses image generation to amuse myself (and not to post it, where copyright might matter) it's honestly somewhat disappointing. But I don't expect any of the major AI companies to release anything without excessive guardrails.
bufferoverflow · 7 months ago
Have you missed how everyone was Ghiblifying everything?
adrian17 · 7 months ago
I saw that, I just didn't connect it with newly added multimodal image generation. I knew variations of style transfer (or LoRA for SD) were possible for years, so I assumed it exploded in popularity purely as a meme, not due to OpenAI making it much more accessible.

Again, I was aware that they added image generation, just not how much of a deal it turned out to be. Think of it like me occasionally noticing merchandise and TV trailers for a new movie without realizing it became the new worldwide box office #1.

andrepd · 7 months ago
Oh you mean the trend of the day on the social media monoculture? I don't take that as an indicator of any significance.
nathan_phoenix · 7 months ago
My biggest gripe is that he's comparing probabilistic models (LLMs) by a single sample.

You wouldn't compare different random number generators by taking one sample from each and then concluding that generator 5 generates the highest numbers...

Would be nicer to run the comparison with 10 images (or more) for each LLM and then average.

simonw · 7 months ago
It might not be 100% clear from the writing but this benchmark is mainly intended as a joke - I built a talk around it because it's a great way to make the last six months of model releases a lot more entertaining.

I've been considering an expanded version of this where each model outputs ten images, then a vision model helps pick the "best" of those to represent that model in a further competition with other models.

(Then I would also expand the judging panel to three vision LLMs from different model families which vote on each round... partly because it will be interesting to track cases where the judges disagree.)

I'm not sure if it's worth me doing that though since the whole "benchmark" is pretty silly. I'm on the fence.

demosthanos · 7 months ago
I'd say definitely do not do that. That would make the benchmark look more serious while still being problematic for knowledge cutoff reasons. Your prompt has become popular even outside your blog, so the odds of some SVG pelicans on bicycles making it into the training data have been going up and up.

Karpathy used it as an example in a recent interview: https://www.msn.com/en-in/health/other/ai-expert-asks-grok-3...

fzzzy · 7 months ago
Even if it is a joke, having a consistent methodology is useful. I did it for about a year with my own private benchmark of reasoning type questions that I always applied to each new open model that came out. Run it once and you get a random sample of performance. Got unlucky, or got lucky? So what. That's the experimental protocol. Running things a bunch of times and cherry picking the best ones adds human bias, and complicates the steps.
Breza · 6 months ago
I'd be really interested in evaluating the evaluations of different models. At work, I maintain our internal LLM benchmarks for content generation. We've always used human raters from MTurk, and the Elo rankings generally match what you'd expect. I'm looking at our options for having LLMs do the evaluating.

In your case, it would be neat to have a bunch of different models (and maybe MTurk) pick the winners of each head-to-head matchup and then compare how stable the Elo scores are between evaluators.

dilap · 7 months ago
Joke or not, it still correlates much better with my own subjective experiences of the models than LM Arena!
ontouchstart · 7 months ago
Very nice talk, acceptable by general public and by AI agent as well.

Any concerns about open source “AI celebrity talks” like yours can be used in contexts that would allow LLM models to optimize their market share in ways that we can’t imagine yet?

Your talk might influence the funding of AI startups.

#butterflyEffect

planb · 7 months ago
And by a sample that has become increasingly known as a benchmark. Newer training data will contain more articles like this one, which naturally improves the capabilities of an LLM to estimate what’s considered a good „pelican on a bike“.
criddell · 7 months ago
And that’s why he says he’s going to have to find a new benchmark.
viraptor · 7 months ago
Would it though? There really aren't that many valid answers to that question online. When this is talked about, we get more broken samples than reasonable ones. I feel like any talk about this actually sabotages future training a bit.

I actually don't think I've seen a single correct svg drawing for that prompt.

cyanydeez · 7 months ago
So what you really need to do is clone this blog post, find and replace pelican with any other noun, run all the tests, and publish that.

Call it wikipediaslop.org

puttycat · 7 months ago
You are right, but the companies making these models invest a lot of effort in marketing them as anything but probabilistic, i.e. making people think that these models work discretely like humans.

In that case we'd expect a human with perfect drawing skills and perfect knowledge about bikes and birds to output such a simple drawing correctly 100% of the time.

In any case, even if a model is probabilistic, if it had correctly learned the relevant knowledge you'd expect the output to be perfect because it would serve to lower the model's loss. These outputs clearly indicate flawed knowledge.

ben_w · 7 months ago
> In that case we'd expect a human with perfect drawing skills and perfect knowledge about bikes and birds to output such a simple drawing correctly 100% of the time.

Look upon these works, ye mighty, and despair: https://www.gianlucagimini.it/portfolio-item/velocipedia/

cyanydeez · 7 months ago
Humans absolutely do not work discretely.
bufferoverflow · 7 months ago
> work discretely like humans

What kind of humans are you surrounded by?

Ask any human to write 3 sentences about a specific topic. Then ask them the same exact question next day. They will not write the same 3 sentences.

mooreds · 7 months ago
My biggest gripe is that he outsourced evaluation of the pelicans to another LLM.

I get it was way easier to do and that doing it took pennies and no time. But I would have loved it if he'd tried alternate methods of judging and seen what the results were.

Other ways:

* wisdom of the crowds (have people vote on it)

* wisdom of the experts (send the pelican images to a few dozen artists or ornithologists)

* wisdom of the LLMs (use more than one LLM)

Would have been neat to see what the human consensus was and if it differed from the LLM consensus

Anyway, great talk!

zahlman · 7 months ago
It would have been interesting to see if the LLM that Claude judged worst would have attempted to justify itself....
timewizard · 7 months ago
My biggest gripe is he didn't include a picture of an actual pelican.

https://www.google.com/search?q=pelican&udm=2

The "closest pelican" is not even close.

qeternity · 7 months ago
I think you mean non-deterministic, instead of probabilistic.

And there is no reason that these models need to be non-deterministic.

skybrian · 7 months ago
A deterministic algorithm can still be unpredictable in a sense. In the extreme case, a procedural generator (like in Minecraft) is deterministic given a seed, but you will still have trouble predicting what you get if you change the seed, because internally it uses a (pseudo-)random number generator.

So there’s still the question of how controllable the LLM really is. If you change a prompt slightly, how unpredictable is the change? That can’t be tested with one prompt.

rvz · 7 months ago
> I think you mean non-deterministic, instead of probabilistic.

My thoughts too. It's more accurate to label LLMs as non-deterministic instead of "probablistic".

Dead Comment

zurichisstained · 7 months ago
Wow, I love this benchmark - I've been doing something similar (as a joke for and much less frequently), where I ask multiple models to attempt to create a data structure like:

``` const melody = [ { freq: 261.63, duration: 'quarter' }, // C4 { freq: 0, duration: 'triplet' }, // triplet rest { freq: 293.66, duration: 'triplet' }, // D4 { freq: 0, duration: 'triplet' }, // triplet rest { freq: 329.63, duration: 'half' }, // E4 ] ```

But with the intro to Smoke on the Water by Deep Purple. Then I run it through the Web Audio API and see how it sounds.

It's never quite gotten it right, but it's gotten better, to the point where I can ask it to make a website that can play it.

I think yours is a lot more thoughtful about testing novelty, but its interesting to see them attempt to do things that they aren't really built for (in theory!).

https://codepen.io/mvattuone/pen/qEdPaoW - ChatGPT 4 Turbo

https://codepen.io/mvattuone/pen/ogXGzdg - Claude Sonnet 3.7

https://codepen.io/mvattuone/pen/ZYGXpom - Gemini 2.5 Pro

Gemini is by far the best sounding one, but it's still off. I'd be curious how the latest and greatest (paid) versions fare.

(And just for comparison, here's the first time I did it... you can tell I did the front-end because there isn't much to it!) https://nitter.space/mvattuone/status/1646610228748730368#m

ojosilva · 7 months ago
Drawbacks for using a pelican on a bicycle svg: it's a very open-ended prompt, no specific criteria to judge, and lately the svg all start to look similar, or at least like they accomplished the same non-goals (there's a pelican, there's a bicycle and I'm not sure its feet should be on the saddle or on the pedals), so it's hard to agree on which is better. And, certainly, having a LLM as a judge, the entire game becomes double-hinged and who knows what to think.

Also, if it becomes popular, training sets may pick it up and improve models unfairly and unrealistically. But that's true of any known benchmark.

Side note: I'd really like to see the Language Benchmark Game become a prompt based languages * models benchmark game. So we could say model X excels at Python Fasta, etc. although then the risk is that, again, it becomes training set and the whole thing self-rigs itself.

dr_kretyn · 7 months ago
I'm slightly confused by your example. What's the actual prompt? Is your expectation that a text model is going to know how to perform the exact song in audio?
zurichisstained · 7 months ago
Ohhh absolutely not, that would be pretty wild - I just wanted to see if it could understand musical notation enough to come up with the correct melody.

I know there are far better ways to do gen AI with music, this was just a joke prompt that worked far better than I expected.

My naive guess is all of the guitar tabs and signal processing info it's trained on gives it the ability to do stuff like this (albeit not very well).

bredren · 7 months ago
Great writeup.

This measure of LLM capability could be extended by taking it into the 3D domain.

That is, having the model write Python code for Blender, then running blender in headless mode behind an API.

The talk hints at this but one shot prompting likely won’t be a broad enough measurement of capability by this time next year. (Or perhaps now, even)

So the test could also include an agentic portion that includes consultation of the latest blender documentation or even use of a search engine for blog entries detailing syntax and technique.

For multimodal input processing, it could take into account a particular photo of a pelican as the test subject.

For usability, the objects can be converted to iOS’s native 3d format that can be viewed in mobile safari.

I built this workflow, including a service for blender as an initial test of what was possible in October of 2022. It took post processing for common syntax errors back then but id imagine the newer LLMs would make those mistakes less often now.

joshstrange · 7 months ago
I really enjoy Simon’s work in this space. I’ve read almost every blog post they’ve posted on this and I love seeing them poke and prod the models to see what pops out. The CLI tools are all very easy to use and complement each other nicely all without trying to do too much by themselves.

And at the end of the day, it’s just so much fun to see someone else having so much fun. He’s like a kid in a candy store and that excitement is contagious. After reading every one of his blog posts, I’m inspired to go play with LLMs in some new and interesting way.

Thank you Simon!

alanmoraes · 6 months ago
I also like what he writes and the way he does it.
blackhaj7 · 7 months ago
Same sentiment!
dotemacs · 7 months ago
The same here.

Because of him, I installed a RSS reader so that I don't miss any of his posts. And I know that he shares the same ones across Twitter, Mastodon & Bsky...

franze · 7 months ago
ramesh31 · 7 months ago
Single shot?
franze · 7 months ago
2 shot, first one did just generate the svg not the shareable html page around it. in the second go it also worked on the svg as i did not forbid it.
anon373839 · 7 months ago
Enjoyable write-up, but why is Qwen 3 conspicuously absent? It was a really strong release, especially the fine-grained MoE which is unlike anything that’s come before (in terms of capability and speed on consumer hardware).
simonw · 7 months ago
Omitting Qwen 3 is my great regret about this talk. Honestly I only realized I had missed it after I had delivered the talk!

It's one of my favorite local models right now, I'm not sure how I missed it when I was reviewing my highlights of the last six months.

Maxious · 7 months ago
Cut for time - qwen3 was pelican tested too https://simonwillison.net/2025/Apr/29/qwen-3/