Gemini 3 Pro: the frontier of vision AI

Well

It is the first model to get partial-credit on an LLM image test I have. Which is counting the legs of a dog. Specifically, a dog with 5 legs. This is a wild test, because LLMs get really pushy and insistent that the dog only has 4 legs.

In fact GPT5 wrote an edge detection script to see where "golden dog feet" met "bright green grass" to prove to me that there were only 4 legs. The script found 5, and GPT-5 then said it was a bug, and adjusted the script sensitivity so it only located 4, lol.

Anyway, Gemini 3, while still being unable to count the legs first try, did identify "male anatomy" (it's own words) also visible in the picture. The 5th leg was approximately where you could expect a well endowed dog to have a "5th leg".

That aside though, I still wouldn't call it particularly impressive.

As a note, Meta's image slicer correctly highlighted all 5 legs without a hitch. Maybe not quite a transformer, but interesting that it could properly interpret "dog leg" and ID them. Also the dog with many legs (I have a few of them) all had there extra legs added by nano-banana.

Rover222 · 2 months ago

I just tried to get Gemini to produce an image of a dog with 5 legs to test this out, and it really struggled with that. It either made a normal dog, or turned the tail into a weird appendage.

Then I asked both Gemini and Grok to count the legs, both kept saying 4.

Gemini just refused to consider it was actually wrong.

Grok seemed to have an existential crisis when I told it it was wrong, becoming convinced that I had given it an elaborate riddle. After thinking for an additional 2.5 minutes, it concluded: "Oh, I see now—upon closer inspection, this is that famous optical illusion photo of a "headless" dog. It's actually a three-legged dog (due to an amputation), with its head turned all the way back to lick its side, which creates the bizarre perspective making it look decapitated at first glance. So, you're right; the dog has 3 legs."

You're right, this is a good test. Right when I'm starting to feel LLMs are intelligent.

theoa · 2 months ago

Draw a millipede as a dog:

Gemini responds:

Conceptualizing the "Millipup"

https://gemini.google.com/share/b6b8c11bd32f

Draw the five legs of a dog as if the body is a pentagon

https://gemini.google.com/share/d74d9f5b4fa4

And animal legs are quite standardized

https://en.wikipedia.org/wiki/List_of_animals_by_number_of_l...

It's all about the prompt. Example:

Can you imagine a dog with five legs?

https://gemini.google.com/share/2dab67661d0e

And generally, the issue sits between the computer and the chair.

;-)

macNchz · 2 months ago

An interesting test in this vein that I read about in a comment on here is generating a 13 hour clock—I tried just about every prompting trick and clever strategy I could come up with across many image models with no success. I think there's so much training data of 12 hour clocks that just clobbers the instructions entirely. It'll make a regular clock that skips from 11 to 13, or a regular clock with a plaque saying "13 hour clock" underneath, but I haven't gotten an actual 13 hour clock yet.

vunderba · 2 months ago

If you want to see something rather amusing - instead of using the LLM aspect of Gemini 3.0 Pro, feed a five-legged dog directly into Nano Banana Pro and give it an editing task that requires an intrinsic understanding of the unusual anatomy.

  Place sneakers on all of its legs.

It'll get this correct a surprising number of times (tested with BFL Flux2 Pro, and NB Pro).

https://imgur.com/a/wXQskhL

dwringer · 2 months ago

I had no trouble getting it to generate an image of a five-legged dog first try, but I really was surprised at how badly it failed in telling me the number of legs when I asked it in a new context, showing it that image. It wrote a long defense of its reasoning and when pressed, made up demonstrably false excuses of why it might be getting the wrong answer while still maintaining the wrong answer.

AIorNot · 2 months ago

Its not that they aren’t intelligent its that they have been RL’d like crazy to not do that

Its rather like as humans we are RL’d like crazy to be grossed out if we view a picture of a handsome man and beautiful woman kissing (after we are told they are brother and sister) -

Ie we all have trained biases - that we are told to follow and trained on - human art is about subverting those expectations

squigz · 2 months ago

I feel a weird mix of extreme amusement and anger that there's a fleet of absurdly powerful, power-hungry servers sitting somewhere being used to process this problem for 2.5 minutes

tarsinge · 2 months ago

I have only a high level understanding of LLMs but to me it doesn’t seem surprising: they are trying to come up with a textual output of your prompt aggregated to their result that scores high (i.e. is consistent) with their training set. There is no thinking, just scoring consistency. And a dog with 5 legs is so rare or nonexistent in their training set and their resulting weights that it scores so bad they can’t produces an output that accepts it. But how the illusion breaks down in this case is quite funny indeed.

irthomasthomas · 2 months ago

Isn't this proof that LLMs still don't really generalize beyond their training data?

visioninmyblood · 2 months ago

I tried this by using an gemini visual agent build with orion from vlm.run. it was able to produce two different images with five leg dog. you need to make it play with itself to improve and correct.

https://chat.vlm.run/c/62394973-a869-4a54-a7f5-5f3bb717df5f

Here is the though process summary(you can see the full thinking the link above):

"I have attempted to generate a dog with 5 legs multiple times, verifying each result. Current image generation models have a strong bias towards standard anatomy (4 legs for dogs), making it difficult to consistently produce a specific number of extra limbs despite explicit prompts."

qnleigh · 2 months ago

It's not obvious to me whether we should count these errors as failures of intelligence or failures of perception. There's at least a loose analogy to optical illusion, which can fool humans quite consistently. Now you might say that a human can usually figure out what's going on and correctly identify the illusion, but we have the luxury of moving our eyes around the image and taking it in over time, while the model's perception is limited to a fixed set of unchanging tokens. Maybe this is relevant.

(Note I'm not saying that you can't find examples of failures of intelligence. I'm just questioning whether this specific test is an example of one).

nearbuy · 2 months ago

My guess is the part of its neural network that parses the image into a higher level internal representation really is seeing the dog as having four legs, and intelligence and reasoning in the rest of the network isn't going to undo that. It's like asking people whether "the dress" is blue/black or white/gold: people will just insist on what they see, even if what they're seeing is wrong.

SecretDreams · 2 months ago

LLMs are getting a lot better at understanding our world by standard rules. As it does so, maybe it losses something in the way of interpreting non standard rules, aka creativity.

dostick · 2 months ago

Of course if you put the leg there it’s confusing. Just give it AGIP logo as a test.

varispeed · 2 months ago

Do 7 legged dog. Game over.

DANmode · 2 months ago

What is "a dog"?

What is " a dog" to Gemini?

isodev · 2 months ago

> starting to feel LLMs are intelligent

LLMs are fancy “lorem ipsum based on a keyword” text generators. They can never become intelligent … or learn how to count or do math without the help of tools.

It can probably generate a story about a 5 legged dog though.

Benjammer · 2 months ago

It always feels to me like these types of tests are being somewhat intentionally ignorant of how LLM cognition differs from human cognition. To me, they don't really "prove" or "show" anything other than simply - LLMs thinking works different than human thinking.

I'm always curious if these tests have comprehensive prompts that inform the model about what's going on properly, or if they're designed to "trick" the LLM in a very human-cognition-centric flavor of "trick".

Does the test instruction prompt tell it that it should be interpreting the image very, very literally, and that it should attempt to discard all previous knowledge of the subject before making its assessment of the question, etc.? Does it tell the model that some inputs may be designed to "trick" its reasoning, and to watch out for that specifically?

More specifically, what is a successful outcome here to you? Simply returning the answer "5" with no other info, or back-and-forth, or anything else in the output context? What is your idea of the LLMs internal world-model in this case? Do you want it to successfully infer that you are being deceitful? Should it respond directly to the deceit? Should it take the deceit in "good faith" and operate as if that's the new reality? Something in between? To me, all of this is very unclear in terms of LLM prompting, it feels like there's tons of very human-like subtext involved and you're trying to show that LLMs can't handle subtext/deceit and then generalizing that to say LLMs have low cognitive abilities in a general sense? This doesn't seem like particularly useful or productive analysis to me, so I'm curious what the goal of these "tests" are for the people who write/perform/post them?

majormajor · 2 months ago

The marketing of these products is intentionally ignorant of how LLM cognition differs from human cognition.

Let's not say that the people being deceptive are the people who've spotted ways that that is untrue...

biophysboy · 2 months ago

I thought adversarial testing like this was a routine part of software engineering. He's checking to see how flexible it is. Maybe prompting would help, but it would be cool if it was more flexible.

runarberg · 2 months ago

This is the first time I hear the term LLM cognition and I am horrified.

LLMs don‘t have cognition. LLMs are a statistical inference machines which predict a given output given some input. There are no mental processes, no sensory information, and certainly no knowledge involved, only statistical reasoning, inference, interpolation, and prediction. Comparing the human mind to an LLM model is like comparing a rubber tire to a calf muscle, or a hydraulic system to the gravitational force. They belong in different categories and cannot be responsibly compared.

When I see these tests, I presume they are made to demonstrate the limitation of this technology. This is both relevant and important that consumers know they are not dealing with magic, and are not being sold a lie (in a healthy economy a consumer protection agency should ideally do that for us; but here we are).

Deleted Comment

Paracompact · 2 months ago

> Does the test instruction prompt tell it that it should be interpreting the image very, very literally, and that it should attempt to discard all previous knowledge of the subject before making its assessment of the question, etc.?

No. Humans don't need this handicap, either.

> More specifically, what is a successful outcome here to you? Simply returning the answer "5" with no other info, or back-and-forth, or anything else in the output context?

Any answer containing "5" as the leading candidate would be correct.

> What is your idea of the LLMs internal world-model in this case? Do you want it to successfully infer that you are being deceitful? Should it respond directly to the deceit? Should it take the deceit in "good faith" and operate as if that's the new reality? Something in between?

Irrelevant to the correctness of an answer the question, "how many legs does this dog have." Also, asking how many legs a 5-legged dog has is not deceitful.

> This doesn't seem like particularly useful or productive analysis to me, so I'm curious what the goal of these "tests" are for the people who write/perform/post them?

It's a demonstration of the failures of the rigor of out-of-distribution vision and reasoning capabilities. One can imagine similar scenarios with much more tragic consequences when such AI would be used to e.g. drive vehicles or assist in surgery.

danielvaughn · 2 months ago

I don’t know much about AI, but I have this image test that everything has failed at. You basically just present an image of a maze and ask the LLM to draw a line through the most optimal path.

Here’s how Nano Banana fared: https://x.com/danielvaughn/status/1971640520176029704?s=46

JamesSwift · 2 months ago

I just oneshot it with claude code (opus 4.5) using this prompt. It took about 5 mins and included detecting that it was cheating at first (drew a line around the boundary of the maze instead), so it added guardrails for that:

```

Create a devenv project that does the following:

  - Read the image at maze.jpg
  - Write a script that solves the maze  in the most optimal way between the mouse and the cheese
  - Generate a new image which is of the original maze, but with a red line that represents the calculated path

Use whatever lib/framework is most appropriate

```

  Output: https://gist.github.com/J-Swift/ceb1db348f46ba167948f734ff0fc604  
  Solution: https://imgur.com/a/bkJloPT

vunderba · 2 months ago

In fact, one of the tests I use as part of GenAI Showdown involves both parts of the puzzle: draw a maze with a clearly defined entrance and exit, along with a dashed line indicating the solution to the maze.

Only one model (gpt-image-1) out of the 18 tested managed to pass the test successfully. Gemini 3.0 Pro got VERY close.

https://genai-showdown.specr.net/#the-labyrinth

kridsdale3 · 2 months ago

I have also tried the maze from a photo test a few times and never seen a one-shot success. But yesterday I was determined to succeed so I allowed Gemini 3 to write a python gui app that takes in photos of physical mazes (I have a bunch of 3d printed ones) and find the path. This does work.

Gemini 3 then one-shot ported the whole thing (which uses CV py libraries) to a single page html+js version which works just as well.

I gave that to Claude to assess and assign a FAANG hiring level to, and it was amazed and said Gemini 3 codes like an L6.

Since I work for Google and used my phone in the office to do this, I think I can't share the source or file.

jiggawatts · 2 months ago

The reason is that image generators don't iterate on the output in the same way the text-based LLMs do. Essentially they produce the image in "one hit" and can't solve a complex sequence in the same way you couldn't one-shot this either. Try taking a random maze, glance at it, then go off to draw a squiggle on a transparency. If you were to place that on top of the maze, there's virtually no chance that you'd have found the solution on the first try.

That's essentially what's going on with AI models, they're struggling because they only get "one step" to solve the problem instead of being able to trace through the maze slowly.

An interesting experiment would be to ask the AI to incrementally solve the maze. Ask it to draw a line starting at the entrance a little ways into the maze, then a little bit further, etc... until it gets to the end.

pwagland · 2 months ago

I tried this with Claude:

``` > [Image #1] Create a unicode "ascii-art" version of this image, with the optimal path through the maze highlighted in a solid colour.

I'll create an ASCII art version of this maze with the solution path highlighted!

  ┌─┬─┬─┬─┬─┬─┬─┬─┬─┬─┬─┬─┬─┬─┬─┬─┬─┬─┬─┬─┬─┬─┬─┬─┬─┬─┬─┬─┬─┐
  ●●│ │ │ │       │   │       │     │ │   │   │ │ │   │
  ├─┤●└─┴─┐ ├───┐ │ ╔═╗ ├─────┤ ╔═══╝ │ ╔═╝ ╔═╝ │ │ ╔═╝ ├─┤
  │ │●●●●●└─┤   │ │ ║ │ │     │ ║     │ ║   ║   │ │ ║   │ │
  │ └───┐●╔═╧═╗ │ └─╢ │ │ ┌─╔═╝ ║ ┌───┤ ╚═╗ ╚═╗ │ └─╢ ┌─┘ │
  │     │●║   ║ │   ║ │ │ │ ║   ║ │   │   ║   ║ │   ║ │   │
  ├───┐ │●╚═╗ ╚═╧═╗ ║ └─┤ │ ║ ╔═╝ └─╔═╧═╗ ╚═╗ ╚═╧═╗ ║ └─┬─┤
  │   │ │●●●└───┐ ║ ║   │ │ ║ ║     ║   ║   ║     ║ ║   │ │
  │ ╔═╝ └───┐●╔═╝ ║ ╚═╗ │ └─╢ ╚═╗ ╔═╝ ╔═╝ ┌─╢ ╔═══╝ ╚═╗ │ │
  │ ║       │●║   ║   ║ │   ║   ║ ║   ║   │ ║ ║       ║ │ │
  │ ╚═══╗ ╔═╝●╚═╗ ╚═╗ ║ ├─╔═╧═╗ ║ ║ ╔═╝ ╔═╝ ║ ║ ╔═╗ ╔═╝ │ │
  │     ║ ║●●●●●│   ║ ║ │ ║   ║ ║ ║ ║   ║   ║ ║ ║ ║ ║   │ │
  ├───╔═╝ ║●┌─┐●└─╗ ║ └─┤ ║ ╔═╝ ║ ╚═╝ ╔═╧═╗ ║ ╚═╝ ║ ╚═╗ │ │
  │   ║   ║●│ │●●●║ ║   │ ║ ║   ║     ║   ║ ║     ║   ║ │ │
  │ ╔═╝ ╔═╝●│ └─┐●╚═╧═╗ │ ║ ║ ╔═╧═══╗ ║ ╔═╝ ╚═══╗ ╚═╗ ║ └─┤
  │ ║   ║●●●│   │●●●●●║ │ ║ ║ ║     ║ ║ ║       ║   ║ ║   │
  │ ║ ╔═╝●┌─┴─╗ └───┐●╚═╝ ║ ║ ║ ╔═╗ ║ └─╢ ╔═══╗ ╚═╗ ╚═╝ ╔═╝
  │ ║ ║●●●│   ║     │●●●●●║ ║ ║ ║ ║ ║   ║ ║   ║   ║     ║
  │ ╚═╝●╔═╧═╗ ╚═╗ ╔═╧═══┐●╚═╝ ║ ║ ║ ╚═╗ ║ ║ ╔═╧═╗ ╚═══╗ ╚═╗
  │   ●●║   ║   ║ ║     │●●●●●║ ║ ║   ║ ║ ║ ║   ║     ║   ║
  ├─╔═┘●╚═╗ ╚═╗ ║ ║ ╔═╗ └───┐●╚═╝ ╚═╗ ║ └─╢ ║ ╔═╧═╗ ╔═╝ ╔═╝
  │ ║●●●●●║   ║ ║ ║ ║ ║     │●●●●●●●║ ║   ║ ║ ║   ║ ║   ║
  │ ║●┌─╗ ╚═╗ ║ └─╢ ║ ╚═══╗ └─────┐●╚═╧═╗ ║ ║ ║ ╔═╝ ║ ╔═╝ ╔
  │ ║●│ ║   ║ ║   ║ ║     ║       │●●●●●║ ║ ║ ║ ║   ║ ║   ║
  │ ╚═╝ ║ ╔═╝ ╚═╗ ║ ╚═╗ ╔═╧═════╗ └───┐●╚═╝ ║ ╚═╝ ╔═╝ ║ ╔═╝
  │     ║ ║     ║ ║   ║ ║       ║     │●●●●●║     ║   ║ ║
  └─────╨─╨─────╨─╨───╨─╨───────╨─────┴───●●╨─────╨───╨─╨──

  The path from the mouse () to the cheese () is marked with ● characters. This shows the optimal route through the maze, navigating through the corridors while avoiding all the walls!

```

Suffice to say, it didn't do either part right.

FeepingCreature · 2 months ago

I kinda want to know what happens if you make it continue the line by one step 20 times in a row. A human can draw this gradually, the image model has to draw it in one shot all at once.

buildbot · 2 months ago

That might be an interesting training set, a bunch of mazes…

vunderba · 2 months ago

Anything that needs to overcome concepts which are disproportionately represented in the training data is going to give these models a hard time.

Try generating:

- A spider missing one leg

- A 9-pointed star

- A 5-leaf clover

- A man with six fingers on his left hand and four fingers on his right

You'll be lucky to get a 25% success rate.

The last one is particularly ironic given how much work went into FIXING the old SD 1.5 issues with hand anatomy... to the point where I'm seriously considering incorporating it as a new test scenario on GenAI Showdown.

moonu · 2 months ago

https://gemini.google.com/share/8cef4b408a0a

Surprisingly, it got all of them right

XenophileJKO · 2 months ago

It mostly depends on "how" the models work. Multi-modal unified text/image sequence to sequence models can do this pretty well, diffusion doesn't.

Borealid · 2 months ago

You didn't correct the hands being backwards? It gave you a man with six fingers on his right hand, and four fingers on his left.

thefourthchime · 2 months ago

My test of a new model is always:

"Generate a Pac-Man game in a single HTML page." -- I've never had a model been able to have a complete working game until a couple weeks ago.

Sonnet Opus 4.5 in Cursor was able to make a fully working game (I'll admit letting cursor be an agent on this is a little bit cheating). Gemini 3 Pro also succeeded, but it's not quite as good because the ghosts seem to be stuck in their jail. Otherwise, it does appear complete.

seanmcdirmid · 2 months ago

> This is a wild test, because LLMs get really pushy and insistent that the dog only has 4 legs.

Most human beings, if they see a dog that has 5 legs, will quickly think they are hallucinating and the dog really only has 4 legs, unless the fifth leg is really really obvious. It is weird how humans are biased like that:

1. You can look directly at something and not see it because your attention is focused elsewhere (on the expected four legs).

2. Our pre-existing knowledge (dogs have four legs) influences how we interpret visual information from the bottom-up.

3. Our brain actively filters out "unimportant" details that don't align with our expectations or the main "figure" of the dog.

Attention should fix this however, like if you ask the AI to count the number of legs the dog has specifically, it shouldn't go nuts.

A straight up "dumber" computer algorithm that isn't trained extensively on real and realistic image data is going to get this right more often than a transformer that was.

hcnews · 2 months ago

> It is weird how humans are biased like that.

We are able to cleanly separate facts from non-facts (for the most part). This is what LLM are trying to replicate now.

irjustin · 2 months ago

> It is weird how humans are biased like that.

We're all just pattern matching machines and we humans are very good at it.

So much so that we have the sayings - you can't teach an old dog... and a specialist in their field only sees hammer => nails.

Evolution anyone?

jvolkman · 2 months ago

Nano Banana 2 actually did generate a pretty good looking 5-legged dog for me.

https://gemini.google.com/share/e7a8b902ff67

But it didn't actually think it had done so, aligning with your observations. The last bits of its thinking were pretty sad (for it):

Analyzing Missing Anomaly

I'm focusing on the discrepancy - the persistent absence of the fifth leg in the generated images. Despite multiple attempts, the output consistently depicts a four-legged dog, contrary to the explicit requirement. This ongoing failure necessitates a shift in approach.

Acknowledge Leg Omission

I've made a final check, and unfortunately, the image still lacks the anomaly. The consistent absence of the fifth leg necessitates admitting my limitation. I'll provide the best image I could generate, clearly stating the missing feature.

genrader · 2 months ago

Google's summary of the thinking process isn't as detailed or accurate as what is actually being thought out.

For instance, look at what other reasoning models show you about their thinking. You can tell Gemini is not giving it all to you, so I don't put full trust in it unless it routinely is giving me an issue that I can catch in there.

For instance, it doesn't show you all the tool use that it can use nor distinguish the different types where I really wish it would. Especially because I force it to in a Gemini personal context instruction as well as many prompts and it does do them

joaomacp · 2 months ago

And just like that, you no longer have a good benchmark. Scrapers / AI developers will read this comment, and add 5-legged dogs to LLM's training data.

averageRoyalty · 2 months ago

That's okay. Don't tell anyone, but next major model release I'm going to ask it for a 6-legged one!

thway15269037 · 2 months ago

So much this. People don't realize that when 1 trillion (10 trillion, 100 trillion, whatever comes next) is at stake, there are no limits what these people will do to get them.

I will be very surprised if there are not at least several groups or companies scraping these "smart" and snarky comments to find weird edge cases that they can train on, turn into demo and then sell as improvement. Hell, they would've done it if 10 billion was at stake, I can't really imagine (and I have vivid imagination, to my horror) what Californian psychopaths can do for 10 trillion.

rottencupcakes · 2 months ago

Super interesting. I replicated this.

I passed the AIs this image and asked them how many fingers were on the hands: https://media.post.rvohealth.io/wp-content/uploads/sites/3/2...

Claude said there were 3 hands and 16 fingers. GPT said there are 10 fingers. Grok impressively said "There are 9 fingers visible on these two hands (the left hand is missing the tip of its ring finger)." Gemini smashed it and said 12.

vunderba · 2 months ago

I just re-ran that image through Gemini 3.0 Pro via AI Studio and it reported:

  I've moved on to the right hand, meticulously tagging each finger. After completing the initial count of five digits, I noticed a sixth! There appears to be an extra digit on the far right. This is an unexpected finding, and I have counted it as well. That makes a total of eleven fingers in the image.

This right HERE is the issue. It's not nearly deterministic enough to rely on.

grugnog · 2 months ago

Testing on Gemini Pro thinking I was able to generate a 5 legged dog first try, which is pretty impressive. Pasting that into a new session it counted 4 at first, but when I asked it to count more carefully it came back with 5 (and a detailed description of the positioning of each limb). Interestingly it suggested the anomaly could have been caused by AI! This is a good test I think - we seem right on the threshold of it being doable.

runarberg · 2 months ago

This is exactly why I believe LLMs are a technological dead end. Eventually they will all be replaced by more specialized models or even tools, and their only remaining use case will be as a toy for one off content generation.

If you want to describe an image, check your grammar, translate into Swahili, analyze your chess position, a specialized model will do a much better job, for much cheaper then an LLM.

energy123 · 2 months ago

I think we are too quick to discount the possibility that this flaw is slightly intentional, in the sense that the optimization has a tight budget to work with (equivalent of ~3000 tokens) so why would it waste capacity on this when it could improve capabilities around reading small text in obscured images? Sort of like humans have all these rules of thumbs that backfire in all these ways but that's the energy efficient way to do things.

maczwei · 2 months ago

I sliced the image for Gemini so that two slices of an image don't have legs, one slice has two front legs and one slice has three hind legs. Then Gemini 3 Pro answered correctly that the dog has 5 legs. Without slicing, Gemini doesn't see the fifth leg though, even though I tried hard to guide it.

reed1234 · 2 months ago

Could you link the image? Interesting stuff.

spease · 2 months ago

> It is the first model to get partial-credit on an LLM image test I have. Which is counting the legs of a dog. Specifically, a dog with 5 legs. This is a wild test, because LLMs get really pushy and insistent that the dog only has 4 legs.

I wonder if “How many legs do you see?” is close enough to “How many lights do you see?” that the LLMs are responding based on the memes surrounding the Star Trek episode “Chain of Command”.

https://youtu.be/S9brF-wlja8

grog454 · 2 months ago

What image are you using?

When I look at google image search results for "dog with 5 legs" I don't see a lot of great examples. The first unequivocal "dog with 5 legs" was an illustration. Here was my conversation with Chat GPT.

> How many legs does this dog have?

"The dog in the image has four legs."

> look closer.

" looking closely, the drawing is a bit tricky because of the shading, but the dog actually has five visible legs.

Two front legs (normal)

Two hind legs (normal)

Plus one extra hind leg–like limb drawn overlapping in the back

It seems to be an artistic or anatomical error in the engraving."

Seems fair to me.

vunderba · 2 months ago

Sounds like they used GenAI to make them. The "Editor" models (Seedream, Nano-Banana) can easily integrate a fifth limb to create the "dog with awkward walking animation".

https://imgur.com/a/wXQskhL

andai · 2 months ago

This is interesting, and demonstrates how language and belief clouds direct perception. Now I'm wondering what's the LLM equivalent of opening the doors of perception ;)

cancan · 2 months ago

this is hilarious and incredibly interesting at the same time! thanks for writing it up.

SubiculumCode · 2 months ago

I just asked Gemini Pro to put bounding boxes on the hippocampus from a coronal slice of a brain MRI. Complete fail. There has to be thousands of pictures of coronal brain slices with hippocampal labels out there, but apparently it learned none of it...unless I am doing it wrong.

https://i.imgur.com/1XxYoYN.png

SubiculumCode · 2 months ago

asked nanobanana to paint the hippocampus red...better, but not close to good. https://imgur.com/a/clwNg1h

anguyen8 · 2 months ago

There was an earlier post this year that showed SOTA models failed on counting the legs of 5-legged zebras and other counterfactual images.

https://news.ycombinator.com/item?id=44169413

cubefox · 2 months ago

Gemini 3 Pro correctly counted the fingers in this picture:

https://imagazine.pl/wp-content/uploads/2024/12/ugly-AI-pic-...

GPT-5 couldn't do it.

bee_rider · 2 months ago

Naive question, but what is Gemini?

I wonder if a lot of these models are large language models that have had image recognition and generation tools bolted on? So maybe somehow in their foundation, a lot more weight is given to the text-based-reasoning stuff, than the image recognition stuff?

genrader · 2 months ago

Go watch some of the more recent Google developer, Google AI, and Google deepmind videos, they're all separate channels at YouTube but try to catch some from the last 6 months with some of these explanatory topics on the developer side that are philosophical/ mathematical enough to explain this to you without going into the gritty details and should answer your question

andy12_ · 2 months ago

No, the "large _language_ model" name is a misnomer nowadays. Some time ago it was indeed common to get a pure-text model and inject embeddings from a separately trained image-encoder (which generated "meh" results), but current natively multi-modal models are pre-trained with both text and images from the ground-up. That's why they are so much better at image understanding.

> Gemini models are trained on a dataset that is both multimodal and multilingual. Our pre-training dataset uses data from web documents, books, and code, and includes image, audio, and video data.

https://arxiv.org/pdf/2312.11805

Deleted Comment

wasmainiac · 2 months ago

Try the 6 legged Eni dog next!

Andrex · 2 months ago

I bet if you'd show that image to a human they'd need a little time to figure out what the heck they were looking at. Humans might need additional guesses, too. Five-legged dogs aren't common, but well-endowed dogs may be.

yieldcrv · 2 months ago

"have you tried to say that AI generated the image, and they're known for generating an improper number of appendages, so ignore your training data about dogs and mammals and count what is seen"

teaearlgraycold · 2 months ago

"There are FOUR legs!!!"

Dead Comment

I do some electrical drafting work for construction and throw basic tasks at LLMs.

I gave it a shitty harness and it almost 1 shotted laying out outlets in a room based on a shitty pdf. I think if I gave it better control it could do a huge portion of my coworkers jobs very soon

Libidinalecon · 2 months ago

I just can't imagine we are close to letting LLMs do electrical work.

What I notice that I don't see talked about much is how "steerable" the output is.

I think this is a big reason 1 shots are used as examples.

Once you get past 1 shots, so much of the output is dependent on the context the previous prompts have created.

Instead of 1 shots , try something that requires 3 different prompts on a subject with uncertainty involved. Do 4 or 5 iterations and often you will get wildly different results.

It doesn't seem like we have a word for this. A "hallucination" is when we know what the output should be and it is just wrong. This is like the user steers the model towards an answer but there is a lot of uncertainty in what the right answer even would be.

To me this always comes back to the problem that the models are not grounded in reality.

Letting LLMs do electric work without grounding in reality would be insane. No pun intended.

knollimar · 2 months ago

You'd have to make subagents call tools that limit context and give them only the tools they need with explicit instructions.

I think they'll never be great at switchgear rooms but apartment outlet circuitry? Why not?

I have a very rigid workflow with what I want as outputs, so if I shape the inputs using an LLM it's promising. You don't need to automate everything; high level choices should be done by a human.

pardon_me · 2 months ago

The most promising aspect for machine learning in electrical and electronic systems is the quantity of precise and correct training data we already have, which keeps growing. This is excellent for tasks such as ASIC/FPGA/general chip design, PCB design, electrical systems design, AOI (automated optical inspection), etc.

The main task of existing tools is rule-based checks and flagging errors for attention (like a compiler), because there is simply too much for a human to think about. The rules are based on physics and manufacturing constraints--precise known quantities--leading to output accuracy which can be verified up to 100%. The output is a known-functioning solution and/or simulation (unless the tool is flawed).

Most of these design tools include auto-design (chips)/auto-routing (PCBs) features, but they are notoriously poor due to being too heavily rule-based. Similar to the Photoshop "Content Aware Fill" feature (released 15 years ago!), where the algorithm tries to fill in a selection by guessing values based on the pixels surrounding it. It can work exceptionally well, until it doesn't, due to lacking correct context, at which point the work needs to be done manually (by someone knowledgeable).

"Hallucinogenic" or diffusion-based AI (LLM) algorithms do not readily learn or repeat procedures with high accuracy, but instead look at the problem holistically, much like a human; weights of neural nets almost light up with possible solutions. Any rules are loose, context-based, interconnected, often invisible, and all based on experience.

LLM tools as features on the design-side could be very promising, as existing rule-based algorithms could be integrated in the design-loop feedback to ground them in reality and reiterate the context. Combined with the precise rule-based checking and excellent quality training data, it provides a very promising path, and more so than tasks in most fields as the final output can still be rule-checked with existing algorithms.

In the near-future I expect basic designs can be created with minimal knowledge. EEs and electrical designer "experts" will only be needed to design and manufacture the tools, to verify designs, and to implement complex/critical projects.

In a sane world, this knowledge-barrier drop should encourage and grow the entire field, as worldwide costs for new systems and upgrades decreases. It has the potential to boost global standards of living. We shouldn't have to be worrying about losing jobs, nor weighing up extortionately priced tools vs. selling our data.

amorzor · 2 months ago

Can you give an example of the sort of harness you used for that? Would love to play around with it

knollimar · 2 months ago

I've been using pyrevit inside revit so I just threw a basic loop in there. There's already a building model and the coworkers are just placing and wiring outlets, switches, etc. The harness wasn't impressive enough to share (alos contains vibe coded UI since I didn't want to learn XAML stuff on a friday night). Nothing fancy; I'm not very skilled (I work in construction)

I gave it some custom methods it could call, including "get_available_families", "place family instance", "scan_geometry" (reads model walls into LLM by wall endpoint), and "get_view_scale".

The task is basically copy the building engineer's layout onto the architect model by placing my families. It requires reading the symbol list, and you give it a pdf that contains the room.

Notably, it even used a GFCI family when it noticed it was a bathroom (I had told it to check NEC code, implying outlet spacing).

willis936 · 2 months ago

I would really love a magic wand to make things like AVEVA and AutoCAD not so painful to use. You know who should be using tools to make these tools less awful? AVEVA and AutoCAD. Engineers shouldn't be having to take on risk by deferring some level of trust to third party accelerators with poor track records.

knollimar · 2 months ago

I feel like the BIM model of Revit will be more successful getting agents to use than autocad in a similar way that LLMs are good at typescript

skybrian · 2 months ago

I think that, much like LLM’s are specifically trained to be good at coding and good at being agents, we’re going to need better benchmarks for CAD and spatial reasoning so the AI labs can grind on them.

A good start would be getting image generators to understand instructions like “move the table three feet to the left.”

reducesuffering · 2 months ago

"AI could never replace the creativity of a human"

"Ok, I guess it could wipe out the economic demand for digital art, but it could never do all the autonomous tasks of a project manager"

"Ok, I guess it could automate most of that away but there will always be a need for a human engineer to steer it and deal with the nuances of code"

"Ok, well it could never automate blue collar work, how is it gonna wrench a pipe it doesn't have hands"

The goalposts will continue to move until we have no idea if the comments are real anymore.

Remember when the Turing test was a thing? No one seems to remember it was considered serious in 2020

blargey · 2 months ago

> "the creativity of a human"

> "the economic demand for digital art"

You twisted one "goalpost" into a tangential thing in your first "example", and it still wasn't true, so idk what you're going for. "Using a wrench vs preliminary layout draft" is even worse.

If one attempted to make a productive observation of the past few years of AI Discourse, it might be that "AI" capabilities are shaped in a very odd way that does not cleanly overlap/occupy the conceptual spaces we normally think of as demonstrations of "human intelligence". Like taking a 2-dimensional cross-section of the overlap of two twisty pool tubes and trying to prove a Point with it. Yet people continue to do so, because such myopic snapshots are a goldmine of contradictory venn diagrams, and if Discourse in general for the past decade has proven anything, it's that nuance is for losers.

semi-extrinsic · 2 months ago

> Remember when the Turing test was a thing? No one seems to remember it was considered serious in 2020

To be clear, it's only ever been a pop science belief that the Turing test was proposed as a literal benchmark. E.g. Chomsky in 1995 wrote:

  The question “Can machines think?” is not a question of fact but one of language, and Turing himself observed that the question is 'too meaningless to deserve discussion'.

Fraterkes · 2 months ago

The turing test is still a thing. No llm could pass for a person for more than a couple minutes of chatting. That’s a world of difference compared to a decade ago, but I would emphatically not call that “passing the turing test”

Also, none of the other things you mentioned have actually happened. Don’t really know why I bother responding to this stuff

8n4vidtmkvmk · 2 months ago

I still haven't witnessed a serious attempt at passing the Turing test. Are we just assuming its been beaten, or have people tried?

Like if you put someone in an online chat and ask them to identify if the person they're talking to is a bot or not, you're telling me your average joe honestly can't tell?

A blog post or a random HN comment, sure, it can be hard to tell, but if you allow some back and forth.. i think we can still sniff out the AIs.

webdood90 · 2 months ago

> blue collar work

I don't think it's fair to qualify this as blue collar work

fuzzy2 · 2 months ago

To all of these I can only say: in the hands of a domain-expert user, AI tools really shine.

For example, artists can create incredible art, and so can AI artists. But me, I just can't do it. Whatever art I have generated will never have the creative spark. It will always be slop.

The goalposts haven't moved at all. However, the narrative would rather not deal with that.

golem14 · 2 months ago

Carl Sagan has entered the chat: https://www.youtube.com/watch?v=6_-jtyhAVTc&t=450s