Gemini 2.5 Computer Use model

It successfully got through the captcha at https://www.google.com/recaptcha/api2/demo

simonw · 2 months ago

Post edited: I was wrong about this. Gemini tried to solve the Google CAPTCHA but it was actually Browserbase that did the solve, notes here: https://simonwillison.net/2025/Oct/7/gemini-25-computer-use-...

dhon_ · 2 months ago

I was concerned there might be sensitive info leaked in the browserbase video at 0:58 as it shows a string of characters in the browser history:

    nricy.jd t.fxrape oruy,ap. majro

3 groups of 8 characters, space separated followed by 5 for a total of 32 characters. Seemed like text from a password generator or maybe an API key? Maybe accidentally pasted into the URL bar at one point and preserved in browser history?

I asked ChatGPT about it and it revealed

    Not a password or key — it’s a garbled search query typed with the wrong keyboard layout.
    
    If you map the text from Dvorak → QWERTY,
    nricy.jd t.fxrape oruy,ap. majro → “logitech keyboard software macos”.

pants2 · 2 months ago

Interesting that they're allowing Gemini to solve CAPTCHAs because OpenAI's agent detects and forces user-input for CAPTCHAs despite being fully able to solve them

SilverSlash · 2 months ago

Any idea how Browserbase solves CAPTCHA? Wouldn't be surprised if it sends requests to some "click farm" in a low cost location where humans solve captchas all day :\

jampa · 2 months ago

The automation is powered through Browserbase, which has a captcha solver. (Whether it is automated or human, I don't know.)

peytoncasper · 2 months ago

We do not use click farms!

You should check out our most recent announcement about Web Bot Auth

https://www.browserbase.com/blog/cloudflare-browserbase-pion...

jrmann100 · 2 months ago

Impressively, it also quickly passed levels 1 (checkbox) and 2 (stop sign) on http://neal.fun/not-a-robot, and got most of the way through level 3 (wiggly text).

subarctic · 2 months ago

Now we just need something to solve captchas for us when we're browsing normally

lavezzi · 2 months ago

https://2captcha.com/captcha-bypass-extension

siva7 · 2 months ago

probably because its ip is coming from googles own subnet

asadm · 2 months ago

isnt it coming from browserbase container?

Many years ago I was sitting at a red light on a secondary road, where the primary cross road was idle. It seemed like you could solve this using a computer vision camera system that watched the primary road and when it was idle, would expedite the secondary road's green light.

This was long before computer vision was mature enough to do anything like that and I found out that instead, there are magnetic systems that can detect cars passing over - trivial hardware and software - and I concluded that my approach was just far too complicated and expensive.

Similarly, when I look at computers, I typically want the ML/AI system to operate on a structured data that is codified for computer use. But I guess the world is complicated enough and computers got fast enough that having an AI look at a computer screen and move/click a mouse makes sense.

chrisfosterelli · 2 months ago

Ironically now that computer vision is commonplace, the cameras you talk about have become increasingly popular over the years because the magnetic systems do not do a very good job of detecting cyclists and the cameras double as a congestion monitoring tool for city staff.

__MatrixMan__ · 2 months ago

Sadly, most signal controllers are still using firmware that is not trajectory aware, so rather than reporting the speed and distance of an oncoming vehicle, these vision systems just emulate a magnetic loop by flipping a 0 to a 1 to indicate mere presence rather than passing along the richer data that they have.

apwell23 · 2 months ago

> the cameras you talk about have become increasingly popular over the years

cameras are being used to detect traffic and change lights? i don't think thats happening in USA.

which country are you referring to here?

y0eswddl · 2 months ago

and soon/now triple as surveillance.

pavelstoev · 2 months ago

It was my first engineering job, calibrating those inductive loops and circuit boards on I-93, just north of Boston's downtown area. Here is the photo from 2006. https://postimg.cc/zbz5JQC0

PEEK controller, 56K modem, Verizon telco lines, rodents - all included in one cabinet

dktp · 2 months ago

I cycle a lot. Outdoors I listen to podcasts and the fact that I can say "Hey Google, go back 30sec" to relisten to something (or forward to skip ads) is very valuable to me.

Indoors I tend to cast some show or youtube video. Often enough I want to change the Youtube video or show using voice commands - I can do this for Youtube, but results are horrible unless I know exactly which video I want to watch. For other services it's largely not possible at all

In a perfect world Google would provide superb APIs for these integrations and all app providers would integrate it and keep it up to date. But if we can bypass that and get good results across the board - I would find it very valuable

I understand this is a very specific scenario. But one I would be excited about nonetheless

Macha · 2 months ago

Do you have a lot of dedicated cycle ways? I'm not sure I'd want to have headphones impeding my hearing anywhere I'd have to interact with cars or pedestrians while on my bike.

nerdsniper · 2 months ago

https://www.ycombinator.com/companies/blue

nerdsniper · 2 months ago

My town solved this at night by putting simple light sensors on the traffic lights so as you approach you can flash ur brights at it and it triggers a cycle.

Otherwise the higher traffic road got a permanent green light at nighttime until it saw high beams or magnetic flux from a car reaching the intersection.

trenchpilgrim · 2 months ago

FWIW those type of traffic cameras are in common use. https://www.milesight.com/company/blog/types-of-traffic-came...

jlhawn · 2 months ago

The camera systems are also superior from an infrastructure maintenance perspective. You can update them with new capabilities or do re-striping without tearing up the pavement.

dekhn · 2 months ago

If I read the web page, they don't actually use that as a solution to shortening a red - IMHO that has a very high safety bar compared to the more common uses. But I'd be happy to hear this is something that Just Works in the Real World with a reasonable false positive and false negative rate.

alach11 · 2 months ago

Computer use is the most important AI benchmark to watch if you're trying to forecast labor-market impact. You're right, there are much more effective ways for ML/AI systems to accomplish tasks on the computer. But they all have to be hand-crafted for each task. Solving the general case is more scalable.

poopiokaka · 2 months ago

Not the current benchmarks, no. The demos in this post are so slow. Between writing the prompt, waiting a long time and checking the work I’d just rather do it myself.

seer · 2 months ago

In some European countries all of this is commonplace - check out the not just bikes video on the subject - https://youtu.be/knbVWXzL4-4?si=NLTMgHiVcgyPv6dc

Detects if you are coming to the intersection and with what speed, and if there is no traffic blocking you automatically cycles the red lights so you don’t have to stop at all.

dgs_sgd · 2 months ago

It’s funny that you used traffic signals as an example of overcomplicating a problem with AI because there turns out to be a YC funded startup making AI powered traffic lights: https://www.ycombinator.com/companies/roundabout-technologie...

MrToadMan · 2 months ago

And even funnier in that context: it’s called ‘roundabout technologies’.

stronglikedan · 2 months ago

> I concluded that my approach was just far too complicated and expensive.

Motorcyclists would conclude that your approach would actually work.

elboru · 2 months ago

I recently spent some time in a country house far enough from civilization that electric lines don’t reach. The owners could have installed some solar panels, but they opted to keep it electricity-free to disconnect from technology, or at least from electronics. They have multiple decades old ingenious utensils that work without electricity, like a fridge that uses propane, oil lamps, non-electric coffee percolator, etc. and that made me wonder, how many analogous devices stopped getting invented because an electric device is the most obvious way of solving things to our current view.

TeMPOraL · 2 months ago

> But I guess the world is complicated enough and computers got fast enough that having an AI look at a computer screen and move/click a mouse makes sense.

It's not that the world is particularly complicated here - it's just that computing is a dynamic and adversarial environment. End-user automation consuming structured data is a rare occurrence not because it's hard, but because it defeats pretty much every way people make money on the Internet. AI is succeeding now because it is able to navigate the purposefully unstructured and obtuse interfaces like a person would.

avereveard · 2 months ago

And the race is not over yet, adversaries to automation will find way to block the last approach too, in the name of monetization

yunyu · 2 months ago

There is a lot of pretraining data available around screen recordings and mouse movements (Loom, YouTube, etc). There is much less pretraining data available around navigating accessibility trees or DOM structures. Many use cases may also need to be image aware (document scan parsing, looking at images), and keyboard/video/mouse-based models generalize to more applicants.

rirze · 2 months ago

I don't know the implementation details, but this is common in the county I live in (US). It's been in use for the last 3-5 years. The traffic lights adapt to current traffic patterns in most intersections and speed up the green light for roads that have cars.

ge96 · 2 months ago

It's funny I'll sometimes scoot forward/rock my car but I'm not sure if it's just coincidence. Also a lot of stop lights now have that tall white camera on top.

netghost · 2 months ago

There's several mechanisms. The most common is (or at least was) a loop detector under the road that triggers when a vehicle is over it. Sometimes if you're not quite over it, or it's somewhat faulty that will trigger it.

bozhark · 2 months ago

Like flashing lights for the first responders sensor

Spooky23 · 2 months ago

Sometimes the rocking helps with a ground loop that isn’t working well.

sagarm · 2 months ago

Robotic process automation isn't new.

VirgilShelton · 2 months ago

The best thing about being nerds like we are is we can just ignore this product since it's not for us.