Readit News logoReadit News
xnx · 2 months ago
I've had good success with the Chrome devtools MCP (https://github.com/ChromeDevTools/chrome-devtools-mcp) for browser automation with Gemini CLI, so I'm guessing this model will work even better.
arkmm · 2 months ago
What sorts of automations were you able to get working with the Chrome dev tools MCP?
odie5533 · 2 months ago
Not OP, but in my experience, Jest and Playwright are so much faster that it's not worth doing much with the MCP. It's a neat toy, but it's just too slow for an LLM to try to control a browser using MCP calls.
grantcarthew · 2 months ago
I've used it to read authenticated pages with Chromium. It can be run as a headless browser and convert the HTML to markdown, but I generally open Chromium, authenticate to the system, then allow the CLI agent to interact with the page.

https://github.com/grantcarthew/scripts/blob/main/get-webpag...

informal007 · 2 months ago
Computer use model comes from interactive demand with computer automatically, Chrome devtools MCP might be one of the core pushers.
iLoveOncall · 2 months ago
This has absolutely nothing in common with a model for computer use... This uses pre-defined tools provided in the MCP server by Google, nothing to do with a general model supposed to work for any software.
falcor84 · 2 months ago
The general model is what runs in an agentic loop, deciding which of the MCP commands to use at each point to control the browser. From my experimentation, you can mix and match between the model and the tools available, even when the model was tuned to use a specific set of tools.
phamilton · 2 months ago
It successfully got through the captcha at https://www.google.com/recaptcha/api2/demo
simonw · 2 months ago
Post edited: I was wrong about this. Gemini tried to solve the Google CAPTCHA but it was actually Browserbase that did the solve, notes here: https://simonwillison.net/2025/Oct/7/gemini-25-computer-use-...
dhon_ · 2 months ago
I was concerned there might be sensitive info leaked in the browserbase video at 0:58 as it shows a string of characters in the browser history:

    nricy.jd t.fxrape oruy,ap. majro
3 groups of 8 characters, space separated followed by 5 for a total of 32 characters. Seemed like text from a password generator or maybe an API key? Maybe accidentally pasted into the URL bar at one point and preserved in browser history?

I asked ChatGPT about it and it revealed

    Not a password or key — it’s a garbled search query typed with the wrong keyboard layout.
    
    If you map the text from Dvorak → QWERTY,
    nricy.jd t.fxrape oruy,ap. majro → “logitech keyboard software macos”.

pants2 · 2 months ago
Interesting that they're allowing Gemini to solve CAPTCHAs because OpenAI's agent detects and forces user-input for CAPTCHAs despite being fully able to solve them
SilverSlash · 2 months ago
Any idea how Browserbase solves CAPTCHA? Wouldn't be surprised if it sends requests to some "click farm" in a low cost location where humans solve captchas all day :\
jampa · 2 months ago
The automation is powered through Browserbase, which has a captcha solver. (Whether it is automated or human, I don't know.)
peytoncasper · 2 months ago
We do not use click farms!

You should check out our most recent announcement about Web Bot Auth

https://www.browserbase.com/blog/cloudflare-browserbase-pion...

jrmann100 · 2 months ago
Impressively, it also quickly passed levels 1 (checkbox) and 2 (stop sign) on http://neal.fun/not-a-robot, and got most of the way through level 3 (wiggly text).
subarctic · 2 months ago
Now we just need something to solve captchas for us when we're browsing normally
siva7 · 2 months ago
probably because its ip is coming from googles own subnet
asadm · 2 months ago
isnt it coming from browserbase container?
mohsen1 · 2 months ago
> Solve today's Wordle

Stucks with:

> ...the task is just to "solve today's Wordle", and as a web browsing robot, I cannot actually see the colors of the letters after a guess to make subsequent guesses. I can enter a word, but I cannot interpret the feedback (green, yellow, gray letters) to solve the puzzle.

jcims · 2 months ago
It solved it in four twice for me.

Its like it sometimes just decides it can’t do that. Like a toddler.

strangescript · 2 months ago
This has been the fundamental issue with the 2.5 line of models. It seems to forget parts of its system prompts, not understand where its "located".
Havoc · 2 months ago
So I guess it’s browsing in grey scale?
daemonologist · 2 months ago
It can definitely see color - I asked it to go to bing and search for the two most prominent colors in the bing background image and it did so just fine. It seems extremely lazy though; it prematurely reported as "completed" most of the tasks I gave it after the first or second step (navigating to the relevant website, usually).
egeozcan · 2 months ago
I tested it and gemini seems not able to solve wordle in grayscale

https://g.co/gemini/share/234fb68bc9a4

apskim · 2 months ago
It actually succeeds and solves it perfectly fine despite writing all these unconfident disclaimers about itself! My screenshot: https://x.com/Skiminok/status/1975688789164237012
samth · 2 months ago
I tried this also, and it was totally garbage for me too (with a similar refusal as well as other failures).
hugh-avherald · 2 months ago
I found ChatGPT also struggled with colour detection when solving Wordle, despite my advice to use any tools. I had to tell it.
davidmurdoch · 2 months ago
ChatGPT regularly "forgets" it can run code, visit urls, and generate images. Once it decides it can't do something there seems to be no way to convince it otherwise, even if it did the things earlier in the same chat.

It told me that "image generation is disabled right now". So I tested in another chat and it was fine. I mentioned that in the broken conversation and it said that "it's only disabled in this chat". I went back to the message before it claimed it was disabled and resent it. It worked. I somewhat miss the days where it would just believe anything you told it, even if I was completely wrong.

qingcharles · 2 months ago
I tried to get GPT to play Wordle when Agent launched, but it was banned from the NYT and it had to play a knock-off for me instead.
jcims · 2 months ago
(Just using the browserbase demo)

Knowing it's technically possible is one thing, but giving it a short command and seeing it go log in to a site, scroll around, reply to posts, etc. is eerie.

Also it tied me at wordle today, making the same mistake I did on the second to lass guess. Too bad you can't talk to it while it's working.

krawcu · 2 months ago
I wonder how it would behave in a scenario where it has to download some file from a shady website that has all those advertisement with fake "download"
beepdyboop · 2 months ago
haha that's a great test actually
albert_e · 2 months ago
I believe it will need very capable but small VLMs that understand common User Interfaces very well -- small enough to run locally -- paired with any other higher level models on the cloud, to achieve human-speed interactions and beyond with reliability.
derekcheng08 · 2 months ago
Really feels like computer use models may be vertical agent killers once they get good enough. Many knowledge work domains boil down to: use a web app, send an email. (e.g. recruiting, sales outreach)
loandbehold · 2 months ago
Why do you need an agent use web app through UI? Can't agent be integrated into web app natively? IMO for verticals you mentioned the missing piece is for an agent to be able to make phone calls.
tgsovlerkhgsel · 2 months ago
Native integration, APIs etc. require the web app author to do something. A computer use agent using the UI doesn't.
dekhn · 2 months ago
Many years ago I was sitting at a red light on a secondary road, where the primary cross road was idle. It seemed like you could solve this using a computer vision camera system that watched the primary road and when it was idle, would expedite the secondary road's green light.

This was long before computer vision was mature enough to do anything like that and I found out that instead, there are magnetic systems that can detect cars passing over - trivial hardware and software - and I concluded that my approach was just far too complicated and expensive.

Similarly, when I look at computers, I typically want the ML/AI system to operate on a structured data that is codified for computer use. But I guess the world is complicated enough and computers got fast enough that having an AI look at a computer screen and move/click a mouse makes sense.

chrisfosterelli · 2 months ago
Ironically now that computer vision is commonplace, the cameras you talk about have become increasingly popular over the years because the magnetic systems do not do a very good job of detecting cyclists and the cameras double as a congestion monitoring tool for city staff.
__MatrixMan__ · 2 months ago
Sadly, most signal controllers are still using firmware that is not trajectory aware, so rather than reporting the speed and distance of an oncoming vehicle, these vision systems just emulate a magnetic loop by flipping a 0 to a 1 to indicate mere presence rather than passing along the richer data that they have.
apwell23 · 2 months ago
> the cameras you talk about have become increasingly popular over the years

cameras are being used to detect traffic and change lights? i don't think thats happening in USA.

which country are you referring to here?

y0eswddl · 2 months ago
and soon/now triple as surveillance.
pavelstoev · 2 months ago
It was my first engineering job, calibrating those inductive loops and circuit boards on I-93, just north of Boston's downtown area. Here is the photo from 2006. https://postimg.cc/zbz5JQC0

PEEK controller, 56K modem, Verizon telco lines, rodents - all included in one cabinet

dktp · 2 months ago
I cycle a lot. Outdoors I listen to podcasts and the fact that I can say "Hey Google, go back 30sec" to relisten to something (or forward to skip ads) is very valuable to me.

Indoors I tend to cast some show or youtube video. Often enough I want to change the Youtube video or show using voice commands - I can do this for Youtube, but results are horrible unless I know exactly which video I want to watch. For other services it's largely not possible at all

In a perfect world Google would provide superb APIs for these integrations and all app providers would integrate it and keep it up to date. But if we can bypass that and get good results across the board - I would find it very valuable

I understand this is a very specific scenario. But one I would be excited about nonetheless

Macha · 2 months ago
Do you have a lot of dedicated cycle ways? I'm not sure I'd want to have headphones impeding my hearing anywhere I'd have to interact with cars or pedestrians while on my bike.
nerdsniper · 2 months ago
My town solved this at night by putting simple light sensors on the traffic lights so as you approach you can flash ur brights at it and it triggers a cycle.

Otherwise the higher traffic road got a permanent green light at nighttime until it saw high beams or magnetic flux from a car reaching the intersection.

trenchpilgrim · 2 months ago
FWIW those type of traffic cameras are in common use. https://www.milesight.com/company/blog/types-of-traffic-came...
jlhawn · 2 months ago
The camera systems are also superior from an infrastructure maintenance perspective. You can update them with new capabilities or do re-striping without tearing up the pavement.
dekhn · 2 months ago
If I read the web page, they don't actually use that as a solution to shortening a red - IMHO that has a very high safety bar compared to the more common uses. But I'd be happy to hear this is something that Just Works in the Real World with a reasonable false positive and false negative rate.
alach11 · 2 months ago
Computer use is the most important AI benchmark to watch if you're trying to forecast labor-market impact. You're right, there are much more effective ways for ML/AI systems to accomplish tasks on the computer. But they all have to be hand-crafted for each task. Solving the general case is more scalable.
poopiokaka · 2 months ago
Not the current benchmarks, no. The demos in this post are so slow. Between writing the prompt, waiting a long time and checking the work I’d just rather do it myself.
seer · 2 months ago
In some European countries all of this is commonplace - check out the not just bikes video on the subject - https://youtu.be/knbVWXzL4-4?si=NLTMgHiVcgyPv6dc

Detects if you are coming to the intersection and with what speed, and if there is no traffic blocking you automatically cycles the red lights so you don’t have to stop at all.

dgs_sgd · 2 months ago
It’s funny that you used traffic signals as an example of overcomplicating a problem with AI because there turns out to be a YC funded startup making AI powered traffic lights: https://www.ycombinator.com/companies/roundabout-technologie...
MrToadMan · 2 months ago
And even funnier in that context: it’s called ‘roundabout technologies’.
stronglikedan · 2 months ago
> I concluded that my approach was just far too complicated and expensive.

Motorcyclists would conclude that your approach would actually work.

elboru · 2 months ago
I recently spent some time in a country house far enough from civilization that electric lines don’t reach. The owners could have installed some solar panels, but they opted to keep it electricity-free to disconnect from technology, or at least from electronics. They have multiple decades old ingenious utensils that work without electricity, like a fridge that uses propane, oil lamps, non-electric coffee percolator, etc. and that made me wonder, how many analogous devices stopped getting invented because an electric device is the most obvious way of solving things to our current view.
TeMPOraL · 2 months ago
> But I guess the world is complicated enough and computers got fast enough that having an AI look at a computer screen and move/click a mouse makes sense.

It's not that the world is particularly complicated here - it's just that computing is a dynamic and adversarial environment. End-user automation consuming structured data is a rare occurrence not because it's hard, but because it defeats pretty much every way people make money on the Internet. AI is succeeding now because it is able to navigate the purposefully unstructured and obtuse interfaces like a person would.

avereveard · 2 months ago
And the race is not over yet, adversaries to automation will find way to block the last approach too, in the name of monetization
yunyu · 2 months ago
There is a lot of pretraining data available around screen recordings and mouse movements (Loom, YouTube, etc). There is much less pretraining data available around navigating accessibility trees or DOM structures. Many use cases may also need to be image aware (document scan parsing, looking at images), and keyboard/video/mouse-based models generalize to more applicants.
rirze · 2 months ago
I don't know the implementation details, but this is common in the county I live in (US). It's been in use for the last 3-5 years. The traffic lights adapt to current traffic patterns in most intersections and speed up the green light for roads that have cars.
ge96 · 2 months ago
It's funny I'll sometimes scoot forward/rock my car but I'm not sure if it's just coincidence. Also a lot of stop lights now have that tall white camera on top.
netghost · 2 months ago
There's several mechanisms. The most common is (or at least was) a loop detector under the road that triggers when a vehicle is over it. Sometimes if you're not quite over it, or it's somewhat faulty that will trigger it.
bozhark · 2 months ago
Like flashing lights for the first responders sensor
Spooky23 · 2 months ago
Sometimes the rocking helps with a ground loop that isn’t working well.
sagarm · 2 months ago
Robotic process automation isn't new.
VirgilShelton · 2 months ago
The best thing about being nerds like we are is we can just ignore this product since it's not for us.