Not OP, but in my experience, Jest and Playwright are so much faster that it's not worth doing much with the MCP. It's a neat toy, but it's just too slow for an LLM to try to control a browser using MCP calls.
I've used it to read authenticated pages with Chromium.
It can be run as a headless browser and convert the HTML to markdown, but I generally open Chromium, authenticate to the system, then allow the CLI agent to interact with the page.
This has absolutely nothing in common with a model for computer use... This uses pre-defined tools provided in the MCP server by Google, nothing to do with a general model supposed to work for any software.
The general model is what runs in an agentic loop, deciding which of the MCP commands to use at each point to control the browser. From my experimentation, you can mix and match between the model and the tools available, even when the model was tuned to use a specific set of tools.
I was concerned there might be sensitive info leaked in the browserbase video at 0:58 as it shows a string of characters in the browser history:
nricy.jd t.fxrape oruy,ap. majro
3 groups of 8 characters, space separated followed by 5 for a total of 32 characters. Seemed like text from a password generator or maybe an API key? Maybe accidentally pasted into the URL bar at one point and preserved in browser history?
I asked ChatGPT about it and it revealed
Not a password or key — it’s a garbled search query typed with the wrong keyboard layout.
If you map the text from Dvorak → QWERTY,
nricy.jd t.fxrape oruy,ap. majro → “logitech keyboard software macos”.
Interesting that they're allowing Gemini to solve CAPTCHAs because OpenAI's agent detects and forces user-input for CAPTCHAs despite being fully able to solve them
Any idea how Browserbase solves CAPTCHA? Wouldn't be surprised if it sends requests to some "click farm" in a low cost location where humans solve captchas all day :\
Impressively, it also quickly passed levels 1 (checkbox) and 2 (stop sign) on http://neal.fun/not-a-robot, and got most of the way through level 3 (wiggly text).
> ...the task is just to "solve today's Wordle", and as a web browsing robot, I cannot actually see the colors of the letters after a guess to make subsequent guesses. I can enter a word, but I cannot interpret the feedback (green, yellow, gray letters) to solve the puzzle.
It can definitely see color - I asked it to go to bing and search for the two most prominent colors in the bing background image and it did so just fine. It seems extremely lazy though; it prematurely reported as "completed" most of the tasks I gave it after the first or second step (navigating to the relevant website, usually).
ChatGPT regularly "forgets" it can run code, visit urls, and generate images. Once it decides it can't do something there seems to be no way to convince it otherwise, even if it did the things earlier in the same chat.
It told me that "image generation is disabled right now". So I tested in another chat and it was fine. I mentioned that in the broken conversation and it said that "it's only disabled in this chat". I went back to the message before it claimed it was disabled and resent it. It worked. I somewhat miss the days where it would just believe anything you told it, even if I was completely wrong.
Knowing it's technically possible is one thing, but giving it a short command and seeing it go log in to a site, scroll around, reply to posts, etc. is eerie.
Also it tied me at wordle today, making the same mistake I did on the second to lass guess. Too bad you can't talk to it while it's working.
I wonder how it would behave in a scenario where it has to download some file from a shady website that has all those advertisement with fake "download"
I believe it will need very capable but small VLMs that understand common User Interfaces very well -- small enough to run locally -- paired with any other higher level models on the cloud, to achieve human-speed interactions and beyond with reliability.
Really feels like computer use models may be vertical agent killers once they get good enough. Many knowledge work domains boil down to: use a web app, send an email. (e.g. recruiting, sales outreach)
Why do you need an agent use web app through UI? Can't agent be integrated into web app natively? IMO for verticals you mentioned the missing piece is for an agent to be able to make phone calls.
Many years ago I was sitting at a red light on a secondary road, where the primary cross road was idle. It seemed like you could solve this using a computer vision camera system that watched the primary road and when it was idle, would expedite the secondary road's green light.
This was long before computer vision was mature enough to do anything like that and I found out that instead, there are magnetic systems that can detect cars passing over - trivial hardware and software - and I concluded that my approach was just far too complicated and expensive.
Similarly, when I look at computers, I typically want the ML/AI system to operate on a structured data that is codified for computer use. But I guess the world is complicated enough and computers got fast enough that having an AI look at a computer screen and move/click a mouse makes sense.
Ironically now that computer vision is commonplace, the cameras you talk about have become increasingly popular over the years because the magnetic systems do not do a very good job of detecting cyclists and the cameras double as a congestion monitoring tool for city staff.
Sadly, most signal controllers are still using firmware that is not trajectory aware, so rather than reporting the speed and distance of an oncoming vehicle, these vision systems just emulate a magnetic loop by flipping a 0 to a 1 to indicate mere presence rather than passing along the richer data that they have.
It was my first engineering job, calibrating those inductive loops and circuit boards on I-93, just north of Boston's downtown area. Here is the photo from 2006. https://postimg.cc/zbz5JQC0
PEEK controller, 56K modem, Verizon telco lines, rodents - all included in one cabinet
I cycle a lot. Outdoors I listen to podcasts and the fact that I can say "Hey Google, go back 30sec" to relisten to something (or forward to skip ads) is very valuable to me.
Indoors I tend to cast some show or youtube video. Often enough I want to change the Youtube video or show using voice commands - I can do this for Youtube, but results are horrible unless I know exactly which video I want to watch. For other services it's largely not possible at all
In a perfect world Google would provide superb APIs for these integrations and all app providers would integrate it and keep it up to date. But if we can bypass that and get good results across the board - I would find it very valuable
I understand this is a very specific scenario. But one I would be excited about nonetheless
Do you have a lot of dedicated cycle ways? I'm not sure I'd want to have headphones impeding my hearing anywhere I'd have to interact with cars or pedestrians while on my bike.
My town solved this at night by putting simple light sensors on the traffic lights so as you approach you can flash ur brights at it and it triggers a cycle.
Otherwise the higher traffic road got a permanent green light at nighttime until it saw high beams or magnetic flux from a car reaching the intersection.
The camera systems are also superior from an infrastructure maintenance perspective. You can update them with new capabilities or do re-striping without tearing up the pavement.
If I read the web page, they don't actually use that as a solution to shortening a red - IMHO that has a very high safety bar compared to the more common uses. But I'd be happy to hear this is something that Just Works in the Real World with a reasonable false positive and false negative rate.
Computer use is the most important AI benchmark to watch if you're trying to forecast labor-market impact. You're right, there are much more effective ways for ML/AI systems to accomplish tasks on the computer. But they all have to be hand-crafted for each task. Solving the general case is more scalable.
Not the current benchmarks, no. The demos in this post are so slow. Between writing the prompt, waiting a long time and checking the work I’d just rather do it myself.
Detects if you are coming to the intersection and with what speed, and if there is no traffic blocking you automatically cycles the red lights so you don’t have to stop at all.
I recently spent some time in a country house far enough from civilization that electric lines don’t reach. The owners could have installed some solar panels, but they opted to keep it electricity-free to disconnect from technology, or at least from electronics. They have multiple decades old ingenious utensils that work without electricity, like a fridge that uses propane, oil lamps, non-electric coffee percolator, etc. and that made me wonder, how many analogous devices stopped getting invented because an electric device is the most obvious way of solving things to our current view.
> But I guess the world is complicated enough and computers got fast enough that having an AI look at a computer screen and move/click a mouse makes sense.
It's not that the world is particularly complicated here - it's just that computing is a dynamic and adversarial environment. End-user automation consuming structured data is a rare occurrence not because it's hard, but because it defeats pretty much every way people make money on the Internet. AI is succeeding now because it is able to navigate the purposefully unstructured and obtuse interfaces like a person would.
There is a lot of pretraining data available around screen recordings and mouse movements (Loom, YouTube, etc). There is much less pretraining data available around navigating accessibility trees or DOM structures. Many use cases may also need to be image aware (document scan parsing, looking at images), and keyboard/video/mouse-based models generalize to more applicants.
I don't know the implementation details, but this is common in the county I live in (US). It's been in use for the last 3-5 years. The traffic lights adapt to current traffic patterns in most intersections and speed up the green light for roads that have cars.
It's funny I'll sometimes scoot forward/rock my car but I'm not sure if it's just coincidence. Also a lot of stop lights now have that tall white camera on top.
There's several mechanisms. The most common is (or at least was) a loop detector under the road that triggers when a vehicle is over it. Sometimes if you're not quite over it, or it's somewhat faulty that will trigger it.
https://github.com/grantcarthew/scripts/blob/main/get-webpag...
I asked ChatGPT about it and it revealed
You should check out our most recent announcement about Web Bot Auth
https://www.browserbase.com/blog/cloudflare-browserbase-pion...
Stucks with:
> ...the task is just to "solve today's Wordle", and as a web browsing robot, I cannot actually see the colors of the letters after a guess to make subsequent guesses. I can enter a word, but I cannot interpret the feedback (green, yellow, gray letters) to solve the puzzle.
Its like it sometimes just decides it can’t do that. Like a toddler.
https://g.co/gemini/share/234fb68bc9a4
It told me that "image generation is disabled right now". So I tested in another chat and it was fine. I mentioned that in the broken conversation and it said that "it's only disabled in this chat". I went back to the message before it claimed it was disabled and resent it. It worked. I somewhat miss the days where it would just believe anything you told it, even if I was completely wrong.
Knowing it's technically possible is one thing, but giving it a short command and seeing it go log in to a site, scroll around, reply to posts, etc. is eerie.
Also it tied me at wordle today, making the same mistake I did on the second to lass guess. Too bad you can't talk to it while it's working.
This was long before computer vision was mature enough to do anything like that and I found out that instead, there are magnetic systems that can detect cars passing over - trivial hardware and software - and I concluded that my approach was just far too complicated and expensive.
Similarly, when I look at computers, I typically want the ML/AI system to operate on a structured data that is codified for computer use. But I guess the world is complicated enough and computers got fast enough that having an AI look at a computer screen and move/click a mouse makes sense.
cameras are being used to detect traffic and change lights? i don't think thats happening in USA.
which country are you referring to here?
PEEK controller, 56K modem, Verizon telco lines, rodents - all included in one cabinet
Indoors I tend to cast some show or youtube video. Often enough I want to change the Youtube video or show using voice commands - I can do this for Youtube, but results are horrible unless I know exactly which video I want to watch. For other services it's largely not possible at all
In a perfect world Google would provide superb APIs for these integrations and all app providers would integrate it and keep it up to date. But if we can bypass that and get good results across the board - I would find it very valuable
I understand this is a very specific scenario. But one I would be excited about nonetheless
Otherwise the higher traffic road got a permanent green light at nighttime until it saw high beams or magnetic flux from a car reaching the intersection.
Detects if you are coming to the intersection and with what speed, and if there is no traffic blocking you automatically cycles the red lights so you don’t have to stop at all.
Motorcyclists would conclude that your approach would actually work.
It's not that the world is particularly complicated here - it's just that computing is a dynamic and adversarial environment. End-user automation consuming structured data is a rare occurrence not because it's hard, but because it defeats pretty much every way people make money on the Internet. AI is succeeding now because it is able to navigate the purposefully unstructured and obtuse interfaces like a person would.