Over the course of experimenting with pruned HTML, accessibility trees, and other perception systems for web agents, we've iterated on Tarsier's components to maximize downstream agent/codegen performance.
Here's the Tarsier pipeline in a nutshell:
1. tag interactable elements with IDs for the LLM to act upon & grab a full-sized webpage screenshot
2. for text-only LLMs, run OCR on the screenshot & convert it to whitespace-structured text (this is the coolest part imo)
3. map LLM intents back to actions on elements in the browser via an ID-to-XPath dict
Humans interact with the web through visually-rendered pages, and agents should too. We run Tarsier in production for thousands of web data extraction agents a day at Reworkd (https://reworkd.ai).
By the way, we're hiring backend/infra engineers with experience in compute-intensive distributed systems!
[1] https://arxiv.org/abs/2305.16328
In https://github.com/OpenAdaptAI/OpenAdapt/blob/main/openadapt... we use FastSAM to first segment the UI elements, then have the LLM describe each segment individually. This seems to work quite well; see https://twitter.com/OpenAdaptAI/status/1789430587314336212 for a demo.
More coming soon!
It's sort of like when you (as a human) write a web scraper and visually click on individual elements to look at the surrounding HTML structure / their selectors, but then end up writing code with more general selectors—not copypasting the selectors of the elements you clicked.
Good luck!
fwiw so far we've seen that Azure has the best OCR for screenshot type data across the proprietary and open source models, though we are far more focused on grabbing data from desktop based applications then web pages so ymmv
Will have to look into supporting Azure OCR in Tarsier then—thanks for the tip!
[1] https://github.com/reworkd/bananalyzer
[1] https://arxiv.org/abs/2405.07987
For instance, I might want to tag as you did where elements are, but I still need an association with a label, quite often, to determine what the actual context of the textbox or select is.
EX: Given a simple login form, I may not know if the label is above or below the username textbox. A password box would be below it. I have a hard time understanding the relevance to tagging without context.
Tagging is basically irrelevant to any automated task if we do not know the context. I am not trying to diminish your great work, don't get me wrong, but if you don't have context I don't see much relevance. Youre doing something that is easily scripted with xpath templates which I've done for over a decade.
The Python API on this is really nice though.
https://flickr.com/photos/wyclif/3271137617/in/album-7215761...