Show HN: Tarsier – Vision utilities for web interaction agents

How do you make sure the tagging of elements is robust? With regular browser automation it's quite hard to write selectors that will keep working after webpages get updated; often when writing E2E testing teams end up putting [data] attributes into the elements to aid with selection. Using a numerical identifier seems quite fragile.

KhoomeiK · 2 years ago

Totally agreed—this is a design choice that basically comes from our agent architecture, and the codegen-based architecture that we think will likely proliferate for web agent tasks in the future. We provide Tarsier's text/screenshot to an LLM and have it write code with generically written selectors rather than the naive selectors that Tarsier assigns to each element.

It's sort of like when you (as a human) write a web scraper and visually click on individual elements to look at the surrounding HTML structure / their selectors, but then end up writing code with more general selectors—not copypasting the selectors of the elements you clicked.

davedx · 2 years ago

Ooh that's a very neat approach, great idea! Chains of thought across abstraction layers. Definitely worth a blog post I reckon.

Good luck!

ghxst · 2 years ago

Great question, also situations where you have multiple CTAs with similar names/contexts on a page is still something I see LLM based automation struggle with.

KhoomeiK · 2 years ago

Hm, not sure I follow why those situations would be especially difficult? Regarding website changes, the nice thing about using LLMs is that we can simply provide the previous scraper as context and have it regenerate the scraper to "self-heal" when significant website changes are detected.

Very cool. We do something similar by combining OCR along with accessiblity data and other data (speech reco et. al.) for desktop based screensharing understanding, but evaluation compared to multi-modal LLMs has not been easy. How are you evaluating to come up with this number "consistently beats multimodal GPT-4V/4o + webpage screenshot by 10-20%,"?

fwiw so far we've seen that Azure has the best OCR for screenshot type data across the proprietary and open source models, though we are far more focused on grabbing data from desktop based applications then web pages so ymmv

KhoomeiK · 2 years ago

Yup, evals can definitely be tough. We basically have a suite of several hundred web data extraction evals in a tool we built called Bananalyzer [1]. It's made it pretty straightforward for us to benchmark how accurately our agent generates code when it uses Tarsier-text (+ GPT-4) for perception v.s. Tarsier-screenshot (+ GPT-4V/o).

Will have to look into supporting Azure OCR in Tarsier then—thanks for the tip!

[1] https://github.com/reworkd/bananalyzer

dbish · 2 years ago

Awesome, will take a look at this. thank you

timabdulla · 2 years ago

Neat. Do you have the Bananalyzer eval results for Tarsier published somewhere?

SomaticPirate · 2 years ago

Surprised to hear Azure beats AWS Textract. I found it to be the best OCR offering but that was when I was doing documents.

navanchauhan · 2 years ago

In my experience Azure is probably the best OCR offering right now. They are also the only ones to be able to recognize my terrible handwriting.

dbish · 2 years ago

Yes, Textract does not work as well for desktop screenshots from our testing

bckmn · 2 years ago

Reminds me of [Language as Intermediate Representation](https://chrisvoncsefalvay.com/posts/lair/) - LLMs are optimized for language, so translate an image into language and they'll do better at modeling it.

Cool connection, hadn't seen this before but feels intuitively correct! I also formulate similar (but a bit more out-there) philosophical thoughts on word-meaning as being described by the topological structure of its corresponding images in embedding space, in Section 5.3 of my undergrad thesis [1].

[1] https://arxiv.org/abs/2305.16328

abrichr · 2 years ago

Congratulations on shipping!

In https://github.com/OpenAdaptAI/OpenAdapt/blob/main/openadapt... we use FastSAM to first segment the UI elements, then have the LLM describe each segment individually. This seems to work quite well; see https://twitter.com/OpenAdaptAI/status/1789430587314336212 for a demo.

More coming soon!

jackienotchan · 2 years ago

Looking at OpenAdapt, I'm wondering why they didn't integrate Tarsier into AgentGPT, which is their flagship github repo but doesn't seem to be under active development anymore.

We have a lot more powerful use-cases for Tarsier in web data extraction at the moment. Stay tuned for a broader launch soon!

pk19238 · 2 years ago

this is such a creative solution. reminds me of how a team rendered wolfenstein into ASCII characters and fine tuned mistral to successfully play it.

Thanks! Yeah, it seems like a lot can be done with just text while we wait for multimodal models to catch up. The recent Platonic Representation Hypothesis [1] also suggests that different models, regardless of modality, build the same internal representations of the world.

[1] https://arxiv.org/abs/2405.07987

shodai80 · 2 years ago

How do you know, for a specific webelement, what label it is associated with for a textbox or select?

For instance, I might want to tag as you did where elements are, but I still need an association with a label, quite often, to determine what the actual context of the textbox or select is.

awtkns · 2 years ago

Tarsier provides a mapping of element number (eg: [23]) to xpath. So for any tagged item we're able to map it back to the actual element in the DOM, allowing for easy interaction with the elements on the page.

I understand that, I assume you are tagging the node and making a basic xpath to the node/attribute with your tag id. Understood. But how relevant is tagging a node when I have no idea what the node is actually for?

EX: Given a simple login form, I may not know if the label is above or below the username textbox. A password box would be below it. I have a hard time understanding the relevance to tagging without context.

Tagging is basically irrelevant to any automated task if we do not know the context. I am not trying to diminish your great work, don't get me wrong, but if you don't have context I don't see much relevance. Youre doing something that is easily scripted with xpath templates which I've done for over a decade.

reidbarber · 2 years ago

Neat! Been building something similar to the tagging feature in Typescript: https://github.com/reidbarber/webmarker

The Python API on this is really nice though.

wyclif · 2 years ago

Hey! I'm actually in the Philippines now, and I've spent a lot of time on the island of Bohol, which has the world's greatest concentration of tarsiers. In fact, I visited the Tarsier Wildlife Sanctuary on the island of Bohol, Philippines with my wife, which is the world's main tarsier sanctuary. So I was instantly intrigued by the name of the app.

https://flickr.com/photos/wyclif/3271137617/in/album-7215761...

Awesome pics! We love tarsiers too