shodai80 (u/shodai80)

shodai80 commented on Show HN: Tarsier – Vision utilities for web interaction agents github.com/reworkd/tarsie... · Posted by u/KhoomeiK

KhoomeiK · 2 years ago

They do show textboxes with labels. From our readme:

"Keep in mind that Tarsier tags different types of elements differently to help your LLM identify what actions are performable on each element. Specifically:

[#ID]: text-insertable fields (e.g. textarea, input with textual type)

[@ID]: hyperlinks (<a> tags)

[$ID]: other interactable elements (e.g. button, select)

[ID]: plain text (if you pass tag_text_elements=True)"

Do you see the search boxes labeled [#4] and [#5] at the top? And before you say that the tag is on a different line from the placeholder text—yes, and our agent is smart enough to handle that minor idiosyncrasy. Are you shocked? :)

shodai80 · 2 years ago

#4 and #5 are using placeholder attributes, and the text itself is contained within the node. Show me a simple form with labels external of an input node, then rearrange the labels to be some above and some below, and I will be shocked! No placeholders. Label must be its own 'text' node.

Edit: I do not intend to come off as negative or disparaging - I already discussed this with some OS projects I work on as well as internally at work. You guys did something great, and I am just trying to point out gaps that could take it from great to unbelievable.

shodai80 commented on Show HN: Tarsier – Vision utilities for web interaction agents github.com/reworkd/tarsie... · Posted by u/KhoomeiK

miki123211 · 2 years ago

This problem isn't that hard, screen readers had to handle this exact issues for years. Inaccessible websites where the labels aren't properly associated with their respective form fields do exist, but aren't that common.

shodai80 · 2 years ago

Yes if they are associated with accessibility attributes (Aria). Many, many sites including massive B2B do not do this (a shame). So no, you are seriously minimizing the problem. This approach would also be architecturally poorly thought out - The solution needs to not depend upon aria, nor any other non-global approach (Which this solution does so far).

Everything shown to me so far has been a solvable problem by scripts/xpath template/creation logic. I've handled all of this for over 10 years with one script. When I see it finding everything and associating them with correct external labels, then they have something. Otherwise I am concluding it non-functional and a long since solved problem where ML is over-engineering.

shodai80 commented on Show HN: Tarsier – Vision utilities for web interaction agents github.com/reworkd/tarsie... · Posted by u/KhoomeiK

KhoomeiK · 2 years ago

We run OCR on the screenshot & convert it to whitespace-structured text, that is passed to the LLM. The images below might make it clearer for you:

[1] https://github.com/reworkd/tarsier/blob/main/.github/assets/...

[2] https://github.com/reworkd/tarsier/blob/main/.github/assets/...

shodai80 · 2 years ago

Provided screenshots below do not show textboxes, selects, or other input nodes with labels. Show me text output with associated labels for inputs being correct and I will be shocked.

shodai80 commented on Show HN: Tarsier – Vision utilities for web interaction agents github.com/reworkd/tarsie... · Posted by u/KhoomeiK

awtkns · 2 years ago

This is where a LLM comes it. In a typical pipeline would tag a page, transform it into a textual representation and then pass it to an llm which would be able to reason about which field(s) are the one you're looking for much like a human.

shodai80 · 2 years ago

My point still stands. How do you augment data for an LLM when you know the context of a page? Do you go through every element and setup the data for an associated label? Do you use div scoping via offset parent through a script to generate associated div (good approach, bad in real-life conditions though)? Do you convert the DOM to JSON or some data structure? That means little because you still don't have context, you'd have to do it by hand every time the layout changes...and you would have to be very specific, which is a separate problem for modeling as layouts are modified. What if the UI can be modified to have different layout types, such as label above, label to side, label below...where this can be dynamically set.

What I am pointing here is, even data modeling is mostly irrelevant unless you want to go through every page/permutation of a page...all the while hoping the layout isn't modified or back to training all over again...which is downtime, and at some point you'll realize its just better to store user created xpath's, as its quicker to update those than retrain.

How do you reason with an LLM without going through any of the above? Automation cannot consistently have downtime for retraining, it's the antithesis for its purpose.

Let's not even get into shadow dom issues.

I am keying on your third bullet point on Github:

"How can you inform a text-only LLM about the page's visual structure?"

My questions suggest a gap in your awesome accomplishment.

shodai80 commented on Show HN: Tarsier – Vision utilities for web interaction agents github.com/reworkd/tarsie... · Posted by u/KhoomeiK

awtkns · 2 years ago

Tarsier provides a mapping of element number (eg: [23]) to xpath. So for any tagged item we're able to map it back to the actual element in the DOM, allowing for easy interaction with the elements on the page.

shodai80 · 2 years ago

I understand that, I assume you are tagging the node and making a basic xpath to the node/attribute with your tag id. Understood. But how relevant is tagging a node when I have no idea what the node is actually for?

EX: Given a simple login form, I may not know if the label is above or below the username textbox. A password box would be below it. I have a hard time understanding the relevance to tagging without context.

Tagging is basically irrelevant to any automated task if we do not know the context. I am not trying to diminish your great work, don't get me wrong, but if you don't have context I don't see much relevance. Youre doing something that is easily scripted with xpath templates which I've done for over a decade.

shodai80 commented on Show HN: Tarsier – Vision utilities for web interaction agents github.com/reworkd/tarsie... · Posted by u/KhoomeiK

shodai80 · 2 years ago

How do you know, for a specific webelement, what label it is associated with for a textbox or select?

For instance, I might want to tag as you did where elements are, but I still need an association with a label, quite often, to determine what the actual context of the textbox or select is.