I work in healthcare RCM. I have no trouble believing the staff here that nothing in their system works.
I work in healthcare RCM. I have no trouble believing the staff here that nothing in their system works.
I know the authors of Skyvern are around here sometimes -- How do you think about code generation with vision based approaches to agentic browser use like OpenAI's Operator, Claude Computer Use and Magnitude?
From my POV, I think the vision based approaches are superior, but they are less amenable to codegen IMO.
CV and direct mouse/kb interactions are the “base” interface, so if you solve this problem, you unlock just about every automation usecase.
(I agree that if you can get good, unambiguous, actionable context from accessibility/automation trees, that’s going to be superior)
All the solutions are already available on the internet on which various models are trained, albeit in various ratios.
Any variance could likely be due to the mix of the data.
If you care about understanding relative performance between models for solving known problems and producing correct output format, it's pretty useful.
- Even for well-known problems, we see a large distribution of quality between models (5 to 75% correctness) - Additionally, we see a large distribution of model's ability to produce responses in formats they were instructed in
At the end of the day, benchmarks are pretty fuzzy, but I always welcome a formalized benchmark as a means to understand model performance over vibe checking.
I had multiple full body dexascans during the programme.
I didn’t change my exercise routine at all. I wasn’t hitting the gym or doing weights, just my usual basic cardio.
And I gained muscle and lost ~10kilos in weight.
It wasn’t much muscle, but the amount of muscle was higher than before.
MRI is the gold standard, everything else is pretty loosely goosey.
Sorry, no references but this comes up pretty often in the science based lifting communities on Reddit and YouTube if you want to learn more.
I guess you mean its "Computer use" API that can (if I understand correctly) send mouse click at specific coordinates?
I got excited thinking Claude can finally do accurate object detection, but alas no. Here's its output:
> Looking at the image directly, the SPACE key appears near the bottom left of the keyboard interface, but I cannot determine its exact pixel coordinates just by looking at the image. I can see it's positioned below the letter grid and appears wider than the regular letter keys, but I apologize - I cannot reliably extract specific pixel coordinates from just viewing the screenshot.
This is 3.5 Sonnet (their most current model).
And they explicitly call out spatial reasoning as a limitation:
> Claude’s spatial reasoning abilities are limited. It may struggle with tasks requiring precise localization or layouts, like reading an analog clock face or describing exact positions of chess pieces.
--https://docs.anthropic.com/en/docs/build-with-claude/vision#...
Since 2022 I occasionally dip in and test this use-case with the latest models but haven't seen much progress on the spatial reasoning. The multi-modality has been a neat addition though.
> When a developer tasks Claude with using a piece of computer software and gives it the necessary access, Claude looks at screenshots of what’s visible to the user, then counts how many pixels vertically or horizontally it needs to move a cursor in order to click in the correct place. Training Claude to count pixels accurately was critical.
Existing approaches tend to involve drawing marked bounding boxes around interactive elements and then asking the LLM to provide a tool call like `click('A12')` where A12 remaps to the underlying HTML element and we perform some sort of Selenium/JS action. Using heuristics to draw those bounding boxes is tricky. Even performing the correct action can be tricky as it might be that click handlers are attached to a different DOM element.
Avoiding this remapping between a visual to an HTML element and instead working with high level operations like `click(x, y)` or `type("foo")` directly on the screen will probably be more effective at automating usecases.
That being said, providing HTML to the LLM as context does tend to improve performance on top of just visual inference right now.
So I dunno... I'm more optimistic about Claude's approach and am very excited about it... especially if visual inference continues to improve.
Using their multi tool, they removed the fender liners (wheel well liners) from all 4 wheels, the trunk side trim (luggage compartment side trim) from both sides - all of which just has plastic push-pin scrivets (retainer clips). They broke 5 of them.
They folded down my back seats (after removing all my personal items out to the shoulder in the rain), then unbolted and removed the back seat.
I do a LOT of interstate driving, and it is not at all uncommon to see this happen.
This is not the only time I have been in situations where authority has been exceeded. My attitude is to generally be cooperative (without giving consent) as my experience has taught me that is the most painless way to go.
Sorry about all the broken plastic on the trim -- That's also very familiar...