As someone building AI SaaS products, I used to have the position that directly integrating with APIs is going to get us most of the way there in terms of complete AI automation.
I wanted to take at stab at this problem and started researching some daily busineses and how they use software.
My brother-in-law (who is a doctor) showed me the bespoke software they use in his practice. Running on Windows. Using MFC forms.
My accountant showed me Cantax - a very powerful software package they use to prepare tax returns in Canada. Also on Windows.
I started to realize that pretty much most of the real world runs on software that directly interfaces with people, without clearly defined public APIs you can integrate into. Being in the SaaS space makes you believe that everyone ought to have client-server backend APIs etc.
Boy was I wrong.
I am glad they did this, since it is a powerful connector to these types of real-world business use cases that are super-hairy, and hence very worthwhile in automating.
This has existed for a long time, it's called "RPA" or Robotic Process Automation. The biggest incumbent in this space is UiPath, but there are a host of startups and large companies alike that are tackling it.
Most of the things that RPA is used for can be easily scripted, e.g. download a form from one website, open up Adobe. There are a lot of startups that are trying to build agentic versions of RPA, I'm glad to see Anthropic is investing in it now too.
It's almost always a framework around existing tools like Selenium that you constantly have to fight against to get good results from. I was always left with the feeling that I could build something better myself just handrolling the scripts rather than using their frameworks.
Getting Claude integrated into the space is going to be a game changer.
I was going to comment about this. Worked at a place that had a “Robotics Department”, wow I thought. Only to find out it was automating arcane software.
UI is now much more accessible as API. I hope we don’t start seeing captcha like behaviour in desktop or web software.
UiPath can't figure out how to make a profitable business since 2005 and we are nearing the end of this hype cycle. I am not so sure this will lead anywhere. I am a former investor in UiPath.
My impression is that actually solving this classic RPA problem with AI is exactly the raison d'etre of AI21Labs with their task specific models[1]. They don't have the biggest or best general purpose LLM, but they have an excellent model that's been pre-trained on specific types of business data and also made available for developers using simple APIs & "RPA-style" interfaces.
> Being in the SaaS space makes you believe that everyone ought to have client-server backend APIs etc.
FWIW, looking at it from end-user perspective, it ain't much different than the Windows apps. APIs are not interoperability - they tend to be tightly-controlled channels, access gated by the vendor and provided through contracts.
In a way, it's easier to make an API to a legacy native desktop app than it is to a typical SaaS[0] - the native app gets updated infrequently, and isn't running in an obstinate sandbox. The older the app, the better - it's more likely to rely on OS APIs and practices, designed with collaboration and accessibility in mind. E.g. in Windows land, in many cases you don't need OCR and mouse emulation - you just need to enumerate the window handles, walk the tree structure looking for text or IDs you care about, and send targeted messages to those components.
Unfortunately, desktop apps are headed the same direction web apps are (increasingly often, they are web apps in disguise), so I agree that AI-level RPA is a huge deal.
--
[0] - This is changing a bit in that frameworks seem to be getting complex enough that SaaS vendors often have no clue as to what kind of access they're leaving open to people who know how to press F12 in their browsers and how to call cURL. I'm not talking bespoke APIs backend team wrote, but standard ones built into middleware, that fell beyond dev team's "abstraction horizon". GraphQL is a notable example.
Basically, if it means companies can introduce automation without changing anything about the tooling/workflow/programs they already use, it's going to be MASSIVE. Just and install and a prompt and you've already automated a lengthy manual process - awesome.
Companies are going to install an AI inside their own proprietary systems full of proprietary and confidential data and PII about their customers and prospects and whatnot, and let it run around and click on random buttons and submit random forms?
Really??!? What could possibly go wrong.
I'm currently trying to do a large ORC project using Google Vision API, and then Gemini 1.5 Pro 002 to parse and reconstruct the results (taking advantage, one hopes, of its big context window). As I'm not familiar with Google Vision API I asked Gemini to guide me in setting it up.
Gemini is the latest Google model; Vision, as the name implies, is also from Google. Yet Gemini makes several egregious mistakes about Vision, gets names of fields or options wrong, etc.
Gemini 1.5 "Pro" also suggests that concatenating two json strings produces a valid json string; when told that's unlikely, it's very sorry and makes lots of apologies, but still it made the mistake in the first place.
LLMs can be useful when used with caution; letting one loose in an enterprise environment doesn't feel safe, or sane.
I've been peddling my vision of "AI automation" for the last several months to acquaintances of mine in various professional fields. In some cases, even building up prototypes and real-user testing. Invariably, none have really stuck.
This is not a technical problem that requires a technical solution. The problem is that it requires human behavior change.
In the context of AI automation, the promise is huge gains, but when you try to convince users / buyers, there is nothing wrong with their current solutions. Ie: There is no problem to solve. So essentially "why are you bothering me with this AI nonsense?"
Honestly, human behavior change might be the only real blocker to a world where AI automates most of the boring busy work currently done by people.
This approach essentially sidesteps the need to have effect a behavior change, at least in the short-term while AI can prove and solidify its value in the real-world.
Talking about ancient Windows software... Windows used to have an API for automation in the 2000s (I don't know if it still does). I wrote this MS Access script that ran and moved the cursor at exactly the pixel coordinates where buttons and fields were positioned in a GUI that we wanted to extract data from, in one of my first jobs. My boss used to do this manually. After a week he had millions of records ready to query in Access. You can imagine how excited he was. Was a fun little project and pretty hilarious to see the cursor moving fast AF around the screen like it was possessed. PS: you could screw up the script run pretty easily by bumping into the mouse of that pc.
Still present. VB and VB Script would do this by using mouse move to Window handles which were discovered using Spy++. You can do with C# or AutoIT these days.
Really good software has automation built in, in the form of macros/scripting.
One of the reasons my bash terminal is more effective than point and click is the easy of automation of routine tasks: from history and aliases to scripting.
Software interop is trickier as it doesn't so much depend on being able to send messages from one piece of software to another, it's you need an agreed format of those messages ( whether they be expressed in the payload of an API, or a UI specific stream of point and clicks ).
I tried once to integrate with software used by pharmacists in Australia (which costs a fortune and looks straight out of the 90's). Even though they have a SQL database with all information about everything, the DB is encrypted and they provide no way to access it. I tried contacting the company to learn how I could integrate with it but they offered no help. It's not in their interest to help upcoming competitors and they claim they cannot provide the encryption key as that's for protecting the customer's privacy, forgetting that the pharmacists already have access to all data through their software. But without a way to automatically extract information from the DB, there's no way to migrate to new software as no pharmacist would want to use new software if it cannot "import" the data they need.
It's clear that the reason there's no competition in the market is that the companies (I think it's literally one company) selling this software want it that way, and legislation is yet to catch up with the way they ensure their monopoloy.
I'm a bit skeptical about this working well enough to handle exceptions as soon as something out of the ordinary occurs. But it seems this could work great for automated testing.
Has anyone tried asking "use computer" to do "Please write a selenium/capybara/whatever test for filling out this form and sending it?"
That would take away some serious drudge work. And it's not a big problem if it fails, contrary to when it makes a mistake in filling out a form in an actual business process.
LLM's are enabling a reimagination of UI.
Wheres the biggest opportunity in UI that hasn't kept up to date? legacy and regulated software in the long tail. Thats the disruption opportunity for LLM's.
Imagine a Banking website. It has lot of predefined flows of what can be achieved. These flows have steps arranged in a dependency graph. From the server side, a llm can ask users for inputs to satisfy the dependencies of the task which user wants to accomplish. We will have intuitive UI interfaces in all languages of the world.
Not to mention software like peoplesoft, SAP and servicenow. Absolute shit interfaces that employees have to deal with day in and day out.
Now, an in-house tool built on top of this Anthropic API can save hours of drudgery. I can already see sales teams smiling at the new 'submit your expense report' button.
With one big exception: your general AI might dominate the business case, but my specialized one can craft raw packets. I’m the one who names the price, not you.
This is pretty similar to the argument for why humanoid robots will be a big deal. The physical world is also built mostly for humans, so having an autonomous agent that can interact with it is huge.
You don’t know for a fact that those two specific packages don’t have supported APIs. Just because the user doesn’t know of any API doesn’t mean none exists. The average accountant or doctor is never going to even ask the vendor “is there an API” because they wouldn’t know what to do with one if there was.
On their "Developing a computer use model" post they have mention
> On one evaluation created to test developers’ attempts to have models use computers, OSWorld, Claude currently gets 14.9%. That’s nowhere near human-level skill (which is generally 70-75%), but it’s far higher than the 7.7% obtained by the next-best AI model in the same category.
Here, "next-best AI model in the same category" referes to which model.
This is cover for the people whose screens are recorded. Run this on the monitorred laptop to make you look busy then do the actual work on laptop 2, some of which might actually require thinking so no mouse movements.
> I apologize, but it seems we're having some issues installing Chromium. As an alternative, since I can see Firefox is already installed on the system, would you like to use Firefox instead? Or would you prefer me to try a different approach to installing Chromium?
It seems unable to figure out how to install chormium. Sees the Firefox icon out there and just want to use it. You know, like a real lazy human being.
It seems to insistent when it needs to, though
> I clicked on Firefox and it should start opening. Let me take another screenshot to see if the window has appeared:
> Firefox seems to be taking a moment to start up. Let me wait a few seconds and take another screenshot:
> Let me try clicking Firefox one more time to ensure it starts:
> Let me try one more time to take a screenshot to see if Firefox has opened:
> Let me try starting Firefox using the bash tool as an alternative method:
> Firefox has successfully opened! As per the instructions, I should ignore the setup wizard and instead click directly on the address bar. Would you like me to do anything specific with Firefox now that it's open?
I didn't instruct him to ignore the setup wizard. So my guess is that Anthropic has configured it to ignore stuff happening on the screen so it doesn't go loose.
And here he goes through my website, through my hacker news account and then find this very comment
> Looking at his first/most recent comment, it's about a discussion of Claude and computer use. Here's what he wrote:
"I like its lazy approach"
This appears to be a humorous response in a thread about "Computer use, a new Claude 3.5 Sonnet, and Claude..." where he's commenting on an AI's behavior in a situation. The comment is very recent (shown as "8 minutes ago" in the screenshot) and is referring to a situation where an AI seems to have taken a simpler or more straightforward approach to solving a problem.
SYSTEM_PROMPT = f"""<SYSTEM_CAPABILITY>
* You are utilising an Ubuntu virtual machine using {platform.machine()} architecture with internet access.
* You can feel free to install Ubuntu applications with your bash tool. Use curl instead of wget.
* To open firefox, please just click on the firefox icon. Note, firefox-esr is what is installed on your system.
* Using bash tool you can start GUI applications, but you need to set export DISPLAY=:1 and use a subshell. For example "(DISPLAY=:1 xterm &)". GUI apps run with bash tool will appear within your desktop environment, but they may take some time to appear. Take a screenshot to confirm it did.
* When using your bash tool with commands that are expected to output very large quantities of text, redirect into a tmp file and use str_replace_editor or `grep -n -B <lines before> -A <lines after> <query> <filename>` to confirm output.
* When viewing a page it can be helpful to zoom out so that you can see everything on the page. Either that, or make sure you scroll down to see everything before deciding something isn't available.
* When using your computer function calls, they take a while to run and send back to you. Where possible/feasible, try to chain multiple of these calls all into one function calls request.
* The current date is {datetime.today().strftime('%A, %B %-d, %Y')}.
</SYSTEM_CAPABILITY>
<IMPORTANT>
* When using Firefox, if a startup wizard appears, IGNORE IT. Do not even click "skip this step". Instead, click on the address bar where it says "Search or enter address", and enter the appropriate search term or URL there.
* If the item you are looking at is a pdf, if after taking a single screenshot of the pdf it seems that you want to read the entire document instead of trying to continue to read the pdf from your screenshots + navigation, determine the URL, use curl to download the pdf, install and use pdftotext to convert it to a text file, and then read that text file directly with your StrReplaceEditTool.
</IMPORTANT>"""
I still feel like the difference between Sonnet and Opus is a bit unclear. Somewhere on Anthropic's website it says that Opus is the most advanced, but on other parts it says Sonnet is the most advanced and also the fastest. The UI doesn't make the distinction clear either. Then on Perplexity, Perplexity says that Opus is the most advanced, compared to Sonnet.
And finally, in the table in the blogpost, Opus isn't even included? It seems to me like Opus is the best model they have, but they don't want people to default using it, maybe the ROI is lower on Opus or something?
When I manually tested it, I feel like Opus gives slightly better replies compared to Sonnet, but I'm not 100% it's just placebo.
Opus hasn't yet gotten an update from 3 to 3.5, and if you line up the benchmarks, the Sonnet "3.5 New" model seems to beat it everywhere.
I think they originally announced that Opus would get a 3.5 update, but with every product update they are doing I'm doubting it more and more. It seems like their strategy is to beat the competition on a smaller model that they can train/tune more nimbly and pair it with outside-the-model product features, and it honestly seems to be working.
> Opus hasn't yet gotten an update from 3 to 3.5, and if you line up the benchmarks, the Sonnet "3.5 New" model seems to beat it everywhere
Why isn't Anthropic clearer about Sonnet being better then? Why isn't it included in the benchmark if new Sonnet beats Opus? Why are they so ambiguous with their language?
- Claude 3 Opus - Powerful model for highly complex tasks
Does that mean Sonnet 3.5 is better than Opus for even highly complex tasks, since it's the "most intelligent model"? Or just for everything except "highly complex tasks"
I don't understand why this seems purposefully ambiguous?
I think the practical economics of the LLM business are becoming clearer in recent times. Huge models are expensive to train and expensive to run. As long as it meets the average user's everyday needs, it's probably much more profitable to just continue with multimodal and fine-tuning development on smaller models.
I think the main reason is they tried training a heavy weight model that was supposed to be opus 3.5, but it didn't yield large enough improvements to 3.5 sonnet to justify them releasing it. (They had it on their page for a while that opus was coming soon, and now they've scrapped that.)
This theory is consistent with the other two top players, Open AI and Google, they both were expected to release a heavy model, but instead have just released multiple medium and small tier models. It's been so long since google released gemini ultimate 1.0 (the naming clearly implying that they were planning on upgrading it to 1.5 like they did with Pro)
Not seeing anyone release a heavyweight model, but at the same time releasing many small and medium sized models makes me think that improving models will be much more complicated than scaling it with more compute, and that there likely are diminishing returns with that regard.
That begs the question: why am I still paying for access to Opus 3 ?
Honestly I don’t know. I’ve not been using Sonnet 3.5 up to now and I’m a fairly light user so I doubt I’ll run into the free tier limits. I’ll probably cancel my subscription until Opus 3.5 comes out (if it ever does).
Opus is a larger and more expensive model. Presumably 3.5 Opus will be the best but it hasn't been released. 3.5 Sonnet is better than 3.0 Opus kind of like how a newer i5 midrange processor is faster and cheaper than an old high-end i7.
Makes me wonder if perhaps they do have 3.5 Opus trained, but that they're not releasing it because 3.5 Sonnet is already enough to beat the competition, and some combination of "don't want to contribute to an arms race" and "it has some scary capabilities they weren't sure were ready to publish yet".
Anthropic use the names Haiku/Sonnet/Opus for the small/medium/large versions of each generation of their models, so within-generation that is also their performance (& cost) order. Evidentially Sonnet 3.5 outperforms Opus 3.0 on at least some tasks, but that is not a same-generation comparison.
I'm wondering at this point if they are going to release Opus 3.5 at all, or maybe skip it and go straight to 4.0. It's possible that Haiku 3.5 is a distillation of Opus 3.5.
By reputation -- I can't vouch for this personally, and I don't know if it'll still be true with this update -- Opus is still often better for things like creative writing and conversations about emotional or political topics.
I don't think that's quite it. They had it on their website before this, that opus 3.5 was coming soon, now they've removed that from the webpage.
Also, Gemini ultra 1.0, was released like 8 months ago, 1.5 pro released soon after, with this wording "The first Gemini 1.5 model we’re releasing for early testing is Gemini 1.5 Pro"
Still no ultra 1.5, despite many mid and small sized models being released in that time frame. This isn't just an issue of "the training time takes longer", or a "skew" to release dates. There's a better theory to explain why all SoTA LLM companies have not released a heavy model in many months.
The models "3.5 Sonnet" and "3 Opus" are in my experience nearly at the same level. Once in my last 250 prompts did I run into a problem that 3 Opus was able to solve, but 3.5 Sonnet could not. (I forget the details but it was a combination of logic and trivia knowledge. It is highly likely 3.5 Sonnet would have done a better job with better prompting and richer context, but this was a problem where also I lacked the context and understanding to prompt well.)
Given that 3.5 Sonnet is cheaper and faster than 3 Opus, I default to 3.5 Sonnet so I don't know what the number for the reverse is. How many problems do 3.5 Sonnet get which 3 Opus does not? ¯\_(ツ)_/¯
My best guess would be that it's something in the same kind of range.
The "computer use" ability is extremely impressive!
This is a lot more than an agent able to use your computer as a tool (and understanding how to do that) - it's basically an autonomous reasoning agent that you can give a goal to, and it will then use reasoning, as well as it's access to your computer, to achieve that goal.
Take a look at their demo of using this for coding.
This seems to be an OpenAI GPT-o1 killer - it may be using an agent to do reasoning (still not clear exactly what is under the hood) as opposed to GPT-o1 supposedly being a model (but still basically a loop around an LLM), but the reasoning it is able to achieve in pursuit of a real world goal is very impressive. It'd be mind boggling if we hadn't had the last few years to get used to this escalation of capabilities.
It's also interesting to consider this from POV of Anthropic's focus on AI safety. On their web site they have a bunch of advice on how to stay safe by sandboxing, limiting what it has access to, etc, but at the end of the day this is a very capable AI able to use your computer and browser to do whatever it deems necessary to achieve a requested goal. How far are we from paperclip optimization, or at least autonomous AI hacking ?
From what I'm seeing on GH, this could have technically already been built right? Is it not just taking screenshots of the computer screen and deciding what to do from their / looping until it gets to the solution ?
Well, obviously it's controlling your computer too - controlling mouse and keyboard input, and has been trained to know how to interact with apps (how to recognize and use UI components). It's not clear exactly what all the moving parts are and how they interact.
I wouldn't be so dismissive - you could describe GPT-o1 in same way "it just loops until it gets to the solution". It's the details and implementation that matter.
I've been saying this is coming for a long time, but my really smart SWE friend who is nevertheless not in the AI/ML space dismissed it as a stupid roundabout way of doing things. That software should just talk via APIs. No matter how much I argued regarding legacy software/websites and how much functionality is really only available through GUI, it seems some people are really put off by this type of approach. To me, who is more embedded in the AI, computer vision, robotics world, the fuzziness of day-to-day life is more apparent.
Just as how expert systems didn't take off and tagging every website for the Semantic Web didn't happen either, we have to accept that the real world of humans is messy and unstructured.
I still advocate making new things more structured. A car on wheels on flattened ground will always be more efficient than skipping the landscaping part and just riding quadruped robots through the forest on uneven terrain. We should develop better information infrastructure but the long tail of existing use cases will require automation that can deal with unstructured mess too.
>it seems some people are really put off by this type of approach
As someone who has had to interact with legacy enterprise systems via RPA (screen scraping and keystroke recording) it is absolutely awful, incredibly brittle, and unmaintainable once you get past a certain level of complexity. Even when it works, performance at scale is terrible.
adding a neural network in the middle suddenly makes these things less brittle. We are approaching the point where this kind of “hacky glue” is almost scalable.
It's Postel's Law, on steroids. Be liberal in what you accept (with LLMs, that means 'literally anything'), but strict in what you return (which in an LLM is still 'literally anything' but you can constrain that).
Beyond that, I can't help but think of the old thin vs. thick client debate, and I would argue that "software should just talk via APIs" is why, in the web space, everybody is blowing time and energy on building client/server architectures and SPAs instead of basic-ass full-stacks.
It's basically the digital equivalent of humanoid robots - people object because having computers interact with a browser, like building a robot in the form of a human, is incredibly inefficient in theory or if you're designing a system from scratch.
The problem is that we're not starting from scratch - we have a web engineered for browser use and a world engineered for humanoid use. That means an agent that can use a browser, while less efficient than an agent using APIs at any particular task, is vastly more useful because it can complete a much greater breadth of tasks. Same thing with humanoid robots - not as efficient at cleaning the floor as my purpose-built Roomba, but vastly more useful because the breadth of tasks it can accomplish means it can be doing productive things most of the time, as opposed to my Roomba, which is not in use 99% of the time.
I do think that once AI agents become common, the web will increasingly be designed for their use and will move away from the browser, but that probably take a comparable amount of time as it did for the mobile web to emerge after the iPhone came out. (Actually that's probably not true - it'll take less time because AI will be doing the work instead of humans.)
Yes, but my friend would say, all these websites/software should just publish an API and if they don't that's just incompetence/laziness/stupidity. But a "should" doesn't matter. Changing human nature is so immensely difficult, but it feels easy to say "everyone should just [...]". Seems to be a gap in thinking that's hard to bridge.
Totally agree. A general-purpose solution that ties together different messy interfaces will win in the long run -- i.e the IP protocol, copy-paste, browsers. In these cases, they provide a single-way for different aspects of computing to collaborate. As mentioned before, semantic web initiatives did not succeed and I think there's an important lesson there.
I recall 90's Macs had a 3rd party app that offered to observe your mouse/keyboard then automatically recommend routine tasks for you. As a young person I found that fascinating. It's interesting to see history renew itself.
If you want an API, have Claude procedurally test actions and then write a pyautogui/pywinauto/autohotkey etc script to perform it instead. Have it auto-test to verify and classify the general applicability of each action. Repeat for all of life...
The code behind the GUI can be arbitrarily obscure. The only reliable way to understand its meaning in a general case is to run it and look at the rendered image. Trying to build a model that implicitly develops an alternative implementation of a browser inside of it sounds worse that just using an existing browser directly.
More often than not you don't have access to the underlying code, or the vendor has interest against you being able to automate it since the complexity is a part of their business model.
Not specific to this update, but I wanted to chime in with just how useful Claude has been, and relatively better than ChatGPT and GitHub copilot for daily use. I've been pro for maybe 6 months. I'm not a power user leveraging their API or anything. Just the chat interface, though with ever more use of Projects, lately. I use it every day, whether for mundane answers or curiosities, to "write me this code", to general consultation on a topic. It has replaced search in a superior way and I feel hugely productive with it.
I do still occasionally pop over to ChatGPT to test their their waters (or if Claude is just not getting it), but I've not felt any need to switch back or have both. Well done, Anthropic!
It's possible that they've determined that Opus no longer makes sense if they're able to focus on continuously optimising Sonnet. That said, Anthropic have been relatively good at setting and managing expectations, so today would have been a good time to make that clear.
As someone building AI SaaS products, I used to have the position that directly integrating with APIs is going to get us most of the way there in terms of complete AI automation.
I wanted to take at stab at this problem and started researching some daily busineses and how they use software.
My brother-in-law (who is a doctor) showed me the bespoke software they use in his practice. Running on Windows. Using MFC forms.
My accountant showed me Cantax - a very powerful software package they use to prepare tax returns in Canada. Also on Windows.
I started to realize that pretty much most of the real world runs on software that directly interfaces with people, without clearly defined public APIs you can integrate into. Being in the SaaS space makes you believe that everyone ought to have client-server backend APIs etc.
Boy was I wrong.
I am glad they did this, since it is a powerful connector to these types of real-world business use cases that are super-hairy, and hence very worthwhile in automating.
Most of the things that RPA is used for can be easily scripted, e.g. download a form from one website, open up Adobe. There are a lot of startups that are trying to build agentic versions of RPA, I'm glad to see Anthropic is investing in it now too.
It's almost always a framework around existing tools like Selenium that you constantly have to fight against to get good results from. I was always left with the feeling that I could build something better myself just handrolling the scripts rather than using their frameworks.
Getting Claude integrated into the space is going to be a game changer.
UI is now much more accessible as API. I hope we don’t start seeing captcha like behaviour in desktop or web software.
[1] https://www.ai21.com/use-cases
FWIW, looking at it from end-user perspective, it ain't much different than the Windows apps. APIs are not interoperability - they tend to be tightly-controlled channels, access gated by the vendor and provided through contracts.
In a way, it's easier to make an API to a legacy native desktop app than it is to a typical SaaS[0] - the native app gets updated infrequently, and isn't running in an obstinate sandbox. The older the app, the better - it's more likely to rely on OS APIs and practices, designed with collaboration and accessibility in mind. E.g. in Windows land, in many cases you don't need OCR and mouse emulation - you just need to enumerate the window handles, walk the tree structure looking for text or IDs you care about, and send targeted messages to those components.
Unfortunately, desktop apps are headed the same direction web apps are (increasingly often, they are web apps in disguise), so I agree that AI-level RPA is a huge deal.
--
[0] - This is changing a bit in that frameworks seem to be getting complex enough that SaaS vendors often have no clue as to what kind of access they're leaving open to people who know how to press F12 in their browsers and how to call cURL. I'm not talking bespoke APIs backend team wrote, but standard ones built into middleware, that fell beyond dev team's "abstraction horizon". GraphQL is a notable example.
Really??!? What could possibly go wrong.
I'm currently trying to do a large ORC project using Google Vision API, and then Gemini 1.5 Pro 002 to parse and reconstruct the results (taking advantage, one hopes, of its big context window). As I'm not familiar with Google Vision API I asked Gemini to guide me in setting it up.
Gemini is the latest Google model; Vision, as the name implies, is also from Google. Yet Gemini makes several egregious mistakes about Vision, gets names of fields or options wrong, etc.
Gemini 1.5 "Pro" also suggests that concatenating two json strings produces a valid json string; when told that's unlikely, it's very sorry and makes lots of apologies, but still it made the mistake in the first place.
LLMs can be useful when used with caution; letting one loose in an enterprise environment doesn't feel safe, or sane.
I've been peddling my vision of "AI automation" for the last several months to acquaintances of mine in various professional fields. In some cases, even building up prototypes and real-user testing. Invariably, none have really stuck.
This is not a technical problem that requires a technical solution. The problem is that it requires human behavior change.
In the context of AI automation, the promise is huge gains, but when you try to convince users / buyers, there is nothing wrong with their current solutions. Ie: There is no problem to solve. So essentially "why are you bothering me with this AI nonsense?"
Honestly, human behavior change might be the only real blocker to a world where AI automates most of the boring busy work currently done by people.
This approach essentially sidesteps the need to have effect a behavior change, at least in the short-term while AI can prove and solidify its value in the real-world.
One of the reasons my bash terminal is more effective than point and click is the easy of automation of routine tasks: from history and aliases to scripting.
Software interop is trickier as it doesn't so much depend on being able to send messages from one piece of software to another, it's you need an agreed format of those messages ( whether they be expressed in the payload of an API, or a UI specific stream of point and clicks ).
It's clear that the reason there's no competition in the market is that the companies (I think it's literally one company) selling this software want it that way, and legislation is yet to catch up with the way they ensure their monopoloy.
Has anyone tried asking "use computer" to do "Please write a selenium/capybara/whatever test for filling out this form and sending it?"
That would take away some serious drudge work. And it's not a big problem if it fails, contrary to when it makes a mistake in filling out a form in an actual business process.
Now, an in-house tool built on top of this Anthropic API can save hours of drudgery. I can already see sales teams smiling at the new 'submit your expense report' button.
If you're in some niche doing AI development work, you are going to be outcompeted by more generalized AI at some point.
In this case I doubt they're networked apps so they probably don't have a server API.
Computer use API documentation: https://docs.anthropic.com/en/docs/build-with-claude/compute...
Computer Use Demo: https://github.com/anthropics/anthropic-quickstarts/tree/mai...
Here, "next-best AI model in the same category" referes to which model.
> I apologize, but it seems we're having some issues installing Chromium. As an alternative, since I can see Firefox is already installed on the system, would you like to use Firefox instead? Or would you prefer me to try a different approach to installing Chromium?
It seems unable to figure out how to install chormium. Sees the Firefox icon out there and just want to use it. You know, like a real lazy human being.
It seems to insistent when it needs to, though
> I clicked on Firefox and it should start opening. Let me take another screenshot to see if the window has appeared:
> Firefox seems to be taking a moment to start up. Let me wait a few seconds and take another screenshot:
> Let me try clicking Firefox one more time to ensure it starts:
> Let me try one more time to take a screenshot to see if Firefox has opened:
> Let me try starting Firefox using the bash tool as an alternative method:
> Firefox has successfully opened! As per the instructions, I should ignore the setup wizard and instead click directly on the address bar. Would you like me to do anything specific with Firefox now that it's open?
I didn't instruct him to ignore the setup wizard. So my guess is that Anthropic has configured it to ignore stuff happening on the screen so it doesn't go loose.
And here he goes through my website, through my hacker news account and then find this very comment
> Looking at his first/most recent comment, it's about a discussion of Claude and computer use. Here's what he wrote:
"I like its lazy approach"
This appears to be a humorous response in a thread about "Computer use, a new Claude 3.5 Sonnet, and Claude..." where he's commenting on an AI's behavior in a situation. The comment is very recent (shown as "8 minutes ago" in the screenshot) and is referring to a situation where an AI seems to have taken a simpler or more straightforward approach to solving a problem.
<IMPORTANT> * When using Firefox, if a startup wizard appears, IGNORE IT. Do not even click "skip this step". Instead, click on the address bar where it says "Search or enter address", and enter the appropriate search term or URL there. * If the item you are looking at is a pdf, if after taking a single screenshot of the pdf it seems that you want to read the entire document instead of trying to continue to read the pdf from your screenshots + navigation, determine the URL, use curl to download the pdf, install and use pdftotext to convert it to a text file, and then read that text file directly with your StrReplaceEditTool. </IMPORTANT>"""
And finally, in the table in the blogpost, Opus isn't even included? It seems to me like Opus is the best model they have, but they don't want people to default using it, maybe the ROI is lower on Opus or something?
When I manually tested it, I feel like Opus gives slightly better replies compared to Sonnet, but I'm not 100% it's just placebo.
I think they originally announced that Opus would get a 3.5 update, but with every product update they are doing I'm doubting it more and more. It seems like their strategy is to beat the competition on a smaller model that they can train/tune more nimbly and pair it with outside-the-model product features, and it honestly seems to be working.
Why isn't Anthropic clearer about Sonnet being better then? Why isn't it included in the benchmark if new Sonnet beats Opus? Why are they so ambiguous with their language?
For example, https://www.anthropic.com/api says:
> Sonnet - Our best combination of performance and speed for efficient, high-throughput tasks.
> Opus - Our highest-performing model, which can handle complex analysis, longer tasks with many steps, and higher-order math and coding tasks.
And Opus is above/after Sonnet. That to me implies that Opus is indeed better than Sonnet.
But then you go to https://docs.anthropic.com/en/docs/about-claude/models and it says:
> Claude 3.5 Sonnet - Most intelligent model
- Claude 3 Opus - Powerful model for highly complex tasks
Does that mean Sonnet 3.5 is better than Opus for even highly complex tasks, since it's the "most intelligent model"? Or just for everything except "highly complex tasks"
I don't understand why this seems purposefully ambiguous?
This theory is consistent with the other two top players, Open AI and Google, they both were expected to release a heavy model, but instead have just released multiple medium and small tier models. It's been so long since google released gemini ultimate 1.0 (the naming clearly implying that they were planning on upgrading it to 1.5 like they did with Pro)
Not seeing anyone release a heavyweight model, but at the same time releasing many small and medium sized models makes me think that improving models will be much more complicated than scaling it with more compute, and that there likely are diminishing returns with that regard.
Honestly I don’t know. I’ve not been using Sonnet 3.5 up to now and I’m a fairly light user so I doubt I’ll run into the free tier limits. I’ll probably cancel my subscription until Opus 3.5 comes out (if it ever does).
I'm wondering at this point if they are going to release Opus 3.5 at all, or maybe skip it and go straight to 4.0. It's possible that Haiku 3.5 is a distillation of Opus 3.5.
So for example, Perplexity is wrong here implying that Opus is better than Sonnet?
https://i.imgur.com/N58I4PC.png
Thats why they release them with that skew
Also, Gemini ultra 1.0, was released like 8 months ago, 1.5 pro released soon after, with this wording "The first Gemini 1.5 model we’re releasing for early testing is Gemini 1.5 Pro"
Still no ultra 1.5, despite many mid and small sized models being released in that time frame. This isn't just an issue of "the training time takes longer", or a "skew" to release dates. There's a better theory to explain why all SoTA LLM companies have not released a heavy model in many months.
Not most advanced
Given that 3.5 Sonnet is cheaper and faster than 3 Opus, I default to 3.5 Sonnet so I don't know what the number for the reverse is. How many problems do 3.5 Sonnet get which 3 Opus does not? ¯\_(ツ)_/¯
My best guess would be that it's something in the same kind of range.
This is a lot more than an agent able to use your computer as a tool (and understanding how to do that) - it's basically an autonomous reasoning agent that you can give a goal to, and it will then use reasoning, as well as it's access to your computer, to achieve that goal.
Take a look at their demo of using this for coding.
https://www.youtube.com/watch?v=vH2f7cjXjKI
This seems to be an OpenAI GPT-o1 killer - it may be using an agent to do reasoning (still not clear exactly what is under the hood) as opposed to GPT-o1 supposedly being a model (but still basically a loop around an LLM), but the reasoning it is able to achieve in pursuit of a real world goal is very impressive. It'd be mind boggling if we hadn't had the last few years to get used to this escalation of capabilities.
It's also interesting to consider this from POV of Anthropic's focus on AI safety. On their web site they have a bunch of advice on how to stay safe by sandboxing, limiting what it has access to, etc, but at the end of the day this is a very capable AI able to use your computer and browser to do whatever it deems necessary to achieve a requested goal. How far are we from paperclip optimization, or at least autonomous AI hacking ?
I wouldn't be so dismissive - you could describe GPT-o1 in same way "it just loops until it gets to the solution". It's the details and implementation that matter.
Just as how expert systems didn't take off and tagging every website for the Semantic Web didn't happen either, we have to accept that the real world of humans is messy and unstructured.
I still advocate making new things more structured. A car on wheels on flattened ground will always be more efficient than skipping the landscaping part and just riding quadruped robots through the forest on uneven terrain. We should develop better information infrastructure but the long tail of existing use cases will require automation that can deal with unstructured mess too.
As someone who has had to interact with legacy enterprise systems via RPA (screen scraping and keystroke recording) it is absolutely awful, incredibly brittle, and unmaintainable once you get past a certain level of complexity. Even when it works, performance at scale is terrible.
Beyond that, I can't help but think of the old thin vs. thick client debate, and I would argue that "software should just talk via APIs" is why, in the web space, everybody is blowing time and energy on building client/server architectures and SPAs instead of basic-ass full-stacks.
The problem is that we're not starting from scratch - we have a web engineered for browser use and a world engineered for humanoid use. That means an agent that can use a browser, while less efficient than an agent using APIs at any particular task, is vastly more useful because it can complete a much greater breadth of tasks. Same thing with humanoid robots - not as efficient at cleaning the floor as my purpose-built Roomba, but vastly more useful because the breadth of tasks it can accomplish means it can be doing productive things most of the time, as opposed to my Roomba, which is not in use 99% of the time.
I do think that once AI agents become common, the web will increasingly be designed for their use and will move away from the browser, but that probably take a comparable amount of time as it did for the mobile web to emerge after the iPhone came out. (Actually that's probably not true - it'll take less time because AI will be doing the work instead of humans.)
Deleted Comment
Isn't the GUI driven by code? Can anything at all in the GUI work that can't be done programmatically?
I do still occasionally pop over to ChatGPT to test their their waters (or if Claude is just not getting it), but I've not felt any need to switch back or have both. Well done, Anthropic!
Internet Archive confirms that on the 8th of October that page listed 3.5 Opus as coming "Later this year" https://web.archive.org/web/20241008222204/https://docs.anth...
The fact that it's no longer listed suggests that its release has at least been delayed for an unpredictable amount of time, or maybe even cancelled.
> i don't write the docs, no clue
> afaik opus plan same as its ever been