This is by far the most practical piece of writing I've seen on the subject of "agents" - it includes actionable definitions, then splits most of the value out into "workflows" and describes those in depth with example applications.
Thanks for all the write-ups on LLMs, you're on top of the news and it makes it way easier to follow what's happening and the existing implementations by following your blog instead.
Yes, they have actionable definitions, but they are defining something quite different than the normal definition of an "agent". An agent is a party who acts for another.
Often this comes from an employer-employee relationship.
This matters mostly when things go wrong. Who's responsible?
The airline whose AI agent gave out wrong info about airline policies found, in court, that their "intelligent agent" was considered an agent in legal terms. Which meant the airline was stuck paying for their mistake.
Anthropic's definition: Some customers define agents as fully autonomous systems that operate independently over extended periods, using various tools to accomplish complex tasks.
That's an autonomous system, not an agent. Autonomy is about how much something can do without outside help. Agency is about who's doing what for whom, and for whose benefit and with what authority. Those are independent concepts.
Where did you get the idea that your definition there is the "normal" definition of agent, especially in the context of AI?
I ask because you seem very confident in it - and my biggest frustration about the term "agent" is that so many people are confident that their personal definition is clearly the one everyone else should be using.
That's only one of many definitions for the word agent outside of the context of AI. Another is something produces effects on the world. Another is something that has agency.
Sort of interesting that we've coalesced on this term that has many definitions, sometimes conflicting, but where many of the definitions vaguely fit into what an "AI Agent" could be for a given person.
But in the context of AI, Agent as Anthropic defines it is an appropriate word because it is a thing that has agency.
>Anthropic's definition: Some customers define agents as fully autonomous systems that operate independently over extended periods, using various tools to accomplish complex tasks.
But that's not their definition, and they explicitly describe that definition as an 'autonomous system'. Their definition comes in the next paragraph:
"At Anthropic, we categorize all these variations as agentic systems, but draw an important architectural distinction between workflows and agents:
* Workflows are systems where LLMs and tools are orchestrated through predefined code paths.
Agents, on the other hand, are systems where LLMs dynamically direct their own processes and tool usage, maintaining control over how they accomplish tasks.
* Agents, on the other hand, are systems where LLMs dynamically direct their own processes and tool usage, maintaining control over how they accomplish tasks."
Hi David; I’ve seen txtai floating around, and just took a look. Would you say that it fits in a similar niche to something like llamaindex, but starting from a data/embeddings abstraction rather than a retrieval one (building on layers from there - like workflows, agents etc)?
I put the agents in quotes because anthropic actually talks more about what they call "workflows". And imo this is where the real value of LLMs currently lies, workflow automation.
They also say that using LangChain and other frameworks is mostly unnecessary and does more harm than good. They instead argue to use some simple patterns, directly on the API level. Not dis-similar to the old-school Gang of Four software engineering patterns.
Really like this post as a guidance for how to actually build useful tools with LLMs. Keep it simple, stupid.
Deploying in production, the current agentic systems do not really work well. Workflow automation does. The reason is very native to LLMs, but also incredibly basic. Every agentic system starts with planning and reasoning module, where an LLM evaluates the task given and plans about how to accomplish that task, before moving on to next steps.
When an agent is given a task, they inevitably come up with different plans on different tries due to inherent nature of LLMs. Most companies like this step to be predictable, and they end up removing it from the system and doing it manually. Thus turning it into a workflow automation vs an agentic system. I think this is what people actually mean when they want to deploy agents in production. LLMs are great at automation*, not great at problem solving. Examples I have seen - customer support (you want predictability), lead mining, marketing copy generation, code flows and architecture, product specs generation, etc.
The next leap for AI systems is going to be whether they can solve challenging problems at companies - being the experts vs the doing the task they are assigned. They should really be called agents, not the current ones.
I felt deeply vindicated by their assessment of these frameworks, in particular LangChain.
I've built and/or worked on a few different LLM-based workflows, and LangChain definitely makes things worse in my opinion.
What it boils down to is that we are still coming to understand the right patterns of development for how to develop agents and agentic workflows. LangChain made choices about how to abstract things that are not general or universal enough to be useful.
In fact they are mentioning LangGraph (the agent framework from the LangChain company). Imo LangGraph is a much more thoughtful and better built piece of software than the LangChain framework.
As I said, they already mention LangGraph in the article, so the Anthropic's conclusions still hold (i.e. KISS).
But this thread is going in the wrong direction when talking about LangChain
I'm lumping them all in the same category tbh. They say to just use the model libraries directly or a thin abstraction layer (like litellm maybe?) if you want to keep flexibility to change models easily.
I guess a little. I really liked the read though, it put in words what I couldn't and I was curious if others felt the same.
However the post was posted here yesterday and didn't really have a lot of traction.
I thought this was partially because of the term agentic, which the community seems a bit fatigued by. So I put it in quotes to highlight that Anthropic themselves deems it a little vague and hopefully spark more interest.
I don't think it messes with their message too much?
Honestly it didn't matter anyways, without second chance pooling this post would have been lost again (so thanks Daniel!)
My personal view is that the roadmap to AGI requires an LLM acting as a prefrontal cortex: something designed to think about thinking.
It would decide what circumstances call for double-checking facts for accuracy, which would hopefully catch hallucinations. It would write its own acceptance criteria for its answers, etc.
It's not clear to me how to train each of the sub-models required, or how big (or small!) they need to be, or what architecture works best. But I think that complex architectures are going to win out over the "just scale up with more data and more compute" approach.
IMHO with a simple loop LLMs are already capable of some meta thinking, even without any internal new architectures. For me where it still fails is that LLMs cannot catch their own mistakes even some obvious ones. Like with GPT 3.5 I had a persistent problem with the following question: "Who is older, Annie Morton or Terry Richardson?". I was giving it Wikipedia and it was correctly finding out the birth dates of the most popular people with the names - but then instead of comparing ages it was comparing birth years. And once it did that it was impossible to it to spot the error.
Now with 4o-mini I have a similar even if not so obvious problem.
Just writing this down convinced me that there are some ideas to try here - taking a 'report' of the thought process out of context and judging it there, or changing the temperature or even maybe doing cross-checking with a different model?
The meta thinking of LLMs is fascinating to me. Here’s a snippet of a convo I had with Claude 3.5 where it struggles with the validity of its own metacognition:
> … true consciousness may require genuine choice or indeterminacy - that is, if an entity's responses are purely deterministic (like a lookup table or pure probability distribution), it might be merely executing a program rather than experiencing consciousness.
> However, even as I articulate this, I face a meta-uncertainty: I cannot know whether my discussion of uncertainty reflects:
- A genuine contemplation of these philosophical ideas
- A well-trained language model outputting plausible tokens about uncertainty
- Some hybrid or different process entirely
> This creates an interesting recursive loop - I'm uncertain about whether my uncertainty is "real" uncertainty or simulated uncertainty. And even this observation about recursive uncertainty could itself be a sophisticated output rather than genuine metacognition.
I actually felt bad for it (him?), and stopped the conversion before it recursed into “flaming pile of H-100s”
Ah yeah - actually I tested that taking out of context. This is the thing that surprised me - I thought it is about 'writing itself into a corner - but even in a completely different context the LLM is consistently doing an obvious mistake.
Here is the example: https://chatgpt.com/share/67667827-dd88-8008-952b-242a40c2ac...
Janet Waldo was playing Corliss Archer on radio - and the quote the LLM found in Wikipedia was confirming it. But the question was about film - and the LLM cannot spot the gap in its reasoning - even if I try to warn it by telling it the report came from a junior researcher.
Interesting, because I almost think of it the opposite way. LLMs are like system 1 thinking, fast, intuitive, based on what you consider most probable based on what you know/have experienced/have been trained on. System 2 thinking is different, more careful, slower, logical, deductive, more like symbolic reasoning. And then some metasystem to tie these two together and make them work cohesively.
> But I think that complex architectures are going to win out over the "just scale up with more data and more compute" approach.
I'm not sure about AGI, but for specialized jobs/tasks (ie having a marketing agent that's familiar with your products and knows how to copywrite for your products) will win over "just add more compute/data" mass-market LLMs. This article does encourage us to keep that architecture simple, which is refreshing to hear. Kind of the AI version of rule of least power.
Admittedly, I have a degree in Cognitive Science, which tended to focus on good 'ol fashioned AI, so I have my biases.
After I read attention is all you need, my first thought was: "Orchestration is all you need". When 4o came out I published this: https://b.h4x.zip/agi/
> Agents can be used for open-ended problems where it’s difficult or impossible to predict the required number of steps, and where you can’t hardcode a fixed path. The LLM will potentially operate for many turns, and you must have some level of trust in its decision-making. Agents' autonomy makes them ideal for scaling tasks in trusted environments.
The questions then become:
1. When can you (i.e. a person who wants to build systems with them) trust them to make decisions on their own?
2. What type of trusted environments are we talking about? (Sandboxing?)
So, that all requires more thought -- perhaps by some folks who hang out at this site. :)
I suspect that someone will come up with a "real-world" application at a non-tech-first enterprise company and let us know.
Just take any example and think how a human would break it down with decision trees.
You are building an AI system to respond to your email.
The first agent decides whether the new email should be responded to, yes or no.
If no, it can send it to another LLM call that decides to archive it or leave it in the inbox for the human.
If yes, it sends it to classifier that decides what type of response is required.
Maybe there are some emails like for your work that require something brief like “congrats!” to all those new feature launch emails you get internally.
Or others that are inbound sales emails that need to go out to another system that fetches product related knowledge to craft a response with the right context. Followed by a checker call that makes sure the response follows brand guidelines.
The point is all of these steps are completely hypothetical but you can imagine how loosely providing some set of instructions and function calls and procedural limits can easily classify things and minimize error rate.
You can do this for any workflow by creatively combining different function calls, recursion, procedural limits, etc. And if you build multiple different decision trees/workflows, you can A/B test those and use LLM-as-a-judge to score the performance. Especially if you’re working on a task with lots of example outputs.
As for trusted environments, assume every single LLM call has been hijacked and don’t trust its input/output and you’ll be good. I put mine in their own cloudflare workers where they can’t do any damage beyond giving an odd response to the user.
> The first agent decides whether the new email should be responded to, yes or no.
How would you trust that the agent is following the criteria, and how sure that the criteria is specific enough. Like someone you just meet told you they going to send you something via email, but then the agent misinterpret it due to missing context and decided to respond in a generic manner leading to misunderstanding.
> assume every single LLM call has been hijacked and don’t trust its input/output and you’ll be good.
Which is not new. But with formal languages, you have a more precise definition of what acceptable inputs are (the whole point of formalism is precise definitions). With LLM workflows, the whole environment should be assumed to be public information. And you should probably add a fine point that the output does not engage you in anything.
Couldn’t agree more with this - too many people rush to build autonomous agents when their problem could easily be defined as a DAG workflow. Agents increase the degrees of freedom in your system exponentially making it so much more challenging to evaluate systematically.
Agents are still a misaligned concept in AI. While this article offers a lot in orchestration, memory (only mentioned once in the post) and governance are not really mentioned. The latter is important to increase reliability -- something Ilya Sutskever mentioned to be important as agents can be less deterministic in their responses.
Interestingly, "agency" i.e., the ability of the agent to make own decisions is not mentioned once.
This was an excellent writeup - felt a bit surprised at how much they considered "workflow" instead of agent but I think it's good to start to narrow down the terminology
I think these days the main value of the LLM "agent" frameworks is being able to trivially switch between model providers, though even that breaks down when you start to use more esoteric features that may not be implemented in cleanly overlapping ways
It took less than a day to get a network of agents to correctly fix swebench-lite examples. It's super early, but very fun. One of the cool things is that this uses Inngest under the hood, so you get all of the classic durable execution/step function/tracing/o11y for free, but it's just regular code that you write.
There's also a cookbook with useful code examples: https://github.com/anthropics/anthropic-cookbook/tree/main/p...
Blogged about this here: https://simonwillison.net/2024/Dec/20/building-effective-age...
This matters mostly when things go wrong. Who's responsible? The airline whose AI agent gave out wrong info about airline policies found, in court, that their "intelligent agent" was considered an agent in legal terms. Which meant the airline was stuck paying for their mistake.
Anthropic's definition: Some customers define agents as fully autonomous systems that operate independently over extended periods, using various tools to accomplish complex tasks.
That's an autonomous system, not an agent. Autonomy is about how much something can do without outside help. Agency is about who's doing what for whom, and for whose benefit and with what authority. Those are independent concepts.
I ask because you seem very confident in it - and my biggest frustration about the term "agent" is that so many people are confident that their personal definition is clearly the one everyone else should be using.
"Anything that can be viewed as perceiving its environment through sensors and acting upon that environment through actuators"
https://en.wikipedia.org/wiki/Intelligent_agent#As_a_definit...
Sort of interesting that we've coalesced on this term that has many definitions, sometimes conflicting, but where many of the definitions vaguely fit into what an "AI Agent" could be for a given person.
But in the context of AI, Agent as Anthropic defines it is an appropriate word because it is a thing that has agency.
But that's not their definition, and they explicitly describe that definition as an 'autonomous system'. Their definition comes in the next paragraph:
"At Anthropic, we categorize all these variations as agentic systems, but draw an important architectural distinction between workflows and agents:
* Workflows are systems where LLMs and tools are orchestrated through predefined code paths. Agents, on the other hand, are systems where LLMs dynamically direct their own processes and tool usage, maintaining control over how they accomplish tasks.
* Agents, on the other hand, are systems where LLMs dynamically direct their own processes and tool usage, maintaining control over how they accomplish tasks."
https://www.etymonline.com/word/autonomous
Disclaimer: I'm the author of the framework.
Dead Comment
They also say that using LangChain and other frameworks is mostly unnecessary and does more harm than good. They instead argue to use some simple patterns, directly on the API level. Not dis-similar to the old-school Gang of Four software engineering patterns.
Really like this post as a guidance for how to actually build useful tools with LLMs. Keep it simple, stupid.
When an agent is given a task, they inevitably come up with different plans on different tries due to inherent nature of LLMs. Most companies like this step to be predictable, and they end up removing it from the system and doing it manually. Thus turning it into a workflow automation vs an agentic system. I think this is what people actually mean when they want to deploy agents in production. LLMs are great at automation*, not great at problem solving. Examples I have seen - customer support (you want predictability), lead mining, marketing copy generation, code flows and architecture, product specs generation, etc.
The next leap for AI systems is going to be whether they can solve challenging problems at companies - being the experts vs the doing the task they are assigned. They should really be called agents, not the current ones.
I've built and/or worked on a few different LLM-based workflows, and LangChain definitely makes things worse in my opinion.
What it boils down to is that we are still coming to understand the right patterns of development for how to develop agents and agentic workflows. LangChain made choices about how to abstract things that are not general or universal enough to be useful.
As I said, they already mention LangGraph in the article, so the Anthropic's conclusions still hold (i.e. KISS).
But this thread is going in the wrong direction when talking about LangChain
I would just posit that they do make a distinction between workflows and agents
However the post was posted here yesterday and didn't really have a lot of traction. I thought this was partially because of the term agentic, which the community seems a bit fatigued by. So I put it in quotes to highlight that Anthropic themselves deems it a little vague and hopefully spark more interest. I don't think it messes with their message too much?
Honestly it didn't matter anyways, without second chance pooling this post would have been lost again (so thanks Daniel!)
It would decide what circumstances call for double-checking facts for accuracy, which would hopefully catch hallucinations. It would write its own acceptance criteria for its answers, etc.
It's not clear to me how to train each of the sub-models required, or how big (or small!) they need to be, or what architecture works best. But I think that complex architectures are going to win out over the "just scale up with more data and more compute" approach.
Now with 4o-mini I have a similar even if not so obvious problem.
Just writing this down convinced me that there are some ideas to try here - taking a 'report' of the thought process out of context and judging it there, or changing the temperature or even maybe doing cross-checking with a different model?
> … true consciousness may require genuine choice or indeterminacy - that is, if an entity's responses are purely deterministic (like a lookup table or pure probability distribution), it might be merely executing a program rather than experiencing consciousness.
> However, even as I articulate this, I face a meta-uncertainty: I cannot know whether my discussion of uncertainty reflects: - A genuine contemplation of these philosophical ideas - A well-trained language model outputting plausible tokens about uncertainty - Some hybrid or different process entirely
> This creates an interesting recursive loop - I'm uncertain about whether my uncertainty is "real" uncertainty or simulated uncertainty. And even this observation about recursive uncertainty could itself be a sophisticated output rather than genuine metacognition.
I actually felt bad for it (him?), and stopped the conversion before it recursed into “flaming pile of H-100s”
Janet Waldo was playing Corliss Archer on radio - and the quote the LLM found in Wikipedia was confirming it. But the question was about film - and the LLM cannot spot the gap in its reasoning - even if I try to warn it by telling it the report came from a junior researcher.
I'm not sure about AGI, but for specialized jobs/tasks (ie having a marketing agent that's familiar with your products and knows how to copywrite for your products) will win over "just add more compute/data" mass-market LLMs. This article does encourage us to keep that architecture simple, which is refreshing to hear. Kind of the AI version of rule of least power.
Admittedly, I have a degree in Cognitive Science, which tended to focus on good 'ol fashioned AI, so I have my biases.
Deleted Comment
The questions then become:
1. When can you (i.e. a person who wants to build systems with them) trust them to make decisions on their own?
2. What type of trusted environments are we talking about? (Sandboxing?)
So, that all requires more thought -- perhaps by some folks who hang out at this site. :)
I suspect that someone will come up with a "real-world" application at a non-tech-first enterprise company and let us know.
You are building an AI system to respond to your email.
The first agent decides whether the new email should be responded to, yes or no.
If no, it can send it to another LLM call that decides to archive it or leave it in the inbox for the human.
If yes, it sends it to classifier that decides what type of response is required.
Maybe there are some emails like for your work that require something brief like “congrats!” to all those new feature launch emails you get internally.
Or others that are inbound sales emails that need to go out to another system that fetches product related knowledge to craft a response with the right context. Followed by a checker call that makes sure the response follows brand guidelines.
The point is all of these steps are completely hypothetical but you can imagine how loosely providing some set of instructions and function calls and procedural limits can easily classify things and minimize error rate.
You can do this for any workflow by creatively combining different function calls, recursion, procedural limits, etc. And if you build multiple different decision trees/workflows, you can A/B test those and use LLM-as-a-judge to score the performance. Especially if you’re working on a task with lots of example outputs.
As for trusted environments, assume every single LLM call has been hijacked and don’t trust its input/output and you’ll be good. I put mine in their own cloudflare workers where they can’t do any damage beyond giving an odd response to the user.
How would you trust that the agent is following the criteria, and how sure that the criteria is specific enough. Like someone you just meet told you they going to send you something via email, but then the agent misinterpret it due to missing context and decided to respond in a generic manner leading to misunderstanding.
> assume every single LLM call has been hijacked and don’t trust its input/output and you’ll be good.
Which is not new. But with formal languages, you have a more precise definition of what acceptable inputs are (the whole point of formalism is precise definitions). With LLM workflows, the whole environment should be assumed to be public information. And you should probably add a fine point that the output does not engage you in anything.
I work on CAAs and document my journey on my substack (https://jdsmerau.substack.com)
I think these days the main value of the LLM "agent" frameworks is being able to trivially switch between model providers, though even that breaks down when you start to use more esoteric features that may not be implemented in cleanly overlapping ways
* A "network of agents" is a system of agents and tools
* That run and build up state (both "memory" and actual state via tool use)
* Which is then inspected when routing as a kind of "state machine".
* Routing should specify which agent (or agents, in parallel) to run next, via that state.
* Routing can also use other agents (routing agents) to figure out what to do next, instead of code.
We're codifying this with durable workflows in a prototypical library — AgentKit: https://github.com/inngest/agent-kit/ (docs: https://agentkit.inngest.com/overview).
It took less than a day to get a network of agents to correctly fix swebench-lite examples. It's super early, but very fun. One of the cool things is that this uses Inngest under the hood, so you get all of the classic durable execution/step function/tracing/o11y for free, but it's just regular code that you write.