I spoke with an Amazon AI production engineer who’s talking with prospective clients about implementing AI in our business. When a colleague asked about using generative AI in customer facing chats the engineer said he knows of zero companies who don’t have a human in the loop. All the automatic replies are non-generative “old” tech. Gen AI is just not reliable enough for anyone to stake their reputation on it.
Years ago I was interested in agents that used "old AI" symbolic techniques backed up with classical machine learning. I kept getting hired though by people who were working on pre-transformer neural nets for texts.
Something I knew all along was that you build the system that lets you do it with the human in the loop, collect evaluation and training data [1] and then build a system which can do some of the work and possibly improve the quality of the rest of it.
[1] in that order because for any 'subjective' task you will need to evaluate the symbolic system even if you don't need to train it -- if you need to train the system, on the other hand, you'll still need to eval
Air Canada did this a bit ago, and their AI gave the customer a fake process for submitting claims for some sort of discount on airfare due to bereavement (the flight was for a funeral). The customer sued and Air Canada's defense was that he shouldn't have trusted the Air Canada AI chatbot. Air Canada lost.
Of course it can but I think the issue is that people may try to jailbreak it or do something funny to get a weird response, then post of x.com against the company. There must be techniques to turn LLMs into a FAQ forwarding bot, but then what's the point of having a LLM
Gen AI can only support people. In our case it scans incoming mails for patterns of order or article numbers, if the customers is already know, etc.
That isn't reliable either, but it supports the person who gets the mail on his desk in the end.
We sometimes get handwritten service protocols and the model we are using is very proficient in reading handwritten notes which you would have difficulties to parse yourself.
It works most of the time, but not often enough that AI could give autogenerated answers.
For service quality reasons we don't want to impose any chatbot or AI on a customer.
Also data protection issues arise if you use most AI services today, so parsing customer contact info is a problem as well. We also rely on service partners to tell the truth about not using any data...
One thing I'll add that isn't touched on here is about context windows. While not "infinite", humans have a very large context window for problems they're specialized in solving. Models can often overcome their context window limitations by having larger and more diverse training sets, but that still isn't really a solution to context windows.
Yes, I get the context window increases over time and that for many purposes it's already sufficient enough, but the current paradigm forces you to compress your personal context into a prompt to produce a meaningful result. In a language as malleable as English, this doesn't feel like engineering so much as it feels like incantations and guessing. We're losing so, so much by skipping determinism.
Humans don't have this fixed split into "context" and "weights", at least not over non-trivial time spans.
For better or worse, everything we see and do ends up modifying our "weights", which is something current LLMs just architecturally can't do since the weights are read-only.
This is why I actually argue that LLMs don't use natural language. Natural language isn't just what's spoken by speakers right now. It's a living thing. Every day in conversation with fellow humans your very own natural language model changes. You'll hear some things for the first time, you'll hear others less, you'll say things that get your point across effectively first time, and you'll say some things that require a second or even third try. All of this is feedback to your model.
All I hear from LLM people is "you're just not using it right" or "it's all in the prompt" etc. That's not natural language. That's no different from programming any computer system.
I've found LLMs to be quite useful for language stuff like "rename this service across my whole Kubernetes cluster". But when it comes to specific things like "sort this API endpoint alphabetically" I find the amount of time to learn to construct an appropriate prompt is the same if I'd have just learnt to program, which I already have done. And then there's the energy used by the LLM to do it's thing which is enormously wasteful.
I agree, I'm mostly trying to illustrate how difficult it is to fit our working model of the world into the LLM paradigm. A lot of comments here keep comparing the accuracy of LLMs with humans and I feel that glosses over so much of how different the two are.
Honestly we have no idea what the human split is between "context" and "weights" aside from a superficial understanding that there are long term and short term memories. The long term memory/experience seems a lot closer to context than it is to dynamic weights. We don't suddenly forget how to do a math problem when we pick up an instrument (ie our "weights" don't seem to update as easily and quickly as context does for an LLM).
> humans have a very large context window for problems they're specialized in solving
Do they? I certainly don't. I don't know if it's my memory deficiency, but I frequently hit my "context window" when solving problems of sufficient complexity.
Can you provide some examples of problems where humans have such large context windows?
> Do they? I certainly don't. I don't know if it's my memory deficiency, but I frequently hit my "context window" when solving problems of sufficient complexity.
Human context windows are not linear. They have "holes" in them which are quickly filled with extrapolation that is frequently correct.
It's why you can give a human an entire novel, say "Christine" by Stephen King, then ask them questions about some other novel until their "context window" is filled, then switch to questions about "Christine" and they'll "remember" that they read the book (even if they get some of the details wrong).
> Can you provide some examples of problems where humans have such large context windows?
See above.
The reason is because humans don't just have a "context window", they have a working memory that is also their primary source of information.
IOW, if we change LLMs so that each query modifies the weights (i.e. each query is also another training data-point), then you wouldn't need a context window.
With humans, each new problem effectively retrains the weights to incorporate the new information. With current LLMs the architecture does not allow this.
It's a very large context window, but it is compressed down a lot. I don't know every line of insert your PL of choice's standard library, but I do know a lot of it with many different excerpts from the documentation, relevant experiences where I used this over that, or edge cases/bugs that one might fall into. Add to it all the domain knowledge for the given project, with explicit knowledge of how the clients will use the product, etc, but even stuff like what might your colleague react to to this approach vs another.
And all this can be novelly combined and reasoned with to come up with new stuff to put into the "context window", and it can be dynamically extended at any point (e.g. you recall something similar during a thought train and "bring it into context").
And all this was only the current task-specific window, which lives inside the sum total of your human experience window.
If you're 50 years old, your personality is a product of 50-ish years. Another way to say this is that humans have a very large context window (that can span multiple decades) for solving the problem of presenting a "face" to the world (socializing, which is something humans in general are specifically good at).
Human multi-step workflows tend to have checkpoints where the work is validated before proceeding further, as humans generally aren't 99%+ accurate either.
I'd imagine future agents will include training to design these checks into any output, validating against the checks before proceeding further. They may even include some minor risk assessment beforehand, such as "this aspect is crucial and needs to be 99% correct before proceeding further".
That's what Claude Code does - it constantly stops and asks you whether you want to proceed, including showing you the suggested changes before they're implemented. Helps with avoiding token waste and 'bad' work.
Lots of applications have to be redesigned around that. My guess is that micro-services architecture will see a renaissance since it plays well with LLMs.
Somebody will still need to have the entire context, i.e. the full end-to-end use case and corresponding cross-service call stack. That's the biggest disadvantage of microservices, in my experience, especially if service boundaries are aligned with team boundaries.
On the other hand, if LLMs are doing the actual service development, that's something software engineers could be doing :)
My AI tool use has been a net positive experience at work. It can take over small tasks when I need a break, clean up or start momentum, and generally provide a good helping hand. But even if it could do my job, the costs pile up really quickly. Claude Code can burn $25/ 1-2 hrs, easily on a large codebase, and that's creeping along at a net positive rate assuming I can keep it on task and provide corrections. If you automate the corrections we are up to $50/hr or some tradeoff of speed, accuracy, and cost.
Same as it's always been.
For agents, that triangle is not very well quanitfied at the moment which makes all these investigations interesting but still risky.
My somewhat cynical 2 cents say, it that these thinking LLMs, that constantly re-prompt themselves in a loop to fix their own mistakes, combined with the 'you don't need RAG, just dump the all code into the 1m token context windows' align well with the 'we charge per token' business model.
One of the ideas i'm playing with is producing several rough drafts of a commit ai-generated at the outset, and then filtering these both manually and with some automations for manual refinements.
_Knowing how way leads to way_, the larger the task, the more chance there is for an early deviation to doom the viability of the solution in total. Thus for even the SOTA right now, agents that can work in parallel to generate several different solutions can reduce your time of manually refactoring the generation. I wrote a little about that process here: https://github.com/sutt/agro/blob/master/docs/case-studies/a...
"The real challenge isn't AI capabilities, it's designing tools and feedback systems that agents can actually use effectively." - this part I agree with - I'd been sitting the AI stuff out because I was unclear where I thought the dust would settle or what the market would accept, but recently joined a very small startup focused on building an agent.
I've gone from skeptical to willing to humor to "yeah this is probably right" in about 5 months, basically I believe: if you scope the subject matter very very well, and then focus on the tooling that the model will require to do it's task, you get a high completion rate. There is a reluctance to lean into the non deterministic nature of the models, but actually if you provide really excellent tooling and scope super narrowly, it's generally acceptably good.
This blog post really makes the tooling part seem hard, and, well... it is, but not that hard - we'll see where this all goes, but I remain optimistic.
He didn't say he made 12 independent saleable products, he says he built 12 tools that fill a need at his job and are used in production. They are probably quite simple and do a very specific task as the whole article is telling us that we have to keep it simple to have something useable.
His full time job is building AI systems for others (and the article is a well written promo piece).
If most of these are one-shot deterministic workflows (as opposed of input-llm-tool loop usually meant by the current use of the term "ai agent"), it's not hard to assume you can build, test and deploy one in a month on average.
I also build agents/ai automation for a living. Coding agents or anything open-ended is just a stupid idea. It's best to have human validated checkpoints, small search spaces and very specific questions/prompts (does this email contain an invoice? YES/NO).
Just because we'd love to have fully intelligent, automatic agents, doesn't mean the tech is here. I don't work on anything that generates content (text, images, code). It's just slob and will bite you in the ass in the long run anyhow.
I am also building an agent framework and also used chat coding (not vibe coding) to generate work - I was easily able to save 50% of my time just by asking GPT.
But it generates mistakes like say 1 in 10 times and I do not see it getting fixed unless we drastically change the LLM architecture. In future I am sure we will have much more robust systems if the current hype cycle doesn't ruin its trust with devs.
But the hit is real, I mean I would hire a lot less If i were to hire now as I can clearly see the dev productivity boost.. Learning curve for most of the topics are also drastically reduced as the loss in Google search result quality is now supplemented by LLMs.
But thing I can vouch for is automation and more streamlined workflows. I mean having normal human tasks being augmented by an LLM in a workflow orchestration framework. The LLM can return its confidence % along with the task results and for anything less than ideal confidence % the workflow framework can fall back on a human. But if done correctly with proper testing, guardrails and all, I can see LLM is going to replace human agents in several non-critical tasks within such workflows.
The point is not replacing humans but automating most of the work so the team size would reduce. For e.g. large e-commerce firms have 100s of employees manually verifying product description, images etc, scanning for anything from typos to image mismatch to name a few. I can see LLMs going to do their job in future.
I just left my CTO job for personal reasons. we tried coding agents, agentic coding, LLM-driven coding whatever. the code any of these generate is subpar (a junior would get the PR rejected for what it produces) and you just waste so much time prompting and not thinking. people don't know the code anymore, don't check the code and it's all just gotten very sloppy. so not hiring coders because of AI is a dangerous thing to do and I'd advise heavily against it. your risk just got way higher because of hype. maintainability is out of the window, people don't know the code and there are so many hidden deviations to the spec that it's just not worth it.
the truth is that we stop thinking when we code like that.
In general I would agree, however the resulting systems of such an approach tend to be "just" expensive workflow systems, which could be done with old tech as well... Where is the real need for anything LLM here?
Extracting structured data from unstructured text comes to mind. We’ve built workflows that we couldn’t before by bridging a non deterministic gap. It’s a business SaaS but the folks using our software seem to be really happy with the result.
you are 100% right. LLMs are perfect for anything that required heuristics. "is that project a good fit for client A given the following specifications ... rate it from 1-10". stuff like that. I use it as a solution for an undefined search space/problem essentially.
it would take months with old tech to create a bot that can check multiple websites for specific data or information? so LLM reduces the time a lot? am I wrong?
Human validation is certainly the most reliable way of introducing checkpoints, but there's others: Running unit tests, doing ad-hoc validations of the entire system etc.
Something I knew all along was that you build the system that lets you do it with the human in the loop, collect evaluation and training data [1] and then build a system which can do some of the work and possibly improve the quality of the rest of it.
[1] in that order because for any 'subjective' task you will need to evaluate the symbolic system even if you don't need to train it -- if you need to train the system, on the other hand, you'll still need to eval
If the LLM can’t answer a query it usually forwards the chat to a human support agent.
This is not being part of a defined workflow that requires structured output.
That isn't reliable either, but it supports the person who gets the mail on his desk in the end.
We sometimes get handwritten service protocols and the model we are using is very proficient in reading handwritten notes which you would have difficulties to parse yourself.
It works most of the time, but not often enough that AI could give autogenerated answers. For service quality reasons we don't want to impose any chatbot or AI on a customer.
Also data protection issues arise if you use most AI services today, so parsing customer contact info is a problem as well. We also rely on service partners to tell the truth about not using any data...
Deleted Comment
Dead Comment
Yes, I get the context window increases over time and that for many purposes it's already sufficient enough, but the current paradigm forces you to compress your personal context into a prompt to produce a meaningful result. In a language as malleable as English, this doesn't feel like engineering so much as it feels like incantations and guessing. We're losing so, so much by skipping determinism.
For better or worse, everything we see and do ends up modifying our "weights", which is something current LLMs just architecturally can't do since the weights are read-only.
All I hear from LLM people is "you're just not using it right" or "it's all in the prompt" etc. That's not natural language. That's no different from programming any computer system.
I've found LLMs to be quite useful for language stuff like "rename this service across my whole Kubernetes cluster". But when it comes to specific things like "sort this API endpoint alphabetically" I find the amount of time to learn to construct an appropriate prompt is the same if I'd have just learnt to program, which I already have done. And then there's the energy used by the LLM to do it's thing which is enormously wasteful.
Do they? I certainly don't. I don't know if it's my memory deficiency, but I frequently hit my "context window" when solving problems of sufficient complexity.
Can you provide some examples of problems where humans have such large context windows?
Human context windows are not linear. They have "holes" in them which are quickly filled with extrapolation that is frequently correct.
It's why you can give a human an entire novel, say "Christine" by Stephen King, then ask them questions about some other novel until their "context window" is filled, then switch to questions about "Christine" and they'll "remember" that they read the book (even if they get some of the details wrong).
> Can you provide some examples of problems where humans have such large context windows?
See above.
The reason is because humans don't just have a "context window", they have a working memory that is also their primary source of information.
IOW, if we change LLMs so that each query modifies the weights (i.e. each query is also another training data-point), then you wouldn't need a context window.
With humans, each new problem effectively retrains the weights to incorporate the new information. With current LLMs the architecture does not allow this.
And all this can be novelly combined and reasoned with to come up with new stuff to put into the "context window", and it can be dynamically extended at any point (e.g. you recall something similar during a thought train and "bring it into context").
And all this was only the current task-specific window, which lives inside the sum total of your human experience window.
I'd imagine future agents will include training to design these checks into any output, validating against the checks before proceeding further. They may even include some minor risk assessment beforehand, such as "this aspect is crucial and needs to be 99% correct before proceeding further".
on a personal note, I'm happy to hear that. I've been apprehensive and haven't tried it, purely due to my fear of the cost.
On the other hand, if LLMs are doing the actual service development, that's something software engineers could be doing :)
Same as it's always been.
For agents, that triangle is not very well quanitfied at the moment which makes all these investigations interesting but still risky.
_Knowing how way leads to way_, the larger the task, the more chance there is for an early deviation to doom the viability of the solution in total. Thus for even the SOTA right now, agents that can work in parallel to generate several different solutions can reduce your time of manually refactoring the generation. I wrote a little about that process here: https://github.com/sutt/agro/blob/master/docs/case-studies/a...
This cost scaling will be an issue for this whole AI employee thing, especially because I imagine these providers are heavily discounting.
I've gone from skeptical to willing to humor to "yeah this is probably right" in about 5 months, basically I believe: if you scope the subject matter very very well, and then focus on the tooling that the model will require to do it's task, you get a high completion rate. There is a reluctance to lean into the non deterministic nature of the models, but actually if you provide really excellent tooling and scope super narrowly, it's generally acceptably good.
This blog post really makes the tooling part seem hard, and, well... it is, but not that hard - we'll see where this all goes, but I remain optimistic.
It's hard to make *one* good product (see startup failure rates). You couldn't make 12 (as seemingly a solo dev?) and you're surprised?
we've been working on Definite[0] for 2 years with a small team and it only started getting really good in the past 6 months.
0 - data stack + AI agent: https://www.definite.app/
> agents that technically make successful API calls but can't actually accomplish complex workflows because they don't understand what happened.
It takes a long time to get these things right.
Something seems off about that...
If most of these are one-shot deterministic workflows (as opposed of input-llm-tool loop usually meant by the current use of the term "ai agent"), it's not hard to assume you can build, test and deploy one in a month on average.
Just because we'd love to have fully intelligent, automatic agents, doesn't mean the tech is here. I don't work on anything that generates content (text, images, code). It's just slob and will bite you in the ass in the long run anyhow.
But it generates mistakes like say 1 in 10 times and I do not see it getting fixed unless we drastically change the LLM architecture. In future I am sure we will have much more robust systems if the current hype cycle doesn't ruin its trust with devs.
But the hit is real, I mean I would hire a lot less If i were to hire now as I can clearly see the dev productivity boost.. Learning curve for most of the topics are also drastically reduced as the loss in Google search result quality is now supplemented by LLMs.
But thing I can vouch for is automation and more streamlined workflows. I mean having normal human tasks being augmented by an LLM in a workflow orchestration framework. The LLM can return its confidence % along with the task results and for anything less than ideal confidence % the workflow framework can fall back on a human. But if done correctly with proper testing, guardrails and all, I can see LLM is going to replace human agents in several non-critical tasks within such workflows.
The point is not replacing humans but automating most of the work so the team size would reduce. For e.g. large e-commerce firms have 100s of employees manually verifying product description, images etc, scanning for anything from typos to image mismatch to name a few. I can see LLMs going to do their job in future.
the truth is that we stop thinking when we code like that.
I wrote a little about one such task, getting agents to supplement my markdown dev-log here: https://github.com/sutt/agro/blob/master/docs/case-studies/a...
The fundamental difference is we need HITL to reduce errors instead of HOTL which leads to the errors you mentioned