The current hype around autonomous agents, and what actually works in production

One thing I'll add that isn't touched on here is about context windows. While not "infinite", humans have a very large context window for problems they're specialized in solving. Models can often overcome their context window limitations by having larger and more diverse training sets, but that still isn't really a solution to context windows.

Yes, I get the context window increases over time and that for many purposes it's already sufficient enough, but the current paradigm forces you to compress your personal context into a prompt to produce a meaningful result. In a language as malleable as English, this doesn't feel like engineering so much as it feels like incantations and guessing. We're losing so, so much by skipping determinism.

lxgr · a month ago

Humans don't have this fixed split into "context" and "weights", at least not over non-trivial time spans.

For better or worse, everything we see and do ends up modifying our "weights", which is something current LLMs just architecturally can't do since the weights are read-only.

globular-toast · a month ago

This is why I actually argue that LLMs don't use natural language. Natural language isn't just what's spoken by speakers right now. It's a living thing. Every day in conversation with fellow humans your very own natural language model changes. You'll hear some things for the first time, you'll hear others less, you'll say things that get your point across effectively first time, and you'll say some things that require a second or even third try. All of this is feedback to your model.

All I hear from LLM people is "you're just not using it right" or "it's all in the prompt" etc. That's not natural language. That's no different from programming any computer system.

I've found LLMs to be quite useful for language stuff like "rename this service across my whole Kubernetes cluster". But when it comes to specific things like "sort this API endpoint alphabetically" I find the amount of time to learn to construct an appropriate prompt is the same if I'd have just learnt to program, which I already have done. And then there's the energy used by the LLM to do it's thing which is enormously wasteful.

alpha_squared · a month ago

I agree, I'm mostly trying to illustrate how difficult it is to fit our working model of the world into the LLM paradigm. A lot of comments here keep comparing the accuracy of LLMs with humans and I feel that glosses over so much of how different the two are.

daveguy · a month ago

Honestly we have no idea what the human split is between "context" and "weights" aside from a superficial understanding that there are long term and short term memories. The long term memory/experience seems a lot closer to context than it is to dynamic weights. We don't suddenly forget how to do a math problem when we pick up an instrument (ie our "weights" don't seem to update as easily and quickly as context does for an LLM).

antisthenes · a month ago

> humans have a very large context window for problems they're specialized in solving

Do they? I certainly don't. I don't know if it's my memory deficiency, but I frequently hit my "context window" when solving problems of sufficient complexity.

Can you provide some examples of problems where humans have such large context windows?

lelanthran · a month ago

> Do they? I certainly don't. I don't know if it's my memory deficiency, but I frequently hit my "context window" when solving problems of sufficient complexity.

Human context windows are not linear. They have "holes" in them which are quickly filled with extrapolation that is frequently correct.

It's why you can give a human an entire novel, say "Christine" by Stephen King, then ask them questions about some other novel until their "context window" is filled, then switch to questions about "Christine" and they'll "remember" that they read the book (even if they get some of the details wrong).

> Can you provide some examples of problems where humans have such large context windows?

See above.

The reason is because humans don't just have a "context window", they have a working memory that is also their primary source of information.

IOW, if we change LLMs so that each query modifies the weights (i.e. each query is also another training data-point), then you wouldn't need a context window.

With humans, each new problem effectively retrains the weights to incorporate the new information. With current LLMs the architecture does not allow this.

gf000 · a month ago

It's a very large context window, but it is compressed down a lot. I don't know every line of insert your PL of choice's standard library, but I do know a lot of it with many different excerpts from the documentation, relevant experiences where I used this over that, or edge cases/bugs that one might fall into. Add to it all the domain knowledge for the given project, with explicit knowledge of how the clients will use the product, etc, but even stuff like what might your colleague react to to this approach vs another.

And all this can be novelly combined and reasoned with to come up with new stuff to put into the "context window", and it can be dynamically extended at any point (e.g. you recall something similar during a thought train and "bring it into context").

And all this was only the current task-specific window, which lives inside the sum total of your human experience window.

vntok · a month ago

If you're 50 years old, your personality is a product of 50-ish years. Another way to say this is that humans have a very large context window (that can span multiple decades) for solving the problem of presenting a "face" to the world (socializing, which is something humans in general are specifically good at).

I also build agents/ai automation for a living. Coding agents or anything open-ended is just a stupid idea. It's best to have human validated checkpoints, small search spaces and very specific questions/prompts (does this email contain an invoice? YES/NO).

Just because we'd love to have fully intelligent, automatic agents, doesn't mean the tech is here. I don't work on anything that generates content (text, images, code). It's just slob and will bite you in the ass in the long run anyhow.

murukesh_s · a month ago

I am also building an agent framework and also used chat coding (not vibe coding) to generate work - I was easily able to save 50% of my time just by asking GPT.

But it generates mistakes like say 1 in 10 times and I do not see it getting fixed unless we drastically change the LLM architecture. In future I am sure we will have much more robust systems if the current hype cycle doesn't ruin its trust with devs.

But the hit is real, I mean I would hire a lot less If i were to hire now as I can clearly see the dev productivity boost.. Learning curve for most of the topics are also drastically reduced as the loss in Google search result quality is now supplemented by LLMs.

But thing I can vouch for is automation and more streamlined workflows. I mean having normal human tasks being augmented by an LLM in a workflow orchestration framework. The LLM can return its confidence % along with the task results and for anything less than ideal confidence % the workflow framework can fall back on a human. But if done correctly with proper testing, guardrails and all, I can see LLM is going to replace human agents in several non-critical tasks within such workflows.

The point is not replacing humans but automating most of the work so the team size would reduce. For e.g. large e-commerce firms have 100s of employees manually verifying product description, images etc, scanning for anything from typos to image mismatch to name a few. I can see LLMs going to do their job in future.

RamblingCTO · a month ago

I just left my CTO job for personal reasons. we tried coding agents, agentic coding, LLM-driven coding whatever. the code any of these generate is subpar (a junior would get the PR rejected for what it produces) and you just waste so much time prompting and not thinking. people don't know the code anymore, don't check the code and it's all just gotten very sloppy. so not hiring coders because of AI is a dangerous thing to do and I'd advise heavily against it. your risk just got way higher because of hype. maintainability is out of the window, people don't know the code and there are so many hidden deviations to the spec that it's just not worth it.

the truth is that we stop thinking when we code like that.

la_fayette · a month ago

In general I would agree, however the resulting systems of such an approach tend to be "just" expensive workflow systems, which could be done with old tech as well... Where is the real need for anything LLM here?

barbazoo · a month ago

Extracting structured data from unstructured text comes to mind. We’ve built workflows that we couldn’t before by bridging a non deterministic gap. It’s a business SaaS but the folks using our software seem to be really happy with the result.

RamblingCTO · a month ago

you are 100% right. LLMs are perfect for anything that required heuristics. "is that project a good fit for client A given the following specifications ... rate it from 1-10". stuff like that. I use it as a solution for an undefined search space/problem essentially.

anon191928 · a month ago

it would take months with old tech to create a bot that can check multiple websites for specific data or information? so LLM reduces the time a lot? am I wrong?

stillsut · a month ago

Yes I agree: highly-focused-scope + low-stakes + high-chorelike-task is the sweet spot for agents currently.

I wrote a little about one such task, getting agents to supplement my markdown dev-log here: https://github.com/sutt/agro/blob/master/docs/case-studies/a...

lxgr · a month ago

Human validation is certainly the most reliable way of introducing checkpoints, but there's others: Running unit tests, doing ad-hoc validations of the entire system etc.

RamblingCTO · a month ago

that goes without saying. but I'd argue: HITL is more of a workflow pattern, the rest are engineering patterns

Arn_Thor · a month ago

I spoke with an Amazon AI production engineer who’s talking with prospective clients about implementing AI in our business. When a colleague asked about using generative AI in customer facing chats the engineer said he knows of zero companies who don’t have a human in the loop. All the automatic replies are non-generative “old” tech. Gen AI is just not reliable enough for anyone to stake their reputation on it.

PaulHoule · a month ago

Years ago I was interested in agents that used "old AI" symbolic techniques backed up with classical machine learning. I kept getting hired though by people who were working on pre-transformer neural nets for texts.

Something I knew all along was that you build the system that lets you do it with the human in the loop, collect evaluation and training data [1] and then build a system which can do some of the work and possibly improve the quality of the rest of it.

[1] in that order because for any 'subjective' task you will need to evaluate the symbolic system even if you don't need to train it -- if you need to train the system, on the other hand, you'll still need to eval

throwehshdhdy · a month ago

Plenty of tech companies have started using gen AI for live chat support. Off the top of head I know off sonder.com and wealthsimple.com.

If the LLM can’t answer a query it usually forwards the chat to a human support agent.

actinium226 · a month ago

Air Canada did this a bit ago, and their AI gave the customer a fake process for submitting claims for some sort of discount on airfare due to bereavement (the flight was for a funeral). The customer sued and Air Canada's defense was that he shouldn't have trusted the Air Canada AI chatbot. Air Canada lost.

nsonha · a month ago

Of course it can but I think the issue is that people may try to jailbreak it or do something funny to get a weird response, then post of x.com against the company. There must be techniques to turn LLMs into a FAQ forwarding bot, but then what's the point of having a LLM

raxxorraxor · a month ago

That is for selling support. A drone for a consumer drone. Nothing more than a little more sophisticated advertising banner.

This is not being part of a defined workflow that requires structured output.

Perhaps. But it’s telling that someone whose job is selling those kinds of services wasn’t aware of any personally.

nominallyfree · a month ago

"This tech works fine as long as you have a back up for when it frequently fails"

Gen AI can only support people. In our case it scans incoming mails for patterns of order or article numbers, if the customers is already know, etc.

That isn't reliable either, but it supports the person who gets the mail on his desk in the end.

We sometimes get handwritten service protocols and the model we are using is very proficient in reading handwritten notes which you would have difficulties to parse yourself.

It works most of the time, but not often enough that AI could give autogenerated answers. For service quality reasons we don't want to impose any chatbot or AI on a customer.

Also data protection issues arise if you use most AI services today, so parsing customer contact info is a problem as well. We also rely on service partners to tell the truth about not using any data...

Deleted Comment

Dead Comment

KoolKat23 · a month ago

Human multi-step workflows tend to have checkpoints where the work is validated before proceeding further, as humans generally aren't 99%+ accurate either.

I'd imagine future agents will include training to design these checks into any output, validating against the checks before proceeding further. They may even include some minor risk assessment beforehand, such as "this aspect is crucial and needs to be 99% correct before proceeding further".

a_bonobo · a month ago

That's what Claude Code does - it constantly stops and asks you whether you want to proceed, including showing you the suggested changes before they're implemented. Helps with avoiding token waste and 'bad' work.

taurath · a month ago

Except when it decides it doesn’t need to do that anymore or forgets

thats good to hear, theyre on their way there!

on a personal note, I'm happy to hear that. I've been apprehensive and haven't tried it, purely due to my fear of the cost.

csomar · a month ago

Lots of applications have to be redesigned around that. My guess is that micro-services architecture will see a renaissance since it plays well with LLMs.

Somebody will still need to have the entire context, i.e. the full end-to-end use case and corresponding cross-service call stack. That's the biggest disadvantage of microservices, in my experience, especially if service boundaries are aligned with team boundaries.

On the other hand, if LLMs are doing the actual service development, that's something software engineers could be doing :)

jvanderbot · a month ago

My AI tool use has been a net positive experience at work. It can take over small tasks when I need a break, clean up or start momentum, and generally provide a good helping hand. But even if it could do my job, the costs pile up really quickly. Claude Code can burn $25/ 1-2 hrs, easily on a large codebase, and that's creeping along at a net positive rate assuming I can keep it on task and provide corrections. If you automate the corrections we are up to $50/hr or some tradeoff of speed, accuracy, and cost.

Same as it's always been.

For agents, that triangle is not very well quanitfied at the moment which makes all these investigations interesting but still risky.

torginus · a month ago

My somewhat cynical 2 cents say, it that these thinking LLMs, that constantly re-prompt themselves in a loop to fix their own mistakes, combined with the 'you don't need RAG, just dump the all code into the 1m token context windows' align well with the 'we charge per token' business model.

One of the ideas i'm playing with is producing several rough drafts of a commit ai-generated at the outset, and then filtering these both manually and with some automations for manual refinements.

_Knowing how way leads to way_, the larger the task, the more chance there is for an early deviation to doom the viability of the solution in total. Thus for even the SOTA right now, agents that can work in parallel to generate several different solutions can reduce your time of manually refactoring the generation. I wrote a little about that process here: https://github.com/sutt/agro/blob/master/docs/case-studies/a...

swader999 · a month ago

Subscription?

I have one, and upgrades don't have unlimited access as far as I can tell. Correct me if I'm wrong.

This cost scaling will be an issue for this whole AI employee thing, especially because I imagine these providers are heavily discounting.

neom · a month ago

"The real challenge isn't AI capabilities, it's designing tools and feedback systems that agents can actually use effectively." - this part I agree with - I'd been sitting the AI stuff out because I was unclear where I thought the dust would settle or what the market would accept, but recently joined a very small startup focused on building an agent.

I've gone from skeptical to willing to humor to "yeah this is probably right" in about 5 months, basically I believe: if you scope the subject matter very very well, and then focus on the tooling that the model will require to do it's task, you get a high completion rate. There is a reluctance to lean into the non deterministic nature of the models, but actually if you provide really excellent tooling and scope super narrowly, it's generally acceptably good.

This blog post really makes the tooling part seem hard, and, well... it is, but not that hard - we'll see where this all goes, but I remain optimistic.

mritchie712 · a month ago

> I've built 12+ production AI agent systems across development, DevOps, and data operations

It's hard to make *one* good product (see startup failure rates). You couldn't make 12 (as seemingly a solo dev?) and you're surprised?

we've been working on Definite[0] for 2 years with a small team and it only started getting really good in the past 6 months.

0 - data stack + AI agent: https://www.definite.app/

Rexxar · a month ago

He didn't say he made 12 independent saleable products, he says he built 12 tools that fill a need at his job and are used in production. They are probably quite simple and do a very specific task as the whole article is telling us that we have to keep it simple to have something useable.

that's my point. He's "Betting Against AI Agents" without having taken a serious attempt at building one.

> agents that technically make successful API calls but can't actually accomplish complex workflows because they don't understand what happened.

It takes a long time to get these things right.

AstroBen · a month ago

They've built 12+ products with a full time job for the last 3 years

Something seems off about that...

senko · a month ago

His full time job is building AI systems for others (and the article is a well written promo piece).

If most of these are one-shot deterministic workflows (as opposed of input-llm-tool loop usually meant by the current use of the term "ai agent"), it's not hard to assume you can build, test and deploy one in a month on average.

danieltanfh95 · a month ago

Same. https://danieltan.weblog.lol/2025/06/agentic-ai-is-a-bubble-...

The fundamental difference is we need HITL to reduce errors instead of HOTL which leads to the errors you mentioned