I've started a company in this space about 2 years ago. We are doing fine. What we've learned so far is that a lot of these techniques are simply optimisations to tackle some deficiency in LLMs that is a problem "today". These are not going to be problems tomorrow because the technology will shift. As it happened many time in the span of the last 2 years.
So yah, cool, caching all of that... but give it a couple of months and a better technique will come out - or more capable models.
Many years ago when disc encryption on AWS was not an option, my team and I had to spend 3 months to come up with a way to encrypt the discs and do so well because at the time there was no standard way. It was very difficult as that required pushing encrypted images (as far as I remember). Soon after we started, AWS introduced standard disc encryption that you can turn on by clicking a button. We wasted 3 months for nothing. We should have waited!
What I've learned from this is that often times it is better to do absolutely nothing.
This is the most important observation. I'm getting so many workshop invitations from my corporate colleagues about AI and agents. What most people don't get that these clever patterns they "invented" will be obsolete next week. This nice company blog about agents - one which got viral recently - will be obsolete next month. It's hard to swallow for my colleagues that in these age - like when you studied gang of four or a software architecture pattern book that you have learned a common language - no, these days the half-life of a pattern for AI is about a week. Even when you ask 10 professionals what an agent actually is - you will get 10 different answers yet they assume that how they use it is the common understanding.
This is also why it's perfectly fine to wait out this AI hype and see what sticks afterward. It probably won't cost too much time to catch up, because at that point everyone who knows what they're doing only learned that a month or two ago anyway.
Counterpoint to these two posts: a journeyman used to have to make his own tools. He could easily have bought them, or his master could have made them. Making your own tools gives you vastly greater skills when using the tools. So I know how fast AI agents and model APIs are evolving, but I’m writing them anyway. Every break in my career has been someone telling me it’s impossible and then me doing it anyway. If you use an agent framework, you really have no idea how artificially constrained you. You’re so constrained, and yet you are oblivious to it.
On the “wasting three months” remark (GP), if it’s a key value proposition, just do it. Don’t wait. If it’s not a key value prop, then don’t do it at all. Often times what I’ve built has been better tailored to our product than what AWS built.
Note that even many of those "long knowledge" things people learned are today obsolete, but people that follow them just haven't figured it out yet. See how many of those object oriented design patters just look very silly the minute you use immutable data structures, and have access to functional programming constructs in your language. And nowadays most do. Many seminal books on how to program in the early 2000s, especially those covering "pure" OO, look quite silly today.
And yet despite being largely obsolete in the specifics, gang of four remains highly relevant and useful in the generalities. All these books continue to be absolutely great foundations if you look past their immediate advice.
I think knowing when to do nothing is being able to evaluate if the problem the team is tackling is essential or tangential to the core focus of the project, and also whether the problem is something new or if it's been around for a while and there is still no standard way to solve it.
Yeah, that will be the make it to brake it moment because if it’s too essential, it will be implemented but if it’s not, it may become a competitive advantage
Vehement disagree. We implemented our own context editing features 4 months back. Claude released a very similar featureset we had all along last month. We were still glad we did it because
(A) it took me half a day to do that work
(B) our solution is still more powerful for our use case
(C) our solution works on other models as well.
It all comes down to trying to predict what will be your vendors' roadmap (or if youre savvy, get a peek into it) and whether the feature you want to create is fundamental to your applications behavior (I doubt encryption is unless youre a storage company).
This is the "Wait Calculation" and it's fiendish because there exists only some small, finite window in which it is indeed better to start before the tech is "better" in order to "win" (i.e. get "there" first, wherever "there" is in your scenario).
if we wait long enough, we just end up dead, so it turns out we didn't need to do anything at all whatsoever. of course there's a balance - often times starting out and growing up with the technology gives you background and experience that gives you an advantage when it hits escape velocity.
These days it seems like training yourself into a specialty that provides steadyish income for a year before someone obliterates your professional/corporate/field’s scaffolding with AI and you have to start over is kind of a win. Doesn’t it feel like a win? Look at the efficiency!
I agree with the sentiment. things are moving so fast that waiting now is a legitimate strategy. though it is also easy to fall in to the trap of. well if we continue along these lines might as well wait 4-5 years and we get agi. which still true imo does feel off as you arent participating in the process.
One example is that there used to be a whole complex apparatus around getting models to do chain of thought reasoning, e.g., LangChain. Now that is built in as reasoning and they are heavily trained to do it. Same with structured outputs and tool calls — you used to have to do a bunch of stuff to get models to produce valid JSON in the shape you want, now it’s built in and again, they are specifically trained around it. It used to be you would have to go find all relevant context up front and give it to the model. Now agent loops can dynamically figure out what they need and make the tool calls to retrieve it. Etc etc.
If we expand this to 3 years, the single biggest shift that totally changed LLM development is the increase in size of context windows from 4,000 to 16,000 to 128,000 to 256,000.
When we were at 4,000 and 16,000 context windows, a lot of effort was spent on nailing down text splitting, chunking, and reduction.
For all intents and purposes, the size of current context windows obviates all of that work.
What else changed?
- Multimodal LLMs - Text extraction from PDFs was a major issue for rag/document intelligence. A lot of time was wasted trying to figure out custom text extraction strategies for documents. Now, you can just feed the image of a PDF page into an LLM and get back a better transcription.
- Reduced emphasis on vector search. People have found that for most purposes, having an agent grep your documents is cheaper and better than using a more complex rag pipeline. Boris Cherny created a stir when he talked about claude code doing it that way[0]
I'm amazed at this question and the responses you're getting.
These last few years, I've noticed that the tone around AI on HN changes quite a bit by waking time zone.
EU waking hours have comments that seem disconnected from genAI. And, while the US hours show a lot of resistance, it's more fear than a feeling that the tools are worthless.
It's really puzzling to me. This is the first time I noticed such a disconnect in the community about what the reality of things are.
To answer your question personally, genAI has changed the way I code drastically about every 6 months in the last two years. The subtle capability differences change what sorts of problems I can offload. The tasks I can trust them with get larger and larger.
It started with better autocomplete, and now, well, agents are writing new features as I write this comment.
On the foundational level, test time compute(reasoning), heavy RL post training, 1M+ plus context length etc.
On the application layer, connecting with sandboxes/VM's is one of the biggest shifts. (Cloudfares codemode etc). Giving an llm a sandbox unlocks on the fly computation, calculations, RPA, anything really.
MCP's, or rather standardized function calling is another one.
Also, local llm's are becoming almost viable because of better and better distillation, relying on quick web search for facts etc.
We started putting them in image and video models and now image and video models are insane.
I think the next period of high and rapid growth will be in media (image, video, sound, 3D), not text.
It's much harder to adapt LLMs to solving business use cases with text. Each problem is niche, you have to custom tailor the solution, and the tooling is crude.
The media use cases, by contrast, are low hanging fruit and result in 10,000x speedups and cost reductions almost immediately. The models are pure magic.
I think more companies would be wise to ignore text for now and focus on visual domain problems.
Nano Banana has so much more utility than agents. And there are so many low hanging fruit ways to make lots of money.
Don't sleep on image and video. That's where the growth salient is.
I suspect you're right, but it's a bit discouraging to consider that an alternative way of framing this is that companies like OpenAI have a huge advantage in this landscape and anything that works will end up behind their API.
In some ways, the fact that the technology will shift is the problem as model behavior keeps changing. It's rather maddening unstable ground to build on. Really hard to gauge the impact to customer experience from a new model.
You could use the like of Amazon / Anthropic, or use Google who has had transparent disk encryption for 10+ years, and Gemini which already had the transparent caching discussed built in.
If you’ve spent any time with the vertex LLM apis you wouldn’t be so enthusiastic about using Google’s platform (I say this as someone who prefers GCP to aws for compute and networking).
Amen. Been seeing these agent SDKs coming out left and right for a couple of years and thought it'd be a breeze to build an agent. Now I'm trying to build one for ~3 weeks, and I've tried three different SDKs and a couple of architectures.
Here's what I found:
- Claude Code SDK (now called Agent SDK) is amazing, but I think they are still in the process of decoupling it from the Claude Code, and that's why a few things are weird. e.g, You can define a subagent programmatically, but not skills. Skills have to be placed in the filesystem and then referenced in the plugin. Also, only Anthoripic models are supported :(
- OpenAI's SDK's tight coupling with their platform is a plus point. i.e, you get agents and tool-use traces by default in your dashboard. Which you can later use for evaluation, distillation, or fine-tuning.
But:
2. They have agent handoffs (which works in some cases), but not subagents. You can use tools as subagents, though.
1. Not easy to use a third-party model provider. Their docs provide sample codes, but it's not as easy as that.
- Google Agent Kit doesn't provide any Typescript SDK yet. So didn't try.
- Mastra, even though it looks pretty sweet, spins up a server for your agent, which you can then use via REST API. umm.. why?
- SmythOS SDK is the one I'm currently testing because it provides flexibility in terms of choosing the model provider and defining your own architecture (handoffs or subagents, etc.). It has its quirks, but I think it'll work for now.
Question: If you don't mind sharing, what is your current architecture? Agent -> SubAgents -> SubSubAgents? Linear? or a Planner-Executor?
I'll write a detailed post about my learnings from architectures (fingers crossed) soon.
Every single SDK I've used was a nightmare once you get past the basics. I ended up just using an OpenRouter client library [1] and writing agents by hand without an abstraction layer. Is it a little more boilerplatey? Yea. Does it take more LoC to write? Yea. Is it worth it? 100%. Despite writing more code, the mental model is much easier (personally) to follow and understand.
As for the actual agent I just do the following:
- Get metadata from initial query
- Pass relevant metadata to agent
- Agent is a reasoning model with tools and output
- Agent runs in a loop (max of n times). It will reason which tool calls to use
- If there is a tool call, execute it and continue the loop
- Once the agent outputs content, the loop is effectively finished and you have your output
This is effectively a ReAct agent. Thanks to the reasoning being built in, you don't need an additional evaluator step.
Tools can be anything. It can be a subagent with subagents, a database query, etc.
Need to do an agent handoff? Just output the result of the agent into a different agent. You don't need an sdk to do a workflow.
I've tried some other SDKs/frameworks (Eino and langchaingo), and personally found it quicker to do it manually (as described above) than fight against the framework.
I think the term sub-agent is almost entirely useless. An agent is an LLM loop that has reasoning and access to tools.
A "sub agent" is just a tool. It's implantation should be abstracted away from the main agent loop. Whether the tool call is deterministic, has human input, etc, is meaningless outside of the main tool contract (i.e Params in Params out, SLA, etc)
I agree, technically, "sub agent" is also another tool. But I think it's important to differentiate tools with deterministic input/output from those with reasoning ability.
A simple 'Tool' will take the input and try to execute, but the 'subagent' might reason that the action is unnecessary and that the required output already exists in the shared context. Or it can ask a clarifying question from the main agent before using its tools.
> It's implantation should be abstracted away from the main agent loop. Whether the tool call is deterministic, has human input, etc, is meaningless outside of the main tool contract (i.e Params in Params out, SLA, etc)
Up to a point. You're obviously right in principle, but if that task itself has the ability to call into "adjacent" tools then the behavior changes quite a bit. You can see this a bit with how the Oracle in Amp surfaces itself to the user. The oracle as sub-agent has access to the same tools as the main agent, and the state changes (rare!) that it performs are visible to itself as well as the main agent. The tools that it invokes are displayed similarly to the main agent loop, but they are visualized as calls within the tool.
ADK differentiates between tools and subagents based on the ability to escalate or transfer control (subagents), where as tools are more basic
I think this is a meaningful distinction, because it impacts control flow, regardless what they are called. The lexicon are quite varied vendor-to-vendor
Nah, when working on anything sufficiently complicated you will have many parallel subagents that need their own context window, ability to mutate shared state, sandboxing differences, durability considerations, etc.
If you want to rewrite the behavior per instance you totally can, but there is a definite concept here that is different than “get_weather”.
I think that existing tools don’t work very well here or leave much of this as an exercise for the user. We have tasks that can take a few days to finish (just a huge volume of data and many non deterministic paths). Most people are doing way too much or way too little. Having subagents with traits that can be vended at runtime feels really nice.
Hello, about Claude Code where only Anthoripic models are supported, in reality you can use Claude Code router (https://github.com/musistudio/claude-code-router) to use other models in Claude Code. I use it since some weeks with opensource models and it works pretty well. You can even use "free" models from openrouter
Google's ADK is pretty nice, I'm using the Go version, which is less mature than the python on. Been at it a bit over a week and progress is great. This weekend I'm aiming for tracking file changes in the session history to allow rewinding / forking
It has a ton of day 2 features, really nice abstractions, and positioned itself well in terms of the building blocks and constructing workflows.
ADK supports working with all the vendors and local LLMs
Have you tried AWS’s Strands Agents SDK? I’ve found it to be a very fluent and ergonomic API. And it doesn’t require you to use Bedrock; most major vendor native APIs are supported.
(Disclaimer: I work for AWS, but not for any team involved. Opinions are my own.)
Dude, it looks great, but I just spent half an hour learning about its 'CodeAgents' feature. Which essentially is 'actions written as code'.
This idea has been floating around in my head, but it wasn't refined enough to implement. It's so wild that what you're thinking of may have already been done by someone else on the internet.
For those who are wondering, it's kind of similar to the 'Code Mode' idea implemented by Cloudflare and now being explored by Anthropic; Write code to discover and call MCPs instead of stuffing context window with their definations.
1. If your agent needs to write a lot of code, it's really hard to beat Claude Code (cc) / Agent SDK. We've tried many approaches and frameworks over the past 2 years (e.g. PydanticAI), but using cc is the first that has felt magic.
2. Vendor lock-in is a risk, but the bigger risk is having an agent that is less capable then what a user gets out of chatgpt because you're hand rolling every aspect of your agent.
3. cc is incredibly self aware. When you ask cc how to do something in cc, it instantly nails it. If you ask cc how to do something in framework xyz, it will take much more effort.
4. Give your agent a computer to use. We use e2b.dev, but Modal is great too. When the agent has a computer, it makes many complex features feel simple.
0 - For context, Definite (https://www.definite.app/) is a data platform with agents to operate it. It's like Heroku for data with a staff of AI data engineers and analysts.
Be careful about what you hand off to Claude versus another agent. Claude is a vibe project monster, but it will fail at hard things, come up with fake solutions, and then lie to you about them. To the point that it'll add random sleeps and do pointless work to cover up the fact that it's reward hacking. It's also very messy.
For brownfield work, work on hard stuff or work in big complex codebases you'll save yourself a lot of pain if you use Codex instead of CC.
It's quite worrying that I have several times in the last few months had to really drive home why people should probably not be building bespoke agentic systems just to essentially act as a half baked version of an agentic coding tool when they could just go use Claude code and instead focus their efforts on creating value rather than instant technical debt.
You can pretty much completely reprogram agents just by passing them through a smart proxy. You don't need to rewrite claude/codex, just add context engineering and tool behaviors at the proxy layer.
yes, we should all stop experimenting and outsource our agentic workflows to our new overlords...
this will surely end up better than where big tech has already brought our current society...
For real though, where did the dreamers about ai / agentic free of the worst companies go? Are we in the seasons of capitulation?
My opinion... build, learn, share. The frameworks will improve, the time to custom agent will be shortened, the knowledge won't be locked in another unicorn
anecdotally, I've come quite far in just a week with ADK and VS Code extensions, having never done extensions before, which has been a large part of the time spent
I've been building agent type stuff for a couple years now and the best thing I did was build my own framework and abstractions that I know like the back of my hand.
I'd stay clear of any llm abstraction. There are so many companies with open source abstractions offering the panacea of a single interface that are crumbling under their own weight due to the sheer futility of supporting every permutation of every SDK evolution, all while the same companies try to build revenue generating businesses on top of them.
I agree with your analysis of building your own Agent framework to have some level of control and fewer abstractions. Agents at their core are about programmatically talking to an LLM and performing these basic operations:
1. Structured Input and String Interpolation in prompts
2. Structured Output and Unmarshalling String response to Structured output (This is getting easier now with LLMs supporting Structured output)
3. Tool registry/discovery (of MCP and Function tools), Tool calls and response looping
4. Composability of Tools
5. Some form of Agent to Agent delegation
I’ve had good luck with using PydanticAI which does these core operations well (due to the underlying Pydantic library), but still struggles with too many MCP servers/Tools and composability.
I’ve built an open-source Agent framework called OpusAgents, that makes the process of creating Agents, Subagents, Tools that are simpler than MCP servers without overloading the context. Check it out here and tutorials/demos to see how it’s more reliable than generic Agents with MCP servers in Cursor/ClaudeDesktop - https://github.com/sathish316/opus_agents
It’s built on top of PydanticAI and FastMCP, so that all non-core operations of Agents are accessible when I need them later.
I also recommend this. I have tried all of the frameworks, and deploy some still for some clients- but for my personal agents, its my own custom framework that is dead simple and very easy to spin up, extend, etc.
This sounds interesting. What about the agent behavior itself? How it decides how to come at a problem, what to show the user along the way, and how it decides when to stop? Are these things you have attempted to grapple with in your framework?
Excellent write up. I’ve been thinking a lot about caching and agents so this was right ilup my alley.
Have you experimented with using semantic cache on the chain of thought(what we get back from the providers anyways) and sending that to a dumb model for similar queries to “simulate” thinking?
Yes, this is great advice. It also applies to interfaces. When we designed a support "chat bot", we went with a diferent architecture than what's out there already. We designed the system with "chat rooms" instead, and the frontend just dumps messages to a chatroom (with a session id). Then on the backend we can do lots of things, incrementally adding functionality, while the front end doesn't have to keep up. We can also do things like group messages, have "system" messages that other services can read, etc. It also feels more natural, as the client can type additional info while the server is working, etc.
If you have to use some of the client side SDKs, another good idea is to have a proxy where you can also add functionality without having to change the frontend.
Creativity is an underrated hard part of building agents. The fun part of building right now is knowing how little of the design space for building agents has been explored.
This is not so unlike the coding agent I'm building for vs code. One of the things I'm doing is keeping a snapshot of the current vs code state (files open, terminal history, etc) in the agent server. Similarly, I track the file changes without actually writing them until the user approves the diff, so there are some "filesystem" like things that need to be carefully managed on each side.
tl;dr, Both sides are broadcasting messages and listening for the ones they care about.
This is a huge undertaking though. Yes it is quite simple to build some basic abstraction on top of openai.complete or similar but this like 1% of an agent need to do.
My bet is that agent frameworks and platform will become more like game engines. You can spin your own engine for sure and it is fun and rewarding. But AAA studios will most likely decide to use a ready to go platform with all the batteries included.
In totality, yes. But you don't need every feature at once. You add to it once you hit boundaries. But I think the most important thing about this exercise is that you leave nothing to the imagination when building agents.
The non-deterministic nature of LLMs already makes the performance of agents so difficult to interpret. Building agents on top of code that you cannot mentally trace through leads to so much frustration when addressing model underperformance and failure.
It's hard to argue that after the dust settles, companies will default to batteries-included frameworks but, right now, a lot of people i've regretted adopting a large framework off the bat.
We're repeating the same overengineering cycle we saw with early LangChain/RAG stacks. Just a couple of months ago the term agent was hard to define, but I've realized the best mental model is just a standard REPL:
Read: Gather context (user input + tool outputs).
Eval: LLM inference (decides: do I need a tool, or am I done?).
Print: Execute the tool (the side effect) or return the answer.
Loop: Feed the result back into the context window.
Rolling a lightweight implementation around this concept has been significantly more robust for me than fighting with the abstractions in the heavy-weight SDKs.
I don't think this has much to do with SDKs. I've developed my own agent code from scratch (starting from the simple loop) and eventually- unless your use case is really simple- you always have to deal with the need for subagents specialised for certain tasks, that share part of their data (but not all) with the main agent, with internal reasoning and reinforcement messages, etc.
Interestingly, sticking to the "Agent = REPL" mental model is actually what helped me solve those specific scaling problems (sub-agents and shared data) without the SDK bloat.
1. Sub-agents are just stack frames. When the main loop encounters a complex task, it "pushes" a new scope (a sub-agent with a fresh, empty context). That sub-agent runs its own REPL loop, returns only the clean result with out any context pollution and is then "popped".
2. Shared Data is the heap. Instead of stuffing "shared data" into the context window (which is expensive and confusing), I pass a shared state object by reference. Agents read/write to the heap via tools, but they only pass "pointers" in the conversation history. In the beginning this was just a Python dictionary and the "pointers" were keys.
My issue with the heavy SDKs isn't that they try to solve these problems, but that they often abstract away the state management. I’ve found that explicitly managing the "stack" (context) and "heap" (artifacts) makes the system much easier to debug.
This is why I use the agent I use. I won't name the company, because I don't want people to think I'm a shill for them (I've already been accused of it before, but I'm just a happy, excited customer). But it's an agentic coding company that isn't associated with any of the big model providers.
I don't want to keep up with all the new model releases. I don't want to read every model card. I don't want to feel pressured to update immediately (if it's better). I don't want to run evals. I don't want to think about when different models are better for different scenarios. I don't want to build obvious/common subagents. I don't want to manage N > 1 billing entities.
I just want to work.
Paying an agentic coding company to do this makes perfect sense for me.
I’ve been surprised at the lack of discussion about sourcegraph’s Amp here which I’m pretty sure you’re referring to - it started a bit rough but these days I find that it’s really good
So, I tried to sign up for Amp. I saw a livestream that mentioned you can sign up for their community Buildcrew on Discord and get $100 of credits. I tried signing up, and got an email that I was accepted and would soon get the credits. The Discord link did not work (it was expired) and the email was a noreply, so I tried emailing Amp support. This was last Friday (8 days ago.) As of today, no updated Discord link, no human response, no credits. If this is their norm, people probably aren't talking about it because they just haven't been able to try it.
> We find testing and evals to be the hardest problem here. This is not entirely surprising, but the agentic nature makes it even harder. Unlike prompts, you cannot just do the evals in some external system because there’s too much you need to feed into it. This means you want to do evals based on observability data or instrumenting your actual test runs. So far none of the solutions we have tried have convinced us that they found the right approach here.
I'm curious about the solutions the op has tried so far here.
"Because there’s too much you need to feed into it" - what does the author mean by this? If it is the amount of data, then I would say sampling needs to be implemented. If that's the extent of the information required from the agent builder, I agree that an LLM-as-a-judge e2e eval setup is necessary.
In general, a more generic eval setup is needed, with minimal requirements from AI engineers, if we want to move forward from Vibe's reliability engineering practices as a sector.
Likewise. I have a nasty feeling that most AI agent deployments happen with nothing more than some cursory manual testing. Going with the ‘vibes’ (to coin an over used term in the industry).
I can confirm this after hundreds of talks about the topic over the last 2 years. 90% of cases are simply not high-volume or high-stakes enough for the devs to care enough. I'm a founder of an evaluation automation startup, and our challenge is spotting teams right as their usage starts to grow and quality issues are about to escalate. Since that’s tough, we're trying to make the getting-to-first-evals so simple that teams can start building the mental models before things get out of hand.
A lot of "generative" work is like this. While you can come up with benchmarks galore, at the end of the day how a model "feels" only seems to come out from actual usage. Just read /r/localllama for opinions on which models are "benchmaxed" as they put it. It seems to be common knowledge in the local LLM community that many models perform well on benchmarks but that doesn't always reflect how good they actually are.
In my case I was until recently working on TTS and this was a huge barrier for us. We used all the common signal quality and MOS-simulation models that judged so called "naturalness" and "expressiveness" etc. But we found that none of these really helped us much in deciding when one model was better than another, or when a model was "good enough" for release. Our internal evaluations correlated poorly with them, and we even disagreed quite a bit within the team on the quality of output. This made hyperparameter tuning as well as commercial planning extremely difficult and we suffered greatly for it. (Notice my use of past tense here..)
Having good metrics is just really key and I'm now at the point where I'd go as far as to say that if good metrics don't exist it's almost not even worth working on something. (Almost.)
What are the main shortcomings of the solutions you tried out?
We believe you need to both automatically create the evaluation policies from OTEL data (data-first) and to bring in rigorous LLM judge automation from the other end (intent-first) for the truly open-ended aspects.
Its a 2 day project at best to create your own bespoke llm as judge e2e eval framework. Thats what we did. Works fine. Not great. Still need someone to write the evals though.
My tip is: don’t use SDK:s for agents. Use a while loop and craft your own JSON, handle context size and handle faults yourself. You will in practice need this level of control if you are not doing something trivial.
So yah, cool, caching all of that... but give it a couple of months and a better technique will come out - or more capable models.
Many years ago when disc encryption on AWS was not an option, my team and I had to spend 3 months to come up with a way to encrypt the discs and do so well because at the time there was no standard way. It was very difficult as that required pushing encrypted images (as far as I remember). Soon after we started, AWS introduced standard disc encryption that you can turn on by clicking a button. We wasted 3 months for nothing. We should have waited!
What I've learned from this is that often times it is better to do absolutely nothing.
On the “wasting three months” remark (GP), if it’s a key value proposition, just do it. Don’t wait. If it’s not a key value prop, then don’t do it at all. Often times what I’ve built has been better tailored to our product than what AWS built.
I wagger the same for AI agent techniques.
You have a positive cash flow from sales of agents? Your revenue exceeds your operating costs?
I've been very skeptical that it is possible to make money from agents, having seen how difficult it was for the current well-known players to do so.
What is your secret sauce?
Imo the key is to serve one use case really well rather than overgeneralize.
It all comes down to trying to predict what will be your vendors' roadmap (or if youre savvy, get a peek into it) and whether the feature you want to create is fundamental to your applications behavior (I doubt encryption is unless youre a storage company).
Here's a nice article about it: https://www.oneusefulthing.org/p/the-lazy-tyranny-of-the-wai...
What technology shifts have happened for LLMs in the last 2 years?
When we were at 4,000 and 16,000 context windows, a lot of effort was spent on nailing down text splitting, chunking, and reduction.
For all intents and purposes, the size of current context windows obviates all of that work.
What else changed?
- Multimodal LLMs - Text extraction from PDFs was a major issue for rag/document intelligence. A lot of time was wasted trying to figure out custom text extraction strategies for documents. Now, you can just feed the image of a PDF page into an LLM and get back a better transcription.
- Reduced emphasis on vector search. People have found that for most purposes, having an agent grep your documents is cheaper and better than using a more complex rag pipeline. Boris Cherny created a stir when he talked about claude code doing it that way[0]
https://news.ycombinator.com/item?id=43163011#43164253
These last few years, I've noticed that the tone around AI on HN changes quite a bit by waking time zone.
EU waking hours have comments that seem disconnected from genAI. And, while the US hours show a lot of resistance, it's more fear than a feeling that the tools are worthless.
It's really puzzling to me. This is the first time I noticed such a disconnect in the community about what the reality of things are.
To answer your question personally, genAI has changed the way I code drastically about every 6 months in the last two years. The subtle capability differences change what sorts of problems I can offload. The tasks I can trust them with get larger and larger.
It started with better autocomplete, and now, well, agents are writing new features as I write this comment.
On the application layer, connecting with sandboxes/VM's is one of the biggest shifts. (Cloudfares codemode etc). Giving an llm a sandbox unlocks on the fly computation, calculations, RPA, anything really.
MCP's, or rather standardized function calling is another one.
Also, local llm's are becoming almost viable because of better and better distillation, relying on quick web search for facts etc.
I think the next period of high and rapid growth will be in media (image, video, sound, 3D), not text.
It's much harder to adapt LLMs to solving business use cases with text. Each problem is niche, you have to custom tailor the solution, and the tooling is crude.
The media use cases, by contrast, are low hanging fruit and result in 10,000x speedups and cost reductions almost immediately. The models are pure magic.
I think more companies would be wise to ignore text for now and focus on visual domain problems.
Nano Banana has so much more utility than agents. And there are so many low hanging fruit ways to make lots of money.
Don't sleep on image and video. That's where the growth salient is.
You could use the like of Amazon / Anthropic, or use Google who has had transparent disk encryption for 10+ years, and Gemini which already had the transparent caching discussed built in.
Any framework you build around the model is just behaviour that can be trained into the model itself
Here's what I found:
- Claude Code SDK (now called Agent SDK) is amazing, but I think they are still in the process of decoupling it from the Claude Code, and that's why a few things are weird. e.g, You can define a subagent programmatically, but not skills. Skills have to be placed in the filesystem and then referenced in the plugin. Also, only Anthoripic models are supported :(
- OpenAI's SDK's tight coupling with their platform is a plus point. i.e, you get agents and tool-use traces by default in your dashboard. Which you can later use for evaluation, distillation, or fine-tuning. But: 2. They have agent handoffs (which works in some cases), but not subagents. You can use tools as subagents, though. 1. Not easy to use a third-party model provider. Their docs provide sample codes, but it's not as easy as that.
- Google Agent Kit doesn't provide any Typescript SDK yet. So didn't try.
- Mastra, even though it looks pretty sweet, spins up a server for your agent, which you can then use via REST API. umm.. why?
- SmythOS SDK is the one I'm currently testing because it provides flexibility in terms of choosing the model provider and defining your own architecture (handoffs or subagents, etc.). It has its quirks, but I think it'll work for now.
Question: If you don't mind sharing, what is your current architecture? Agent -> SubAgents -> SubSubAgents? Linear? or a Planner-Executor?
I'll write a detailed post about my learnings from architectures (fingers crossed) soon.
As for the actual agent I just do the following:
- Get metadata from initial query
- Pass relevant metadata to agent
- Agent is a reasoning model with tools and output
- Agent runs in a loop (max of n times). It will reason which tool calls to use
- If there is a tool call, execute it and continue the loop
- Once the agent outputs content, the loop is effectively finished and you have your output
This is effectively a ReAct agent. Thanks to the reasoning being built in, you don't need an additional evaluator step.
Tools can be anything. It can be a subagent with subagents, a database query, etc. Need to do an agent handoff? Just output the result of the agent into a different agent. You don't need an sdk to do a workflow.
I've tried some other SDKs/frameworks (Eino and langchaingo), and personally found it quicker to do it manually (as described above) than fight against the framework.
[1]: https://github.com/reVrost/go-openrouter
A "sub agent" is just a tool. It's implantation should be abstracted away from the main agent loop. Whether the tool call is deterministic, has human input, etc, is meaningless outside of the main tool contract (i.e Params in Params out, SLA, etc)
Up to a point. You're obviously right in principle, but if that task itself has the ability to call into "adjacent" tools then the behavior changes quite a bit. You can see this a bit with how the Oracle in Amp surfaces itself to the user. The oracle as sub-agent has access to the same tools as the main agent, and the state changes (rare!) that it performs are visible to itself as well as the main agent. The tools that it invokes are displayed similarly to the main agent loop, but they are visualized as calls within the tool.
I think this is a meaningful distinction, because it impacts control flow, regardless what they are called. The lexicon are quite varied vendor-to-vendor
If you want to rewrite the behavior per instance you totally can, but there is a definite concept here that is different than “get_weather”.
I think that existing tools don’t work very well here or leave much of this as an exercise for the user. We have tasks that can take a few days to finish (just a huge volume of data and many non deterministic paths). Most people are doing way too much or way too little. Having subagents with traits that can be vended at runtime feels really nice.
Deleted Comment
It has a ton of day 2 features, really nice abstractions, and positioned itself well in terms of the building blocks and constructing workflows.
ADK supports working with all the vendors and local LLMs
You will be happy you did
(Disclaimer: I work for AWS, but not for any team involved. Opinions are my own.)
This idea has been floating around in my head, but it wasn't refined enough to implement. It's so wild that what you're thinking of may have already been done by someone else on the internet.
https://huggingface.co/docs/smolagents/conceptual_guides/int...
For those who are wondering, it's kind of similar to the 'Code Mode' idea implemented by Cloudflare and now being explored by Anthropic; Write code to discover and call MCPs instead of stuffing context window with their definations.
https://ai-sdk.dev/docs/agents/overview
1. If your agent needs to write a lot of code, it's really hard to beat Claude Code (cc) / Agent SDK. We've tried many approaches and frameworks over the past 2 years (e.g. PydanticAI), but using cc is the first that has felt magic.
2. Vendor lock-in is a risk, but the bigger risk is having an agent that is less capable then what a user gets out of chatgpt because you're hand rolling every aspect of your agent.
3. cc is incredibly self aware. When you ask cc how to do something in cc, it instantly nails it. If you ask cc how to do something in framework xyz, it will take much more effort.
4. Give your agent a computer to use. We use e2b.dev, but Modal is great too. When the agent has a computer, it makes many complex features feel simple.
0 - For context, Definite (https://www.definite.app/) is a data platform with agents to operate it. It's like Heroku for data with a staff of AI data engineers and analysts.
For brownfield work, work on hard stuff or work in big complex codebases you'll save yourself a lot of pain if you use Codex instead of CC.
Codex is stronger out of the box but properly customized Claude can't be matched at the moment
this will surely end up better than where big tech has already brought our current society...
For real though, where did the dreamers about ai / agentic free of the worst companies go? Are we in the seasons of capitulation?
My opinion... build, learn, share. The frameworks will improve, the time to custom agent will be shortened, the knowledge won't be locked in another unicorn
anecdotally, I've come quite far in just a week with ADK and VS Code extensions, having never done extensions before, which has been a large part of the time spent
I'd stay clear of any llm abstraction. There are so many companies with open source abstractions offering the panacea of a single interface that are crumbling under their own weight due to the sheer futility of supporting every permutation of every SDK evolution, all while the same companies try to build revenue generating businesses on top of them.
I’ve had good luck with using PydanticAI which does these core operations well (due to the underlying Pydantic library), but still struggles with too many MCP servers/Tools and composability.
I’ve built an open-source Agent framework called OpusAgents, that makes the process of creating Agents, Subagents, Tools that are simpler than MCP servers without overloading the context. Check it out here and tutorials/demos to see how it’s more reliable than generic Agents with MCP servers in Cursor/ClaudeDesktop - https://github.com/sathish316/opus_agents
It’s built on top of PydanticAI and FastMCP, so that all non-core operations of Agents are accessible when I need them later.
Have you experimented with using semantic cache on the chain of thought(what we get back from the providers anyways) and sending that to a dumb model for similar queries to “simulate” thinking?
If you have to use some of the client side SDKs, another good idea is to have a proxy where you can also add functionality without having to change the frontend.
tl;dr, Both sides are broadcasting messages and listening for the ones they care about.
My bet is that agent frameworks and platform will become more like game engines. You can spin your own engine for sure and it is fun and rewarding. But AAA studios will most likely decide to use a ready to go platform with all the batteries included.
The non-deterministic nature of LLMs already makes the performance of agents so difficult to interpret. Building agents on top of code that you cannot mentally trace through leads to so much frustration when addressing model underperformance and failure.
It's hard to argue that after the dust settles, companies will default to batteries-included frameworks but, right now, a lot of people i've regretted adopting a large framework off the bat.
Read: Gather context (user input + tool outputs). Eval: LLM inference (decides: do I need a tool, or am I done?). Print: Execute the tool (the side effect) or return the answer. Loop: Feed the result back into the context window.
Rolling a lightweight implementation around this concept has been significantly more robust for me than fighting with the abstractions in the heavy-weight SDKs.
1. Sub-agents are just stack frames. When the main loop encounters a complex task, it "pushes" a new scope (a sub-agent with a fresh, empty context). That sub-agent runs its own REPL loop, returns only the clean result with out any context pollution and is then "popped".
2. Shared Data is the heap. Instead of stuffing "shared data" into the context window (which is expensive and confusing), I pass a shared state object by reference. Agents read/write to the heap via tools, but they only pass "pointers" in the conversation history. In the beginning this was just a Python dictionary and the "pointers" were keys.
My issue with the heavy SDKs isn't that they try to solve these problems, but that they often abstract away the state management. I’ve found that explicitly managing the "stack" (context) and "heap" (artifacts) makes the system much easier to debug.
I don't want to keep up with all the new model releases. I don't want to read every model card. I don't want to feel pressured to update immediately (if it's better). I don't want to run evals. I don't want to think about when different models are better for different scenarios. I don't want to build obvious/common subagents. I don't want to manage N > 1 billing entities.
I just want to work.
Paying an agentic coding company to do this makes perfect sense for me.
I'm curious about the solutions the op has tried so far here.
In general, a more generic eval setup is needed, with minimal requirements from AI engineers, if we want to move forward from Vibe's reliability engineering practices as a sector.
In my case I was until recently working on TTS and this was a huge barrier for us. We used all the common signal quality and MOS-simulation models that judged so called "naturalness" and "expressiveness" etc. But we found that none of these really helped us much in deciding when one model was better than another, or when a model was "good enough" for release. Our internal evaluations correlated poorly with them, and we even disagreed quite a bit within the team on the quality of output. This made hyperparameter tuning as well as commercial planning extremely difficult and we suffered greatly for it. (Notice my use of past tense here..)
Having good metrics is just really key and I'm now at the point where I'd go as far as to say that if good metrics don't exist it's almost not even worth working on something. (Almost.)
We believe you need to both automatically create the evaluation policies from OTEL data (data-first) and to bring in rigorous LLM judge automation from the other end (intent-first) for the truly open-ended aspects.
https://google.github.io/adk-docs/evaluate/
tl;dr - challenging because different runs produce different output, also how do you pass/fail (another LLM/agent is what people do)