Readit News logoReadit News
_pdp_ · 22 days ago
I've started a company in this space about 2 years ago. We are doing fine. What we've learned so far is that a lot of these techniques are simply optimisations to tackle some deficiency in LLMs that is a problem "today". These are not going to be problems tomorrow because the technology will shift. As it happened many time in the span of the last 2 years.

So yah, cool, caching all of that... but give it a couple of months and a better technique will come out - or more capable models.

Many years ago when disc encryption on AWS was not an option, my team and I had to spend 3 months to come up with a way to encrypt the discs and do so well because at the time there was no standard way. It was very difficult as that required pushing encrypted images (as far as I remember). Soon after we started, AWS introduced standard disc encryption that you can turn on by clicking a button. We wasted 3 months for nothing. We should have waited!

What I've learned from this is that often times it is better to do absolutely nothing.

siva7 · 22 days ago
This is the most important observation. I'm getting so many workshop invitations from my corporate colleagues about AI and agents. What most people don't get that these clever patterns they "invented" will be obsolete next week. This nice company blog about agents - one which got viral recently - will be obsolete next month. It's hard to swallow for my colleagues that in these age - like when you studied gang of four or a software architecture pattern book that you have learned a common language - no, these days the half-life of a pattern for AI is about a week. Even when you ask 10 professionals what an agent actually is - you will get 10 different answers yet they assume that how they use it is the common understanding.
Vinnl · 22 days ago
This is also why it's perfectly fine to wait out this AI hype and see what sticks afterward. It probably won't cost too much time to catch up, because at that point everyone who knows what they're doing only learned that a month or two ago anyway.
lowbloodsugar · 21 days ago
Counterpoint to these two posts: a journeyman used to have to make his own tools. He could easily have bought them, or his master could have made them. Making your own tools gives you vastly greater skills when using the tools. So I know how fast AI agents and model APIs are evolving, but I’m writing them anyway. Every break in my career has been someone telling me it’s impossible and then me doing it anyway. If you use an agent framework, you really have no idea how artificially constrained you. You’re so constrained, and yet you are oblivious to it.

On the “wasting three months” remark (GP), if it’s a key value proposition, just do it. Don’t wait. If it’s not a key value prop, then don’t do it at all. Often times what I’ve built has been better tailored to our product than what AWS built.

hibikir · 22 days ago
Note that even many of those "long knowledge" things people learned are today obsolete, but people that follow them just haven't figured it out yet. See how many of those object oriented design patters just look very silly the minute you use immutable data structures, and have access to functional programming constructs in your language. And nowadays most do. Many seminal books on how to program in the early 2000s, especially those covering "pure" OO, look quite silly today.
AYBABTME · 21 days ago
And yet despite being largely obsolete in the specifics, gang of four remains highly relevant and useful in the generalities. All these books continue to be absolutely great foundations if you look past their immediate advice.

I wagger the same for AI agent techniques.

lelanthran · 22 days ago
> I've started a company in this space about 2 years ago. We are doing fine.

You have a positive cash flow from sales of agents? Your revenue exceeds your operating costs?

I've been very skeptical that it is possible to make money from agents, having seen how difficult it was for the current well-known players to do so.

What is your secret sauce?

yayitswei · 21 days ago
I'm cash flow positive on my SMS sales agent, it serves just one client and my revenue is at least 3x the cost of inference/hosting.

Imo the key is to serve one use case really well rather than overgeneralize.

nvader · 21 days ago
Bumping for interest too. Would love to hear what you believe is correlated to success.
gchamonlive · 22 days ago
I think knowing when to do nothing is being able to evaluate if the problem the team is tackling is essential or tangential to the core focus of the project, and also whether the problem is something new or if it's been around for a while and there is still no standard way to solve it.
gessha · 22 days ago
Yeah, that will be the make it to brake it moment because if it’s too essential, it will be implemented but if it’s not, it may become a competitive advantage
ramraj07 · 21 days ago
Vehement disagree. We implemented our own context editing features 4 months back. Claude released a very similar featureset we had all along last month. We were still glad we did it because (A) it took me half a day to do that work (B) our solution is still more powerful for our use case (C) our solution works on other models as well.

It all comes down to trying to predict what will be your vendors' roadmap (or if youre savvy, get a peek into it) and whether the feature you want to create is fundamental to your applications behavior (I doubt encryption is unless youre a storage company).

popcorncowboy · 21 days ago
This is the "Wait Calculation" and it's fiendish because there exists only some small, finite window in which it is indeed better to start before the tech is "better" in order to "win" (i.e. get "there" first, wherever "there" is in your scenario).

Here's a nice article about it: https://www.oneusefulthing.org/p/the-lazy-tyranny-of-the-wai...

exe34 · 22 days ago
if we wait long enough, we just end up dead, so it turns out we didn't need to do anything at all whatsoever. of course there's a balance - often times starting out and growing up with the technology gives you background and experience that gives you an advantage when it hits escape velocity.
nrhrjrjrjtntbt · 21 days ago
If you wait long enough in AI they may not need your agent they just use OpenAI directly.
DrewADesign · 21 days ago
These days it seems like training yourself into a specialty that provides steadyish income for a year before someone obliterates your professional/corporate/field’s scaffolding with AI and you have to start over is kind of a win. Doesn’t it feel like a win? Look at the efficiency!
nowittyusername · 22 days ago
I agree with the sentiment. things are moving so fast that waiting now is a legitimate strategy. though it is also easy to fall in to the trap of. well if we continue along these lines might as well wait 4-5 years and we get agi. which still true imo does feel off as you arent participating in the process.
an0malous · 22 days ago
> These are not going to be problems tomorrow because the technology will shift. As it happened many time in the span of the last 2 years.

What technology shifts have happened for LLMs in the last 2 years?

dcre · 22 days ago
One example is that there used to be a whole complex apparatus around getting models to do chain of thought reasoning, e.g., LangChain. Now that is built in as reasoning and they are heavily trained to do it. Same with structured outputs and tool calls — you used to have to do a bunch of stuff to get models to produce valid JSON in the shape you want, now it’s built in and again, they are specifically trained around it. It used to be you would have to go find all relevant context up front and give it to the model. Now agent loops can dynamically figure out what they need and make the tool calls to retrieve it. Etc etc.
postalcoder · 22 days ago
If we expand this to 3 years, the single biggest shift that totally changed LLM development is the increase in size of context windows from 4,000 to 16,000 to 128,000 to 256,000.

When we were at 4,000 and 16,000 context windows, a lot of effort was spent on nailing down text splitting, chunking, and reduction.

For all intents and purposes, the size of current context windows obviates all of that work.

What else changed?

- Multimodal LLMs - Text extraction from PDFs was a major issue for rag/document intelligence. A lot of time was wasted trying to figure out custom text extraction strategies for documents. Now, you can just feed the image of a PDF page into an LLM and get back a better transcription.

- Reduced emphasis on vector search. People have found that for most purposes, having an agent grep your documents is cheaper and better than using a more complex rag pipeline. Boris Cherny created a stir when he talked about claude code doing it that way[0]

https://news.ycombinator.com/item?id=43163011#43164253

throwaway13337 · 22 days ago
I'm amazed at this question and the responses you're getting.

These last few years, I've noticed that the tone around AI on HN changes quite a bit by waking time zone.

EU waking hours have comments that seem disconnected from genAI. And, while the US hours show a lot of resistance, it's more fear than a feeling that the tools are worthless.

It's really puzzling to me. This is the first time I noticed such a disconnect in the community about what the reality of things are.

To answer your question personally, genAI has changed the way I code drastically about every 6 months in the last two years. The subtle capability differences change what sorts of problems I can offload. The tasks I can trust them with get larger and larger.

It started with better autocomplete, and now, well, agents are writing new features as I write this comment.

deepdarkforest · 22 days ago
On the foundational level, test time compute(reasoning), heavy RL post training, 1M+ plus context length etc.

On the application layer, connecting with sandboxes/VM's is one of the biggest shifts. (Cloudfares codemode etc). Giving an llm a sandbox unlocks on the fly computation, calculations, RPA, anything really.

MCP's, or rather standardized function calling is another one.

Also, local llm's are becoming almost viable because of better and better distillation, relying on quick web search for facts etc.

WA · 22 days ago
Not the LLMs. The APIs got more capabilities such as tool/function calling, explicit caching etc.
echelon · 22 days ago
We started putting them in image and video models and now image and video models are insane.

I think the next period of high and rapid growth will be in media (image, video, sound, 3D), not text.

It's much harder to adapt LLMs to solving business use cases with text. Each problem is niche, you have to custom tailor the solution, and the tooling is crude.

The media use cases, by contrast, are low hanging fruit and result in 10,000x speedups and cost reductions almost immediately. The models are pure magic.

I think more companies would be wise to ignore text for now and focus on visual domain problems.

Nano Banana has so much more utility than agents. And there are so many low hanging fruit ways to make lots of money.

Don't sleep on image and video. That's where the growth salient is.

sethev · 21 days ago
I suspect you're right, but it's a bit discouraging to consider that an alternative way of framing this is that companies like OpenAI have a huge advantage in this landscape and anything that works will end up behind their API.
toddmorey · 22 days ago
In some ways, the fact that the technology will shift is the problem as model behavior keeps changing. It's rather maddening unstable ground to build on. Really hard to gauge the impact to customer experience from a new model.
ares623 · 22 days ago
For a JS dev, it’s just another Tuesday
verdverm · 22 days ago
Vendor choice matters.

You could use the like of Amazon / Anthropic, or use Google who has had transparent disk encryption for 10+ years, and Gemini which already had the transparent caching discussed built in.

te_chris · 22 days ago
If you’ve spent any time with the vertex LLM apis you wouldn’t be so enthusiastic about using Google’s platform (I say this as someone who prefers GCP to aws for compute and networking).
wolttam · 21 days ago
This has been my intuition with these models since close to the beginning.

Any framework you build around the model is just behaviour that can be trained into the model itself

jFriedensreich · 22 days ago
exactly what my experience is too. we focus all our energy on the parts that will not be solved by someone else in a few months.
moinism · 22 days ago
Amen. Been seeing these agent SDKs coming out left and right for a couple of years and thought it'd be a breeze to build an agent. Now I'm trying to build one for ~3 weeks, and I've tried three different SDKs and a couple of architectures.

Here's what I found:

- Claude Code SDK (now called Agent SDK) is amazing, but I think they are still in the process of decoupling it from the Claude Code, and that's why a few things are weird. e.g, You can define a subagent programmatically, but not skills. Skills have to be placed in the filesystem and then referenced in the plugin. Also, only Anthoripic models are supported :(

- OpenAI's SDK's tight coupling with their platform is a plus point. i.e, you get agents and tool-use traces by default in your dashboard. Which you can later use for evaluation, distillation, or fine-tuning. But: 2. They have agent handoffs (which works in some cases), but not subagents. You can use tools as subagents, though. 1. Not easy to use a third-party model provider. Their docs provide sample codes, but it's not as easy as that.

- Google Agent Kit doesn't provide any Typescript SDK yet. So didn't try.

- Mastra, even though it looks pretty sweet, spins up a server for your agent, which you can then use via REST API. umm.. why?

- SmythOS SDK is the one I'm currently testing because it provides flexibility in terms of choosing the model provider and defining your own architecture (handoffs or subagents, etc.). It has its quirks, but I think it'll work for now.

Question: If you don't mind sharing, what is your current architecture? Agent -> SubAgents -> SubSubAgents? Linear? or a Planner-Executor?

I'll write a detailed post about my learnings from architectures (fingers crossed) soon.

copypaper · 21 days ago
Every single SDK I've used was a nightmare once you get past the basics. I ended up just using an OpenRouter client library [1] and writing agents by hand without an abstraction layer. Is it a little more boilerplatey? Yea. Does it take more LoC to write? Yea. Is it worth it? 100%. Despite writing more code, the mental model is much easier (personally) to follow and understand.

As for the actual agent I just do the following:

- Get metadata from initial query

- Pass relevant metadata to agent

- Agent is a reasoning model with tools and output

- Agent runs in a loop (max of n times). It will reason which tool calls to use

- If there is a tool call, execute it and continue the loop

- Once the agent outputs content, the loop is effectively finished and you have your output

This is effectively a ReAct agent. Thanks to the reasoning being built in, you don't need an additional evaluator step.

Tools can be anything. It can be a subagent with subagents, a database query, etc. Need to do an agent handoff? Just output the result of the agent into a different agent. You don't need an sdk to do a workflow.

I've tried some other SDKs/frameworks (Eino and langchaingo), and personally found it quicker to do it manually (as described above) than fight against the framework.

[1]: https://github.com/reVrost/go-openrouter

peab · 22 days ago
I think the term sub-agent is almost entirely useless. An agent is an LLM loop that has reasoning and access to tools.

A "sub agent" is just a tool. It's implantation should be abstracted away from the main agent loop. Whether the tool call is deterministic, has human input, etc, is meaningless outside of the main tool contract (i.e Params in Params out, SLA, etc)

moinism · 22 days ago
I agree, technically, "sub agent" is also another tool. But I think it's important to differentiate tools with deterministic input/output from those with reasoning ability. A simple 'Tool' will take the input and try to execute, but the 'subagent' might reason that the action is unnecessary and that the required output already exists in the shared context. Or it can ask a clarifying question from the main agent before using its tools.
the_mitsuhiko · 21 days ago
> It's implantation should be abstracted away from the main agent loop. Whether the tool call is deterministic, has human input, etc, is meaningless outside of the main tool contract (i.e Params in Params out, SLA, etc)

Up to a point. You're obviously right in principle, but if that task itself has the ability to call into "adjacent" tools then the behavior changes quite a bit. You can see this a bit with how the Oracle in Amp surfaces itself to the user. The oracle as sub-agent has access to the same tools as the main agent, and the state changes (rare!) that it performs are visible to itself as well as the main agent. The tools that it invokes are displayed similarly to the main agent loop, but they are visualized as calls within the tool.

verdverm · 22 days ago
ADK differentiates between tools and subagents based on the ability to escalate or transfer control (subagents), where as tools are more basic

I think this is a meaningful distinction, because it impacts control flow, regardless what they are called. The lexicon are quite varied vendor-to-vendor

nostrebored · 21 days ago
Nah, when working on anything sufficiently complicated you will have many parallel subagents that need their own context window, ability to mutate shared state, sandboxing differences, durability considerations, etc.

If you want to rewrite the behavior per instance you totally can, but there is a definite concept here that is different than “get_weather”.

I think that existing tools don’t work very well here or leave much of this as an exercise for the user. We have tasks that can take a few days to finish (just a huge volume of data and many non deterministic paths). Most people are doing way too much or way too little. Having subagents with traits that can be vended at runtime feels really nice.

Deleted Comment

Vinnl · 22 days ago
What does "has reasoning" mean? Isn't that just a system prompt that says something like "make a plan" and includes that in the loop?
ColinEberhardt · 22 days ago
Oh, so _that_ is what a sub-agent is. I have been wondering about that for a while now!
blancm · 22 days ago
Hello, about Claude Code where only Anthoripic models are supported, in reality you can use Claude Code router (https://github.com/musistudio/claude-code-router) to use other models in Claude Code. I use it since some weeks with opensource models and it works pretty well. You can even use "free" models from openrouter
moinism · 22 days ago
Thank you. But the main blocker for me right now is their skill definition: https://platform.claude.com/docs/en/agent-sdk/skills#how-ski...
verdverm · 22 days ago
Google's ADK is pretty nice, I'm using the Go version, which is less mature than the python on. Been at it a bit over a week and progress is great. This weekend I'm aiming for tracking file changes in the session history to allow rewinding / forking

It has a ton of day 2 features, really nice abstractions, and positioned itself well in terms of the building blocks and constructing workflows.

ADK supports working with all the vendors and local LLMs

dragonwriter · 22 days ago
I really wish ADK had a local persistent memory implementation, though.
mountainriver · 22 days ago
The frameworks are all pointless, just use AI assist to create agents in python or ideally a language with concurrency.

You will be happy you did

moinism · 22 days ago
How do you deal with the different APIs/Tooluse schema in a custom build? As other people have mentioned, it's a bigger undertaking than it sounds.
moduspol · 21 days ago
You will undoubtedly be recreating what already exists in LangGraph. And you'll probably be doing it worse.
otterley · 22 days ago
Have you tried AWS’s Strands Agents SDK? I’ve found it to be a very fluent and ergonomic API. And it doesn’t require you to use Bedrock; most major vendor native APIs are supported.

(Disclaimer: I work for AWS, but not for any team involved. Opinions are my own.)

moinism · 22 days ago
This looks good. Even though it's only in Python, I think its worth a try. Thanks.
kordlessagain · 22 days ago
If you are still open to trying Codex, I'm working on a containerized version with various features: https://github.com/DeepBlueDynamics/codex-container
moinism · 22 days ago
This looks good, but a bit overkill for what I'm trying to build tbh.
ph4rsikal · 22 days ago
My favourite is Smolagents from Huggingface. You can easily mix and match their models in your agents.
moinism · 22 days ago
Dude, it looks great, but I just spent half an hour learning about its 'CodeAgents' feature. Which essentially is 'actions written as code'.

This idea has been floating around in my head, but it wasn't refined enough to implement. It's so wild that what you're thinking of may have already been done by someone else on the internet.

https://huggingface.co/docs/smolagents/conceptual_guides/int...

For those who are wondering, it's kind of similar to the 'Code Mode' idea implemented by Cloudflare and now being explored by Anthropic; Write code to discover and call MCPs instead of stuffing context window with their definations.

thewhitetulip · 22 days ago
Did you try langchain/langgraph? Am I confusing what the OP means aa agents?
langitbiru · 22 days ago
What about AI SDK from Vercel?

https://ai-sdk.dev/docs/agents/overview

moinism · 22 days ago
Haven't tried it yet, but it looks similar to OpenAI's. What is your experience?
mritchie712 · 22 days ago
Some things we've[0] learned on agent design:

1. If your agent needs to write a lot of code, it's really hard to beat Claude Code (cc) / Agent SDK. We've tried many approaches and frameworks over the past 2 years (e.g. PydanticAI), but using cc is the first that has felt magic.

2. Vendor lock-in is a risk, but the bigger risk is having an agent that is less capable then what a user gets out of chatgpt because you're hand rolling every aspect of your agent.

3. cc is incredibly self aware. When you ask cc how to do something in cc, it instantly nails it. If you ask cc how to do something in framework xyz, it will take much more effort.

4. Give your agent a computer to use. We use e2b.dev, but Modal is great too. When the agent has a computer, it makes many complex features feel simple.

0 - For context, Definite (https://www.definite.app/) is a data platform with agents to operate it. It's like Heroku for data with a staff of AI data engineers and analysts.

CuriouslyC · 22 days ago
Be careful about what you hand off to Claude versus another agent. Claude is a vibe project monster, but it will fail at hard things, come up with fake solutions, and then lie to you about them. To the point that it'll add random sleeps and do pointless work to cover up the fact that it's reward hacking. It's also very messy.

For brownfield work, work on hard stuff or work in big complex codebases you'll save yourself a lot of pain if you use Codex instead of CC.

wild_egg · 22 days ago
Claude is amazing at brownfield if you take the time to experiment with your approach.

Codex is stronger out of the box but properly customized Claude can't be matched at the moment

smcleod · 22 days ago
It's quite worrying that I have several times in the last few months had to really drive home why people should probably not be building bespoke agentic systems just to essentially act as a half baked version of an agentic coding tool when they could just go use Claude code and instead focus their efforts on creating value rather than instant technical debt.
CuriouslyC · 22 days ago
You can pretty much completely reprogram agents just by passing them through a smart proxy. You don't need to rewrite claude/codex, just add context engineering and tool behaviors at the proxy layer.
faxmeyourcode · 22 days ago
Point 2 is very often overlooked. Building products that are worse than the baseline chatgpt website is very common.
verdverm · 22 days ago
yes, we should all stop experimenting and outsource our agentic workflows to our new overlords...

this will surely end up better than where big tech has already brought our current society...

For real though, where did the dreamers about ai / agentic free of the worst companies go? Are we in the seasons of capitulation?

My opinion... build, learn, share. The frameworks will improve, the time to custom agent will be shortened, the knowledge won't be locked in another unicorn

anecdotally, I've come quite far in just a week with ADK and VS Code extensions, having never done extensions before, which has been a large part of the time spent

postalcoder · 22 days ago
I've been building agent type stuff for a couple years now and the best thing I did was build my own framework and abstractions that I know like the back of my hand.

I'd stay clear of any llm abstraction. There are so many companies with open source abstractions offering the panacea of a single interface that are crumbling under their own weight due to the sheer futility of supporting every permutation of every SDK evolution, all while the same companies try to build revenue generating businesses on top of them.

sathish316 · 22 days ago
I agree with your analysis of building your own Agent framework to have some level of control and fewer abstractions. Agents at their core are about programmatically talking to an LLM and performing these basic operations: 1. Structured Input and String Interpolation in prompts 2. Structured Output and Unmarshalling String response to Structured output (This is getting easier now with LLMs supporting Structured output) 3. Tool registry/discovery (of MCP and Function tools), Tool calls and response looping 4. Composability of Tools 5. Some form of Agent to Agent delegation

I’ve had good luck with using PydanticAI which does these core operations well (due to the underlying Pydantic library), but still struggles with too many MCP servers/Tools and composability.

I’ve built an open-source Agent framework called OpusAgents, that makes the process of creating Agents, Subagents, Tools that are simpler than MCP servers without overloading the context. Check it out here and tutorials/demos to see how it’s more reliable than generic Agents with MCP servers in Cursor/ClaudeDesktop - https://github.com/sathish316/opus_agents

It’s built on top of PydanticAI and FastMCP, so that all non-core operations of Agents are accessible when I need them later.

spacecadet · 22 days ago
I also recommend this. I have tried all of the frameworks, and deploy some still for some clients- but for my personal agents, its my own custom framework that is dead simple and very easy to spin up, extend, etc.
drittich · 22 days ago
This sounds interesting. What about the agent behavior itself? How it decides how to come at a problem, what to show the user along the way, and how it decides when to stop? Are these things you have attempted to grapple with in your framework?
wizhi · 21 days ago
So you advice people to build their own framework, then advertise your own?
the_mitsuhiko · 22 days ago
Author here. I’m with you on the abstractions part. I dumped a lot of my though so this into a follow up post: https://lucumr.pocoo.org/2025/11/22/llm-apis/
thierrydamiba · 22 days ago
Excellent write up. I’ve been thinking a lot about caching and agents so this was right ilup my alley.

Have you experimented with using semantic cache on the chain of thought(what we get back from the providers anyways) and sending that to a dumb model for similar queries to “simulate” thinking?

msp26 · 19 days ago
really nice post, will share!
NitpickLawyer · 22 days ago
Yes, this is great advice. It also applies to interfaces. When we designed a support "chat bot", we went with a diferent architecture than what's out there already. We designed the system with "chat rooms" instead, and the frontend just dumps messages to a chatroom (with a session id). Then on the backend we can do lots of things, incrementally adding functionality, while the front end doesn't have to keep up. We can also do things like group messages, have "system" messages that other services can read, etc. It also feels more natural, as the client can type additional info while the server is working, etc.

If you have to use some of the client side SDKs, another good idea is to have a proxy where you can also add functionality without having to change the frontend.

postalcoder · 22 days ago
Creativity is an underrated hard part of building agents. The fun part of building right now is knowing how little of the design space for building agents has been explored.
verdverm · 22 days ago
This is not so unlike the coding agent I'm building for vs code. One of the things I'm doing is keeping a snapshot of the current vs code state (files open, terminal history, etc) in the agent server. Similarly, I track the file changes without actually writing them until the user approves the diff, so there are some "filesystem" like things that need to be carefully managed on each side.

tl;dr, Both sides are broadcasting messages and listening for the ones they care about.

_pdp_ · 22 days ago
This is a huge undertaking though. Yes it is quite simple to build some basic abstraction on top of openai.complete or similar but this like 1% of an agent need to do.

My bet is that agent frameworks and platform will become more like game engines. You can spin your own engine for sure and it is fun and rewarding. But AAA studios will most likely decide to use a ready to go platform with all the batteries included.

postalcoder · 22 days ago
In totality, yes. But you don't need every feature at once. You add to it once you hit boundaries. But I think the most important thing about this exercise is that you leave nothing to the imagination when building agents.

The non-deterministic nature of LLMs already makes the performance of agents so difficult to interpret. Building agents on top of code that you cannot mentally trace through leads to so much frustration when addressing model underperformance and failure.

It's hard to argue that after the dust settles, companies will default to batteries-included frameworks but, right now, a lot of people i've regretted adopting a large framework off the bat.

eclipsetheworld · 22 days ago
We're repeating the same overengineering cycle we saw with early LangChain/RAG stacks. Just a couple of months ago the term agent was hard to define, but I've realized the best mental model is just a standard REPL:

Read: Gather context (user input + tool outputs). Eval: LLM inference (decides: do I need a tool, or am I done?). Print: Execute the tool (the side effect) or return the answer. Loop: Feed the result back into the context window.

Rolling a lightweight implementation around this concept has been significantly more robust for me than fighting with the abstractions in the heavy-weight SDKs.

throw310822 · 22 days ago
I don't think this has much to do with SDKs. I've developed my own agent code from scratch (starting from the simple loop) and eventually- unless your use case is really simple- you always have to deal with the need for subagents specialised for certain tasks, that share part of their data (but not all) with the main agent, with internal reasoning and reinforcement messages, etc.
eclipsetheworld · 22 days ago
Interestingly, sticking to the "Agent = REPL" mental model is actually what helped me solve those specific scaling problems (sub-agents and shared data) without the SDK bloat.

1. Sub-agents are just stack frames. When the main loop encounters a complex task, it "pushes" a new scope (a sub-agent with a fresh, empty context). That sub-agent runs its own REPL loop, returns only the clean result with out any context pollution and is then "popped".

2. Shared Data is the heap. Instead of stuffing "shared data" into the context window (which is expensive and confusing), I pass a shared state object by reference. Agents read/write to the heap via tools, but they only pass "pointers" in the conversation history. In the beginning this was just a Python dictionary and the "pointers" were keys.

My issue with the heavy SDKs isn't that they try to solve these problems, but that they often abstract away the state management. I’ve found that explicitly managing the "stack" (context) and "heap" (artifacts) makes the system much easier to debug.

mitchellh · 22 days ago
This is why I use the agent I use. I won't name the company, because I don't want people to think I'm a shill for them (I've already been accused of it before, but I'm just a happy, excited customer). But it's an agentic coding company that isn't associated with any of the big model providers.

I don't want to keep up with all the new model releases. I don't want to read every model card. I don't want to feel pressured to update immediately (if it's better). I don't want to run evals. I don't want to think about when different models are better for different scenarios. I don't want to build obvious/common subagents. I don't want to manage N > 1 billing entities.

I just want to work.

Paying an agentic coding company to do this makes perfect sense for me.

pjm331 · 22 days ago
I’ve been surprised at the lack of discussion about sourcegraph’s Amp here which I’m pretty sure you’re referring to - it started a bit rough but these days I find that it’s really good
SatvikBeri · 21 days ago
So, I tried to sign up for Amp. I saw a livestream that mentioned you can sign up for their community Buildcrew on Discord and get $100 of credits. I tried signing up, and got an email that I was accepted and would soon get the credits. The Discord link did not work (it was expired) and the email was a noreply, so I tried emailing Amp support. This was last Friday (8 days ago.) As of today, no updated Discord link, no human response, no credits. If this is their norm, people probably aren't talking about it because they just haven't been able to try it.
ReDeiPirati · 22 days ago
> We find testing and evals to be the hardest problem here. This is not entirely surprising, but the agentic nature makes it even harder. Unlike prompts, you cannot just do the evals in some external system because there’s too much you need to feed into it. This means you want to do evals based on observability data or instrumenting your actual test runs. So far none of the solutions we have tried have convinced us that they found the right approach here.

I'm curious about the solutions the op has tried so far here.

hommes-r · 21 days ago
"Because there’s too much you need to feed into it" - what does the author mean by this? If it is the amount of data, then I would say sampling needs to be implemented. If that's the extent of the information required from the agent builder, I agree that an LLM-as-a-judge e2e eval setup is necessary.

In general, a more generic eval setup is needed, with minimal requirements from AI engineers, if we want to move forward from Vibe's reliability engineering practices as a sector.

ColinEberhardt · 22 days ago
Likewise. I have a nasty feeling that most AI agent deployments happen with nothing more than some cursory manual testing. Going with the ‘vibes’ (to coin an over used term in the industry).
heljakka · 20 days ago
I can confirm this after hundreds of talks about the topic over the last 2 years. 90% of cases are simply not high-volume or high-stakes enough for the devs to care enough. I'm a founder of an evaluation automation startup, and our challenge is spotting teams right as their usage starts to grow and quality issues are about to escalate. Since that’s tough, we're trying to make the getting-to-first-evals so simple that teams can start building the mental models before things get out of hand.
radarsat1 · 22 days ago
A lot of "generative" work is like this. While you can come up with benchmarks galore, at the end of the day how a model "feels" only seems to come out from actual usage. Just read /r/localllama for opinions on which models are "benchmaxed" as they put it. It seems to be common knowledge in the local LLM community that many models perform well on benchmarks but that doesn't always reflect how good they actually are.

In my case I was until recently working on TTS and this was a huge barrier for us. We used all the common signal quality and MOS-simulation models that judged so called "naturalness" and "expressiveness" etc. But we found that none of these really helped us much in deciding when one model was better than another, or when a model was "good enough" for release. Our internal evaluations correlated poorly with them, and we even disagreed quite a bit within the team on the quality of output. This made hyperparameter tuning as well as commercial planning extremely difficult and we suffered greatly for it. (Notice my use of past tense here..)

Having good metrics is just really key and I'm now at the point where I'd go as far as to say that if good metrics don't exist it's almost not even worth working on something. (Almost.)

heljakka · 21 days ago
What are the main shortcomings of the solutions you tried out?

We believe you need to both automatically create the evaluation policies from OTEL data (data-first) and to bring in rigorous LLM judge automation from the other end (intent-first) for the truly open-ended aspects.

ramraj07 · 21 days ago
Its a 2 day project at best to create your own bespoke llm as judge e2e eval framework. Thats what we did. Works fine. Not great. Still need someone to write the evals though.
verdverm · 22 days ago
ADK has a few pages and some API for evaluating agentic systems

https://google.github.io/adk-docs/evaluate/

tl;dr - challenging because different runs produce different output, also how do you pass/fail (another LLM/agent is what people do)

havkom · 22 days ago
My tip is: don’t use SDK:s for agents. Use a while loop and craft your own JSON, handle context size and handle faults yourself. You will in practice need this level of control if you are not doing something trivial.