LLM function calls don't scale; code orchestration is simpler, more effective

I've been saying for two years that "any sufficiently advanced agent is indistinguishable from a DSL."

Rather than asking an agent to internalize its algorithm, you should teach it an API and then ask it to design an algorithm which you can run that in user space. There are very few situations where I think it makes sense (for cost or accuracy) for an LLM to internalize its algorithm. It's like asking asking an engineer to step through a function in their head instead of just running it.

ianbicking · 3 months ago

I think I understand what you're proposing, but I'm not sure.

So in concrete terms I'm imagining:

1. Create a prompt that gives the complete API specification and some general guidance about what role the agent will have.

2. In that prompt ask it to write a function that can be concisely used by the agent, written to be consumed from the agent and with the agent's perspective. The body of that function translates the agent-oriented function definition to an API call.

3. Now the agent can use these modified versions of the API that expose only what's really important from its perspective.

4. But there's no reason APIs and functions have to map 1:1. You can wrap multiple APIs in one function, or break things up however made most sense.

5. Now the API-consuming agent is just writing library routines for other agents, and creating a custom environment for those agents.

6. This is all really starting to look like a team of programmers building a platform.

7. You could design the whole thing top-down as well, speculating then creating the functions the agents will likely want, and using whatever capabilities you have to implement those functions. The API calls are just an available set of functionality.

And really you could have multiple APIs being used in one function call, and any number of ways to rephrase the raw capabilities as more targeted and specific capabilities.

4. Now the

symbolicAGI · 3 months ago

Evidence that the path to ASI is not extending the capabilities of LLMs, but instead distilling out and compiling self-improving algorithms externally in a symbolic application.

fooker · 3 months ago

Can you point to evidence of widespread use of the word 'agent' in this context from two years ago?

lolinder · 3 months ago

Here are the top articles for the month of May 2023 on HN with "agent" in the title [0]. Looks like early days for the term but with a few large hits (like the HuggingFace announcement), which suggests OP was surprisingly precise in their citation of two years as the time window.

Also, since you're implicitly questioning OP's claim to have been saying this all along, here's a comment from September 2023 where they first said the same quote and said they'd been building agents for 3 months by that point [1]. That's close enough to 2 years in my book.

[0] https://hn.algolia.com/?dateEnd=1685491200&dateRange=custom&...

[1] https://news.ycombinator.com/item?id=37626877

madrox · 3 months ago

https://news.ycombinator.com/item?id=37626877

nitwit005 · 3 months ago

I'm sure you can find it in chatbot documentation from the 90s. It's a generic term carried over from non-AI chat. People responding to support chats were called agents.

I've been building agentic systems for my ecommerce business. I evaluated smolagents. It's elegant and has a lot of appealing qualities, but adds a lot of complexity to the system. For some tasks it's perfect, dynamic reporting environments that can sort and aggregate data without schema might be a good one. For most tasks it's just overkill. Gemini and OpenAI both offer python interpreters as tools, which can cover a lot of the use cases for code agents. It's true that cramming endless message on a stack of tool calls and interactions is not scalable, that is not a good way to use these tools. Most agentic workflows are shortlived. Complexity is managed with structure and discipline. These are well known problems in software development, and the old lessons still apply to the new tools. Function calls can absolutely scale well in an agentic system, or they can become a mess, just like they can in any codebase. Personally, building a system that works well is as much about managing cognitive load as the developer as it is about managing control flow and runtime performance. A simple solution that works well enough is usually superior to a clever solution with great performance. Composing function calls is the simple solution. Structured data can be still be parsed and transformed the old fashioned way. If the structure is unknown, even the cheap models are great at parsing. Managing complexity in an agentic system can be broken down into a problem of carefully managing application state. The message stack can be manipulated as needed to feed the models the active context. It's memory management in a constrained environment.

qu0b · 3 months ago

Great summary of the trade-offs in Agentic systems. We’ve tackled these exact challenges as we built out our conversational product discovery product for e-commerce at IsarTech [0].

I agree function composition and structured data are essential for keeping complexity in check. In our experience, well-defined structured outputs are the real scalability lever in tool calling. Typed schemas keep both cognitive load and system complexity manageable. We rely on deterministic behavior wherever possible, and reserve LLM processing for cases where schema-less data or ambiguity is involved. Its a great tool for mapping fuzzy user requests to a more structured deterministic system.

That said, finding the right balance between taking complexity out of high entropy input or introducing complexity through chained tool calling is a tradeoff and balance that needs to be struck carefully. In real-world commerce settings, you rarely get away with just one approach. Structured outputs are great until you hit ambiguous intents—then things get messy and you need fallback strategies.

[0] https://isartech.io/

jacob019 · 3 months ago

Ambiguity must be explicitly handled like uncertainty in predictive modeling, that can be challenging. I run into trouble with task complexity. At a certain point even the best models start making dumb mistakes, and it's tough to draw the line for decomposing tasks. Role playing to induce planning and reflection helps, but I feel that upper bound. I've noticed that the model performance declines when using constrained outputs. Last year I would go to all this trouble decomposing tasks in ways that seem silly given the current models. At the pace that things are moving, I expect to see models soon that can handle 10x complexity and 10mb context, I just hope I can afford to use them.

mehdibl · 3 months ago

The issue is not in function calls but HOW MCP got designed here and you are using.

Most MCP are replicating API. Returning blobs of data.

1. This is using a lot of input context in formating as JSON and escaping a Json inside already a JSON. 2. This contain a lot of irrelevant information that you can same on it.

So the issue is the MCP tool. It should instead flaten the data as possible as it's going back again thru JSON Encoding. And if needed remove some fields.

So MCP SAAS here are mainly API gateways.

That brings this noise! And most of ALL they are not optimizing MCP's.

jensneuse · 3 months ago

This is what GraphQL was designed for. Only select fields you really need. We've built an OSS Gateway that turns a collection of GraphQL queries into an MCP server to make this simple: https://wundergraph.com/mcp-gateway

jokethrowaway · 3 months ago

MCP doesn't help but filtering is not always a good solution - sometimes you just need the agent to process a lot of data.

In that scenario running code on the data with minimum evaluation of the data (eg. a schema with explanation) is a much better approach and it will scale up to use cases of a certain complexity.

Even this system is not perfect: once your data definition and orchestration grow to big you'll face the same problems.

This should allow you to scale to pretty complex problems though, while the naive approach of just embedding API responses in the chat fails soon (I run into this issue frequently, maintaining a relatively simple systems with a few tool calls).

The only proper solution is reproducing the level of granularity of human decisions in code and call this "decisional system" from an LLM (which would be then reduced to a mere language interface between human language and the internal system). Easier said than done, though.

never_inline · 3 months ago

> 1. This is using a lot of input context in formating as JSON and escaping a Json inside already a JSON.

Isn't it a model problem that they don't respect complex json schemas?

devoutsalsa · 3 months ago

Just for fun, I used ChatGPT to reverse a string as my first test of using their API. I was amused at how much work it took to get the LLM to give me just the reversed string, and even then I didn't feel I could fully trust it. I learned my lesson, and now I have multiple LLMs check to see of the string has actually been reversed. Soon I'll be spinning up a data center to host the GPUs necessary to correctly count the number of Rs in strawberry.

obiefernandez · 3 months ago

My team at Shopify just open sourced Roast [1] recently. It lets us embed non-deterministic LLM jobs within orchestrated workflows. Essential when trying to automate work on codebases with millions of lines of code.

[1] https://github.com/shopify/roast

TheTaytay · 3 months ago

Wow - Roast looks fantastic. You architected and put names and constraints on some things that I've been wrestling with for a while. I really like how you are blending the determinism and non-determinism. (One thing that is not obvious to me after reading the README a couple of times (quickly), is whether/how the LLM can orchestrate multiple tool calls if necessary and make decisions about which tools to call in which order. It seems like it does when you tell it to refactor, but I couldn't tell if this would be suitable for the task of "improve, then run tests. Repeat until done.")

drewda · 3 months ago

Nice to see Ruby continuing to exist and deliver... even in the age of "AI"

crakhamster01 · 3 months ago

This looks pretty cool! I'm curious how these sort of workflows are being used internally at Shopify. Any examples you can share?

bandoti · 3 months ago

This is great! Reading the docs tickles my brain. Nice way to package up LLM functionality in a declarative way!

The_Blade · 3 months ago

good stuff!

i just broke Claude Code Research Preview, and i've crashed ChatGPT 4.5 Pro Deep Research. and i have the receipts :), so i'm looking for tools that work

hintymad · 3 months ago

I feel that the optimal solution is hybrid, not polarized. That is, we use deterministic approach as much as we can, but leverage LLMs to handle the remaining complex part that is hard to spec out or describe deterministically

jngiam1 · 3 months ago

Yes - in particular, I think one interesting angle is use the LLM to generate deterministic approaches (code). And then, if the code works, save it for future use and it becomes deterministic moving forward.

Yes, and the other way around: use the deterministic methods to generate the best possible input to LLM.

nowittyusername · 3 months ago

I agree. You want to use as little LLM as possible in your workflows.

mort96 · 3 months ago

I've been developing software for decades without LLMs, turns out you can get away with very little!

padjo · 3 months ago

Sorry I’ve been out of the industry for the last year or so, is this madness really what people are doing now?

_se · 3 months ago

No, not most people. But some people are experimenting.

No one has found anything revolutionary yet, but there are some useful applications to be sure.

Or, we have a hammer and we’re hitting things with it to see if they’re nails.

tobyhinloopen · 3 months ago

Some people believe that if you're not doing this now, you might be out of the industry again pretty soon.

czechdeveloper · 3 months ago

My daily job by now is massively using AI to develop AI agent designer, which means a lot of stuff like this.

I really did not even want this, it just happened.

codyb · 3 months ago

I'm slightly confused as to why you'd use a LLM to sort structured data in the first place?

The goal is to do more complex data processing, like build dashboards, agentically figure out which tickets are stalled, do a quarterly review of things done, etc. Sorting is a tiny task in the bigger ones, but hopefully more easily exemplifies the problem.

kikimora · 3 months ago

I don’t understand how this can work. Given probabilistic nature of LLMs the more steps you have more chances something goes off. What is good in the dashboard if you cannot be sure it was not partially hallucinated?

risyachka · 3 months ago

Everything you described is already solved by Metabase and few other tools. It takes a few hours to make daily reports there and the dashboard of your dreams.

And its not like it changes every day. KPis etc stay the same for months. And then you can easily update it in a hour.

So what exactly does llm solve here?

avereveard · 3 months ago

That's kind of the entire premise of huggingface smolagent and while it does work really well when it works it also increase the challenges in rolling back failed actions

I guess one could in principle wrap the entire execution block into a distributed transaction, but llm try to make code that is robust, which works against this pattern as it makes hard to understand failure

Agree, the smolagent premise is good; but the hard part is handling execution, errors, etc.

For example, when the code execution fails mid-way, we really want the model to be able to pick up from where it failed (with the states of the variables at the time of failure) and be able to continue from there.

We've found that the LLM is able to generate correct code that picks up gracefully. The hard part now is building the runtime that makes that possible; we've something that works pretty well in many cases now in production at Lutra.

I think in principle you can make the entire API exposed to the llm idempotent so that it bicomes irrelevant for the backend wheter the llm replay the whole action or just the failed steps

hooverd · 3 months ago

Could you implement an actual state machine and have your agent work with that?