A practical guide to building agents [pdf]

For every example-agent they gave, an ordinary 'dumb' (as in 'non-intelligent') service would've sufficed...

So, to give an example of what's worked really well for me: I'm working for an app hosting startup named Wasmer, and we host a decent of apps. Some of these are malicious. So to very effectively detect the malicious apps, we have an app-vetter agent named Herman. Herman reads the index page of every newly created app, alongside a printscreen of the index page, then he flags if he thinks that the app is malicious. Then some human (usually me) inspect the app and make the final decision of it should banned or not.

This allows us to scan a quite large amount of apps and filter the noisy non-malicious apps. Doing this with a 'dumb' service wouldn't really be feasible, and the context of an LLM fits perfectly where it gets both an image and the source code. An LLM is also quite 'omnipotent', in that it for example knows that DANA is a bank in Malaysia, something I personally have no idea about.

I think tedious and time consuming chores like this is a great way of using agents. Next in line for my experimentation is to utilize agents for 'fuzzy' integration testing, where the LLM simply has access to a browser + cli tools and UAT specifications, and may (in an isolated environment) do whatever it wants. Then it should report back any findings and improvements using an MCP integration towards our ticketing system. So to utilize the hallucinations to find issues.

baalimago · 3 months ago

I tried doing LLM-based tests, coined it "agentic-tests" and it worked quite well:

The idea was to use Stagehand [1] as the testing framework and then integrate this with Linear, which is our ticketing system. During a hackathon I whipped something together: first the 'agent' read a UAT from linear, then it pass this into a quite heavily prompted Stagehand. The prompts instructed Stagehand to run the UAT after its best ability and make very structured notes on each step on what failed. Once the Stagehand process was done, 'the agent' reported which steps succeeded and failed and into Linear.

Fundamentally the idea was sound, but there were some limitations in both the Linear SDK and in Stagehand. With some better tooling (or a novel system) and I predict this sort of agentic testing will work very well, especially for exploratory testing where the agent may be prompted to act like either a 90 year old grandma or 16 year old turbogamer. Privacy-safe usage heatmaps may also be generated automatically to test out the UX, as each run yielded slightly different approaches to achieve the UATs.

Testing teams need not apply!

[1]: https://www.stagehand.dev/

Deleted Comment

Thanks for sharing this. I'm actually starting to explore integrating an agent into one of my SaaS solutions, based on a client request.

To be honest, my experience with agents is still pretty limited, so I’d really appreciate any advice, especially around best practices or a roadmap for implementation. The goal is to build something that can learn and reflect the company’s culture, answer situational questions like “what to do in this case,” assist with document error checking, and generally serve as a helpful internal assistant.

All of this came from the client's desire to have a tool that aligns with their internal knowledge and workflows.

Is something like this feasible in terms of quality and reliability? And beyond hallucinations, are there major security or roadblocks concerns I should be thinking about?

ursaguild · 3 months ago

Ingesting documents and using natural language to search your org docs with an internal assistant sounds more like a good use case for RAG[1]. Agents are best when you need to autonomously plan and execute a series of actions[2]. You can combine the two but knowing when depends on the use case.

I really like the OpenAI approach and how they outlined the thought process of when and how to use agents.

[1] https://www.willowtreeapps.com/craft/retrieval-augmented-gen...

[2] https://www.willowtreeapps.com/craft/building-ai-agents-with...

ednite · 3 months ago

Interesting, and thanks for explanations.

In this case, the agent would also need to learn from new events, like project lessons learned, for example.

Just curious: can a RAG[1] system actually learn from new situations over time in this kind of setup, or is it purely pulling from what's already there?

eric-burel · 3 months ago

The complexity of an agent may range from something relatively simple to whatever level of complexity you want. So your projet sounds doable but you'll have to run some exploration to get proper answers. Regarding reliability, quality and security, it is as important to learn how to observe an agent system than learning how to implement an agent system. An agent/LLM-based solution is proven to work only if you observe that it actually works, experiments, tests and monitoring are not optional like eg in web development. As for security concerns, you'd want to take a look at the OWASP top 10 for LLMs: https://owasp.org/www-project-top-10-for-large-language-mode... LLMs/agents indeed have their own new set of vulnerabilities.

ednite · 3 months ago

That’s sound advice, really appreciate the link. Regarding your point about continuous monitoring, that’s actually the first thing I mentioned to the client.

It’s still highly experimental and needs to be observed, corrected, and tweaked constantly, kind of like teaching a child, where feedback and reinforcement are key.

I may share my experience with the HN community down the line. Thanks again!

abelanger · 3 months ago

I'm a big fan of https://github.com/humanlayer/12-factor-agents because I think it gets at the heart of engineering these systems for usage in your app rather than a completely unconstrained demo or MCP-based solution.

In particular you can reduce most concerns around security and reliability when you treat your LLM call as a library method with structured output (Factor 4) and own your own control flow (Factor 8). There should never be a case where your agent is calling a tool with unconstrained input.

ednite · 3 months ago

I guess I’ve got some reading and research ahead of me. I definitely would rather support the idea of treating LLM calls more like structured library functions, rather than letting them run wild.

Definitely bookmarking this for reference. Appreciate you sharing it.

glompers · 3 months ago

Provided the goal is solid internal culture at the company, the company would have an automatic dart thrower that reliably hits a target that is the wrong target altogether.

Supposing you hired a consultant to be "culture keeper" for this company -- and she or he said, "I'm just going to reason about context by treating this culture as a body of text" -- you would instantly assume that they didn't have skin in the game and didn't understand how culture actually grows and accretes, let alone monitoring and validating eventual quality or reliability. We can't read about what rules apply in some foreign culture's situations and then remotely prescribe what to do socially in a foreign culture we've never set foot in. We can't accurately anticipate even the second-order effects of our recommendations in that situation.

We simply have to participate first. It would be better for this to be a role that involves someone inside of the company who does participate in navigating the culture themselves so that they make accurate observations from experience. A person trustworthy enough to steward this culture would also necessarily be trustworthy enough not to alarm the chief of HR. Based on my model of how work works, from experience, I am wondering if they imagine they want this sensitive role filled with a nonhuman 'trusted' advisor so that it can't ever become a social shadow power center within the firm.

Or maybe they don't want to admit that modeling culture is beyond the reach of their matter-of-fact internal process models and simulations, and they're just wishfully hoping you can abstract away all of the soft elements without producing social fever dreams or ever having to develop a costly true soft element model. But then you absolutely abstract away where the rubber meets the road! That's quite a roadblock, to be honest with you.

ednite · 3 months ago

Oh, your response basically captures hours of conversation I’ve had with that and many other clients. I completely agree, no AI tool can replace years of experience within a company. Trying to model that too literally risks creating misunderstandings, or worse, damaging trust and reputation.

It definitely will never be a replacement for HR or top executive thinking. At best, I’ll be proposing something much lighter, more like a glorified internal search tool for real user examples. To be honest, I’m still all figuring it out. Best case: a helpful resource guide. Worst case: it adds no real value.

The tricky part is, if I don’t provide something, even just a prototype, they’re already looking at other consultants who’ll happily promise the moon. And that’s my bigger concern: if I’m not involved, someone else might introduce a half-baked solution that interferes with the SaaS I’ve already built for them.

So now I’m in a position where I need to put together a clear, honest demo that shows what this tech can and can’t do, just to avoid further complications down the line.

Ironically, this all started when another “AI expert” sold them the idea.

I’ve been saying the same thing all along, we’re not quite there yet. Maybe one day, but not now.

I also get that businesses want to take full advantage of this tech when it’s pitched as a money-saving opportunity, the pressure to act fast is real.

I wonder how many other devs and consultants are facing similar situations?

trevinhofmann · 3 months ago

Others have given some decent advice based on your comment, but would you be interested in a ~30 minute (video) call to dive a bit deeper so I can give more tailored suggestions?

_pdp_ · 3 months ago

IMHO this guide should have been called "a theoretical guide for building agents". In practice, you cannot build agents like that if you want them to do useful things.

Also, the examples provided are not only not practical but potentially bad practice. Why do you need a manager pattern to control a bunch of language translation agents when most models will do fine especially for latin-based languages? In practice a single LLM will not only be more cost-effective but also good for the overall user experience too.

Also, prompting is the real unsung hero that barely gets a mention. In practice you cannot get away with just a couple of lines describing the problem / solution at a high-level. Prompts are complex and very much an art form because and frankly, let's be honest, there is no science whatsoever behind them - just intuition. But in practice they do have enormous effect on the overall agent performance.

This guide is not aimed at developers to really educate how to build agents but at business executives and decision-makers who need a high-level understanding without getting into the practical implementation details. It glosses over the technical challenges and complexity that developers actually face when building useful agent systems in production environments.

3abiton · 3 months ago

Do you have any good practical guide in mind?

HaZeust · 3 months ago

Bumped and following

anshumankmr · 3 months ago

>Also, the examples provided are not only not practical but potentially bad practice. Why do you need a manager pattern to control a bunch of language translation agents when most models will do fine especially for latin-based languages? In practice a single LLM will not only be more cost-effective but also good for the overall user experience too.

Usually that's how these agent tutorials work. I don't think there any open sourced large scale agent applications that anyone has published online yet, especially those agent that hand off to other agent sort of apps.

helsinki · 3 months ago

Has anyone solved scoped permissions in multi-agent systems? For instance, if a user asks an orchestrator agent to:

1) Search GitHub for an issue in their repo.

2) Fix the issue and push it to GitHub.

3) Search Jira for any tasks related to this bug, and update their status.

4) Post a message to slack, notifying the team that you fixed it.

Now, let’s assume this agent is available to 1000 users at a company. How does the system obtain the necessary GitHub, Jira, and Slack permissions for a specific user?

The answer is fairly obvious if the user approves each action as the task propagates between agents, but how do you do this in a hands-free manner? Let’s assume the user is only willing to approve the necessary permissions once, after submitting their initial prompt and before the orchestrator agent attempts to call the GitHub agent.

If anyone could offer any advice on this, I would really appreciate it. Thank you!

simonw · 3 months ago

I would solve this using the equivalent of a service account - I would give that "agent" an identity - "CodeBot" or whatever - and then that as an actor which has permission to read things on Jira, permission to send notifications to Slack, permission to access the GitHub API etc.

Then I would control who had permission to tell that to do, and log everything in detail.

evantahler · 3 months ago

We do this at Arcade.dev with oAuth scopes and token persistence for each agent. There is a grant required for each tool-user tuple the first time, but then it’s remembered

CMCDragonkai · 3 months ago

Yea we have been developing Polykey for this purpose. Sent you an email to discuss.

chem83 · 3 months ago

Equivalent guides from —

Google: https://ia600601.us.archive.org/15/items/google-ai-agents-wh...

Anthropic: https://www.anthropic.com/engineering/building-effective-age...

ramesh31 · 3 months ago

Tools are the only thing that matters, and are what you should focus on, not "agents" as a separate concept. Locking yourself into any particular agent framework is silly; they are nothing but LLM calling while-loops connected to JSON/XML parsers. Tools define and shape the entirety of an agent's capability to do useful things, and through MCP can be trivially shared with virtually any agentic process.

gavmor · 3 months ago

Yes, I wonder if it ever makes sense to partition an agent's toolkit across multiple "agents"—besides horizontal scaling. Why should one process have access to APIs that another doesn't? Authorization and secrets, maybe, but functionality?

DarmokJalad1701 · 3 months ago

Depending on the capability or context size of the model, it is easy to imagine a situation where specifying too many tools (or MCPs) can overwhelm it, affecting the accuracy.

Dead Comment