rchaves (u/rchaves) - Readit News

rchaves commented on Show HN: Simulation-Based Testing for Agents Using AG-UI Protocol github.com/langwatch/scen... · Posted by u/0xdeafcafe

rchaves · 8 months ago

Hello HN!

tl;dr: We built Scenario, an open-source testing library for AI agents. It simulates real conversations with your agent, its code-driven, and lets you assert anything mid-dialogue. Repo: https://github.com/langwatch/scenario Docs: https://scenario.langwatch.ai/

I'm Rogerio, founder of LangWatch, I've been helping many customers building LLM applications in this past two years and worked with Alex on this.

Most of the efforts for LLM quality so far were about evaluations, single-turn, there was nothing actually good to test agents, it all felt forced, but we believe we cracked it now, we have built an agent testing library that test your agent by simulating a user and playing a conversation back and forth with it.

One of the key challenges there was that we had to make it compatible with all the 273+ AI frameworks (and counting) there are. Luckliy AG-UI protocol popped up recently, standardizing agents frameworks and UI interactions, this is perfect, because at the end of the day, we want our user simulator to "see" just the same that the user sees.

So we made Scenario in a way that is really easy to connect to any agent no matter the tech stack, from a simple string <-> string connection, to openai standard messages format, to AG-UI.

The other key challenge was to balance testing the open-endedness of agents vs having reliable cases you want to test, so we worked a lot on thinking through the autopilot simulation vs the fully scripted one, and here again, the goal was complete interoperability. At the end of the day, the design we achieved was simply having lambdas, that you can call at any point of the test, so it's just code, where you can connect any other evaluation or assertion tool you want, we are not restrictive.

Check out the repo and the docs, we would love to get some feedback in here!

Repo: https://github.com/langwatch/scenario Docs: https://scenario.langwatch.ai/

rchaves commented on Show HN: Rowboat – Open-source IDE for multi-agent systems github.com/rowboatlabs/ro... · Posted by u/segmenta

nurettin · 10 months ago

The sentence should read;

"It is becoming clear that agentic systems which run a prompt per work node is becoming a curiosity so we should hype it as the correct solution in order to make a buck despite all the efforts that have been spent trying to one-shot complex problems."

rchaves · 10 months ago

well I think hype is not bad per se, I'd do it even if not trying to make a buck, it's okay (up to a point) to hype up something so that eventually it finds a problem where it fits well, but yeah, I'm still waiting on this one

rchaves commented on Show HN: Rowboat – Open-source IDE for multi-agent systems github.com/rowboatlabs/ro... · Posted by u/segmenta

simonw · 10 months ago

"It’s becoming clear that real-world agentic systems work best when multiple agents collaborate, rather than having one agent attempt to do everything."

I'll be honest: I don't buy that premise (yet). It's clearly a popular idea and I see a lot of excitement about it (see Google's A2A thing) but it feels to me like a pattern that, in many cases, will make LLM code even harder to get reliable results from.

I worry it's the AI-equivalent of microservices: useful in a small set of hyper complex systems, the vast majority of applications that adopt it would have been better off without.

If there are strong arguments counter to what I've said here I'd love to hear them!

rchaves · 10 months ago

same here, but I would even avoid "strong arguments" because that's what we all have been doing so far

what I want is real use cases, show me real-world production examples from established companies where multi-agent collaboration helped them better than a simple agent + tools and deterministic workflows

rchaves commented on Show HN: Rowboat – Open-source IDE for multi-agent systems github.com/rowboatlabs/ro... · Posted by u/segmenta

danenania · 10 months ago

A few concrete examples of multi-agent collaboration being useful in my project Plandex[1]:

- While it uses Sonnet 3.7 by default for creating the edit snippet when writing code, calls related to applying the snippet and validating the result (and falling back to a whole file write if needed) use o3-mini (soon to be o4-mini) which is 1/3 the cost, much faster, and actually more accurate and reliable than Sonnet for this particular narrow task.

- If Sonnet 3.7's context limit is exceeded in the planning stages, it can switch to a Gemini model for planning, then go back to Sonnet again for the implementation steps (since these only need the files relevant to each step).

- It eagerly summarizes the conversation after each response so that the summary can be used later if the conversation gets too long. This is only practical because much smaller models than the main planning/coding models are sufficient for a good summary. Otherwise it would be way too expensive.

It's definitely more complex, but I think in these cases at least, there's a real payoff for the trouble.

1 - https://github.com/plandex-ai/plandex

rchaves · 10 months ago

is this multi-agent collaboration though, or is it just a workflow? All examples you listed seem to have pretty deterministic control flows (write then validade, context exceeded, after each response, etc)

when I think of multi-agent collaboration I think of also the control flow and handover to be defined by the agents themselves, this is the thing I have yet to see examples of in production, and the premise that I also don't buy yet

rchaves commented on GPT-5 is behind schedule wsj.com/tech/ai/openai-gp... · Posted by u/owenthejumper

rchaves · a year ago

Nah it's just a marketing problem, "GPT" and "ChatGPT" names is the biggest asset OpenAI has, people have expectations so high for GPT-5 that they cannot burn this name unless it's something truly majestic, bordering AGI at the very least. Until they are confident enough that people will be blown off by it, it's better to continue building up the hype