Readit News logoReadit News
nyellin commented on Benchmarking OpenTelemetry: Can AI trace your failed login?   quesma.com/blog/introduci... · Posted by u/stared
smithclay · 11 days ago
We need more rigorous benchmarks for SRE tasks, which is much easier said that done.

The only other benchmark I've come across is https://sreben.ch/ ... certainly there must be others by now?

nyellin · 11 days ago
We publish the benchmarks for HolmesGPT (CNCF sandbox project) at https://holmesgpt.dev/development/evaluations/
nyellin commented on Benchmarking OpenTelemetry: Can AI trace your failed login?   quesma.com/blog/introduci... · Posted by u/stared
nyellin · 11 days ago
HolmesGPT maintainer here: our benchmarks [1] tell a very different story, as does anecdotal evidence from our customers- including Fortune 500 using SRE agents in incredibly complex production environments.

We're actually struggling a bit with benchmark saturation right now. Opus does much better in the real world than Sonnet but it's hard to create sophisticated enough benchmarks to show that in the lab. When we run benchmarks with a small number of iterations Sonnet even wins sometimes.

[1] https://holmesgpt.dev/development/evaluations/history/

nyellin commented on How to code Claude Code in 200 lines of code   mihaileric.com/The-Empero... · Posted by u/nutellalover
terminalshort · a month ago
For one thing it seems to splitting up the work and making some determination of complexity, then allocating it out to a model based on that complexity to save resources. When I run Claude with Opus 4.5 and run /cost I see tokens for Opus 4.5, but also a lot in Sonnet and Haiku, with the majority of tokens actually being used by Haiku.
nyellin · a month ago
Haiku is called often, but not always the way you think. E.g. every time you write something CC invokes Haiku multiple times to generate the 'delightful 1-2 word phrase used to indicate progress to the user' (Doing Stuff, Wizarding, etc)
nyellin commented on How to code Claude Code in 200 lines of code   mihaileric.com/The-Empero... · Posted by u/nutellalover
prodigycorp · a month ago
I'm not a good representative for claude code because I'm primarily a codex user now, but I know that if codex had subagents it would be at least twice as productive. Time spent is an important aspect of performance so yup, the complexity improved performance.
nyellin · a month ago
Not necessarily true. Subagents allow for parallelization but they can decrease accuracy dramatically if you're not careful because there are often dependencies between tasks and swapping context windows with a summary is extremely lossy.

For the longest time, Claude Code itself didnt really use subagents much by default, other than supporting them as a feature eager users could configure. (Source is reverse engineering we did on Claude code using the fantastic CC tracing tool Simon Willison wrote about once. This is also no longer true on latest versions that have e.g. an Explore subagent that is actively used.)

nyellin commented on How to code Claude Code in 200 lines of code   mihaileric.com/The-Empero... · Posted by u/nutellalover
hazrmard · a month ago
This reflects my experience. Yet, I feel that getting reliability out of LLM calls with a while-loop harness is elusive.

For example

- how can I reliably have a decision block to end the loop (or keep it running)?

- how can I reliably call tools with the right schema?

- how can I reliably summarize context / excise noise from the conversation?

Perhaps, as the models get better, they'll approach some threshold where my worries just go away. However, I can't quantify that threshold myself and that leaves a cloud of uncertainty hanging over any agentic loops I build.

Perhaps I should accept that it's a feature and not a bug? :)

nyellin · a month ago
Forgot to address the easiest part:

> - how can I reliably call tools with the right schema?

This is typically done by enabling strict mode for tool calling which is a hermetic solution. Makes llm unable to generate tokens that would violate the schema. (I.e. LLM samples tokens only from the subset of tokens that lead to valid schema generation.)

nyellin commented on How to code Claude Code in 200 lines of code   mihaileric.com/The-Empero... · Posted by u/nutellalover
hazrmard · a month ago
This reflects my experience. Yet, I feel that getting reliability out of LLM calls with a while-loop harness is elusive.

For example

- how can I reliably have a decision block to end the loop (or keep it running)?

- how can I reliably call tools with the right schema?

- how can I reliably summarize context / excise noise from the conversation?

Perhaps, as the models get better, they'll approach some threshold where my worries just go away. However, I can't quantify that threshold myself and that leaves a cloud of uncertainty hanging over any agentic loops I build.

Perhaps I should accept that it's a feature and not a bug? :)

nyellin · a month ago
Re (1) use a TODOs system like Claude code.

Re (2) also fairly easy! It's just a summarization prompt. E.g. this is the one we use in our agent: https://github.com/HolmesGPT/holmesgpt/blob/62c3898e4efae69b...

Or just use the Claude Code SDK that does this all for you! (You can also use various provider-specific features for 2 like automatic compaction on OpenAI responses endpoint.)

nyellin commented on How to code Claude Code in 200 lines of code   mihaileric.com/The-Empero... · Posted by u/nutellalover
nyellin · a month ago
There's a bit more to it!

For example, the agent in the post will demonstrate 'early stopping' where it finishes before the task is really done. You'd think you can solve this with reasoning models, but it doesn't actually work on SOTA models.

To fix 'early stopping' you need extra features in the agent harness. Claude Code does this with TODOs that are injected back into every prompt to remind the LLM what tasks remain open. (If you're curious somewhere in the public repo for HolmesGPT we have benchamrks with all the experiments we ran to solve this - from hypothesis tracking to other exotic approaches - but TODOs always performed best.)

Still, good article. Agents really are just tools in a loop. It's not rocket science.

nyellin commented on Meta Uses LLMs to Improve Incident Response   tryparity.com/blog/how-me... · Posted by u/wilson090
nyellin · a year ago
I know there are already a number of comments here about proprietary solutions.

If you're looking for something open source: https://github.com/robusta-dev/holmesgpt/

u/nyellin

KarmaCake day3552December 22, 2010
About
Founder at robusta.dev

CNCF SRE Agent, adopted by Microsoft and many more

natan@robusta.dev

View Original