nyellin (u/nyellin) - Readit News

nyellin commented on Benchmarking OpenTelemetry: Can AI trace your failed login? quesma.com/blog/introduci... · Posted by u/stared

smithclay · 11 days ago

We need more rigorous benchmarks for SRE tasks, which is much easier said that done.

The only other benchmark I've come across is https://sreben.ch/ ... certainly there must be others by now?

nyellin · 11 days ago

We publish the benchmarks for HolmesGPT (CNCF sandbox project) at https://holmesgpt.dev/development/evaluations/

nyellin commented on Benchmarking OpenTelemetry: Can AI trace your failed login? quesma.com/blog/introduci... · Posted by u/stared

nyellin · 11 days ago

HolmesGPT maintainer here: our benchmarks [1] tell a very different story, as does anecdotal evidence from our customers- including Fortune 500 using SRE agents in incredibly complex production environments.

We're actually struggling a bit with benchmark saturation right now. Opus does much better in the real world than Sonnet but it's hard to create sophisticated enough benchmarks to show that in the lab. When we run benchmarks with a small number of iterations Sonnet even wins sometimes.

[1] https://holmesgpt.dev/development/evaluations/history/

nyellin commented on How to code Claude Code in 200 lines of code mihaileric.com/The-Empero... · Posted by u/nutellalover

terminalshort · a month ago

For one thing it seems to splitting up the work and making some determination of complexity, then allocating it out to a model based on that complexity to save resources. When I run Claude with Opus 4.5 and run /cost I see tokens for Opus 4.5, but also a lot in Sonnet and Haiku, with the majority of tokens actually being used by Haiku.

nyellin · a month ago

Haiku is called often, but not always the way you think. E.g. every time you write something CC invokes Haiku multiple times to generate the 'delightful 1-2 word phrase used to indicate progress to the user' (Doing Stuff, Wizarding, etc)

nyellin commented on How to code Claude Code in 200 lines of code mihaileric.com/The-Empero... · Posted by u/nutellalover

prodigycorp · a month ago

I'm not a good representative for claude code because I'm primarily a codex user now, but I know that if codex had subagents it would be at least twice as productive. Time spent is an important aspect of performance so yup, the complexity improved performance.

nyellin · a month ago

Not necessarily true. Subagents allow for parallelization but they can decrease accuracy dramatically if you're not careful because there are often dependencies between tasks and swapping context windows with a summary is extremely lossy.

For the longest time, Claude Code itself didnt really use subagents much by default, other than supporting them as a feature eager users could configure. (Source is reverse engineering we did on Claude code using the fantastic CC tracing tool Simon Willison wrote about once. This is also no longer true on latest versions that have e.g. an Explore subagent that is actively used.)

nyellin commented on How to code Claude Code in 200 lines of code mihaileric.com/The-Empero... · Posted by u/nutellalover

hazrmard · a month ago

This reflects my experience. Yet, I feel that getting reliability out of LLM calls with a while-loop harness is elusive.

For example

- how can I reliably have a decision block to end the loop (or keep it running)?

- how can I reliably call tools with the right schema?

- how can I reliably summarize context / excise noise from the conversation?

Perhaps, as the models get better, they'll approach some threshold where my worries just go away. However, I can't quantify that threshold myself and that leaves a cloud of uncertainty hanging over any agentic loops I build.

Perhaps I should accept that it's a feature and not a bug? :)

nyellin · a month ago

Forgot to address the easiest part:

> - how can I reliably call tools with the right schema?

This is typically done by enabling strict mode for tool calling which is a hermetic solution. Makes llm unable to generate tokens that would violate the schema. (I.e. LLM samples tokens only from the subset of tokens that lead to valid schema generation.)

nyellin commented on How to code Claude Code in 200 lines of code mihaileric.com/The-Empero... · Posted by u/nutellalover

hazrmard · a month ago

This reflects my experience. Yet, I feel that getting reliability out of LLM calls with a while-loop harness is elusive.

For example

- how can I reliably have a decision block to end the loop (or keep it running)?

- how can I reliably call tools with the right schema?

- how can I reliably summarize context / excise noise from the conversation?

Perhaps, as the models get better, they'll approach some threshold where my worries just go away. However, I can't quantify that threshold myself and that leaves a cloud of uncertainty hanging over any agentic loops I build.

Perhaps I should accept that it's a feature and not a bug? :)

nyellin · a month ago

Re (1) use a TODOs system like Claude code.

Re (2) also fairly easy! It's just a summarization prompt. E.g. this is the one we use in our agent: https://github.com/HolmesGPT/holmesgpt/blob/62c3898e4efae69b...

Or just use the Claude Code SDK that does this all for you! (You can also use various provider-specific features for 2 like automatic compaction on OpenAI responses endpoint.)

nyellin commented on How to code Claude Code in 200 lines of code mihaileric.com/The-Empero... · Posted by u/nutellalover

nyellin · a month ago

There's a bit more to it!

For example, the agent in the post will demonstrate 'early stopping' where it finishes before the task is really done. You'd think you can solve this with reasoning models, but it doesn't actually work on SOTA models.

To fix 'early stopping' you need extra features in the agent harness. Claude Code does this with TODOs that are injected back into every prompt to remind the LLM what tasks remain open. (If you're curious somewhere in the public repo for HolmesGPT we have benchamrks with all the experiments we ran to solve this - from hypothesis tracking to other exotic approaches - but TODOs always performed best.)

Still, good article. Agents really are just tools in a loop. It's not rocket science.

nyellin commented on Meta Uses LLMs to Improve Incident Response tryparity.com/blog/how-me... · Posted by u/wilson090

nyellin · a year ago

I know there are already a number of comments here about proprietary solutions.

If you're looking for something open source: https://github.com/robusta-dev/holmesgpt/