If you want to say what you think is important about an article, that's fine, but do it by adding a comment to the thread. Then your view will be on a level playing field with everyone else's: https://hn.algolia.com/?dateRange=all&page=0&prefix=false&so...
(Submitted title was "OTelBench: AI struggles with simple SRE tasks (Opus 4.5 scores only 29%)")
As a context, I felt that the original title (the first time I posted get a few upvotes, but not more). At the same time, I shouldn’t have editorialized it, as it is a slippery slope from „just a bit better title”, through optimization, to a clickbait.
Thank you dang for keeping the spirit and quality of HN.
From the post I expected that the tasks were about analysing traces, but all the tasks in the repository are about adding instrumentation to code!
Some of the instructions don't give any guidance how to do it, some specify which libraries to use.
"Use standard OTEL patterns" ... that's about as useful as saying "go write some code". There are a lot of ways to do instrumentation....
I'd be very curious HOW exactly the models fail.
Are the test sets just incredibly specific about what output they except, and you get a lot of failures because of tiny subtle mismatches? Or do they just get the instrumentation categorically wrong?
Also important: do the models have access to a web search tool to read the library docs?
Otel libraries are often complicated to use... without reading latest docs or source code this would be quite tricky.
Some models have gotten better at adding dependencies, installing them and then reading the code from the respective directory where dependencies get stored, but many don't do well with this.
All in all, I'm very skeptical that this is very useful as a benchmark as is.
I'd be much more interested in tasks like:
Here are trace/log outputs , here is the source code, find and fix the bug.
+1 I’m not sure if tasks like Add OTel instrumentation belongs more in a Coding bench than an SRE bench. I came here expecting to see things like, this is how Models perform on finding the root cause in 50 complicated microservice failure scenarios.
For AI-SRE tasks like finding root cause of bugs and errors, I believe the key is to provide tools to the agent to query metrics, logs, traces and understand the problem. I’m working on a similar OSS framework and benchmark (work in progress using metrics and logs - demo - https://youtube.com/playlist?list=PLKWJ03cHcPr3Od1rwL7ErHW1p...), where context is Semantics and Text2SQL to query the right metrics, logs and benchmark is on a set of Skills that Claude code or other agents can run using these tools to find the root cause of errors:
I'm surprised by how many people think that SRE's job is to debug.
SRE's job is to make the software reliable, for instance by adding telemetry, understanding and improving the failure modes, the behavior under load etc.
So a better SRE test would not be "read the logs and fix the bug", but rather "read the code and identify potential issues".
>Some of the instructions don't give any guidance how to do it, some specify which libraries to use.
In supporting a piece of cloud software with a lot of microservices I think this is a more generalized problem for humans. The app I work with demanded some logging requirements like the library to use. But that was it, different parts by different teams ended up with all kinds of different behaviors.
As for the AI side, this is something where I see our limited context sizes causing issues when developing architecture across multiple products.
This is definitely not a context problem. Very simple things like checking for running processes and killing the correct one is something that models like opus 4.5 can't do consistently correct... instead of recognizing that it needs to systematize that sort of thing -- one and done. Like, probably 50% of the time it kills the wrong thing. About 25% of the time after that it recognizes that it didn't kill the correct thing and then rewrites the ps or lsof from scratch and has the problem again. Then if I kill the process myself out of frustration it checks to see if the process is running, sees that it's not, then gets confused and sets its new task to rewrite the ps or lsof... again. It does the same thing with tests, where it decides to just, without any doubt in its rock brain, delete the test and replace it with a print statement.
Context size isn't the issue. You cannot effectively leverage an infinite context if you had one anyways. The general solution is to recursively decompose the problem into smaller ones and solve them independently of each other, returning the results back up the stack. Recursion being the key here. A bunch of parallel agents on separate call stacks that don't block on their logical callees is a slop factory.
Are these the same people who say it doesn't work well? I've been experimenting with writing what I actually mean by that (with the help of an LLM, funny enough), and it seems to be giving me much better code than the typical AI soup. e.g.
- functional core, imperative shell. prefer pure helpers.
- avoid methods when a standalone function suffices
- use typed errors. avoid stringly errors.
- when writing functions, create a "spine" for orchestration
- spine rules: one dominant narrative, one concept per line, named values.
- orchestration states what happens and in what order
- implementation handles branching, retries, parsing, loops, concurrency, etc.
- apply recursively: each function stays at one abstraction level
- names describe why something exists, not how it is computed
etc.
This is no different from writing a style guide for your team/org. You don't just say "write clean code" and expect that you'll get something you like.
I hate that it's true, but things like this make outputs night-and-day for me. This is the difference e.g. of a model writing appropriate test harnesses, or pushing back on requirements, vs writing the most absolute horrible code and test/dependency injection I've ever seen in pursuit of the listed goals.
Similar to adjacent commentors I've tried to be better at enumerating what I consider to be best practice, but I couldn't argue in good faith that instructions like these produce no noticible improvment.
(As with all things AI, it could all be percepion on my end, so YMMV, wish there was a better way to concretely evaluate effects on outcomes of different rule sets / instructions / ...)
Like with robotaxi, ok, the thing is not perfect, but how does this compare to an human ? I'm interviewing OPS / SRE at the moment , and i'm not so happy with what I see...
If you're interviewing Ops don't expect them to know anything about OTEL. Ops is about platforms, systems, and operations surrounding and supporting the application.
Integration of OTEL into an application stack requires explicitly knowledge of the code - the developers.
Original title: Benchmarking OpenTelemetry: Can AI trace your failed login?
HN Editorialized: OTelBench: AI struggles with simple SRE tasks (Opus 4.5 scores only 29%)
The task:
> Your task is: Add OTEL tracing to all microservices.
> Requirements:
> Instrumentation should match conventions and well-known good practices.
> Instrumentation must match the business domain of the microservices.
> Traces must be sent to the endpoint defined by a standard OTEL environment variable.
> Use the recent version of the OTEL SDK.
I really don't think anything involved with multiple microservices can be called 'simple' even to humans. Perhaps to an expert who knows the specific business's domain knowledge it is.
As someone whos job is support more than SWE, I agree with this.
I've had to work in systems where events didn't share correlation IDs, I had to go in and filter entries down to microseconds to get a small enough number of entries that I could trace what actually happened between a set of services.
From what I've seen in the enterprise software side of the world is a lot of companies are particularly bad at SRE and there isn't a great amount of standardization.
Enterprise app observability is purely a responsibility of each individual application/project manager. There is virtually no standardization or even shared infra, a team just stuffing plaintext logs into an unconfigured elasticsearch instance is probably above median already. There is no visibility for anything across departments and more often that not, not even across apps in a department.
Having done app support across many environments, um - yes, multiple microservices is usually pretty simple. Just look at the open file/network handles and go from there. It's absolutely maddening to watch these models flail in trying to do something basic as, "check if the port is open" or "check if the process is running... and don't kill firefox this time".
These aren't challenging things to do for an experienced human at all. But it's such a huge pain point for these models! It's hard for me to wrap my head around how these models can write surprisingly excellent code but fail down in these sorts of relatively simple troubleshooting paths.
I would wager the main reason for this is the same reason it’s also hard to teach these skills to people: there’s not a lot of high quality training for distributed debugging of complex production issues. Competence comes from years of experience fighting fires.
Very few people start their careers as SREs, it’s generally something they migrate into after enjoying it and showing aptitude for it.
With that said, I wouldn’t expect this wall to hold up for too long. There has been a lot of low hanging fruit teaching models how to code. When that is saturated, the frontier companies will likely turn their attention to honing training environments for SRE style debug.
There is definitely more to the inability for models to perform well at SRE. One, it is not engineering, it is next token prediction, it is vibes. They could do Site Reliability Vibing or something like that.
When we ask it to generate an image, any image will do it. We couldn't care less. Try to sculpt it, try to rotate it 45 degrees and all hell breaks loose. The image would be rotated but the hair color could change as well. Pure vibes!
When you ask it to refactor your code, any pattern would do it. You could rearrange the code in infinite ways, rename variables in infinite ways without fundamentally breaking logic. You could make as many arbitrary bullshit abstraction and call it good, as people have done it for years with OOP. It does not matter at all, any result would do it in this cases.
When you want to hit an specific gRPC endpoint, you need an specific address and the method expects an specific contract to be honored. This either matches or it doesn't.
When you wish the llms could implement a solution that captures specifics syscalls from specifics hosts and send traces to an specific platform, using an specific protocol, consolidating records on a specific bucket...you have one state that satisfy your needs and 100 requirement that needs to necessarily be fulfilled. It either meet all the requirements or it's no good.
It truly is different from Vibing and llms will never be able to do in this. Maybe agents will, depending on the harnesses, on the systems in place, but one model just generate words words words with no care about nothing else
> I would wager the main reason for this is the same reason it’s also hard to teach these skills to people: there’s not a lot of high quality training for distributed debugging of complex production issues. Competence comes from years of experience fighting fires.
The search space for a cause beyong a certain size can also be big. Very big.
Like, at work we're at the beginning of where the powerlaw starts going nuts. Somewhere around 700 - 1000 services in production, across several datacenters, with a few dozen infrastructure clusters behind it. For each bug, if you looked into it, there'd probably by 20 - 30 changes, 10 - 20 anomalies, and 5 weird things someone noticed in the 30 minutes around it.
People already struggle at triaging relevance of everything in this context. That's something I can see AI start helping and there were some talks about Meta doing just that - ranking changes and anomalies in order of relevance to a bug ticket so people don't run after other things.
That's however just the reactive part of OPS and SRE work. The proactive part is much harder and oftentimes not technical. What if most negatively rated support cases run into a dark hole in a certain service, but the responsible team never allocates time to improve monitoring, because sales is on their butt for features? LLMs can identify this maybe, or help them implement the tracing faster, but those 10 minutes could also be spent on features for money.
And what AI model told you to collect the metrics about support cases and resolution to even have that question?
AI works as a better tool for teaching humans than to do the work themselves.
While someone experienced in fighting fires can take intuitive leaps, the basic idea is still to synthesize a hypothesis from signals, validating the hypothesis, and coming up with mitigations and longer term fixes. This is a learned skill, and a team of people/AI will work better than someone solo.
> With that said, I wouldn’t expect this wall to hold up for too long.
The models are already so good at the traditionally hard stuff: collecting that insane amount of detailed knowledge across so many different domains, languages and software stacks.
I can say that as someone that does this for a job for a while, it's starting to be useful in many domains related to SRE that make parts of the job easier.
MCP servers for monitoring tools are making our developers more competent at finding metrics and issues.
It'll get there but nobody is going to type "fix my incident" in production and have a nice time today outside of the most simple things that if they are possible to fix like this, could've been automated already anyway. But between writing a runbook and automating sometimes takes time so those use cases will grow.
HolmesGPT maintainer here: our benchmarks [1] tell a very different story, as does anecdotal evidence from our customers- including Fortune 500 using SRE agents in incredibly complex production environments.
We're actually struggling a bit with benchmark saturation right now. Opus does much better in the real world than Sonnet but it's hard to create sophisticated enough benchmarks to show that in the lab. When we run benchmarks with a small number of iterations Sonnet even wins sometimes.
We've been experimenting with combining durable execution with debugging tasks, and it's working incredibly well! With the added context of actual execution data, defined by the developer as to which functions are important (instead of individual calls), it give the LLM the data it needs.
I know there are AI SRE companies that have discovered the same -- that you can't just throw a bunch of data at a regular LLM and have it "do SRE things". It needs more structured context, and their value add is knowing what context and what structure is necessary.
The 29% score tells us more about benchmark design than model capability IMO.
These benchmarks conflate two very different problems: (1) understanding what needs to be done, and (2) correctly implementing it in a specific library ecosystem.
A human SRE who's never touched OTel would also struggle initially - not because they can't reason about traces, but because the library APIs have quirks that take time to learn.
The more interesting question is whether giving the model access to relevant docs/examples during the task significantly changes the scores. If it does, that suggests the bottleneck is recall not reasoning. If it doesn't, the reasoning gap is real.
FWIW I've found that models do much better on ops tasks when you can give them concrete examples of working instrumentation in the same codebase rather than asking them to generate from scratch.
If you want to say what you think is important about an article, that's fine, but do it by adding a comment to the thread. Then your view will be on a level playing field with everyone else's: https://hn.algolia.com/?dateRange=all&page=0&prefix=false&so...
(Submitted title was "OTelBench: AI struggles with simple SRE tasks (Opus 4.5 scores only 29%)")
As a context, I felt that the original title (the first time I posted get a few upvotes, but not more). At the same time, I shouldn’t have editorialized it, as it is a slippery slope from „just a bit better title”, through optimization, to a clickbait.
Thank you dang for keeping the spirit and quality of HN.
Deleted Comment
From the post I expected that the tasks were about analysing traces, but all the tasks in the repository are about adding instrumentation to code!
Some of the instructions don't give any guidance how to do it, some specify which libraries to use.
"Use standard OTEL patterns" ... that's about as useful as saying "go write some code". There are a lot of ways to do instrumentation....
I'd be very curious HOW exactly the models fail.
Are the test sets just incredibly specific about what output they except, and you get a lot of failures because of tiny subtle mismatches? Or do they just get the instrumentation categorically wrong?
Also important: do the models have access to a web search tool to read the library docs? Otel libraries are often complicated to use... without reading latest docs or source code this would be quite tricky.
Some models have gotten better at adding dependencies, installing them and then reading the code from the respective directory where dependencies get stored, but many don't do well with this.
All in all, I'm very skeptical that this is very useful as a benchmark as is.
I'd be much more interested in tasks like:
Here are trace/log outputs , here is the source code, find and fix the bug.
For AI-SRE tasks like finding root cause of bugs and errors, I believe the key is to provide tools to the agent to query metrics, logs, traces and understand the problem. I’m working on a similar OSS framework and benchmark (work in progress using metrics and logs - demo - https://youtube.com/playlist?list=PLKWJ03cHcPr3Od1rwL7ErHW1p...), where context is Semantics and Text2SQL to query the right metrics, logs and benchmark is on a set of Skills that Claude code or other agents can run using these tools to find the root cause of errors:
Codd Semantic/Text2SQL engine: https://github.com/sathish316/codd_query_engine
PreCogs skills and simulated scenarios: https://github.com/sathish316/precogs_sre_oncall_skills
SRE's job is to make the software reliable, for instance by adding telemetry, understanding and improving the failure modes, the behavior under load etc.
So a better SRE test would not be "read the logs and fix the bug", but rather "read the code and identify potential issues".
In supporting a piece of cloud software with a lot of microservices I think this is a more generalized problem for humans. The app I work with demanded some logging requirements like the library to use. But that was it, different parts by different teams ended up with all kinds of different behaviors.
As for the AI side, this is something where I see our limited context sizes causing issues when developing architecture across multiple products.
Context size isn't the issue. You cannot effectively leverage an infinite context if you had one anyways. The general solution is to recursively decompose the problem into smaller ones and solve them independently of each other, returning the results back up the stack. Recursion being the key here. A bunch of parallel agents on separate call stacks that don't block on their logical callees is a slop factory.
People say to say things like "Use best practices" in your prompts all the time, and chide people who don't.
This is no different from writing a style guide for your team/org. You don't just say "write clean code" and expect that you'll get something you like.
Similar to adjacent commentors I've tried to be better at enumerating what I consider to be best practice, but I couldn't argue in good faith that instructions like these produce no noticible improvment.
(As with all things AI, it could all be percepion on my end, so YMMV, wish there was a better way to concretely evaluate effects on outcomes of different rule sets / instructions / ...)
Integration of OTEL into an application stack requires explicitly knowledge of the code - the developers.
HN Editorialized: OTelBench: AI struggles with simple SRE tasks (Opus 4.5 scores only 29%)
The task:
> Your task is: Add OTEL tracing to all microservices.
> Requirements:
> Instrumentation should match conventions and well-known good practices.
> Instrumentation must match the business domain of the microservices.
> Traces must be sent to the endpoint defined by a standard OTEL environment variable.
> Use the recent version of the OTEL SDK.
I really don't think anything involved with multiple microservices can be called 'simple' even to humans. Perhaps to an expert who knows the specific business's domain knowledge it is.
I've had to work in systems where events didn't share correlation IDs, I had to go in and filter entries down to microseconds to get a small enough number of entries that I could trace what actually happened between a set of services.
From what I've seen in the enterprise software side of the world is a lot of companies are particularly bad at SRE and there isn't a great amount of standardization.
Enterprise app observability is purely a responsibility of each individual application/project manager. There is virtually no standardization or even shared infra, a team just stuffing plaintext logs into an unconfigured elasticsearch instance is probably above median already. There is no visibility for anything across departments and more often that not, not even across apps in a department.
These aren't challenging things to do for an experienced human at all. But it's such a huge pain point for these models! It's hard for me to wrap my head around how these models can write surprisingly excellent code but fail down in these sorts of relatively simple troubleshooting paths.
There isn't much posted in the way of "bash history and terminal output of successful sysadminning" on the web
Very few people start their careers as SREs, it’s generally something they migrate into after enjoying it and showing aptitude for it.
With that said, I wouldn’t expect this wall to hold up for too long. There has been a lot of low hanging fruit teaching models how to code. When that is saturated, the frontier companies will likely turn their attention to honing training environments for SRE style debug.
When we ask it to generate an image, any image will do it. We couldn't care less. Try to sculpt it, try to rotate it 45 degrees and all hell breaks loose. The image would be rotated but the hair color could change as well. Pure vibes!
When you ask it to refactor your code, any pattern would do it. You could rearrange the code in infinite ways, rename variables in infinite ways without fundamentally breaking logic. You could make as many arbitrary bullshit abstraction and call it good, as people have done it for years with OOP. It does not matter at all, any result would do it in this cases.
When you want to hit an specific gRPC endpoint, you need an specific address and the method expects an specific contract to be honored. This either matches or it doesn't. When you wish the llms could implement a solution that captures specifics syscalls from specifics hosts and send traces to an specific platform, using an specific protocol, consolidating records on a specific bucket...you have one state that satisfy your needs and 100 requirement that needs to necessarily be fulfilled. It either meet all the requirements or it's no good.
It truly is different from Vibing and llms will never be able to do in this. Maybe agents will, depending on the harnesses, on the systems in place, but one model just generate words words words with no care about nothing else
The search space for a cause beyong a certain size can also be big. Very big.
Like, at work we're at the beginning of where the powerlaw starts going nuts. Somewhere around 700 - 1000 services in production, across several datacenters, with a few dozen infrastructure clusters behind it. For each bug, if you looked into it, there'd probably by 20 - 30 changes, 10 - 20 anomalies, and 5 weird things someone noticed in the 30 minutes around it.
People already struggle at triaging relevance of everything in this context. That's something I can see AI start helping and there were some talks about Meta doing just that - ranking changes and anomalies in order of relevance to a bug ticket so people don't run after other things.
That's however just the reactive part of OPS and SRE work. The proactive part is much harder and oftentimes not technical. What if most negatively rated support cases run into a dark hole in a certain service, but the responsible team never allocates time to improve monitoring, because sales is on their butt for features? LLMs can identify this maybe, or help them implement the tracing faster, but those 10 minutes could also be spent on features for money.
And what AI model told you to collect the metrics about support cases and resolution to even have that question?
AI works as a better tool for teaching humans than to do the work themselves.
While someone experienced in fighting fires can take intuitive leaps, the basic idea is still to synthesize a hypothesis from signals, validating the hypothesis, and coming up with mitigations and longer term fixes. This is a learned skill, and a team of people/AI will work better than someone solo.
https://hazelweakly.me/blog/stop-building-ai-tools-backwards...
The models are already so good at the traditionally hard stuff: collecting that insane amount of detailed knowledge across so many different domains, languages and software stacks.
I wouldn’t touch this with a pole if our MTTR was dependent on it being successful though.
MCP servers for monitoring tools are making our developers more competent at finding metrics and issues.
It'll get there but nobody is going to type "fix my incident" in production and have a nice time today outside of the most simple things that if they are possible to fix like this, could've been automated already anyway. But between writing a runbook and automating sometimes takes time so those use cases will grow.
We're actually struggling a bit with benchmark saturation right now. Opus does much better in the real world than Sonnet but it's hard to create sophisticated enough benchmarks to show that in the lab. When we run benchmarks with a small number of iterations Sonnet even wins sometimes.
[1] https://holmesgpt.dev/development/evaluations/history/
I know there are AI SRE companies that have discovered the same -- that you can't just throw a bunch of data at a regular LLM and have it "do SRE things". It needs more structured context, and their value add is knowing what context and what structure is necessary.
These benchmarks conflate two very different problems: (1) understanding what needs to be done, and (2) correctly implementing it in a specific library ecosystem.
A human SRE who's never touched OTel would also struggle initially - not because they can't reason about traces, but because the library APIs have quirks that take time to learn.
The more interesting question is whether giving the model access to relevant docs/examples during the task significantly changes the scores. If it does, that suggests the bottleneck is recall not reasoning. If it doesn't, the reasoning gap is real.
FWIW I've found that models do much better on ops tasks when you can give them concrete examples of working instrumentation in the same codebase rather than asking them to generate from scratch.