I share a lot of this sentiment, although I struggle more with the setup and maintenance than the diagnosis.
It's baffling to me that it can still take _so_much_work_ to set up a good baseline of observability (not to mention the time we spend on tweaking alerting). I recently spent an inordinate amount of time trying to make sense of our telemetry setup and fill in the gaps. It took weeks. We had data in many systems, many different instrumentation frameworks (all stepping on each other), noisy alerts, etc.
Part of my problem is that the ecosystem is big. There's too much to learn:
OpenTelemetry, OpenTracing, Zipkin, Micrometer, eBPF, auto-instrumentation, OTel SDK vs Datadog Agent, and on and on. I don't know, maybe I'm biased by the JVM-heavy systems I've been working in.
I worked for New Relic for years, and even in an observability company, it was still a lot of work to maintain, and even then traces were not heavily used.
I can definitely imagine having Claude debug an issue faster than I can type and click around dashboards and query UIs. That sounds fun.
I completely agree w/ your points about why observability sucks:
- Too much setup
- Too much maintenance
- Too steep of a learning curve
This isn't the whole picture, but it's a huge part of the picture. IMO, observability shouldn't be so complex that it warrants specialized experience; it should be something that any junior product engineer can do on their own.
> I can definitely imagine having Claude debug an issue faster than I can type and click around dashboards and query UIs. That sounds fun.
> Part of my problem is that the ecosystem is big. There's too much to learn: OpenTelemetry, OpenTracing, Zipkin, Micrometer, eBPF, auto-instrumentation, OTel SDK vs Datadog Agent, and on and on. I don't know, maybe I'm biased by the JVM-heavy systems I've been working in.
We've had success keeping things simple with VictoriaMetrics stack, and avoiding what we perceive as unnecessary complexity in some of the fancier tools/standards.
First. Love that more tools like Honeycomb (amazing) are popping up in the space. I agree with the post.
But. IMO, statistics and probability can’t be replaced with tooling. As software engineering can’t be replaced with no-code services to build applications…
If you need to profile some bug or troubleshoot complex systems (distributed, dbs). You must do your math homework consistently as part of the job.
If you don’t comprehend the distribution of your data, the seasonality, noise vs signal; how can you measure anything valuable? How can you ask the right questions?
We need more automation. Less data, more insight. We're at the firehose stage, and nobody's got time for that. ML-based anomaly detection is not widespread and automated RCA barely exists. We'll have solved the problem when AI detects the problem and submits the bug fix before the engineers wake up.
Of course that sucks. Just enable full time-travel recording in production and then you can use a standard multi-program trace visualizer and time travel debugger to identify the exact execution down to the instruction and precisely identify root causes in the code.
Everything is then instrumented automatically and exhaustively analyzable using standard tools. At most you might need to add in some manual instrumentation to indicate semantic layers, but even that can frequently be done after the fact with automated search and annotation on the full (instruction-level) recording.
You're not the first person I've met that has articulated an idea like this. It sounds amazing. Do you have an idea about why this approach isn't broadly popular?
cost and compliance are non-trivial for non-trivial applications. Universal instrumentation and recording creates a meaningful fixed cost for every transaction, and you must record ~every transaction; you can't sample & retain post-hoc. If you're processing many thousands of TPS on many thousands of nodes that quickly adds up to a very significant aggregate cost even if the individual cost is small.
For compliance (or contractual agreement) there are limitations on data collection, retention, transfer, and access. I certainly don't want private keys, credentials, or payment instruments inadvertently retained. I dont want confidential material to be distributed out of band or in an uncontrolled manner (like your dev laptop). I probably don't even want employees to be able to _see_ "customer data." Which runs head long in to a bunch challenges where low level trace/sampling/profiling tools have more less open access to record and retain arbitrary bytes.
Edit: Im a big fan of continuous and pervasive observability and tracing data. Enable and retain that at ~debug level and filter + join post-hoc as needed.
My skepticism above is about continuous profiling and recording (ala vtune/perf/ebpf), which is where "you" need to be cognizant of risks & costs.
Mostly agree although I think ease of instrumentation is getting pretty good. At least in the Python ecosystem, you set some env vars and run opentelemtry-bootstrap and it spits out a list of packages to add. Then you run your code with the otel cli wrapper and it just works.
Datadog is equally as easy.
That alone gets you pretty good canned dashboards on vendors that have built in APM views.
The rest definitely rings true and I suspect some of it has come with the ease of software development. You need to know less about computer fundamentals and debugging with the proliferation of high level frameworks, codegen, and AI.
I also have noticed a trend that brings observability closer to development with IDE integration which I think is a good direction. Having the info "silo'd" in an opaque mega data store isn't useful.
> with the rise of [...] microservices, apps were becoming [...] too complex for any individual to fully understand.
But wasn't the idea of microservices that these services would be developed and deployed independently and owned by different teams? Building a single app out of multiple microservices and expecting a single individual to debug it sounds like holding it wrong, which then requires this distributed tracing solution to fix it.
It's baffling to me that it can still take _so_much_work_ to set up a good baseline of observability (not to mention the time we spend on tweaking alerting). I recently spent an inordinate amount of time trying to make sense of our telemetry setup and fill in the gaps. It took weeks. We had data in many systems, many different instrumentation frameworks (all stepping on each other), noisy alerts, etc.
Part of my problem is that the ecosystem is big. There's too much to learn: OpenTelemetry, OpenTracing, Zipkin, Micrometer, eBPF, auto-instrumentation, OTel SDK vs Datadog Agent, and on and on. I don't know, maybe I'm biased by the JVM-heavy systems I've been working in.
I worked for New Relic for years, and even in an observability company, it was still a lot of work to maintain, and even then traces were not heavily used.
I can definitely imagine having Claude debug an issue faster than I can type and click around dashboards and query UIs. That sounds fun.
This isn't the whole picture, but it's a huge part of the picture. IMO, observability shouldn't be so complex that it warrants specialized experience; it should be something that any junior product engineer can do on their own.
> I can definitely imagine having Claude debug an issue faster than I can type and click around dashboards and query UIs. That sounds fun.
Working on it :)
We've had success keeping things simple with VictoriaMetrics stack, and avoiding what we perceive as unnecessary complexity in some of the fancier tools/standards.
But. IMO, statistics and probability can’t be replaced with tooling. As software engineering can’t be replaced with no-code services to build applications…
If you need to profile some bug or troubleshoot complex systems (distributed, dbs). You must do your math homework consistently as part of the job.
If you don’t comprehend the distribution of your data, the seasonality, noise vs signal; how can you measure anything valuable? How can you ask the right questions?
I don't see why the same isn't true for "vibe-fixers" and their data (telemetry).
I believe the author is in the former camp.
> We'll have solved the problem when AI detects the problem and submits the bug fix before the engineers wake up.
Working on it :)
Everything is then instrumented automatically and exhaustively analyzable using standard tools. At most you might need to add in some manual instrumentation to indicate semantic layers, but even that can frequently be done after the fact with automated search and annotation on the full (instruction-level) recording.
For compliance (or contractual agreement) there are limitations on data collection, retention, transfer, and access. I certainly don't want private keys, credentials, or payment instruments inadvertently retained. I dont want confidential material to be distributed out of band or in an uncontrolled manner (like your dev laptop). I probably don't even want employees to be able to _see_ "customer data." Which runs head long in to a bunch challenges where low level trace/sampling/profiling tools have more less open access to record and retain arbitrary bytes.
Edit: Im a big fan of continuous and pervasive observability and tracing data. Enable and retain that at ~debug level and filter + join post-hoc as needed. My skepticism above is about continuous profiling and recording (ala vtune/perf/ebpf), which is where "you" need to be cognizant of risks & costs.
Datadog is equally as easy.
That alone gets you pretty good canned dashboards on vendors that have built in APM views.
The rest definitely rings true and I suspect some of it has come with the ease of software development. You need to know less about computer fundamentals and debugging with the proliferation of high level frameworks, codegen, and AI.
I also have noticed a trend that brings observability closer to development with IDE integration which I think is a good direction. Having the info "silo'd" in an opaque mega data store isn't useful.
2015 - Ben Sigelman (one of the Dapper folks) cofounds Lightstep
Huge fan of historical artifacts like Cantrill's ACM paper
> with the rise of [...] microservices, apps were becoming [...] too complex for any individual to fully understand.
But wasn't the idea of microservices that these services would be developed and deployed independently and owned by different teams? Building a single app out of multiple microservices and expecting a single individual to debug it sounds like holding it wrong, which then requires this distributed tracing solution to fix it.