Observability's past, present, and future

I share a lot of this sentiment, although I struggle more with the setup and maintenance than the diagnosis.

It's baffling to me that it can still take _so_much_work_ to set up a good baseline of observability (not to mention the time we spend on tweaking alerting). I recently spent an inordinate amount of time trying to make sense of our telemetry setup and fill in the gaps. It took weeks. We had data in many systems, many different instrumentation frameworks (all stepping on each other), noisy alerts, etc.

Part of my problem is that the ecosystem is big. There's too much to learn: OpenTelemetry, OpenTracing, Zipkin, Micrometer, eBPF, auto-instrumentation, OTel SDK vs Datadog Agent, and on and on. I don't know, maybe I'm biased by the JVM-heavy systems I've been working in.

I worked for New Relic for years, and even in an observability company, it was still a lot of work to maintain, and even then traces were not heavily used.

I can definitely imagine having Claude debug an issue faster than I can type and click around dashboards and query UIs. That sounds fun.

shcallaway · a month ago

I completely agree w/ your points about why observability sucks: - Too much setup - Too much maintenance - Too steep of a learning curve

This isn't the whole picture, but it's a huge part of the picture. IMO, observability shouldn't be so complex that it warrants specialized experience; it should be something that any junior product engineer can do on their own.

> I can definitely imagine having Claude debug an issue faster than I can type and click around dashboards and query UIs. That sounds fun.

Working on it :)

pphysch · a month ago

> Part of my problem is that the ecosystem is big. There's too much to learn: OpenTelemetry, OpenTracing, Zipkin, Micrometer, eBPF, auto-instrumentation, OTel SDK vs Datadog Agent, and on and on. I don't know, maybe I'm biased by the JVM-heavy systems I've been working in.

We've had success keeping things simple with VictoriaMetrics stack, and avoiding what we perceive as unnecessary complexity in some of the fancier tools/standards.

Of course that sucks. Just enable full time-travel recording in production and then you can use a standard multi-program trace visualizer and time travel debugger to identify the exact execution down to the instruction and precisely identify root causes in the code.

Everything is then instrumented automatically and exhaustively analyzable using standard tools. At most you might need to add in some manual instrumentation to indicate semantic layers, but even that can frequently be done after the fact with automated search and annotation on the full (instruction-level) recording.

shcallaway · a month ago

You're not the first person I've met that has articulated an idea like this. It sounds amazing. Do you have an idea about why this approach isn't broadly popular?

donavanm · a month ago

cost and compliance are non-trivial for non-trivial applications. Universal instrumentation and recording creates a meaningful fixed cost for every transaction, and you must record ~every transaction; you can't sample & retain post-hoc. If you're processing many thousands of TPS on many thousands of nodes that quickly adds up to a very significant aggregate cost even if the individual cost is small.

For compliance (or contractual agreement) there are limitations on data collection, retention, transfer, and access. I certainly don't want private keys, credentials, or payment instruments inadvertently retained. I dont want confidential material to be distributed out of band or in an uncontrolled manner (like your dev laptop). I probably don't even want employees to be able to _see_ "customer data." Which runs head long in to a bunch challenges where low level trace/sampling/profiling tools have more less open access to record and retain arbitrary bytes.

Edit: Im a big fan of continuous and pervasive observability and tracing data. Enable and retain that at ~debug level and filter + join post-hoc as needed. My skepticism above is about continuous profiling and recording (ala vtune/perf/ebpf), which is where "you" need to be cognizant of risks & costs.

buchanae · a month ago

vrnvu · a month ago

First. Love that more tools like Honeycomb (amazing) are popping up in the space. I agree with the post.

But. IMO, statistics and probability can’t be replaced with tooling. As software engineering can’t be replaced with no-code services to build applications…

If you need to profile some bug or troubleshoot complex systems (distributed, dbs). You must do your math homework consistently as part of the job.

If you don’t comprehend the distribution of your data, the seasonality, noise vs signal; how can you measure anything valuable? How can you ask the right questions?

Vibe-coders don't comprehend how the code works, yet they can create impressive apps that are completely functional.

I don't see why the same isn't true for "vibe-fixers" and their data (telemetry).

jbs789 · a month ago

The distinction is originality vs replicating existing.

I believe the author is in the former camp.

esafak · a month ago

We need more automation. Less data, more insight. We're at the firehose stage, and nobody's got time for that. ML-based anomaly detection is not widespread and automated RCA barely exists. We'll have solved the problem when AI detects the problem and submits the bug fix before the engineers wake up.

You're so right.

> We'll have solved the problem when AI detects the problem and submits the bug fix before the engineers wake up.

Veserv · a month ago

nijave · a month ago

Mostly agree although I think ease of instrumentation is getting pretty good. At least in the Python ecosystem, you set some env vars and run opentelemtry-bootstrap and it spits out a list of packages to add. Then you run your code with the otel cli wrapper and it just works.

Datadog is equally as easy.

That alone gets you pretty good canned dashboards on vendors that have built in APM views.

The rest definitely rings true and I suspect some of it has come with the ease of software development. You need to know less about computer fundamentals and debugging with the proliferation of high level frameworks, codegen, and AI.

I also have noticed a trend that brings observability closer to development with IDE integration which I think is a good direction. Having the info "silo'd" in an opaque mega data store isn't useful.

camel_gopher · a month ago

2006 - Bryan Cantrill publishes this work on software observability https://queue.acm.org/detail.cfm?id=1117401

2015 - Ben Sigelman (one of the Dapper folks) cofounds Lightstep

These are great. I should have included them in my timeline!

Huge fan of historical artifacts like Cantrill's ACM paper

hrimfaxi · a month ago

The timeline seems contrived for narrative purposes. It mentioned Honeycomb's founding as "one of the first managed tracing platforms". But honeycomb didn't have tracing until 2018 https://www.prnewswire.com/news-releases/honeycomb-launches-...

ptx · a month ago

Wait a minute...

> with the rise of [...] microservices, apps were becoming [...] too complex for any individual to fully understand.

But wasn't the idea of microservices that these services would be developed and deployed independently and owned by different teams? Building a single app out of multiple microservices and expecting a single individual to debug it sounds like holding it wrong, which then requires this distributed tracing solution to fix it.