All you need is Wide Events, not "Metrics, Logs and Traces"

This is essentially Amazon Coral’s service log format except service logs include cumulative metrics between log events. This surfaces in cloudwatch logs as metrics extraction and Logs Insights as structured log queries. The meta scuba is like a janky imitation of that tool chain

People point to Splunk and ELK but they fail to realize that inverted index based solutions algorithmically can’t scale to arbitrary sizes. I would rather point people to Grafana Loki and CloudWatch Logs Insights and the compromises they entail as not just the right model for “wide events” or structured logging based events and metrics. Their architectures allow you to scale at low costs to PB or even exabyte scale monitoring.

deschutes · 2 years ago

As far as design and ergonomics go, I'd compare servicelogs to a pile of trash that may yet grow massive enough to accrete into a planetoid.

A text based format whose sole virtue is descending from a system that was composed mainly of bugs that had coalesced into perl scripts.

It's not the basis of something you could even give away, let alone have people willingly pay you for their agony. Cloudwatch being rather alike in this regard.

jcgrillo · 2 years ago

One thing that really gets under my skin when I think about observability data is the abject waste we incur by shipping all this crap around as UTF-8 bytes. This post (from 1996!) puts us all to shame: https://lists.w3.org/Archives/Public/www-logging/1996May/000...

Knowing the type of each field unlocks some interesting possibilities. If we can classify fields as STRING, INTEGER, UUID, FLOAT, TIMESTAMP, IP, etc we could store (and transmit!) them optimally. In particular, knowing whether we can delta-encode is important--if you have a timestamp column, storing the deltas (with varint or vbyte encoding) is way cheaper than storing each and every timestamp. Only store each string once, in a compressed way, and refer to it by ID (with smaller IDs for more frequent strings).

It's sickening to imagine how much could be saved by exploiting redundancy in these data if we could just know the type of each field. You get some of this with formats like protocol buffers, but not enough.

Another thing, as you mention, is optimizing for search. Indexing everything seems like the wrong move. Maybe some partial indexing strategy? Rollups? Just do everything with mapreduce jobs? I don't know what the right answer is but fully indexing data which are mostly write-only is definitely wrong.

fnordpiglet · 2 years ago

In my analysis of some exabyte scale log analytic systems the top two compute spend buckets was json serde and utf8 conversions.

ablob · 2 years ago

Storing by delta can bite you quite hard in the event of data corruption. Instead of 1 data point being affected it would cascade down. Selecting specific ranges where the concrete bottom/top as in "give me everything between 1-2 pm from last Saturday" might also become problematic. I'm sure there's a tradeoff to be had here; Weaving data-dependencies throughout your file certainly leaves a redundancy hole not everyone is willing to have.

heyoni · 2 years ago

That second part of only storing reference ids sounds like db normalizing but for logs. Seems reasonable though!

rmbyrro · 2 years ago

Which compromises in CloudWatch Log Insights makes it not the right model for "wide events"?

I have the impression it does a good job providing visibility tools (search, filter, aggregation...) over structured logs.

Ergonomics is bad, though, with the custom query language and low processing speed, depending on the amount of data you're processing during an investigation.

fnordpiglet · 2 years ago

I think you misunderstood. It does.

flaminHotSpeedo · 2 years ago

> This surfaces in cloudwatch logs as metrics extraction and Logs Insights as structured log queries. The meta scuba is like a janky imitation of that tool chain

I don't have any experience with scuba besides this article, but I think you've missed the point. Wide events, based on my understanding, are a combination of traditional logs and something akin to service logs.

This provides two crucial improvements. The first is flexible, arbitrary associations as a first-class feature. As I interpret it, wide events give you the ability to associate a free-form traditional log message with additional dimensions, which is similar to what service logs offer but more flexible. E.g. if you log "caught unhandled FooException, returning ServerException" but only emit a metric for ServerException=1, service logs can't help you.

The other major benefit that you seem to have overlooked is a good UI to explore those events. I think most people would agree that the cloud watch UI is somewhere between bad to mediocre, but the monitor portal UI is nothing short of an unmitigated disaster. And neither give you the ability described in this article, to point and click graph events that match certain criteria. As I read it, it's the equivalent functionality to simple insights queries, except it doesn't require any typing, searching for the right dimension names, or writing stats queries to get graphs.

riku_iki · 2 years ago

> inverted index based solutions algorithmically can’t scale to arbitrary sizes

curious why do you think so? Inverted index can be sharded and built/updated/queried in parallel, so scale linearly.

fnordpiglet · 2 years ago

A few issues come up. First inverted indices can be sharded but the index insert patterns aren’t uniformly distributed but instead have a zipf distribution, which means your sharding scales proportional to the frequency of the most common token in the log. There are patches but in the end it sort of boils down to this.

Another issue is indexing up front is crazy expensive vs doing absolutely nothing but packing and time indexing, maybe some bloom indices. This is really important because the vast majority of log and event and telemetry in general is never accessed. Like 99.99% of it or more.

The technique of something like Loki is to batch data into micro batches and index them within the batches into a columnar store (like parquet of orc) and time index the micro batches. The query path is highly parallel and fairly expensive, but given the cost savings up front it’s a lot cheaper than up front indexing. You can turn the fan out knob on queries to any size and similar to MPP scale out databases such as Snowflake there’s not really much of an upper limit. Effectively everything from ingestion to query scales out linearly without uneven heat problems like you see in a sharded index.

jcgrillo · 2 years ago

I've seen situations where the cost of indexing all the logs (which were otherwise just written to a hierarchical structure in HDFS and queried with mapreduce jobs) in ES would have been highly significant--think like an uncomfortable fraction of total infrastructure spend. So, sure, you can make it scale linearly by adding enough nodes to keep up with write volume but that doesn't mean it's affordable. And then consider what that's actually accomplishing for those dollars. You're optimizing for quick search queries on data which you'll mostly never query. Worth it?

EDIT: as a user, being able to just run mapreduce jobs over logs is a heck of a lot better experience IMO than trying to torture Kibana into giving me the answers I want.

At the company I work for we send json to kafka and subsiquently to Elastic search with great effect. That's basically 'wide events'. The magical thing about hooking up a bunch of pipelines with kafka is that all of a sudden your observability/metrics system becomes an amazing API for extending systems with aditional automations. Want to do something when a router connects to a network? Just subscribe to this kafka topic here. It doesn't matter that the topic was origionally intended just to log some events. We even created an open source library for writing and running these,pipelines in jupyter. Here's a super simple example https://github.com/bitswan-space/BitSwan/blob/master/example...

People tend to think kafka is hard, but as you can see from the example, it can be extremely easy.

hibikir · 2 years ago

This works well for a while. But eventually you get big, and have little to no idea of what is in your downstream. Then every single format change in any event you write must be treated like open heart surgery, because tracing your data dependencies is unreliable.

Sometimes it seems that it's fixable by 'just having a list of people listening', and then you look and all that some of them do is mildly transform your data and pass it along. It doesn't take long before people realize that. 'just logging some events' is making future promises to other teams you don't know about, and people start being terrified of emitting anything.

This is a story I've seen in at least 4 places in my career. Making data available to other people is not any less scary in kafka than it was back in the days where applications shared a giant database, and you'd see yearlong projects to do some mild changes to a data model, which was originally designed in 5 minutes.

As for kafka being easy, It's not quite as hard as some people say, but it's both a pub sub system and a distributed database. When your clusters get large, it definitely isn't easy.

lmm · 2 years ago

> This works well for a while. But eventually you get big, and have little to no idea of what is in your downstream. Then every single format change in any event you write must be treated like open heart surgery, because tracing your data dependencies is unreliable.

Yeah, I'd always use protobuf or similar rather than JSON for that reason, and if you need a truly breaking change I'd emit a new version of the events to a new topic rather than trying to migrate the existing one in place. It's not actually so costly to keep writing events to an old topic (and if you really want you can move that part into a separate adapter process that reads your new topic and writes to your old one). Or you can do the whole avro/schema-registry stuff if you prefer.

> Making data available to other people is not any less scary in kafka than it was back in the days where applications shared a giant database

It should be significantly less scary: it's impossible to mutate data in-place, foreign key issues are something you go back and fix and reprocess rather than something that takes down your OLTP system, schema changes are better-understood and less big-bang, event streams that are generated by transforming another event stream are completely indistinguishable from "original" event streams as opposed to views being sort-of-like-tables but having all sorts of caveats and gotchas.

> As for kafka being easy, It's not quite as hard as some people say, but it's both a pub sub system and a distributed database. When your clusters get large, it definitely isn't easy.

There are hard parts but also parts that are easier than a traditional database. There's no query planner, no MVCC, no locks, no deadlocks, no isolation levels, indices are not magic, ...

meandmycode · 2 years ago

I think this is the crux of it, if something works for awhile then actually that's fine, as an industry we over index and scare new developers towards complexity. The counter is true too, what works at scale doesn't at non scale - not because of tech, but because holistically your asking for a lot, a lot of knowledge, a lot of complex tech to be deployed by a small team.

MuffinFlavored · 2 years ago

> This works well for a while. But eventually you get big, and have little to no idea of what is in your downstream.

All you have to do is pass around trace baggage headers, right?

jeffbee · 2 years ago

I'm glad that works for you but to me it sounds really expensive. At small scale you can do this any way you want but if you build an observability system with linear cost and a high coefficient it will become an issue if you run into some success.

timthelion · 2 years ago

The only expensive part is the hardwarevfor the elastic servers. Kafka is cheap to run. We have an on prem elastic db pulling in tens of thousands of events per second. On prem servers aren't that expensive. It's really just 6 servers with 20tb each and another 40tb for backups. And it's not like you have to store everything forever... Compare that data flow to everyonevwatching youtube all the time. It's really nothing...

dgellow · 2 years ago

Zookeeper in production can be really a pain to maintain…

rockwotj · 2 years ago

Check out Redpanda if you don’t like Zookeeper. Raft based, no JVM, lower cost and simpler to run Kafka compatible system

KptMarchewa · 2 years ago

You don't need zookeeper for Kafka for a while now.

Osmose · 2 years ago

This isn't an unknown idea outside of Meta, it's just really expensive, especially if you're using a vendor and not building your own tooling. Prohibitively so, even with sampling.

ricardobeat · 2 years ago

Exactly. When they say

> Unlike with prometheus, however, with Wide Events approach we don’t need to worry about cardinality

This is hinting at the hidden reason why not everyone does it. You have to 'worry' about cardinality because Prometheus is pre-aggregating data so you can visualize it fast, and optimizing storage. If you want the same speed on a massive PB-scale data lake, with an infinite amount of unstructured data, and in the cloud instead of your own datacenters, it's gonna cost you a lot, and for most companies it is not a sensible expense.

It does work at smaller scale though, we once had an in-house system like this that worked well. Eventually user events were moved to MixPanel, and everything else to Datadog, metrics/logs/traces + a migration to OpenTel. It took months and added 2-digit monthly bills, and in the end debugging or resolving incidents wasn't much improved over having instant access to events and business metrics. Whoever figures out a system that can do "wide events" in a cost-effective way from startup to unicorn scale will absolutely make a killing.

AeroNotix · 2 years ago

https://victoriametrics.com/ would definitely recommend anyone having performance issues with Prometheus to give VictoriaMetrics a try.

hosh · 2 years ago

Not everything emits wide events. Maybe you can get the entire application layer like that, but there is also value in logs and metrics emitted from the rest of the infra stack.

To be fair, you could probably store and represent everything as wide events and build visualization tools out of that that can combine everything together, even if they are sourced from something else.

GauntletWizard · 2 years ago

I worked on Scuba, inside and outside of Meta (Interana), and yeah - It was expensive AF. I recommend focusing on metrics first. Use analytics logging sparingly, and understand the statistics of how metrics work, because without understanding those statistics you'll misread your events anyway.

This is not to say that wide events aren't worth it - For many things, something like Scuba or Bigquery are invaluable. There's ways to optimize. But we're talking about "One of AWS's largest machines" vs "A couple cores", and I suggest learning Prometheus first.

Xcelerate · 2 years ago

> understand the statistics of how metrics work

Haha, since you worked on Scuba I’ll mention IMO this point was by far the biggest flaw of ODS. No one ever performed the metric rollups correctly. Average of averages? And at what granularity? ODS downsampled the older time series data but now perhaps you’re taking a percentile over a “max of maxes”. Except it only sometimes used that method of downsampling automatically.

And I seem to recall the labels “daily”, “weekly”, and “monthly” not being intuitive either, and two of them meant the same thing... that was quite a mess to work with.

A lot of the autoscaling systems were wonky because the ODS metrics they were based upon didn’t represent what people thought they did.

mlhpdx · 2 years ago

I don’t know that’s true. My last two very-not-meta-sized companies have both had systems that were very cost effective and essentially what the article describes. It’s not the simplest thing to put in place, but far from unapproachable.

I think on if the big hills is moving to a culture that values observability (or whatever you choose to call it, I prefer forensic debugging). It’s another thing to understand and worry about and it helps tremendously if there are good, highly visible examples of it.

Edit: Typo.

gtirloni · 2 years ago

Could you share some specifics of how it could be approached?

staticautomatic · 2 years ago

Maybe this is a dumb question but why wouldn’t it be cost effective to pre-aggregate counts occasionally and sample on the fly?

wiseguyeh · 2 years ago

While I don't have an opinion on wide events (AKA spans) replacing logs, there are benefits to metrics that warrant their existence:

1. They're incredibly cheap to store. In Prometheus, is may cost you as little as 1 byte per sample (ignoring series overheads). Because they're cheap, you can keep them for much longer and use them for long-term analysis of traffic, resource use, performance, etc. Most tracing vendors seem to cap storage at 1-3 months while metric vendors can offer multi-year storage.

2. They're far more accurate that metrics derived from wide events in higher-throughput scenarios. While wide events are incredibly flexible, their higher storage cost means there's an upper limit on the sample rate. The sampled nature of wide events means that deriving accurate counts is far more difficult- metrics really shine in this role (unless you're operating over datasets with very high cardinality). The problem only gets worse when you combine tail sampling into the mix and add bias towards errors/ slow requests in your data.

phillipcarter · 2 years ago

For point (2), you can derive accurate counts from sampled data if the sampling rate is captured as metadata on every sampled event. Some tools do support this (I work for Honeycomb, and our sampling proxy + backend work like this, can't speak for others).

The issue is there are still limits to that, though. I can still get a count of events, or a AVG(duration_ms). But if I have a custom tag I can't get accurate counts of that. And if I want to get distinct counts of values, I'm out of luck. Estimating that is an active machine learning research problem.

It's an interesting point. We are actually running a test with with Honeycomb's refinery later this week, I'm slightly skeptical but curious to see if they can overcome this bias.

taion · 2 years ago

You also lose accuracy because of sampling noise.

fl0ki · 2 years ago

On top of that, metrics can have exemplars, which give you more (and dynamic) dimensions for buckets without increasing the cardinality of the metric vectors themselves. It's pretty much a wide event, with the sampling rate on this extra information just being the scrape interval you were already using anyway.

Not every library or tool supports exemplars, but they're a big part of the Prometheus & Grafana value proposition that many users entirely overlook.

This is exactly right. This kind of structured logging is great, but it doesn’t replace metrics. You really want to have both, and simple unsampled metrics are actively better for e.g. automated alerting for exactly those reasons. They’re complements more than substitutes.

infogulch · 2 years ago

This thread has a lot of discussion about Wide Events / Structured Logs (same thing) being too big at scale, and you should use metrics instead.

Why does it have to be an either/or thing? Couldn't you hook up a metrics extractor to the event stream and convert your structured logs to compact metrics in-process before expensive serde/encoding? With this your choice doesn't have to affect the code, just write slogs all the time; if you want structured logs then then output them, but if you only want metrics then switch to the metrics extractor slog handler.

Futher, has nobody tried writing structured logs to parquet files and shipping out 1MB blocks at once? Way less serde/encoding overhead, and column oriented layout compresses like crazy with built-in dictionary and delta encodings.

pat2man · 2 years ago

The open telemetry collector does just that. https://github.com/open-telemetry/opentelemetry-collector-co...

pingiun · 2 years ago

seems cool but the top of the page says the thing you suggest is now deprecated

This package implements the parquet idea for Go's slog package: https://github.com/samber/slog-parquet

lowbloodsugar · 2 years ago

>too big at scale

Bigger than meta? o_0

pavlov · 2 years ago

The scale isn’t a problem if you also have Meta’s profit margins. But most businesses don’t…

Deleted Comment

The isomorphism of traces and logs is clear. You can flatten a trace to a log and you can perfectly reconstruct the trace graph from such a log. I don't see the unifying theme that brings metrics into this framework, though. Metrics feels fundamentally different, as a way to inspect the internal state of your program, not necessarily driven by exogenous events.

But I definitely agree with the theme of the article that leaving a big company can feel like you got your memory erased in a time machine mishap. Inside a FANG you might become normalized to logging hundreds of thousands of informational statements, per second, per core. You might have got used to every endpoint exposing thirty million metric time series. As soon as you walk out the door some guy will chew you out about "cardinality" if you have 100 metrics.

HeyImAlex · 2 years ago

I think all metrics can be reconstructed as “wide events” since they’re just a bunch of arbitrary data? Counts, gauges, and histograms at least seem pretty straight forward to me.

It seems like the main motivation for metrics is that sending + storing + querying wide events for everything is cost prohibitive and/or performance intensive. If you can afford it and it works well, wide events is definitely more flexible. A metric is kinda just a pre-aggregation on the event stream.

growse · 2 years ago

If you think of a metric as an event representing the act of measuring (along with the result of that measurement), then it becomes the same as any other event.

True. I guess the thing that I normally want from metrics is I want to have a huge number of them that exist in a way that I can look at them when I want. But I don't want to have to pay for collecting and aggregating them all the time. So in the scenario where they are just events then I need some other control system that can trigger the collection of events that aren't normally emitted

mikpanko · 2 years ago

It took the world decades to develop widely accepted standards for working with relational data and SQL. I believe we are at the early stages of doing the same with event data and sequence analytics. It is starting to simultaneously emerge in many different fields:

- eng observability (traces at Datadog, Sumologic, etc)

- operational research (process mining at Celonis)

- product analytics (funnels at Amplitude, Mixpanel)

As with every new field, there are a lot of different and overlapping terms being suggested and explored at the same time.

We are trying to contribute to the field with a deep fundamental approach at Motif Analytics, including a purpose-built set of core sequence operations, rich flow visualizations, a pattern matching query engine, and foundational AI models on event sequences [1].

Fun fact: creators of Scuba turned it into a startup Interana (acquired by Twitter), who we took a lot of inspiration from for Motif's query engine.

[1] https://motifanalytics.com

isburmistrov · 2 years ago

Thanks for sharing, didn't know about Motif

treflop · 2 years ago

We use wide events at work (or really “structured logging” or really “your log system has fields”) and they are great.

But they aren’t a replacement for metrics because metrics are so god damn cheap.

And while I’ve never used a log system with traces, every logging setup I’ve ever used has had request/correlation IDs to generate a trace because sometimes you just wanna lookup a flow and see it without spending a time digging through wide events/your log system. If you aren’t looking up logs very often, then yeah it seems browsing through structured logs isn’t that bad but then do it often and it’s just annoying…