Binance built a 100PB log service with Quickwit

A word of caution here: This is very impressive, but almost entirely wrong for your organisation.

Most log messages are useless 99.99% of the time. Best likely outcome is that its turned into a metric. The once in the blue moon outcome is that it tells you what went wrong when something crashed.

Before you get to shipping _petabytes_ of logs, you really need to start thinking in metrics. Yes, you should log errors, you should also make sure they are stored centrally and are searchable.

But logs shouldn't be your primary source of data, metrics should be.

things like connection time, upstream service count, memory usage, transactions a second, failed transactions, upsteam/downstream end point health should all be metrics emitted by your app(or hosting layer), directly. Don't try and derive it from structured logs. Its fragile, slow and fucking expensive.

comparing, cutting and slicing metrics across processes or even services is simple, with logs its not.

reisse · 2 years ago

Metrics are only good when you can disregard some amount of errors without investigation. But they're a financial organization, they have a certain amount of liability. Generalized metrics won't help to understand what happened to that one particular transaction that failed in a cumbersome way and caused some money to disappear.

KaiserPro · 2 years ago

You can still have logs. What I'm suggesting is that vast amounts of unstructured logs, are worse than useless.

Metics tell you where and when something when wrong. Logs tell you why.

However, a logging framework, which is generally lossy, and has the lowest level of priority in terms of deliverability is not an audit mechanism. especially as nowhere are ACLs or verifiability is mentioned. How do they prove that those logs originates from that machine?

If you're going to have an audit mechanism, some generic logging framework is almost certainly a bad fit.

pavlov · 2 years ago

> "But they're a financial organization, they have a certain amount of liability."

In the loosest possible sense. Binance is an organization that pretended it doesn't have any physical location in any jurisdiction. Its founder is currently in jail in the United States.

fells · 2 years ago

It's always struck me that these are two wildly different concerns though.

Use metrics & SLOs to help diagnose the health of your systems. Derive those directly from logs/traces, keep a sample of the raw data, and now you can point any alert to the sampled data to help go about understanding a client-facing issue.

But, for auditing of a particular transaction, you don't need full indexing of the events? You need a transactional journal for every account/user, likely with a well-defined schema to describe successful changes and failed attempts. Perhaps these come from the same stream of data as the observability tooling, but I can only imagine it must be a much smaller subset of the 100PB that you can avoid doing full inverse indexes on this, because your search pattern is simply answering "what happened to this transaction?"

andrewf · 2 years ago

As an engineer I generally want logs so I can dive into problems that weren't anticipated. Debugging.

I get a lot of pushback from ops folks. They often don't have the same use case. The logs are for the things that'll be escalated beyond the ops folks to the people that wrote the bug.

Yes, most (> 99.99%) of them will never be looked at. But storage is supposed to be cheap, right? If we can waste bytes on loading a copy of Chromium for each desktop application, surely we can waste bytes on this.

My argument is completely orthogonal to "do we want to generate metrics from structured logs".

andmarios · 2 years ago

Most probably, said ops folks have quite a few war stories to share about logs.

Maybe a JVM-based app went haywire, producing 500GB of logs within 15 minutes, filling the disk, and breaking a critical system because no one anticipated that a disk could go from 75% free to 0% free in 15 minutes.

Maybe another JVM-based app went haywire inside a managed Kubernetes service, producing 4 terabytes of logs, and the company's Google Cloud monthly usage went from $5,000 to $15,000 because storing bytes is supposed to be cheap when they are bytes and not when they are terabytes.

I completely agree that logs are useful, but developers often do not consider what to log and when. Check your company's cloud costs. I bet you the cost of keeping logs is at least 10%, maybe closer to 25% of the total cost.

jiggawatts · 2 years ago

Something I’ve discovered is that Azure App Insights can capture memory snapshots when an exception happens. You can download these with a button press and open in Visual Studio with a double-click.

It’s magic!

The stack variables, other threads and most of the heap is right there as-if you had set a breakpoint and it was an interactive debug session.

IMHO this eliminates the need for 99% of the typical detailed tracing seen in large complex apps.

sgarland · 2 years ago

I simply doubt that most of these logs (or anyone’s, usually) are that useful.

I worked at a SaaS observability company (Datadog competitor) that was ingesting, IIRC, multiple GBps of metrics, spread across multiple regions, dozens upon dozens of cells, etc. Our log budget was 650 GB/day.

I have seen – entirely too many times – DEBUG logs running in prod endlessly, messages that are clearly INFO at best classified as ERROR, etc. Not to mention where a 3rd party library is spamming the same line continuously, and no one bothers to track down why and stop it.

ansgri · 2 years ago

You probably don't need full text search, but only exact match search and very efficient time-based retrieval of contiguous log fragments. As an engineer spending quite a lot of time debugging and reading logs, our Opensearch has been almost useless for me (and a nightmare for our ops folks), since it can miss searches on terms like filenames and OSD UX is slow and generally unpleasant. I'd rather have a 100MB of text logs downloaded locally.

Please enlighten me, what are use cases for real full-text search (with fuzzy matching, linguistic normalization etc.) in logs and similar machine-generated transactional data? I understand its use for dealing with human-written texts, but these are rarely in TB range, unless you are indexing the Web or logs of some large-scale communication platform.

9dev · 2 years ago

My response to that would be that you can enable logging locally, or in your staging environment, but not in production. If an error occurs, your telemetry tooling should gather a stack trace and all related metadata, so you should be able to reproduce or at least locate the error.

But all other logs produced at runtime are breadcrumbs that are only ever useful when an exception occurs, anyway. Thus, you don’t need them otherwise.

mianos · 2 years ago

Storage is not cheap at this scale. That would be 100s of thousands a year at the very least. (How I know, I work in an identical area and have huge budget problems with rando verbose logging).

__0x01 · 2 years ago

Error level logging can exist with a metrics focused approach.

Log_out_ · 2 years ago

My system has a version number and input + known starting state dbwise. Now assuming i have determenistic reprodible state, is a log just a replay of that game engine at work?

zzyzxd · 2 years ago

If it crashes, it's probably some scenario that was not properly handled. If it's not properly handled, it's also likely not properly logged. That's why you need verbose logs -- once in a blue moon you need to have the ability to retrospectively investigate something in the past that was not thought through, without using a time machine.

This is more common in the financial world where audit trail is required to be kept long term for regulation. Some auditor may ask you for proof that you have done a unit test for a function 3 years ago.

Every organization needs to find their balance between storage cost and quality of observability. I prefer to keep as much data as we are financially allowed. If Binance is happy to pay to store 100PB logs, good for them!

"Do we absolutely need this data or not" is a very tough question. Instead, I usually ask "how long do we need to keep this data" and apply proper retention policy. That's a much easier question to answer for everyone.

jen20 · 2 years ago

It is quite unlikely that a regulator will ask you for proof you have a unit test for anything (also, that's not what a unit test is - see [1] for a good summary of why).

It _is_ likely a regulator will ask you to prove that you are developing within the quality assurance framework you have claimed you are, though.

Finally though, logs are not an audit trail, and almost no-one can prove their logs are correct with respect to the state of the system at any given time.

[1]: https://www.youtube.com/watch?v=EZ05e7EMOLM

KaiserPro · 2 years ago

> If it's not properly handled, it's also likely not properly logged

Then you're blue moon probability if it being useful rapidly drops. Verbose logs are simply a pain in the arse, unless you have a massive processing system. but even then it just either kneecaps your observation window, or makes your queries take ages.

I am lucky enough to work at a place that has really ace logging capability, but, and I cannot stress this enough, it is colossally expensive. literal billions.

but, logging is not an audit trail. Even here where we have fancy PII shields and stuff, logging doesn't have the SLA to record anything critical. If there is a capacity crunch, logging resolution gets turned down. Plus logging anything of value to the system gets you a significant bollocking.

If you need something that you can hand to a government investigator, if you're pulling logs, you're already in deep shit. An audit framework needs to have a super high SLA, incredible durability and strong authentication for both people and services. All three of those things are generally foreign to logging systems.

Logging is useful, you should log things, but, you should not use it as a way to generate metrics. verbose logs are just a really efficient way to burn through your infrastructure budget.

_boffin_ · 2 years ago

Hogwash. I’ll agree that it’s not as simple with logs, but amazingly powerful, and even more so with distributed tracing.

They both have their places and are both needed.

Without logs, I would not have been able to pinpoint multiple issues that plagued our systems. With logs, we were able to tell google, Apigee, it was there problem, not ours. With tracing, we were able to tell a legacy team they had an issue and was able to pinpoint it after them telling us for 6 months that it was our fault. Without logging and tracing, we wouldn’t have been able to tell our largest client, that we never received a 1/3 of their requests they sent us as our company was running around frantically.

They’re both needed, but for different things…ish.

KaiserPro · 2 years ago

You're missing my main point: logs should not be your primary source of information.

> Without logs, I would not have been able to pinpoint multiple issues that plagued our systems.

Logs are great for finding out what went wrong, but terrible at telling there is a problem. This is what I mean by primary information source. If you are sifting through TBs logs to pinpoint a issue, it sucks. Yes, there are tools, but its still hard.

Logs are shit for deriving metrics, it usually requires some level of bespoke processing which is easy to break silently, especially for rarer messages.

Deleted Comment

p-o · 2 years ago

I would say from my experience, for _application logs_, it's the exact opposite. When you deal with a few GB/day of data, you want to have logs, and metrics can be derived from those logs.

Logs are expensive compared to metrics, but they convey a lot more information about the state of your system. You want to move towards metrics over time only one hotspot at a time to reduce cost while keeping observability of your overall system.

I'll take logs over metrics any day of the week, when cost isn't prohibitive.

KaiserPro · 2 years ago

I was at a large financial news site, They were a total splunk shop. We had lots real steel machines shipping and chunking _loads_ of logs. Every team had a large screen showing off key metrics. Most of the time they were badly maintained and broken, so only the _really_ key metrics worked. Great for finding out what went wrong, terrible at alerting when it went wrong.

However, over the space of about three years we shifted organically over to graphite+grafana. There wasn't a top down push, but once people realised how easy it was to make a dashboard, do templating and generally keep things working, they moved in droves. It also helped that people put metrics emitting system into the underlying hosting app library.

What really sealed the deal was the non-tech business owners making or updating dashboards. They managed to take pure tech metrics and turn them into service/business metrics.

FridgeSeal · 2 years ago

> Logs are expensive compared to metrics, but they convey a lot more information about the state of your system.

My experience has been the kind of opposite.

Yes, you can put more fields in a log, and you can nest stuff. In my experience however, attics tend to give me a clearer picture into the overall state (and behaviour) of my systems. I find them easier and faster to operate, easier to get an automatic chronology going, easier to alert on, etc.

Logs in my apps are mostly relegated to capturing warning error and error states for debugging reference as the metrics give us a quicker and easier indicator of issues.

lmpdev · 2 years ago

I’m not well versed in QA/Sysadmin/Logs but surely metrics suffer from Simpson’s paradox compared to properly probed questions only answered through having access to the entirety of the logs?

If you average out metrics across all log files you’re potentially reaching false or worse inverse conclusions about multiple distinct subsets of the logs

It’s part of the reason why statisticians are so pedantic about the wording of their conclusions and to which subpopulation their conclusions actually apply to

BonusPlay · 2 years ago

When performing forensic analysis, metrics don't usually help that much. I'd rather sift 2PB of logs, knowing that information I'm looking for is in there, than sit at the usual "2 weeks of nginx access logs which roll over".

Obviously running everything with debug logging just burns through money, but having decent logs can help a lot other teams, not just the ones working on the project (developers, sysadmins, etc.)

Deleted Comment

ryukoposting · 2 years ago

Metrics are useful when you know what to measure, which implies that you already have a good idea for what can go wrong. If your entire product exists in some cloud servers that you fully control, that's probably feasible. Binance probably could have done something more elegant than storing extraordinary amounts of logs.

However, if you're selling a physical product, and/or a service that integrates deeply with third party products/services, it becomes a lot more difficult to determine what's even worth measuring. A conservative approach to metrics collection will limit the usefulness of the metrics, for obvious reasons. A "kitchen sink" approach will take you right back to the same "data volume" problem you had with logs, but now your developers have to deal with more friction when creating diagnostics. Neither extreme is desirable, and finding the middle ground would require information that you simply don't have.

On a related note, one approach I've found useful (at a certain scale) is to shove metrics inside of the logs themselves. Put a machine-readable suffix on your human-readable log messages. The resulting system requires no more infrastructure than what your logs are already using, and you get a reliable timeline of when certain metrics appear vs. when certain log messages appear.

temporarely · 2 years ago

Any system has a 'natural set' of metrics. And metrics are not about "what [went] wrong" rather system health. So Metrics -> Alert -> Log Diagnostics.

londons_explore · 2 years ago

When you have metrics, you should also keep sampled logs.

Ie. 1 per million log entries is kept. Write some rules to try and keep more of the more interesting ones.

One way to do this is to have your logging macro include the source file and line number the logline came from, and then, for each file and line number emit/store no more than 1 logline per minute.

That way you get detailed records of rare events, while filtering most of the noise.

hooverd · 2 years ago

There are also different types of logs. Maybe you want every transaction action but don't need a full fidelity copy of every load balancer ping from the last ten years.

zarathustreal · 2 years ago

I’ve got to disagree here, especially with memoization and streaming, deriving metric from structured logs is extremely flexible, relatively fast, and can be configured to be as cheap as you need it to be. With streaming you can literally run your workload on a raspberry pi. Granted, you need to write the code to do so yourself, most off-the-shelf services probably are expensive

KaiserPro · 2 years ago

> memoization and streaming,

memoization isn't free in logs, you're basically deduping an unbounded queue and its difficult to scale from one machine. Its both CPU and Memory heavy. I mean sure you can use scuba, which is great, but that's basically a database made to look like a log store.

> deriving metric from structured logs is extremely flexible

Assuming you can actually generate structured logs reliably. but even if you do, its really easy to silently break it.

> With streaming you can literally run your workload on a raspberry pi

no, you really can't. Streaming logs to a centralised place is exceptionally IO heavy. If you want to generate metrics from it, its CPU heavy as well. If you need speed, then you'll also need lots of RAM, otherwise searching your logs will cause logging to stop. (either because you've run out of CPU, or you've just caused the VFS cache to drop because you're suddenly doing no predictable IO. )

greylog exists for streaming logs. hell, even rsyslog does it. Transporting logs is fairly simple, storing and generating signal from it is very much not.

tuyguntn · 2 years ago

> Most log messages are useless 99.99% of the time.

Things are useless until first crash happens, same thing applies to replication, you don't need replication until your servers start crashing.

> But logs shouldn't be your primary source of data, metrics should be.

There are different types of data related to the product:

    * product data - what's in your db
    * logs - human readable details of a journey for a single request
    * metrics - approximate health state of overall system, where storing high cardinality values are bad (e.g. customer_uuid)
    * traces - approximate details of a single request to be able to analyze request journey through systems, where storing high cardinality values might still be bad.

Logs are useful, but costly. Just like everything else which makes system more reliable

galkk · 2 years ago

Just to be sure, I'm speaking below about application/system logs, not as "our event sourcing uses log storage"

Yes, you probably don't want to store debug logs of 2 years ago, but logs and metrics solve very different problems.

Logs need to have determined lifecycle, e.g. most detailed logs are stored for 7/14/30/release cadence days, then discarded. But when you need to troubleshoot something, metrics give you signal, but logs give you information about what was going on.

giancarlostoro · 2 years ago

> Most log messages are useless 99.99% of the time. Best likely outcome is that its turned into a metric. The once in the blue moon outcome is that it tells you what went wrong when something crashed. Wonder if just keeping timestamps in a more efficient table of each unique textual log entry would be better? Or rather log entry text template. Then store the arguments also separate.

anonygler · 2 years ago

"Once in a blue moon" -- you mean the thing that constantly happens? If you're not using logs, you're not practicing engineering. Metrics can't really diagnose problems.

It's also a lot easier to inspect a log stream that maps to an alert with a trace id than it is to assemble a pile of metrics for each user action.

hot_gril · 2 years ago

I think the above comment is just saying that you shouldn't use logs to do the job of metrics. Like, if you have an alert that goes off when some HTTP server is sending lots of 5xx, that shouldn't rely on parsing logs.

pphysch · 2 years ago

> But logs shouldn't be your primary source of data, metrics should be.

Metrics, logs, relational data, KVs, indexes, flat files, etc. are all equally valid forms of data for different shapes of data and different access patterns. If you are building for a one-size-fits-all database you are in for a nasty surprise.

Deleted Comment

pawelduda · 2 years ago

With logs you can get an idea of what events happened in what order during some complex process, stretched over long timeframe, and so on. I don't think you can do this with a metric

KaiserPro · 2 years ago

> With logs you can get an idea of what events happened in what order

Again, if you're at that point, you need logs. But thats never going to be your primary source of information. if you have more than a few services running at many transactions a second, you can't scale that kind of understanding using logs.

This is my point, if you have >100 services, each with many tens or hundreds of processes, your primary (well it shouldn't be, you need pre SLA fuckup alerts)alert to something going wrong is something breeching an SLA. That's almost certainly a metric. Using logs to derive that metric means you have a latency of 60-1500 seconds

Getting your apps to emit metrics directly means that you are able to make things much more observable. It also forces your devs to think about _how_ their app is observed.

Deleted Comment

derefr · 2 years ago

I would note that a notional "log store", doesn't have to just be used for things that are literally "logs."

You know what else you could call a log store? A CQRS/ES event store.

(Specifically, a "log store" is a CQRS/ES event store that just so happens to also remember a primary-source textual representation for each structured event-document it ingests — i.e. the original "log line" — so that it can spit "log lines" back out unchanged from their input form when asked. But it might not even have this feature, if it's a structured log store that expects all "log lines" to be "structured logging" formatted, JSON, etc.)

And you know what the most important operation a CQRS/ES event store performs is? A continuous streaming-reduction over particular filtered subsets of the events, to compute CQRS "aggregates" (= live snapshot states / incremental state deltas, which you then continuously load into a data warehouse to power the "query" part of CQRS.)

Most CQRS/ES event stores are built atop message queues (like Kafka), or row-stores (like Postgres). But neither are actually very good backends for powering the "ad-hoc-filtered incremental large-batch streaming" operation.

• With an MQ backend, streaming is easy, but MQs maintain no indices for events per se, just copies of events in different topics; so filtered streaming would either have the filtering occur mostly client-side; or would involve a bolt-on component that is its own "client-side", ala Kafka Streams. You can use topics for this — but only if you know exactly what reduction event-type-sets you'll need before you start publishing any events. Or if you're willing to keep an archival topic of every-event-ever online, so that you can stream over it to retroactively build new filtered topics.

• With a row-store backend, filtered streaming without pre-indexing is tenable — it's a query plan consisting of a primary-key-index-directed seq scan with a filter node. But it's still a lot more expensive than it'd be to just be streaming through a flat file containing the same data, since a seq scan is going to be reading+materializing+discarding all the rows that don't match the filtering rule. You can create (partial!) indices to avoid this — and nicely-enough, in a row-store, you can do this retroactively, once you figure out what the needs of a given reduction job are. But it's still a DBA task rather than a dev task — the data warehouse needs to be tweaked to respond to the needs of the app, every time the needs of the app change. (I would also mention something about schema flexibility here, but Postgres has a JSON column type, and I presume CQRS/ES event-store backends would just use that.)

A CQRS/ES event store built atop a fully-indexed document store / "index store" like ElasticSearch (or Quickwit, apparently) would have all the same advantages of the RDBMS approach, but wouldn't require any manual index creation.

Such a store would perform as if you took the RDBMS version of the solution, and then wrote a little insert-trigger stored-procedure that reads the JSON documents out of each row, finds any novel keys in them, and creates a new partial index for each such novel key. (Except with much lower storage-overhead — because in an "index store" all the indices share data; and much better ability to combine use of multiple "indices", as in an "index store" these are often not actually separate indices at all, but just one index where the key is part of the index.)

---

That being said, you know what you can use the CQRS/ES model for? Reducing your literal "logs" into metrics, as a continuous write-through reduction — to allow your platform to write log events, but have its associated observability platform read back pre-aggregated metrics time-series data, rather than having to crunch over logs itself at query time.

And AFAIK, this "modelling of log messages as CQRS/ES events in a CQRS/ES event store, so that you can do CQRS/ES reductions to them to compute metrics as aggregates" approach is already widely in use — but just not much talked about.

For example, when you use Google Cloud Logging, Google seems to be shoving your log messages into something approximating an event-store — and specifically, one with exactly the filtered-streaming-cost semantics of an "index store" like ElasticSearch (even though they're actually probably using a structured column-store architecture, i.e. "BigTable but append-only and therefore serverless.") And this event store then powers Cloud Logging's "logs-based metrics" reductions (https://cloud.google.com/logging/docs/logs-based-metrics).

I wonder how much their setup costs. Naively, if one were to simply feed 100 PB into Google BigQuery without any further engineering efforts, it would cost about 3 million USD per month.

francoismassot · 2 years ago

Good question.

Let's estimate the costs of compute.

For indexing, they need 2800 vCPUs[1], and they are using c6g instances; on-demand hourly price is $0.034/h per vCPU. So indexing will cost them around $70k/month.

For search, they need 1200 vCPUs, it will cost them around $30k/month.

For storage, it will cost them $23/TB * 20000 = $460k/month.

Storage costs are an issue. Of course, they pay less than $23/TB but it's still expensive. They are optimizing this either by using different storage classes or by moving data to cheaper cloud providers for long term storage (less requests mean you need less performant storage and usually you can get a very good price on those object storages).

On quickwit side, we will also improve the compression ratio to reduce the storage footprint.

[1]: I fixed the num vCPUs number of indexing, it was written 4000 when I published the post, but it corresponded to the total number of vCPUs for search and indexing.

rcaught · 2 years ago

Savings plans, spot, EDP discounts. Some of these have to be applied, right?

onlyrealcuzzo · 2 years ago

A lot.

1PB with triple redundancy costs around ~$20k just in hard drive costs per year. That's ~$2.5M per year just in disks.

I'd be impressed if they're doing this for less than $1.5M per month (including SWE costs).

Obviously, if they can, saving $1.5M a month vs BigQuery seems like maybe a decent reason to DIY.

BiteCode_dev · 2 years ago

Why per year? If they buy their own server, they keep the disk several years.

The money motivation to self host on bare metal at this scale is huge.

the_arun · 2 years ago

DIY also comes with the cost of managing it. We need a team to maintain, bug fix etc., not hard but cost

AJSDfljff · 2 years ago

Good question. I thought it would be a no brainer to put it on s3 or similiar but thats already way to expensive at 2m/month without api requests.

Backplace storage pods are an initial investment of 5 Million, thats probably the best bet you could do and on that savings level, having 1-3 good people dedicated to this is probably still cheaper.

But you could / should start talking to the big cloud providers to see if they are flexible enough going lower on the price.

I have seen enough companies, including big ones, being absolut shitty in optimizing these types of things. At this level of data, i would optimize everyting including encoding, date format etc.

But i said it in my other comment: the interesting questions are not answered :D

orf · 2 years ago

The compressed size is 20pb, so it’s about 500k per month in S3 fees

Daviey · 2 years ago

"Object storage as the primary storage: All indexed data remains on object storage, removing the need for provisioning and managing storage on the cluster side."

So the underlying storage is still Object storage, so base that around your calculations depending if you are using S3, GCP Object Storage, self hosted Ceph, MinIO, Garage or SeaweedFS.

Aurornis · 2 years ago

They provide some big hints about the number of vCPUs and the size of the compressed data set on S3:

> Size on S3 (compressed): 20 PB

There are also charts about vCPUs and RAM for the indexing and searching clusters.

gaogao · 2 years ago

Yeah, doing some preferred cloud Data Warehouse with an indexing layer seems fine for this sort of thing. That has an advantage over something specialized like this of still being able to easily do stream processing / Spark / etc, plus probably saves some money.

Maybe Quickwit is that indexing layer in this case? I haven't dug too much into the general state of cloud dw indexing.

fulmicoton · 2 years ago

Quickwit is designed to do full-text search efficiently with an index stored on an object storage.

There are no equivalent technology, apart maybe:

- Chaossearch but it is hard to tell because they are not opensource and do not share their internals. (if someone from chaossearch wants to comment?)

- Elasticsearch makes it possible to search into an index archived on S3. This is still a super useful feature as a way to search punctually into your archived data, but it would be too slow and too expensive (it generates a lot of GET requests) to use as your everyday "main" log search index.