So yeah, this is only really relevant for collecting logs from clickhouse. Not for logs from anything else. Good for them, and I really love Clickhouse, but not really relevant.
> If a service is crash-looping or down, SysEx is unable to scrape data because the necessary system tables are unavailable. OpenTelemetry, by contrast, operates in a passive fashion. It captures logs emitted to stdout and stderr, even when the service is in a failed state. This allows us to collect logs during incidents and perform root cause analysis even if the service never became fully healthy.
we use k8s + otel filelog receiver. in this case you don't have to connect to the clickhouse instance to collect what it's writing to stdout/stderr, just tail /var/log/pods/*/*/*.log.
Yes. However, as you can imagine, the processing costs can be potentially enormous. If your indexing/ordering/clustering strategy isn't set up well, a single query can easily end up costing you on the order of $1-$10 to do something as simple as "look for records containing this string".
My experiences line up with theirs: at the scale where you are moving petabytes of data, the best optimizations are, unsurprisingly, "touch as little data as few times as possible" and "move as little data as possible". Every time you have to serialize/de-serialize, and every time you have to perform disk/network I/O, you introduce a lot of performance cost and therefore overall cost to your wallet.
Naturally, this can put OTel directly at odds with efficiency because the OTel collector is an extra I/O and serialization hop. But then again, if you operate at the petabyte scale, the amount of money you save by throwing away a single hop can more than pay for an engineer whose only job is to write serializer/deserializer logic.
How do engineers troubleshoot then? Our engineers would throw hands if they are asked not to parse through two months worth of log volume for a single issue.
> Why would I use ClickHouse instead of storing log data as json file for historical log data?
There are multiple reasons:
1. Databases optimized for logs (such as ClickHouse or VictoriaLogs) store logs in a compressed form, where values per every log field are grouped and compressed individually (aka column-oriented storage). This results in smaller storage space comparing to plain files with JSON logs, even if they are compressed.
2. Databases optimized for logs perform typical queries at much faster speed comparing to grep over JSON files. Performance gains may be 1000x and more because these databases skip reading unneeded data. See https://chronicles.mad-scientist.club/tales/grepping-logs-re...
3. How are you going to grep 100 petabytes of JSON files? Databases optimized for logs allow querying such amounts of logs because they can scale horizontally by adding more storage nodes and storage space.
Scale and costs. We are faced with logging scale at my work. A naive "push json into splunk" will cost us over $6M/year, but I can only get maybe 5-10% of that approved.
In the article, they talk about needing 8k cpu to process their json logs, but only 90 cpu afterward.
Couple of years ago clickhouse wasn't that good with full text search, to me that was the biggest drawback. Yes it's faster and can handle ES scale but depending on your use case it's way faster to query ES when you do FTS or grouping without pre-build index.
Do wide events really have to take up this much space? I mean, observability is to a large degree basically a sampling problem where the goal is to maximize the ability to reconstruct the state of the environment at a given time using a minimal amount of storage. You can accomplish that by either reducing the number of samples taken or by improving your compression capability.
For the latter, I have a very hard time believing we’ve squeezed most of the juice out of compression already. Surely there’s an absolutely massive amount of low-rank structure in all that redundant data. Yeah, I know these companies already use inverted indices and various sorts of trees, but I would have thought there are more research-y approaches (e.g. low rank tensor decomposition) that if we could figure out how to perform them efficiently would blow the existing methods out of the water. But IDK, I’m not in that industry so maybe I’m overlooking something.
> Do wide events really have to take up this much space?
100PB is the total volume of the raw, uncompressed data for the full retention period (180 days). compression is what makes it cost-efficient. on this dataset, we see ~15x compression, so we only store around 6.5PB at rest.
There isn't much information about correlation. What are the state-of-the-art tools and techniques for observability in stateful use cases?
Let's take the example of an SFU-based video conferencing app, where user devices go through multiple API calls to join a session. Now imagine a user reports that they cannot see video from another participant. How can such problems be effectively traced?
Of course, I can manually filter logs and traces by the first user, then by the second user, and look at the signaling exchange and frontend/backend errors. But are there better approaches?
When I get back from Clickhouse to Postgres, I am always shocked. Like, what it is doing for some minutes importing this 20G dump? Shouldn't it take seconds?
Every time I use Clickhouse I want blow my brains out, especially knowing that Postgres exists. I’m not saying Clickhouse doesn’t have its place or that Postgres can do everything that Clickhouse can.
What I am saying is that I really dislike working in Clickhouse with all of the weird foot guns. Unless you are using it in a very specific, and in my opinion, limited way, it feels worse than Postgres in every way.
Anything in my life that uses Zookeeper or its dumbass etcd friend means I'm going to have a real bad time. I am thankful they're at least shipping their own ZK-ish but it seems to have fallen into the same trap as etcd, where membership has to be managed like the precious little pets that they are https://clickhouse.com/docs/guides/sre/keeper/clickhouse-kee...
Just don't use ClickHouse for OLTP tasks. ClickHouse is an analytical database, which isn't optimized for transactional workloads. Keep calm and use Postgresql for OLTP, and ClickHouse for OLAP.
Yes, this what the people who will curse you out and judge you for not using wide events omits: it will greatly increase storage costs compared to the normal metrics + traces + sample based logging that is conventional. It has both a benefit and a cost, and the cost part is always omitted.
Properly implemented wide events usually reduce storage costs comparing to typical chaotic logging of everything. It is expected that a single external request leads to exactly one wide event with all the information about this request, which may be needed for further debugging and analytics. See https://jeremymorrell.dev/blog/a-practitioners-guide-to-wide... .
How would you add an outgoing request you make to external system in the wide event?
For example, I receive a request, in that request I make a HTTP call to http://example.com. In tracing that will be a separate span, but how you manage that in a single wide event?
> If a service is crash-looping or down, SysEx is unable to scrape data because the necessary system tables are unavailable. OpenTelemetry, by contrast, operates in a passive fashion. It captures logs emitted to stdout and stderr, even when the service is in a failed state. This allows us to collect logs during incidents and perform root cause analysis even if the service never became fully healthy.
Can you search log data in this volume? ElasticSearch has query capabilities for small scale log data I think.
Why would I use ClickHouse instead of storing log data as json file for historical log data?
(Context: I work at this scale)
Yes. However, as you can imagine, the processing costs can be potentially enormous. If your indexing/ordering/clustering strategy isn't set up well, a single query can easily end up costing you on the order of $1-$10 to do something as simple as "look for records containing this string".
My experiences line up with theirs: at the scale where you are moving petabytes of data, the best optimizations are, unsurprisingly, "touch as little data as few times as possible" and "move as little data as possible". Every time you have to serialize/de-serialize, and every time you have to perform disk/network I/O, you introduce a lot of performance cost and therefore overall cost to your wallet.
Naturally, this can put OTel directly at odds with efficiency because the OTel collector is an extra I/O and serialization hop. But then again, if you operate at the petabyte scale, the amount of money you save by throwing away a single hop can more than pay for an engineer whose only job is to write serializer/deserializer logic.
There are multiple reasons:
1. Databases optimized for logs (such as ClickHouse or VictoriaLogs) store logs in a compressed form, where values per every log field are grouped and compressed individually (aka column-oriented storage). This results in smaller storage space comparing to plain files with JSON logs, even if they are compressed.
2. Databases optimized for logs perform typical queries at much faster speed comparing to grep over JSON files. Performance gains may be 1000x and more because these databases skip reading unneeded data. See https://chronicles.mad-scientist.club/tales/grepping-logs-re...
3. How are you going to grep 100 petabytes of JSON files? Databases optimized for logs allow querying such amounts of logs because they can scale horizontally by adding more storage nodes and storage space.
In the article, they talk about needing 8k cpu to process their json logs, but only 90 cpu afterward.
For the latter, I have a very hard time believing we’ve squeezed most of the juice out of compression already. Surely there’s an absolutely massive amount of low-rank structure in all that redundant data. Yeah, I know these companies already use inverted indices and various sorts of trees, but I would have thought there are more research-y approaches (e.g. low rank tensor decomposition) that if we could figure out how to perform them efficiently would blow the existing methods out of the water. But IDK, I’m not in that industry so maybe I’m overlooking something.
100PB is the total volume of the raw, uncompressed data for the full retention period (180 days). compression is what makes it cost-efficient. on this dataset, we see ~15x compression, so we only store around 6.5PB at rest.
Let's take the example of an SFU-based video conferencing app, where user devices go through multiple API calls to join a session. Now imagine a user reports that they cannot see video from another participant. How can such problems be effectively traced?
Of course, I can manually filter logs and traces by the first user, then by the second user, and look at the signaling exchange and frontend/backend errors. But are there better approaches?
What I am saying is that I really dislike working in Clickhouse with all of the weird foot guns. Unless you are using it in a very specific, and in my opinion, limited way, it feels worse than Postgres in every way.