hagen1778 (u/hagen1778)

hagen1778 commented on I can't recommend Grafana anymore henrikgerdes.me/blog/2025... · Posted by u/gpi

solatic · a month ago

Mimir is just architected for a totally different order of magnitude of metrics. At that scale, yeah, Kafka is actually necessary. There are no other open-source solutions offering the same scalability, period.

That's besides the point that most customers will never need that level of scale. If you're not running Mimir on a dedicated Kubernetes cluster (or at least a dedicated-to-Grafana / observability cluster) then it's probably over-engineered for your use-case. Just use Prometheus.

hagen1778 · a month ago

Using "period" triggers me :)

If Mimir is the only one, why Roblox, GrafanaLabs's customer, isn't using Mimir for monitoring? They're using VictoriaMetrics on approx scale of 5 Billion active time series. See https://docs.victoriametrics.com/victoriametrics/casestudies....

None solution is perfect. Each one has its own trade-offs. That is why it triggers me when I see statements like this one.

hagen1778 commented on I can't recommend Grafana anymore henrikgerdes.me/blog/2025... · Posted by u/gpi

akvadrako · a month ago

Victoria metrics which has similar database design to clickhouse is good for metrics.

But it doesn't have a complete dashboard UI like Grafana.

hagen1778 · a month ago

Just to add, VictoriaMetrics covers all 3 signals:

- VictoriaMetrics for metrics. With Prometheus API support, so it integrates with Grafana using Prometheus datasource. It has its own Grafana datasource with extra functionality too.

- VictoriaLogs for logs. Integrates natively with Grafana using VictoriaLogs datasource.

- VictoriaTraces for traces. With Jaeger API support, so it intergrates with Grafana using Jaeger datasource.

All 3 solutions support alerting, managed by same team, are Apache2 licensed, are focused on resource efficiency and simiplicity.

hagen1778 commented on I can't recommend Grafana anymore henrikgerdes.me/blog/2025... · Posted by u/gpi

didierbreedt · a month ago

I have found Grafana to be a decent product, but Prom needs a better horizontally scalable solution. We use Vector and Clickhouse for logging and works really well.

hagen1778 · a month ago

There are plenty of ways to scale Prometheus:

- Thanos

- Mimir

- VictoriaMetrics

All of them provide a way to scale monitoring to insane numbers. The difference is in architecture, maintainability and performance. But make your own choices here.

Before, I remember there was m3db from Uber. But the project seems pretty dead now.

And there was a Cortex project, mostly maintaned by GrafanaLabs. But at some point they forked Cortex and named it Mimir. And Cortex is now maintained by Amazon and, as I undersand, is powering Amazon Managed Prometheus. However, I would avoid using Cortex ecaxctly because it is now maintained by Amazon.

hagen1778 commented on I can't recommend Grafana anymore henrikgerdes.me/blog/2025... · Posted by u/gpi

stym06 · a month ago

> "But I got it all working; now I can finally stop explaining to my boss why we need to re-structure the monitoring stack every year."

Prometheus and Grafana have been progressing in their own ways and each of them is trying to have a fullstack solution and then the OTEL thingy came and ruined the party for everyone

hagen1778 · a month ago

I think OTEL has made things worse for metrics. Prometheus was so simple and clean before the long journey toward OTEL support began. Now Prometheus is much more complicated:

- all the delta-vs-cumulative counter confusion

- push support for Prometheus, and the resulting out-of-order errors

- the {"metric_name"} syntax changes in PromQL

- resource attributes and the new info() function needed to join them

I just don’t see how any of these OTEL requirements make my day-to-day monitoring tasks easier. Everything has only become more complicated.

And I haven’t even mentioned the cognitive and resource cost everyone pays just to ship metrics in the OTEL format - see https://promlabs.com/blog/2025/07/17/why-i-recommend-native-...

hagen1778 commented on Datadog's $65M/year customer mystery solved blog.pragmaticengineer.co... · Posted by u/thunderbong

hagen1778 · 5 months ago

My understanding is that with Prometheus+Grafana, and the rest of their stack, you can achieve the same functionality as Datadog (or even more) at much lower costs. But, it requires engineering time to set up these tools, monitor them, build dashboards and alerts. Build an observability platform at home, in other words.

But what about other open source solutions that already trying very hard to become an out-of-box solution for observability? Things like Netdata, Hyperdx, Coroot, etc. are already platforms for all telemetry signals, with fancy UIs and a lot of presets. Why people don't use them instead of Datadog?

hagen1778 commented on Datadog's $65M/year customer mystery solved blog.pragmaticengineer.co... · Posted by u/thunderbong

secondcoming · 6 months ago

We are moving from Datadog to Prometheus/Grafana and it's really not all a bed of roses. You'll need monitoring on your monitoring.

hagen1778 · 5 months ago

Ofc you need to monitor your monitoring, because you run it. Datadog runs their own systems and monitors them, that's why they charge you so much. I barely can imagine a criticial piece of software that I need to run and not monitor it in the same time.

hagen1778 commented on Netdata vs. Prometheus: A 2025 Performance Analysis netdata.cloud/blog/netdat... · Posted by u/Aliki92

hagen1778 · 10 months ago

> Our tests revealed that Prometheus v3.1 requires 500 GiB of RAM to handle this workload, despite claims from its developers that memory efficiency has improved in v3.

AFAIK, starting from v3 Prometheus has `auto-gomemlimit` set by default. It means "Prometheus v3 will automatically set GOMEMLIMIT to match the Linux container memory limit.", which effectively prevents garbage collection until process reaches the specified limit. This is why, I think, Prometheus has increased flatlined mem usage in the article.

> Query the average over the last 2 hours, of system.ram of all nodes, grouped by dimension (4 dimensions in total), providing 120 (per-minute) points over time.

The query used for Prometheus here is an Instant query `query=avg_over_time(netdata_system_ram{dimension=~".+"}[2h:60s])`. This is rather a very weird subquery, that probably was never used by any of Prometheus users. Effectively, it instructs Prometheus to execute `netdata_system_ram{dimension=~".+"}` on 2h interval `2h/60s=120` 120 times, reading `120 * 7200 * 4k series = 3.5Bil` data samples. Normally, Prometheus users don't do this. They'd rather run a /query_range query `avg_over_time(netdata_system_ram{dimension=~".+"}[5m])` with step=5m on 2h time interval, reading `7200*4k=29Mil` samples.

Another weird thing with this query is that in return Prometheus will send 4k time series with all the labels in JSON format. I wonder, how much time from 1.8s it took just to transfer data over the network.

hagen1778 commented on Prometheus 3.0 prometheus.io/blog/2024/1... · Posted by u/dmazin

never_inline · a year ago

I am curious to hear from people on this forum, at what point will people practically cross the limits of prometheus, and straightforward division (eg, different prometheus across clusters and environments) does not work?

hagen1778 · a year ago

It usually comes with increase of active series and churn rate. Of course, you can scale Prometheus horizontally by adding more replicas and by sharding scrape targets. But at some point you'd like to achieve the following:

1. Global query view. Ability to get metrics from all Prometheis with one request. Or just simply not thinking which Prometheus has data you're looking for.

2. Resource usage management. No matter how you try, scrape targets can't be sharded perfectly. So you'll end up with some Prometheis using more resources than others. This could backfire in future in weird ways, reducing stability of the whole system.

hagen1778 commented on Prometheus 3.0 prometheus.io/blog/2024/1... · Posted by u/dmazin

oulipo · a year ago

Hmmm but the documentation seems poorly written, what is the team behind?

hagen1778 · a year ago

What makes you think that about docs? Of course, it was written by developers, not tech writers. But anyway, what do you think can be improved?

hagen1778 commented on Build a serverless ACID database with this one neat trick (atomic PutIfAbsent) notes.eatonphil.com/2024-... · Posted by u/todsacerdoti

yencabulator · a year ago

The filesystem supposedly-atomic file creation is buggy. The putIfAbsent implementation will leaves partial commit files visible to others (forever on errors), making other users see corrupt transactions. For filesystems, the correct answer is to write to a tempfile, sync that, then use link(2) to claim a transaction id.

hagen1778 · a year ago

We use the same approach in time series database I'm working on. While file creation and fsync aren't atomic, rename [1] syscall is. So we create a temporary file, write the data, call fsync and if all is good - rename it atomically to be visible for other users. I had a talk about this [2] a few month ago.

[1] https://man7.org/linux/man-pages/man2/rename.2.html

[2] https://www.youtube.com/watch?v=1gkfmzTdPPI