I'm also seriously considering dropping Grafana for good for the same reasons stated in the post. Every year I need to rebuild a dashboard, reconfigure alerts, use the shiny new toy, etc etc. I'm tired.
I just want the thing to alert me when something's down, and ideally if the check doesn't change and the datasource and metric don't change, the dashboard definition and the alert definition should be the same for the last and the next 10 years.
The UI used to have the most 4-5 important links in the sidebar, now it's 10 menus with submenus of submenus, and I never know where to find the basics: Dashboards and Alerts. When something goes off I don't have time to re-learn the UI I look at maybe once a month.
Building an elaborate pile of technical debt is a great way to have an elaborate pile of technical debt, but the lifespan of services being 2-3 years gets painful as you start composing a stack out of enough products that every quarter you need to replace something big.
There are products, which fight against the software bloat with bells and whistles, and against breaking backwards compatibility with every new release. They put user experience and stability as the top priority - https://docs.victoriametrics.com/victoriametrics/goals/
FTA > "I know for a fact that that pace is partially driven by career-driven development."
This isn't a Grafana problem, this is an industry wide problem. Resume driven product design, resume driven engineering, resume driven marketing. DO your 2-3 years, pump out something big to inflate your resume. Apply elsewhere to get the pay bump that almost no company is handing out. After the departures there is no one left who knows the system and the next people in want to replace the things they don't understand to pad their resume for the next job.
Wash, rinse, repete.
Loyalty, simply goes unrewarded in a lot of places in our industry (and at a many corporations). And the people who do stay... in many cases they turn into furniture that ends up holding potential good evolution back. They loose out to the technological magpies the bring shiny things to management because it will "move the needle".
Sadly this is just one facet of the problems we are facing, from how we interview to how we run (or rent) our infrastructure things have gotten rather silly...
without any stability, you really can’t blame the player for playing this game.
The days where you could devote your career to a firm and retire with a pension are long gone
The author of this article wants a boring tech stack that just works, and honestly after everything we’ve been through in the last five years, I kinda want a boring job I can keep until I retire, too
Mimir is just architected for a totally different order of magnitude of metrics. At that scale, yeah, Kafka is actually necessary. There are no other open-source solutions offering the same scalability, period.
That's besides the point that most customers will never need that level of scale. If you're not running Mimir on a dedicated Kubernetes cluster (or at least a dedicated-to-Grafana / observability cluster) then it's probably over-engineered for your use-case. Just use Prometheus.
Have a look at Victoria Metrics - have run it at a relatively high scale with much more success than any other metric stores. It's one of those things that just work. It's extremely easy to run at in a single-instance mode and handles much more than you would expect. Scaling it is a breeze too.
(I'm not affiliated, but a very happy user across multiple orgs and personal projects)
The project where I looked at Mimir was a 500+ million timeseries project, with the desire to support scaling to the ten-figure level of timeseries (working for a BigCo supporting hundreds of product development teams).
All of these systems that store metrics in object storage - you have to remember that object storage is not file storage. Generally speaking (stuff like S3 One Zone being a relatively recent exception) you cannot append to object files. Metrics queries are resolved by querying historical metrics in object storage plus a stateful service hosting the latest 2 hours of data before it can be compressed and uploaded to object storage as a single block. At a certain scale, you simply need to choose which is more important - being able to answer queries or being able to insert more timeseries. And if you don't prioritize insertion, it just results in the backlog getting bigger and bigger, which especially in the eventual case (Murphy's Law guarantees it) of a sudden flood of metrics to ingest will cause several hour ingestion delays during which you are blind. And if you do prioritize insertion, well the component simply won't respond to queries, which makes you blind anyway. Lose-lose.
Mimir built in Kafka because it's quite literally necessary at scale. You need the stateful query component (with the latest 2 hours) to prioritize queries, then pull from the Kafka topic on a lower priority thread, when there's spare time to do so. Kafka soaks up the sudden ingestion floods so that they don't result in the stateful query component getting DoS'd.
I took a quick look at VictoriaMetrics - no Kafka or Kafka-like component to soak up ingestion floods? DOA.
Again, most companies are not BigCos. If you're a startup/scaleup with one VP supervising several development teams, you likely don't need that scale, probably VictoriaMetrics is just fine, you're not the first person I've heard recommend it. But I would say 80% of companies are small enough to be served with a simple Prometheus or Thanos Query over HA Prometheus setup, 17% of companies will get a lot of value out of Victoria Metrics, the last 3% really need Mimir's scalability.
What's your preferred solution for observability and monitoring of tiny apps?
I'm looking for something with really compact storage, really simple deployment (preferably a single statically linked binary that does everything), and compatible with OpenTelemetry (including metrics and distributed tracing). If/when I outgrow it, I can switch to another OpenTelemetry provider (but realistically this will not happen)
I'm personally not convinced OpenTelemetry is the future. I get the desire to not be vendor-locked to a single provider, but Prometheus and Jaeger are very solid, battle-hardened, popular, well-maintained, easily self-hosted open-source projects. For small deployments you do not need to overthink things here - Grafana, Prometheus, Jaeger (with local disk storage), logging depends on how many machines you're talking about and where they're hosted (e.g. GCP Cloud Logging is fine for GCP-hosted projects, the 50 GB free tier is a lot for a small project) but as a default Loki is also just fine and much better than Elastic/OpenSearch.
OpenTelemetry is, last I looked at it, way too immature, unstable, and resource-hungry to be such a foundational part of infrastructure.
AMP is an internal proprietary fork of Cortex, they're not up-streaming their changes, also in large part due to the scalability limits of Cortex's design. It has the same scalability limitation I described earlier with the lack of a Kafka-like component to soak up ingest floods.
> Multi-petabyte
Sheer storage size is a meaningless point here, as longer retention requires more storage. There may or may not be compaction components that help speed up queries over larger windows, but that's irrelevant to the point that the queries will still succeed. I have no doubt that any of the solutions on the table will handle storing that much data.
The real scaling question is how many active timeseries the system can handle, at which resolution (per 15 seconds? per 60 seconds? worse?), and no, "we scale horizontally" doesn't mean much without more serious benchmarks.
What's the most promising alternative to Prometheus/Grafana if you're developing a new solution around OTEL? If you could start today and pick tools, what would you go for?
We also started with the typical kube-prometheus-stack, but we don’t like Prometheus/PromQL. Moreover, it only solves the „metrics“ part - to handle logs and traces, more quite heavy and complex components have to be added to the observability stack.
This didn‘t feel right, so we looked around and found greptimedb https://github.com/GreptimeTeam/greptimedb, which simplifies the whole stack. It‘s designed to handle metrics, logs, and traces. We collect metrics and logs via OpenTelemetry, and visualize them with Grafana. It provides endpoints for Postgres, MySQL, PromQL; we‘re happy to be able to build dashboards using SQL as that’s where we have the most knowledge.
The benchmarks look promising, but our k8s clusters aren’t huge anyway. As a platform engineer, we appreciate the simplicity of our observability stack.
Any other happy greptimedb users around here? Together with OTel, we think we can handle all future obs needs.
Thank you for giving GreptimeDB a shout-out—it means a lot to us. We created GreptimeDB to simplify the observability data stack with an all-in-one database, and we’re glad to hear it’s been helpful.
OpenTelemetry-native is a requirement, not an option, for the new observability data stack. I believe otel-arrow (https://github.com/open-telemetry/otel-arrow) has strong future potential, and we are committed to supporting and improving it.
FYI: I think SQL is great for building everything—dashboards, alerting rules, and complex analytics—but PromQL still has unique value in the Prometheus ecosystem. To be transparent, GreptimeDB still has some performance issues with PromQL, which we’ll address before the 1.0 GA.
Check OpenObserve https://github.com/openobserve/openobserve. It precisely was built to solve the challenges around grafana nd elastic. This is not a stack that you will need to weave together, just a single binary/container that would suffice for most users' needs - logs, metrics, traces, dashboards, alerts.
I've been happy with OpenObserve for personal use. It's a minimal deployment, so I'm not sure how well it scales, but I really like that it's self-contained, easy to deploy and manage. Not having to think about integrating a half-dozen different tools is great. I just setup otel-collector on each node, and point them to the OpenObserve host. Easy.
I've been doing monitoring since before it was called observability with good old Nagios, and the modern observability stack is insane. I'm glad that tools like OpenObserve and SigNoz exist.
Having been using Grafana community/cloud for a number of years, my new gig is currently moving everything to SigNoz. Mostly slick, under active development, communicative team, open source... what's not to love?
I don't know why software developers feel the urge to stay on the bleeding edge of every product and update their setup every week then turn around and compain "this stuff isn't stable!"
I've had a grafana + prometheus setup on my servers since like 2017. It worked then and works today. I log in maybe once every year or two to update to a newer LTS version. Every dashboard is still pristine, and nothing has ever broken.
I don't understand most of the words in the linked post and don't need to. The core package is the boring solution that 99% of people here need, and that works great.
Not sure what's an alternative for Grafana in the open source world in terms of building dashboards for o11y? I'm not aware of one and Grafana is used very extensively in my company...
I mentioned it in another reply, but https://perses.dev/ is probably the most promising alternative.
Besides that, if you're feeling masochistic you could use Prometheus' console templates or VictoriaMetrics' built-in dashboards.
Though these are all obviously nowhere near as feature rich and capable as Grafana and would only be able to display metrics for the single Prom/VM node they're running on. Might enough for some users.
How come I’ve never heard of Perses yet? A really Open Source, standardising Grafana clone to go alongside Prometheus for self-hosted deployments sounds just perfect!
from cursory reading of the article I don’t see that author’s problems are specifically with Grafana in its best use case (metrics), but with other products from Grafana company, for which are a lot of alternatives.
Grafana dashboards itself (paired with VictoriaMetrics and occasionally Clickhouse) is one of the most pleasant web apps IMO. Especially when you don’t try to push the constraints of its display model, which are sometimes annoying but understandable.
I remember that alternative, free/FOSS products existed before Grafana (c2015) but many died, Grafana was everywhere. Now I also cannot find the old-alts. Vague memories of RRD and Nagios...
Munin was what we used for a while, along with a smattering of smokeping.
We're using a combination of Zabbix (alerting) and local Grafana/Prometheus/Loki (observability) at this point, but I've been worried about when Grafana will rug-pull for a while now. Hopefully enough people using their cloud offering sates their appetite and they leave the people running locally alone.
https://github.com/opensearch-project/OpenSearch-Dashboards (Kibana fork) is one. But Grafana is still way better if you just stay away from anything that isn't the core product: data visualization and exploration (explorer and traces).
I use Signoz for my private purposes, it's not a 100% match, but you can do Prometheus metrics, logs analysis, dashboards, alerts, OTEL spans so depending on your usecase it can be enough
we're opentelemetry-native and apart from many out of box charts for APM, infra monitoring, and logs, you can also build customized dashboards with lots of visualization option.
> "But I got it all working; now I can finally stop explaining to my boss why we need to re-structure the monitoring stack every year."
Prometheus and Grafana have been progressing in their own ways and each of them is trying to have a fullstack solution and then the OTEL thingy came and ruined the party for everyone
I think OTEL has made things worse for metrics. Prometheus was so simple and clean before the long journey toward OTEL support began. Now Prometheus is much more complicated:
- all the delta-vs-cumulative counter confusion
- push support for Prometheus, and the resulting out-of-order errors
- the {"metric_name"} syntax changes in PromQL
- resource attributes and the new info() function needed to join them
I just don’t see how any of these OTEL requirements make my day-to-day monitoring tasks easier. Everything has only become more complicated.
I still haven't got my head around how OTEL fits into a good open-source monitoring stack. Afaik, it is a protocol for metrics, traces, and logs. And we want our open-source monitoring services/dbs to support it, so they become pluggable. But, afaik, there's no one good DB for logs and metrics, so most of us use Prometheus for metrics and OpenSearch for logs.
Does OTEL mean we just need to replace all our collectors (like logstash for logs and all the native metrics collectors and pushgateway crap) and then reconfigure Prometheus and OpenSearch?
logs, spans and metrics are stored as time-stamped stuff. sure simple fixed-width columnar storage is faster, and makes sense to special case for numbers (add downsampling and aggregations, and histogram maintenance and whatnot), but any write-optimized storage engine can handle this, it's not the hard part (basically LevelDB, and if there's need for scaling out it'll look like Cassandra, Aerospike, ScyllaDB, or ClickHouse ... see also https://docs.greptime.com/user-guide/concepts/data-model/ and specialized storage engines https://docs.greptime.com/reference/about-greptimedb-engines... )
I think the answer is it doesn't fit in any definition of a _good_ monitoring stack, but we are stuck with it. It has largely become the blessed protocol, specification, and standard for OSS monitoring, along every axis (logging, tracing, collecting, instrumentation, etc)...its a bit like the efforts that resulted in J2EE and EJBs back in the day, only more diffuse and with more varied implementations.
And we don't really have a simpler alternative in sight...at least in the java days there was the disgust and reaction via struts, spring, EJB3+, and of course other languages and communities.
Not sure how we exactly we got into such an over-engineered mono-culture in terms of operations and monitoring and deployment for 80%+ of the industry (k8s + graf/loki/tempo + endless supporting tools or flavors), but it is really a sad state.
Then you have endless implementations handling bits and pieces of various parts of the spec, and of course you have the tools to actually ingest and analyze and report on them.
I just want the thing to alert me when something's down, and ideally if the check doesn't change and the datasource and metric don't change, the dashboard definition and the alert definition should be the same for the last and the next 10 years.
The UI used to have the most 4-5 important links in the sidebar, now it's 10 menus with submenus of submenus, and I never know where to find the basics: Dashboards and Alerts. When something goes off I don't have time to re-learn the UI I look at maybe once a month.
This isn't a Grafana problem, this is an industry wide problem. Resume driven product design, resume driven engineering, resume driven marketing. DO your 2-3 years, pump out something big to inflate your resume. Apply elsewhere to get the pay bump that almost no company is handing out. After the departures there is no one left who knows the system and the next people in want to replace the things they don't understand to pad their resume for the next job.
Wash, rinse, repete.
Loyalty, simply goes unrewarded in a lot of places in our industry (and at a many corporations). And the people who do stay... in many cases they turn into furniture that ends up holding potential good evolution back. They loose out to the technological magpies the bring shiny things to management because it will "move the needle".
Sadly this is just one facet of the problems we are facing, from how we interview to how we run (or rent) our infrastructure things have gotten rather silly...
The days where you could devote your career to a firm and retire with a pension are long gone
The author of this article wants a boring tech stack that just works, and honestly after everything we’ve been through in the last five years, I kinda want a boring job I can keep until I retire, too
That's besides the point that most customers will never need that level of scale. If you're not running Mimir on a dedicated Kubernetes cluster (or at least a dedicated-to-Grafana / observability cluster) then it's probably over-engineered for your use-case. Just use Prometheus.
(I'm not affiliated, but a very happy user across multiple orgs and personal projects)
All of these systems that store metrics in object storage - you have to remember that object storage is not file storage. Generally speaking (stuff like S3 One Zone being a relatively recent exception) you cannot append to object files. Metrics queries are resolved by querying historical metrics in object storage plus a stateful service hosting the latest 2 hours of data before it can be compressed and uploaded to object storage as a single block. At a certain scale, you simply need to choose which is more important - being able to answer queries or being able to insert more timeseries. And if you don't prioritize insertion, it just results in the backlog getting bigger and bigger, which especially in the eventual case (Murphy's Law guarantees it) of a sudden flood of metrics to ingest will cause several hour ingestion delays during which you are blind. And if you do prioritize insertion, well the component simply won't respond to queries, which makes you blind anyway. Lose-lose.
Mimir built in Kafka because it's quite literally necessary at scale. You need the stateful query component (with the latest 2 hours) to prioritize queries, then pull from the Kafka topic on a lower priority thread, when there's spare time to do so. Kafka soaks up the sudden ingestion floods so that they don't result in the stateful query component getting DoS'd.
I took a quick look at VictoriaMetrics - no Kafka or Kafka-like component to soak up ingestion floods? DOA.
Again, most companies are not BigCos. If you're a startup/scaleup with one VP supervising several development teams, you likely don't need that scale, probably VictoriaMetrics is just fine, you're not the first person I've heard recommend it. But I would say 80% of companies are small enough to be served with a simple Prometheus or Thanos Query over HA Prometheus setup, 17% of companies will get a lot of value out of Victoria Metrics, the last 3% really need Mimir's scalability.
I'm looking for something with really compact storage, really simple deployment (preferably a single statically linked binary that does everything), and compatible with OpenTelemetry (including metrics and distributed tracing). If/when I outgrow it, I can switch to another OpenTelemetry provider (but realistically this will not happen)
OpenTelemetry is, last I looked at it, way too immature, unstable, and resource-hungry to be such a foundational part of infrastructure.
Isn't OpenTelemetry very slow?
If Mimir is the only one, why Roblox, GrafanaLabs's customer, isn't using Mimir for monitoring? They're using VictoriaMetrics on approx scale of 5 Billion active time series. See https://docs.victoriametrics.com/victoriametrics/casestudies....
None solution is perfect. Each one has its own trade-offs. That is why it triggers me when I see statements like this one.
I love it when people take a hard stand like this, using the words "period"
BTW, Cortex is used as Amazon Managed Prometheus (Probably at a much larger scale) than Mimir by AWS.
OpenObserve too, is already being used at a multi-petabyte scale.
> Multi-petabyte
Sheer storage size is a meaningless point here, as longer retention requires more storage. There may or may not be compaction components that help speed up queries over larger windows, but that's irrelevant to the point that the queries will still succeed. I have no doubt that any of the solutions on the table will handle storing that much data.
The real scaling question is how many active timeseries the system can handle, at which resolution (per 15 seconds? per 60 seconds? worse?), and no, "we scale horizontally" doesn't mean much without more serious benchmarks.
This didn‘t feel right, so we looked around and found greptimedb https://github.com/GreptimeTeam/greptimedb, which simplifies the whole stack. It‘s designed to handle metrics, logs, and traces. We collect metrics and logs via OpenTelemetry, and visualize them with Grafana. It provides endpoints for Postgres, MySQL, PromQL; we‘re happy to be able to build dashboards using SQL as that’s where we have the most knowledge.
The benchmarks look promising, but our k8s clusters aren’t huge anyway. As a platform engineer, we appreciate the simplicity of our observability stack.
Any other happy greptimedb users around here? Together with OTel, we think we can handle all future obs needs.
Thank you for giving GreptimeDB a shout-out—it means a lot to us. We created GreptimeDB to simplify the observability data stack with an all-in-one database, and we’re glad to hear it’s been helpful.
OpenTelemetry-native is a requirement, not an option, for the new observability data stack. I believe otel-arrow (https://github.com/open-telemetry/otel-arrow) has strong future potential, and we are committed to supporting and improving it.
FYI: I think SQL is great for building everything—dashboards, alerting rules, and complex analytics—but PromQL still has unique value in the Prometheus ecosystem. To be transparent, GreptimeDB still has some performance issues with PromQL, which we’ll address before the 1.0 GA.
Disclosure: I am a maintainer of OpenObserve
I've been doing monitoring since before it was called observability with good old Nagios, and the modern observability stack is insane. I'm glad that tools like OpenObserve and SigNoz exist.
open source and opentelemetry-native. Lots of our users have migrated from grafana to overcome challenges like having to handle multiple backends.
p.s - i am one of the maintainers.
I've had a grafana + prometheus setup on my servers since like 2017. It worked then and works today. I log in maybe once every year or two to update to a newer LTS version. Every dashboard is still pristine, and nothing has ever broken.
I don't understand most of the words in the linked post and don't need to. The core package is the boring solution that 99% of people here need, and that works great.
How did you handle the angular deprecation in grafana? Or are you just staying in an older version that still supports it?
Though I won't say I loved doing it.
Besides that, if you're feeling masochistic you could use Prometheus' console templates or VictoriaMetrics' built-in dashboards.
Though these are all obviously nowhere near as feature rich and capable as Grafana and would only be able to display metrics for the single Prom/VM node they're running on. Might enough for some users.
Disclaimer: I am affiliated with them.
Grafana dashboards itself (paired with VictoriaMetrics and occasionally Clickhouse) is one of the most pleasant web apps IMO. Especially when you don’t try to push the constraints of its display model, which are sometimes annoying but understandable.
We're using a combination of Zabbix (alerting) and local Grafana/Prometheus/Loki (observability) at this point, but I've been worried about when Grafana will rug-pull for a while now. Hopefully enough people using their cloud offering sates their appetite and they leave the people running locally alone.
I'm out of that game now though so don't have the challenge.
https://www.centreon.com/
we're opentelemetry-native and apart from many out of box charts for APM, infra monitoring, and logs, you can also build customized dashboards with lots of visualization option.
p.s - i am one of the maintainers
The kicker for me recently was hearing someone say "ally"
Deleted Comment
Prometheus and Grafana have been progressing in their own ways and each of them is trying to have a fullstack solution and then the OTEL thingy came and ruined the party for everyone
- all the delta-vs-cumulative counter confusion
- push support for Prometheus, and the resulting out-of-order errors
- the {"metric_name"} syntax changes in PromQL
- resource attributes and the new info() function needed to join them
I just don’t see how any of these OTEL requirements make my day-to-day monitoring tasks easier. Everything has only become more complicated.
And I haven’t even mentioned the cognitive and resource cost everyone pays just to ship metrics in the OTEL format - see https://promlabs.com/blog/2025/07/17/why-i-recommend-native-...
Does OTEL mean we just need to replace all our collectors (like logstash for logs and all the native metrics collectors and pushgateway crap) and then reconfigure Prometheus and OpenSearch?
And we don't really have a simpler alternative in sight...at least in the java days there was the disgust and reaction via struts, spring, EJB3+, and of course other languages and communities.
Not sure how we exactly we got into such an over-engineered mono-culture in terms of operations and monitoring and deployment for 80%+ of the industry (k8s + graf/loki/tempo + endless supporting tools or flavors), but it is really a sad state.
Then you have endless implementations handling bits and pieces of various parts of the spec, and of course you have the tools to actually ingest and analyze and report on them.