For example, we were running a 20 node k8s cluster for our Cortex (distributed Prometheus) install, monitoring about 30k servers around the world, and it was generating a bit over a TB of data a day. It was a lot more cost effective and performant to create a minio cluster for that data than to use S3.
Also, you can get durability with minio with multi cluster replication.
> career-driven development
we don't have this and promote and reward as frequently for "I've done solid operations" as we do for "I've added this feature" (I'm on promotion committees and can state this confidently).
what we do have is high autonomy for engineers. This autonomy means it's a freedom that engineers have to identify problems they feel are important and to work on them, they do not need permission and leadership do not veto this. Some of the best features in the last few years have been a direct result of this autonomy, it's one of the things that makes working here so attractive to many of the engineers. But, with autonomy comes a little chaos, and not everything that is done is going to satisfy every end user of OSS or paid customer (of which these are a small percent of the whole).
a lot of the innovation speed is just in the DNA of the company, even the creation of Grafana can be traced to a desire to get things done; Torkel wanted Kibana to also work for Prometheus, Kibana declined to add this, Torkel didn't stand still and added things to a fork of Kibana now called Grafana and hasn't stopped adding things since.
> They also deprecated Angular within Grafana and switched to React for dashboards. This broke most existing dashboards.
we did, I think the entire journey was 7 years long, communicated many times, over at least 6 major releases. maintaining dashboards in two languages increased complexity, whilst reducing compatibility, and gave a very large security surface to be worried about. we communicated clearly, provided migration tools, put it in release notes, updated docs, repeated it at conferences and on community calls.
arguably we went too slow, and should've ripped the band-aid off, but we were sensitive to the fact that it was a breaking change and so we proceeded with extreme caution. it's done now, it was finally completed in the last version, only a very small number of users reported impact as a result of the time and care taken on this.
> I just hope OTEL settles, gets stable and boring fast
this is distinct from Grafana, but it's a good point... OTel is the product of virtually every vendor at this point, and a hell of a lot of engineers, it now has a lot of momentum and the pace is unlikely to ease up due to the sheer number of contributions and things that OTel as a community wishes to achieve.
the most likely eventuality is that enough stability emerges to allow vendors (including but not limited to Grafana Labs) to abstract away the pace of innovation occurring underneath, but this is in tension with providing the benefits of the innovation to the people that use it.
what I would say is that for most people the boring and slow path does still exist, and it's still good... just use Prometheus, a logging option of your choice, and simple Grafana dashboards and alerts. that combination hasn't varied in years, and those on it today are still immune from caring about the pace of innovation and change in OTel and across the Observability industry. OTel is being used in production at massive scale by lots of companies, but whether your project or company need move to it now reflects your priorities, many are adopting to gain independence from vendors, or just control over their telemetry, but many customers are also saying they're happy to stay on the slow and boring path and for everything to work predictably with low cost to keep pace... it works too.
If this migration appeared to be so painful, why did you decide to finish it (and make users unhappy) instead of cancelling the migration at early stages? What are benefits of this migration?
> career-driven development
we don't have this and promote and reward as frequently for "I've done solid operations" as we do for "I've added this feature" (I'm on promotion committees and can state this confidently).
what we do have is high autonomy for engineers. This autonomy means it's a freedom that engineers have to identify problems they feel are important and to work on them, they do not need permission and leadership do not veto this. Some of the best features in the last few years have been a direct result of this autonomy, it's one of the things that makes working here so attractive to many of the engineers. But, with autonomy comes a little chaos, and not everything that is done is going to satisfy every end user of OSS or paid customer (of which these are a small percent of the whole).
a lot of the innovation speed is just in the DNA of the company, even the creation of Grafana can be traced to a desire to get things done; Torkel wanted Kibana to also work for Prometheus, Kibana declined to add this, Torkel didn't stand still and added things to a fork of Kibana now called Grafana and hasn't stopped adding things since.
> They also deprecated Angular within Grafana and switched to React for dashboards. This broke most existing dashboards.
we did, I think the entire journey was 7 years long, communicated many times, over at least 6 major releases. maintaining dashboards in two languages increased complexity, whilst reducing compatibility, and gave a very large security surface to be worried about. we communicated clearly, provided migration tools, put it in release notes, updated docs, repeated it at conferences and on community calls.
arguably we went too slow, and should've ripped the band-aid off, but we were sensitive to the fact that it was a breaking change and so we proceeded with extreme caution. it's done now, it was finally completed in the last version, only a very small number of users reported impact as a result of the time and care taken on this.
> I just hope OTEL settles, gets stable and boring fast
this is distinct from Grafana, but it's a good point... OTel is the product of virtually every vendor at this point, and a hell of a lot of engineers, it now has a lot of momentum and the pace is unlikely to ease up due to the sheer number of contributions and things that OTel as a community wishes to achieve.
the most likely eventuality is that enough stability emerges to allow vendors (including but not limited to Grafana Labs) to abstract away the pace of innovation occurring underneath, but this is in tension with providing the benefits of the innovation to the people that use it.
what I would say is that for most people the boring and slow path does still exist, and it's still good... just use Prometheus, a logging option of your choice, and simple Grafana dashboards and alerts. that combination hasn't varied in years, and those on it today are still immune from caring about the pace of innovation and change in OTel and across the Observability industry. OTel is being used in production at massive scale by lots of companies, but whether your project or company need move to it now reflects your priorities, many are adopting to gain independence from vendors, or just control over their telemetry, but many customers are also saying they're happy to stay on the slow and boring path and for everything to work predictably with low cost to keep pace... it works too.
This is the worst reason to migrate to OTEL format for metrics, since every vendor and every solution for metrics has its' own set of transformation rules for the ingested OTEL metrics before saving them into the internal storage (this is needed in order to align OTEL metrics to the internal data model unique per each vendor / service). These transformation rules are incompatible among vendors and services. Also, every vendor / service may have its own querying API. This means that users cannot easily migrate from one vendor / service to another one by just switching from the old format to OTEL format for metrics' transfer. Read more about this at https://x.com/valyala/status/1982079042355343400