> Our tests revealed that Prometheus v3.1 requires 500 GiB of RAM to handle this workload, despite claims from its developers that memory efficiency has improved in v3.
AFAIK, starting from v3 Prometheus has `auto-gomemlimit` set by default. It means "Prometheus v3 will automatically set GOMEMLIMIT to match the Linux container memory limit.", which effectively prevents garbage collection until process reaches the specified limit. This is why, I think, Prometheus has increased flatlined mem usage in the article.
> Query the average over the last 2 hours, of system.ram of all nodes, grouped by dimension (4 dimensions in total), providing 120 (per-minute) points over time.
The query used for Prometheus here is an Instant query `query=avg_over_time(netdata_system_ram{dimension=~".+"}[2h:60s])`. This is rather a very weird subquery, that probably was never used by any of Prometheus users. Effectively, it instructs Prometheus to execute `netdata_system_ram{dimension=~".+"}` on 2h interval `2h/60s=120` 120 times, reading `120 * 7200 * 4k series = 3.5Bil` data samples.
Normally, Prometheus users don't do this. They'd rather run a /query_range query `avg_over_time(netdata_system_ram{dimension=~".+"}[5m])` with step=5m on 2h time interval, reading `7200*4k=29Mil` samples.
Another weird thing with this query is that in return Prometheus will send 4k time series with all the labels in JSON format. I wonder, how much time from 1.8s it took just to transfer data over the network.
AFAIK, starting from v3 Prometheus has `auto-gomemlimit` set by default. It means "Prometheus v3 will automatically set GOMEMLIMIT to match the Linux container memory limit.", which effectively prevents garbage collection until process reaches the specified limit. This is why, I think, Prometheus has increased flatlined mem usage in the article.
> Query the average over the last 2 hours, of system.ram of all nodes, grouped by dimension (4 dimensions in total), providing 120 (per-minute) points over time.
The query used for Prometheus here is an Instant query `query=avg_over_time(netdata_system_ram{dimension=~".+"}[2h:60s])`. This is rather a very weird subquery, that probably was never used by any of Prometheus users. Effectively, it instructs Prometheus to execute `netdata_system_ram{dimension=~".+"}` on 2h interval `2h/60s=120` 120 times, reading `120 * 7200 * 4k series = 3.5Bil` data samples. Normally, Prometheus users don't do this. They'd rather run a /query_range query `avg_over_time(netdata_system_ram{dimension=~".+"}[5m])` with step=5m on 2h time interval, reading `7200*4k=29Mil` samples.
Another weird thing with this query is that in return Prometheus will send 4k time series with all the labels in JSON format. I wonder, how much time from 1.8s it took just to transfer data over the network.