I'm not really a cloud expert so maybe I'm fundamentally missing something about how I'm "supposed to work", but honestly all I have ever wanted to do, when looking at logs, is see the log from one process, from beginning to end, as a text file. You can of course do this using kubectl but only for the most recent two instances of a given pod which isn't helpful when investigating an incident that happened a while ago.
It seems nobody else cares about this use case and wants you to use LogQL and the incredibly clunky Grafana web UI instead, because it makes it possible to aggregate across many different processes, slice and dice by various labels, etc., which as I said, I have never (or almost never) actually wanted to do.
Hopefully this new UI is a step in the right direction as people won't need to futz around with LogQL anymore, but it seems like it still doesn't quite do what I want.
Just want to chip in and say that I wholeheartedly agree with you. I'm not a cloud developer either, but I'm regularly forced into what's apparently called "Google Cloud's operations suite" to grovel through logs. Compared to working with Linux journals using the tried and true text manipulation tools, it feels like looking through a straw with oven mitts on. I'd happily download a 500 MB text file instead, but there is an arbitrary limit to how much I can grab (10k lines IIRC). Maybe we're just out of touch.
> but I'm regularly forced into what's apparently called "Google Cloud's operations suite" to grovel through logs
Is this google cloud logging? If so, personally I quite like it, especially for looking through logs from multiple sources at the same time. Being able to put all your logs through there, and then search them with a simple query language, feels very convenient.
Fwiw this is how I use Loki most of the time. Pick an app label, pick a time period, look at raw logs. The LogQL for this ends up something like `{app="workload-foo"}`. Loki is excellent at that.
Then if I know which pod I'll filter down to a specific pod with `{pod="workload-foo-1234"}`, sometimes I'll search for a specific term (error message etc) with `{pod="workload-foo-1234"} |= "error message"` then look at the logs around that. There's really no point writing complicated queries unless you need to.
That will, if I understand correctly, get the logs for one pod, not for one process. For example if the pod restarted 10 times you will not get 10 separate files from that query.
I'm an old fart so I use things like "cat" and "grep", and maybe "sed" and "cut" if the lines are particularly long.
I have one log file per day per host on my syslog server and can use "sort" to order across multiple files.
Loki was sold to me at fosdem a couple of years ago as this, but I still haven't got round to working it out, seems a very high barrier to entry compared with running cat.
> seems a very high barrier to entry compared with running cat
It really isn't. It's a single binary with a relatively simple configuration file, you throw logs at it via an API (which a bunch of logging agents support, and syslogs can be sent to it).
Then the actual queries aren't all that complex, it's just a difference of cd-ing to the correct folder for the date/server to be able to cat and grep vs writing a query that selects by server name and filters by date.
The learning curve and maintenance of Loki are quite minimal, but the value add is quite significant in most cases. Being able to do cross-host queries, metrics from logs (how many times did error X occur in the logs), as well as easy visualisations is pretty useful.
You may be amazed at how hard these tools are to get started with relative to that. I have been thoroughly unimpressed with and unable to really get started with any of these tools because of the overemphasis on cloud. Not sure what people were doing before, but sshing to the prod box kinda sucks.
If you’re debugging something simple or non-distributed, this product isn’t for you.
If you’re working on anything distributed, log aggregation becomes a must. But, also, if you’re working on anything distributed and you’re looking at logs, you’re desperate. Distributed traces are so much higher quality.
When I formed these opinions I was working on Materialize, which is basically the polar opposite of "simple and non-distributed". However it was still quite common that I knew exactly which process was doing something weird and unexpected.
Yup and the reason no one markets something like "tail the logs for server X" is because, if you're talking in the context of an individual server, you're too small for anyone to care about.
Sorry, did plenty of "distributed" tracing back in the day and this is just not the case. I can't help but feel like you're after-the-fact rationalizing as if you need this for diagnosing anything "distributed" or "complicated".
Distributed anything is actually easier in most cases because you will always have input and output. Sure, if you're debugging a complicated and coordinated "dance" between two concurrent threads/processes then yeah fully agreed, but then you're deep in uncharted territory and you need all the help you can get.
> maybe I'm fundamentally missing something about how I'm "supposed to work", but honestly all I have ever wanted to do, when looking at logs, is see the log from one process, from beginning to end, as a text file.
This is still a valid use case but pretend for a minute you have thousands or millions of log lines to inspect. Even after filtering for ERROR level only, you still have too many "those are normal" errors, devs swear (but do not fix). And maybe the data you need to diagnose isn't even in ERROR!
The solution? Use log queries to compare a normal and abnormal process or cluster, group them by some kind of fingerprint, then apply some Laplace smoothing or other bayesian techniques to score fingerprints by strength of association with abnormal. This lets me rapidly identify problems at scale that would otherwise take hours of pouring through logs to exclude stuff by hand.
This works any time you can divide logs into "good" and "bad." Example scenarios:
- canary analysis, comparing canary and baseline
- single faulty pod in a deploy, comparing the bad container to the n good ones
- one AZ or region in a multi-region deploy
- now versus yesterday, or versus an hour ago, etc
> I'm not really a cloud expert so maybe I'm fundamentally missing something about how I'm "supposed to work", but honestly all I have ever wanted to do, when looking at logs, is see the log from one process, from beginning to end, as a text file.
That's the rub that I think you are missing. In distributed and/or cloud environments it is quite unusual for there to be a single end-to-end process, and thus we need new ways to trace across a system.
In harmony with tracing, we also need the aggregated view _across_ the estate to understand where system hotspots, levels of throughput, redundant infrastructure, error rates, etc.
Dump the logs into elastic, loki or whatever, along with pod name as a label. Usually I use Kibana, so I don't want to speak for Loki, but seems pretty straight forward.
You missed the key criterion, which is being able to see the logs from that process "as a text file", or the way I'd rephrase it "with the same ease of a text file."
Kibana is ok (definitely beats grep) when you want to look across a fleet and determine if a specific thing is happening. But when you have a specific symptom that happens on a particular instance, what you want to do is see logs in the order they happened, and Kibana isn't close. Querying and viewing logs are just slow and cumbersome relative to less/grep.
If you get a chance, please check out kubetail (https://github.com/kubetail-org/kubetail). It's an open source log viewer for Kubernetes. Currently you can use it to look at pod logs from beginning to end, grouped together by workload (e.g. Deployment, CronJob) with basic filtering available (e.g. node-id, AZ). It doesn't let you look at historical logs yet but that's where we're headed. We just launched so we're eager for feedback and we like to build out new features quickly.
yes it can, if you tag your log stream correctly - either by having the stream externally tagged via attributes, or internally by following certain conventions in the log line.
You can also do something like
select client_ip from requests where elapsed_ms > 10000
yep, with the caveat that you probably don't want to have the backend of whatever log system you use (not exactly sure how Loki does it) to have an index on something as high-cardinality as session id so that query could get slow.
But these log query systems can also optimize these queries for instance by by sampling, using distributed trace ids to ensure you get shown corresponding, allowing you to get only logs where at least one step in the trace errored, etc.
strace and gdb can trace and close and reopen process file handles 0,1,2.
ldpreloadhook has an example of hooking write() with LD_PRELOAD=, which e.g. golang programs built without libc don't support.
When systemd is /sbin/init, it owns all subprocess' file handles already, so there's no need to close(0), time, open(0) with gdb.
Without having to logship (copy buffers that are flushed and/or have newline characters in the stream) to a network or local Arrow database files and or SQLite vtables,
journalctl (journald) supports pattern matching with: -t syslogidentifier, -u unit; and -g grepexpr of the MESSAGE= field:
journalctl -u <TAB>
journalctl -u init.scope --reverse
journalctl -u unit.scope -g "Reached target" # and then "/sleep" to search and highlight with less
journalctl -u auditd.service
# this is slow because it's a full table scan, because
# journald does not index the logfiles;
# and -g/--grep is case insensitive if the query is all lowercase:
journalctl -g avc --reverse
journalctl -g AVC --reverse
# this is faster:
journalctl -t audit -g AVC -r
# this is still faster,
# because it only searches the current boot:
journalctl -b 0 -t audit -g AVC
# these are equivalent:
journalctl -b 0 --dmesg -t kernel
journalctl -k
#
journalctl -b 0 --user | grep -i -C "xyz123"
There is a GNOME Logs viewer that has 'All' and a few mutually exclusive filter/reports in a side pane, and a search expression field to narrow a filter/report like All or Important.
Podman with Systemd doesn't need the Grafana Docker Driver (or other logshippers like logstash, loggly, or fluentd) because systemd spawns containers and optionally pipes their stdout/stderr logs to journald.
Influx has Telegraf, InfluxDB, Chronograf, and Kapacitor. Chronograf is their WebUI which provides a query interface for configurable chart dashboards and InfluxQL.
Grafana supports SQL, PromQL, InfluxQL, and LogQL.
Graylog2 also indexes logfiles.
But you can't query stdout and stderr you or /sbin/init haven't logged to a file.
I use LogQL a fair amount. Often times even just negative filtering is quite useful.
I do a fair amount of tracking down of issues with LogQL. Looking for logs specific to a customer support ticket. Filtering for logs by a traceId for distributed traces.
I have serious doubts this new UI is something I will care about at all.
The explore ui for setting labels is atrocious and painful, and I'd rather just give me the text input for LogQL*
*: Please FFS someone fix the Ctrl+f creating a vscode like find dialog that only finds inside the text input. I never want to do a find specifically isolated to my LogQL
> Please FFS someone fix the Ctrl+f creating a vscode like find dialog that only finds inside the text input. I never want to do a find specifically isolated to my LogQL
Quick update: that has been fixed and will be available in Grafana 11.1.
I recently setup Victoria Metrics + https://github.com/prometheus/snmp_exporter + Grafana to get start tracking bandwidth on my top of rack switches in my Datacenter rack which has been a pretty awesome setup. The way you can auto generate a config for your SNMP MIBs with SNMP Exporter was unexpectedly not a terrible experience.
My next task is to get centralized logging going with Victoria Logs + Vector, I'll have to check this out once I get everything setup. I believe I can use LogQL with Victoria Logs but I haven't tried it out yet. https://docs.victoriametrics.com/victorialogs/logsql/
Here you have Vector in aggregator + agent mode and several sources. VictoriaLogs also recently added Grafana datasource so it is fairly easy to set it up:
I've been eyeing a VictoriaLogs setup for my docker container fleet, but I haven't quite spotted where docker's remote logging export options overlap with VictoriaLogs ingestion options.
Wrinkle: two docker remote logging plugins I tried (e.g. loki, elastic) didn't seem to work on ARM processors out of the box.
I'll preface this with the fact I haven't look at Loki in a bit, so maybe this has changed. But I found the documentation needing a lot of work and the configuration for promtail to be obtuse and not very user friendly. I haven't used it for those reasons, not because of the query language.
Our team uses loki and I have to say I think their collected helm charts are pretty easy to use - my problem is more that it seems to be quite slow to run on-prem. Very often my loki query times out and I have to do more work filtering down the log lines or selecting a narrower time range.
I'm kind of amazed the UI doesn't select small time ranges iteratively to build up the response, especially since I believe this is what the CLI does. Perhaps this is also part of their cloud offering provides and it is part of their marketing strategy. Not a good one because if we came down to the decision I would start by looking for something else from being p'ed off by Loki.
But I guess it still works pretty well considering it is free.
Loki UI in Grafana Explore seems to only select 1000 lines by default for me?
Also Loki on the backend splits/parallelizes requests if it can.
The Grafana backend Mimir / Loki / Tempo products all appear to be architected pretty similar, and I'm more experienced operating Mimir, but the answer to read load often has to just do with right-sizing the deployment scale, and using caches aggressively.
Loki OSS is just a sales pitch for their managed service. It doesn't work well without dedicating significant time tweaking and configuring it. Documentation is confusing at best if you want to do anything serious. You have to also be ready to handle support calls if you open it up for others to use, because it WILL have issues fairly regularly if you have a good volume of logs and query range is more than a day or two.
Unless you have the bank to go with their managed service, don't bother.
Why have explore logs as a separate app instead of bundled with Loki? It would be nice if Loki had the same kind of barebones querying/debugging functionality as Prometheus...
Yeah but Prometheus has a web ui where you can run PromQL queries and it'll give you basic graphs back, which is handy for throwing a quick query at it before putting it into something more long-term like a Grafana dashboard or an alerting rule.
LogQL so far just does not click for me. I get that it's trying to be like Prometheus, but logs are not the same as time series - we have each and every log! So why am I forced to query it like a time series data source?
I want to query my logs like a SQL table, not a time series database.
I thought it was a standalone web app, but it's integrated into Grafana. I'm confused a bit. There's already Explore functionality in Grafana for Loki. Seems like spreading the efforts for no reason.
After having used Datadog for several years, going back to Grafana / Loki / Prometheus felt like regressing by two decades. As much as I appreciate free solutions, I feel like Grafana has really fallen behind when it comes to developer experience
Grafana cloud is better for querying logs.
Grafana cloud is probably a bit better for querying metrics. Grafana cloud is terrible at finding traces or even loading them. Datadog is lightyears ahead. For alerting I feel datadog has better features but is overwhelming with all the different options.
grafana is very quirky for searching for traces. And has a huge learning curve.
Could you provide more details? Although I've never had the opportunity to use Datadog at any of my previous positions, I am quite familiar with Grafana and I'm generally pretty happy with it.
It seems nobody else cares about this use case and wants you to use LogQL and the incredibly clunky Grafana web UI instead, because it makes it possible to aggregate across many different processes, slice and dice by various labels, etc., which as I said, I have never (or almost never) actually wanted to do.
Hopefully this new UI is a step in the right direction as people won't need to futz around with LogQL anymore, but it seems like it still doesn't quite do what I want.
Is this google cloud logging? If so, personally I quite like it, especially for looking through logs from multiple sources at the same time. Being able to put all your logs through there, and then search them with a simple query language, feels very convenient.
Then if I know which pod I'll filter down to a specific pod with `{pod="workload-foo-1234"}`, sometimes I'll search for a specific term (error message etc) with `{pod="workload-foo-1234"} |= "error message"` then look at the logs around that. There's really no point writing complicated queries unless you need to.
I have one log file per day per host on my syslog server and can use "sort" to order across multiple files.
Loki was sold to me at fosdem a couple of years ago as this, but I still haven't got round to working it out, seems a very high barrier to entry compared with running cat.
It really isn't. It's a single binary with a relatively simple configuration file, you throw logs at it via an API (which a bunch of logging agents support, and syslogs can be sent to it).
Then the actual queries aren't all that complex, it's just a difference of cd-ing to the correct folder for the date/server to be able to cat and grep vs writing a query that selects by server name and filters by date.
The learning curve and maintenance of Loki are quite minimal, but the value add is quite significant in most cases. Being able to do cross-host queries, metrics from logs (how many times did error X occur in the logs), as well as easy visualisations is pretty useful.
If you’re working on anything distributed, log aggregation becomes a must. But, also, if you’re working on anything distributed and you’re looking at logs, you’re desperate. Distributed traces are so much higher quality.
Distributed anything is actually easier in most cases because you will always have input and output. Sure, if you're debugging a complicated and coordinated "dance" between two concurrent threads/processes then yeah fully agreed, but then you're deep in uncharted territory and you need all the help you can get.
This is still a valid use case but pretend for a minute you have thousands or millions of log lines to inspect. Even after filtering for ERROR level only, you still have too many "those are normal" errors, devs swear (but do not fix). And maybe the data you need to diagnose isn't even in ERROR!
The solution? Use log queries to compare a normal and abnormal process or cluster, group them by some kind of fingerprint, then apply some Laplace smoothing or other bayesian techniques to score fingerprints by strength of association with abnormal. This lets me rapidly identify problems at scale that would otherwise take hours of pouring through logs to exclude stuff by hand.
This works any time you can divide logs into "good" and "bad." Example scenarios:
- canary analysis, comparing canary and baseline
- single faulty pod in a deploy, comparing the bad container to the n good ones
- one AZ or region in a multi-region deploy
- now versus yesterday, or versus an hour ago, etc
- Android versus iPhone
That's the rub that I think you are missing. In distributed and/or cloud environments it is quite unusual for there to be a single end-to-end process, and thus we need new ways to trace across a system.
In harmony with tracing, we also need the aggregated view _across_ the estate to understand where system hotspots, levels of throughput, redundant infrastructure, error rates, etc.
Kibana is ok (definitely beats grep) when you want to look across a fleet and determine if a specific thing is happening. But when you have a specific symptom that happens on a particular instance, what you want to do is see logs in the order they happened, and Kibana isn't close. Querying and viewing logs are just slow and cumbersome relative to less/grep.
There is a CLI tool with the same name that does something similar - https://github.com/johanhaleby/kubetail
You can also do something like
select client_ip from requests where elapsed_ms > 10000
which is incredibly powerful
Deleted Comment
But these log query systems can also optimize these queries for instance by by sampling, using distributed trace ids to ensure you get shown corresponding, allowing you to get only logs where at least one step in the trace errored, etc.
ldpreloadhook has an example of hooking write() with LD_PRELOAD=, which e.g. golang programs built without libc don't support.
When systemd is /sbin/init, it owns all subprocess' file handles already, so there's no need to close(0), time, open(0) with gdb.
Without having to logship (copy buffers that are flushed and/or have newline characters in the stream) to a network or local Arrow database files and or SQLite vtables,
journalctl (journald) supports pattern matching with: -t syslogidentifier, -u unit; and -g grepexpr of the MESSAGE= field:
There is a GNOME Logs viewer that has 'All' and a few mutually exclusive filter/reports in a side pane, and a search expression field to narrow a filter/report like All or Important.There is a Grafana Loki Docker Driver that logships from all containers visible on that DOCKER_HOST docker socket to Grafana for querying with Loki: https://grafana.com/docs/loki/latest/send-data/docker-driver...
Podman with Systemd doesn't need the Grafana Docker Driver (or other logshippers like logstash, loggly, or fluentd) because systemd spawns containers and optionally pipes their stdout/stderr logs to journald.
Influx has Telegraf, InfluxDB, Chronograf, and Kapacitor. Chronograf is their WebUI which provides a query interface for configurable chart dashboards and InfluxQL.
Grafana supports SQL, PromQL, InfluxQL, and LogQL.
Graylog2 also indexes logfiles.
But you can't query stdout and stderr you or /sbin/init haven't logged to a file.
I do a fair amount of tracking down of issues with LogQL. Looking for logs specific to a customer support ticket. Filtering for logs by a traceId for distributed traces.
I have serious doubts this new UI is something I will care about at all.
The explore ui for setting labels is atrocious and painful, and I'd rather just give me the text input for LogQL*
*: Please FFS someone fix the Ctrl+f creating a vscode like find dialog that only finds inside the text input. I never want to do a find specifically isolated to my LogQL
Quick update: that has been fixed and will be available in Grafana 11.1.
Deleted Comment
My next task is to get centralized logging going with Victoria Logs + Vector, I'll have to check this out once I get everything setup. I believe I can use LogQL with Victoria Logs but I haven't tried it out yet. https://docs.victoriametrics.com/victorialogs/logsql/
https://github.com/nklmilojevic/home/blob/main/kubernetes/ap...
https://github.com/nklmilojevic/home/tree/main/kubernetes/ap...
Here you have Vector in aggregator + agent mode and several sources. VictoriaLogs also recently added Grafana datasource so it is fairly easy to set it up:
https://github.com/nklmilojevic/home/blob/main/kubernetes/ap...
I'm a big fan of VictoriaMetrics as well and we use it extensively in my company at high scale.
Wrinkle: two docker remote logging plugins I tried (e.g. loki, elastic) didn't seem to work on ARM processors out of the box.
I use Podman for all of my container stuff and there are issues with how Podman produces JSON logs
https://github.com/vectordotdev/vector/issues/6807https://github.com/containers/podman/issues/16317
which needs to get fixed before I can use it for my workloads.
I'm kind of amazed the UI doesn't select small time ranges iteratively to build up the response, especially since I believe this is what the CLI does. Perhaps this is also part of their cloud offering provides and it is part of their marketing strategy. Not a good one because if we came down to the decision I would start by looking for something else from being p'ed off by Loki.
But I guess it still works pretty well considering it is free.
Also Loki on the backend splits/parallelizes requests if it can.
The Grafana backend Mimir / Loki / Tempo products all appear to be architected pretty similar, and I'm more experienced operating Mimir, but the answer to read load often has to just do with right-sizing the deployment scale, and using caches aggressively.
Unless you have the bank to go with their managed service, don't bother.
I want to query my logs like a SQL table, not a time series database.
grafana is very quirky for searching for traces. And has a huge learning curve.
What's the TL;DR for why Datadog is better?