Preview of Explore Logs, a new way to browse your logs without writing LogQL

I'm not really a cloud expert so maybe I'm fundamentally missing something about how I'm "supposed to work", but honestly all I have ever wanted to do, when looking at logs, is see the log from one process, from beginning to end, as a text file. You can of course do this using kubectl but only for the most recent two instances of a given pod which isn't helpful when investigating an incident that happened a while ago.

It seems nobody else cares about this use case and wants you to use LogQL and the incredibly clunky Grafana web UI instead, because it makes it possible to aggregate across many different processes, slice and dice by various labels, etc., which as I said, I have never (or almost never) actually wanted to do.

Hopefully this new UI is a step in the right direction as people won't need to futz around with LogQL anymore, but it seems like it still doesn't quite do what I want.

fellerts · a year ago

Just want to chip in and say that I wholeheartedly agree with you. I'm not a cloud developer either, but I'm regularly forced into what's apparently called "Google Cloud's operations suite" to grovel through logs. Compared to working with Linux journals using the tried and true text manipulation tools, it feels like looking through a straw with oven mitts on. I'd happily download a 500 MB text file instead, but there is an arbitrary limit to how much I can grab (10k lines IIRC). Maybe we're just out of touch.

remus · a year ago

> but I'm regularly forced into what's apparently called "Google Cloud's operations suite" to grovel through logs

Is this google cloud logging? If so, personally I quite like it, especially for looking through logs from multiple sources at the same time. Being able to put all your logs through there, and then search them with a simple query language, feels very convenient.

debarshri · a year ago

It fairly risky to download 500MB of log and analyse it locally in the machine. I know People do it anyways. Just saying.

pluies · a year ago

Fwiw this is how I use Loki most of the time. Pick an app label, pick a time period, look at raw logs. The LogQL for this ends up something like `{app="workload-foo"}`. Loki is excellent at that.

Then if I know which pod I'll filter down to a specific pod with `{pod="workload-foo-1234"}`, sometimes I'll search for a specific term (error message etc) with `{pod="workload-foo-1234"} |= "error message"` then look at the logs around that. There's really no point writing complicated queries unless you need to.

umanwizard · a year ago

That will, if I understand correctly, get the logs for one pod, not for one process. For example if the pod restarted 10 times you will not get 10 separate files from that query.

ta1243 · a year ago

I'm an old fart so I use things like "cat" and "grep", and maybe "sed" and "cut" if the lines are particularly long.

I have one log file per day per host on my syslog server and can use "sort" to order across multiple files.

Loki was sold to me at fosdem a couple of years ago as this, but I still haven't got round to working it out, seems a very high barrier to entry compared with running cat.

sofixa · a year ago

> seems a very high barrier to entry compared with running cat

It really isn't. It's a single binary with a relatively simple configuration file, you throw logs at it via an API (which a bunch of logging agents support, and syslogs can be sent to it).

Then the actual queries aren't all that complex, it's just a difference of cd-ing to the correct folder for the date/server to be able to cat and grep vs writing a query that selects by server name and filters by date.

The learning curve and maintenance of Loki are quite minimal, but the value add is quite significant in most cases. Being able to do cross-host queries, metrics from logs (how many times did error X occur in the logs), as well as easy visualisations is pretty useful.

abhijat · a year ago

For analysing text logs lnav is pretty good, if you need to work with a live updated view of the log in response to commands.

392 · a year ago

You may be amazed at how hard these tools are to get started with relative to that. I have been thoroughly unimpressed with and unable to really get started with any of these tools because of the overemphasis on cloud. Not sure what people were doing before, but sshing to the prod box kinda sucks.

tkone · a year ago

If you’re debugging something simple or non-distributed, this product isn’t for you.

If you’re working on anything distributed, log aggregation becomes a must. But, also, if you’re working on anything distributed and you’re looking at logs, you’re desperate. Distributed traces are so much higher quality.

umanwizard · a year ago

When I formed these opinions I was working on Materialize, which is basically the polar opposite of "simple and non-distributed". However it was still quite common that I knew exactly which process was doing something weird and unexpected.

mason55 · a year ago

Yup and the reason no one markets something like "tail the logs for server X" is because, if you're talking in the context of an individual server, you're too small for anyone to care about.

zo1 · a year ago

Sorry, did plenty of "distributed" tracing back in the day and this is just not the case. I can't help but feel like you're after-the-fact rationalizing as if you need this for diagnosing anything "distributed" or "complicated".

Distributed anything is actually easier in most cases because you will always have input and output. Sure, if you're debugging a complicated and coordinated "dance" between two concurrent threads/processes then yeah fully agreed, but then you're deep in uncharted territory and you need all the help you can get.

jldugger · a year ago

> maybe I'm fundamentally missing something about how I'm "supposed to work", but honestly all I have ever wanted to do, when looking at logs, is see the log from one process, from beginning to end, as a text file.

This is still a valid use case but pretend for a minute you have thousands or millions of log lines to inspect. Even after filtering for ERROR level only, you still have too many "those are normal" errors, devs swear (but do not fix). And maybe the data you need to diagnose isn't even in ERROR!

The solution? Use log queries to compare a normal and abnormal process or cluster, group them by some kind of fingerprint, then apply some Laplace smoothing or other bayesian techniques to score fingerprints by strength of association with abnormal. This lets me rapidly identify problems at scale that would otherwise take hours of pouring through logs to exclude stuff by hand.

This works any time you can divide logs into "good" and "bad." Example scenarios:

- canary analysis, comparing canary and baseline

- single faulty pod in a deploy, comparing the bad container to the n good ones

- one AZ or region in a multi-region deploy

- now versus yesterday, or versus an hour ago, etc

- Android versus iPhone

dotancohen · a year ago

  > then apply some Laplace smoothing or other bayesian techniques to score fingerprints by strength of association with abnormal

I would love to hear more about this process.

Jenk · a year ago

> I'm not really a cloud expert so maybe I'm fundamentally missing something about how I'm "supposed to work", but honestly all I have ever wanted to do, when looking at logs, is see the log from one process, from beginning to end, as a text file.

That's the rub that I think you are missing. In distributed and/or cloud environments it is quite unusual for there to be a single end-to-end process, and thus we need new ways to trace across a system.

In harmony with tracing, we also need the aggregated view _across_ the estate to understand where system hotspots, levels of throughput, redundant infrastructure, error rates, etc.

dzikimarian · a year ago

Dump the logs into elastic, loki or whatever, along with pod name as a label. Usually I use Kibana, so I don't want to speak for Loki, but seems pretty straight forward.

hyperpape · a year ago

You missed the key criterion, which is being able to see the logs from that process "as a text file", or the way I'd rephrase it "with the same ease of a text file."

Kibana is ok (definitely beats grep) when you want to look across a fleet and determine if a specific thing is happening. But when you have a specific symptom that happens on a particular instance, what you want to do is see logs in the order they happened, and Kibana isn't close. Querying and viewing logs are just slow and cumbersome relative to less/grep.

andres · a year ago

If you get a chance, please check out kubetail (https://github.com/kubetail-org/kubetail). It's an open source log viewer for Kubernetes. Currently you can use it to look at pod logs from beginning to end, grouped together by workload (e.g. Deployment, CronJob) with basic filtering available (e.g. node-id, AZ). It doesn't let you look at historical logs yet but that's where we're headed. We just launched so we're eager for feedback and we like to build out new features quickly.

bennine · a year ago

Interesting. Will be following this tool.

There is a CLI tool with the same name that does something similar - https://github.com/johanhaleby/kubetail

nine_k · a year ago

Could LogQL do.something like

  select * from stdout, stderr
  where session_id = 123456

? If not, why?

Matthias247 · a year ago

yes it can, if you tag your log stream correctly - either by having the stream externally tagged via attributes, or internally by following certain conventions in the log line.

You can also do something like

select client_ip from requests where elapsed_ms > 10000

which is incredibly powerful

Deleted Comment

skrtskrt · a year ago

yep, with the caveat that you probably don't want to have the backend of whatever log system you use (not exactly sure how Loki does it) to have an index on something as high-cardinality as session id so that query could get slow.

But these log query systems can also optimize these queries for instance by by sampling, using distributed trace ids to ensure you get shown corresponding, allowing you to get only logs where at least one step in the trace errored, etc.

westurner · a year ago

strace and gdb can trace and close and reopen process file handles 0,1,2.

ldpreloadhook has an example of hooking write() with LD_PRELOAD=, which e.g. golang programs built without libc don't support.

When systemd is /sbin/init, it owns all subprocess' file handles already, so there's no need to close(0), time, open(0) with gdb.

Without having to logship (copy buffers that are flushed and/or have newline characters in the stream) to a network or local Arrow database files and or SQLite vtables,

journalctl (journald) supports pattern matching with: -t syslogidentifier, -u unit; and -g grepexpr of the MESSAGE= field:

  journalctl -u <TAB>
  journalctl -u init.scope --reverse
  journalctl -u unit.scope -g "Reached target"  # and then "/sleep" to search and highlight with less
  
  journalctl -u auditd.service

  # this is slow because it's a full table scan, because
  # journald does not index the logfiles;
  # and -g/--grep is case insensitive if the query is all lowercase:
  journalctl -g avc --reverse
  journalctl -g AVC --reverse

  # this is faster:
  journalctl -t audit -g AVC -r

  # this is still faster,
  # because it only searches the current boot:
  journalctl -b 0 -t audit -g AVC

  # these are equivalent:
  journalctl -b 0 --dmesg -t kernel
  journalctl -k

  # 
  journalctl -b 0 --user | grep -i -C "xyz123"

There is a GNOME Logs viewer that has 'All' and a few mutually exclusive filter/reports in a side pane, and a search expression field to narrow a filter/report like All or Important.

There is a Grafana Loki Docker Driver that logships from all containers visible on that DOCKER_HOST docker socket to Grafana for querying with Loki: https://grafana.com/docs/loki/latest/send-data/docker-driver...

Podman with Systemd doesn't need the Grafana Docker Driver (or other logshippers like logstash, loggly, or fluentd) because systemd spawns containers and optionally pipes their stdout/stderr logs to journald.

Influx has Telegraf, InfluxDB, Chronograf, and Kapacitor. Chronograf is their WebUI which provides a query interface for configurable chart dashboards and InfluxQL.

Grafana supports SQL, PromQL, InfluxQL, and LogQL.

Graylog2 also indexes logfiles.

But you can't query stdout and stderr you or /sbin/init haven't logged to a file.

Topgamer7 · a year ago

I use LogQL a fair amount. Often times even just negative filtering is quite useful.

I do a fair amount of tracking down of issues with LogQL. Looking for logs specific to a customer support ticket. Filtering for logs by a traceId for distributed traces.

I have serious doubts this new UI is something I will care about at all.

The explore ui for setting labels is atrocious and painful, and I'd rather just give me the text input for LogQL*

*: Please FFS someone fix the Ctrl+f creating a vscode like find dialog that only finds inside the text input. I never want to do a find specifically isolated to my LogQL

xenartro · a year ago

> Please FFS someone fix the Ctrl+f creating a vscode like find dialog that only finds inside the text input. I never want to do a find specifically isolated to my LogQL

Quick update: that has been fixed and will be available in Grafana 11.1.

Deleted Comment