Show HN: Log collector that runs on a $4 VPS

I’ve found the hard part is not so much the collection of logs (especially at this scale), but the eventual querying. If you’ve got an unknown set of fields been logged, queries very quickly devolve into lots of slow table scans or needing materialised views that start hampering your ingest rate.

I settled on a happy/ok midpoint recently whereby I dump logs in a redis queue using filebeat as it’s very simple. Then have a really simple queue consumer that dumps the logs into clickhouse using a schema Uber detailed (split keys and values), so queries can be pretty quick even over arbitrary fields. 30,00 logs an hour and I can normally search for anything in under a second.

mr-karan · 3 years ago

I've a similar pipeline to yours (for the storage part). I use vector.dev for collecting and aggregating logs, enriching with metadata (cluster, env, AWS tags) and then finally storing them in a Clickhouse table.

Do you use any particular UI/Frontend tool for querying these logs?

mekster · 3 years ago

I've got that same logging backend and I use Metabase for querying. This is far cleaner and easier to use/learn than Kibana or Greylog.

I've also considered Grafana, but it's not good for viewing raw logs.

Dachande663 · 3 years ago

We already have a back office tool, so there’s just an extra screen that has query builder and outputs. Every few months I’ll tweak and add bits but nice to be able to see a user or entity ID and click on it to view the full resource.

metadat · 3 years ago

What are the hardware requirements/ / resources dedicated to pull this off?

Dachande663 · 3 years ago

Runs off a 2GB digital ocean box, which I think is $10 now?

It’s probably incredibly boring to describe, but I think that’s why it just tends to work. The whole thing took an afternoon to write (in PHP of all things too).

FrenchTouch42 · 3 years ago

Can you share more information about the schema you're mentioning? Thank you!

ignoramous · 3 years ago

Not OP but they might be referring to Uber moving from ES to ClickHouse to store their schema-flexible, structured logs, mostly to improve ingestion performance: https://archive.is/bFsTF / https://www.uber.com/blog/logging/

The gist of it is:

- Structured logs (json) are stored as kv pairs in parallel arrays, along side metadata (host, timestamp, id, geo, namespace, etc).

- Log fields (ie kv pairs) are materialized (indexed) depending on query patterns, and vaccummed up if unused.

- Authoring queries and Kibana dashboard support is not trivial but handled with a query translation layer.

This is exactly why I build log-store. Can easily handle 60k logs/sec, but I think more importantly is the query interface. Commands to help you extract value from your logs, including custom commands written in Python.

Free through '23 is my motto... Just a solo founder looking for feedback.

recck · 3 years ago

I came across this a few months ago and have been following pretty closely. Having been using this locally in a Docker container has been painless. The UI is definitely iterating quickly, but the time-to-first-log was impressive! Happy to keep using it.

spsesk117 · 3 years ago

Disclaimer: I am friends with the founder of log-store.

I have been beta testing it for a while for small scale (~50 million non-nested json objects) log aggregation it's working beautifully for this case.

It's a no nonsense solution that is seemless to integrate and operate. On the ops side, it's painless to setup, maintain, and push logs to. On the user side, its extremely fast and straight forward. End users are not fumbling their way through a monster UI like Kibana, access to information they need is straight forward and uncluttered.

I can't speak to it's suitability in a 1TB logs/day situation, but for a small scale straight forward log agg. tool I can't recommend it enough.

binwiederhier · 3 years ago

log-store [1] is pretty neat. Thanks for making it. It's super powerful and easy to use. There's a learning curve with the query language, but it's super cool once you figure it out.

[1] https://log-store.com/

folmar · 3 years ago

Sorry, but I don't see the selling point yet. Rsyslog has omlibdbi module that send your data to sqlite. It can consume pretty much any standard protocol on input, is already available and battle proven.

remram · 3 years ago

Or just keep it in the log file? I am not sure what is the advantage of putting it in SQLite, if all you're going to do is unindexed `json_extract()` queries on it.

djbusby · 3 years ago

Or syslog-ng ? And syslog is crazy easy to integrate into nearly any code.

unxdfa · 3 years ago

I see your idea but you could drop the JSON and use rsyslogd + logrotate + grep? You can grep 10 gig files on a $5 VPS easily and quickly! I can't speak for a $4 one ;)

tiagod · 3 years ago

If you use grep you'll be doing the same expensive operation every time, following files naively will fail after rotation, etc... And if you use something like Loki it's easier to integrate with other tools to react to the logs

It’s potentially a premature optimisation to not do that expensive operation every time. Loki and brethren have a significant infrastructure cost and cognitive load to consider. I speak from experience and know where the ROI appears and it’s far from the use case specified here.

ilyt · 3 years ago

> following files naively will fail after rotation,

...so what you're saying they have to write "tail -F" instead of "tail".

> If you use grep you'll be doing the same expensive operation every time

if you have ingest that low it barely matters. Modern grep replacements are pretty fast

Why do people like to stick to inefficient ancient method like grep for log viewing?

Try tools like Metabase and see how it makes your log reading far better.

Thaxll · 3 years ago

You could have just used Filebeat? It's also in Go and it's pretty easy to use.

https://www.elastic.co/guide/en/beats/filebeat/current/fileb...

I think Vector really shines with its VRL language to parse and enrich data. It's well thought out with buffering for network errors and throwing errors on parsing instead of silently discarding.

https://vector.dev/docs/reference/vrl/

rsdbdr203 · 3 years ago

May be more widely applicable for personal servers: lnav, an advanced log file viewer for the terminal: https://lnav.org/

It uses SQLite internally but can parse log files in many formats on the fly. C++, BSD license, discussed 1 month ago: https://news.ycombinator.com/item?id=34243520

keroro · 3 years ago

If anyones looking for similar services Im using vector.dev to move logs around & it works great & has a ton of sources/destinations pre-configured.

Hamuko · 3 years ago

I feel like if you're going to use "$4 VPS" as a quantifier, you could at least specify which $4 VPS is being used.

sgt · 3 years ago

Look at this:

https://www.hetzner.com/cloud

More like $5 but still, 1 vCPU, 2GB RAM, 20GB NVMe storage. Closer to $4 USD if you let go of IPv4 in favor of IPv6 only.

Edit: Looks like that's also a shared vCPU.

teruakohatu · 3 years ago

DO's 512mb basic VPS starts at $4, so I am guessing it is that.

benatkin · 3 years ago

I don't think it is. That one is shared vCPU and I've been hearing about a single vCPU one.