Vector is fantastic software. Currently running a multi-GB/s log pipeline with it. Vector agents as DaemonSets collecting pod and journald logs then forwarding w/ vector's protobuf protocol to a central vector aggregator Deployment with various sinks - s3, gcs/bigquery, loki, prom.
The documentation is great but it can be hard to find examples of common patterns, although it's getting better with time and a growing audience.
My pro-tip has been to prefix your searches with "vector dev <query>" for best results on google. I think "vector" is/was just too generic.
I feel like the ecosystem is very, very close to ready for what I would consider to be a really nice medium-to-long-term queryable log storage system. In my mind, it works like this:
1. Logs get processed (by a tool like vector) and stored to a sink that consists of widely-understood files in an object store. Parquet format would be a decent start. (Yscope has what sounds like a nifty compression scheme that could layer in here.)
2. Those logs objects are (transactionally!) enrolled into a metadata store so things can find them. Delta Lake or Iceberg seem credible. Sure, these tools are meant for Really Big Data, but I see so reason they couldn’t work at any scale. And because the transaction layer exists as a standalone entity, one could run multiple log processing pipelines all committing into the same store.
3. High-performance and friendly tools can read them. Think Clickhouse, DuckDB, Spark, etc. Maybe everything starts to support this as a source for queries.
4. If you want to switch tools, no problem — the formats are standard. You can even run more than one at once.
Has anyone actually put the pieces together to make something like this work?
I work on something where we use Vector similar to this.
The application writes directly to a local Vector instance running as a daemon set, using the TCP protocol. That instance buffers locally in case of upstream downtime. It also augments each payload with some metadata about the origin.
The local one then sends to a remote Vector using Vector's internal Protobuf-based framing protocol. That Vector has two sinks, one which writes the raw data in immutable chunks to an object store for archival, and another that ingests in real time into ClickHouse.
This all works pretty great. The point of having a local Vector is so applications can be thin clients that just "firehose" out their data without needing a lot of complex buffering, retrying, etc. and without a lot of overhead, so we can emit very fine-grained custom telemetry data.
There is a tiny bit of retrying logic with a tiny bit of in-memory buffering (Vector can go down or be restarted and the client must handle that), but it's very simple, and designed to sacrifice messages to preserve availability.
Grafana is a nice way to use ClickHouse. ClickHouse is a bit more low level than I'd like (it often feels more like a "database construction kit" than a database), but the design is fantastic.
Depending on your use case and if you miss a few logs, sending log data via udp is helpful so you don’t interrupt the app. I have done this to good effect, though not with vector. Our stuff was custom and aggregated many thing into 1s chunks.
Quickwit is very similar to what is described here.
Unfortunately, the files are not in Parquet so even though Quickwit is opensource, it is difficult to tap into the file format.
We did not pick Parquet because we want to actually be able to search and do analysis efficiently, so we ship an inverted index, a row-oriented store, and a columnar format that allows for random access.
We are planning to eventually add ways to tap into the file and get data in the Apache arrow format.
For a quick skim through the docs, it wasn’t clear to me: can I run a stateless Quickwit instance or even a library to run queries, such that the only data accessed is in the underlying object store? Or do I need a long-running search instance or cluster?
Would this fit your medium to long term? It's a weekend work to automate: json logs go to Kafka, logstash consumer to store batches in hive partitioned data in s3 with gzip compression, Athena tables over these s3 prefixes and prestodb language used to query/cast/aggregate the data
Much more reliable than beats and vendor specific forwarders (chronicle forwarder and fdr) in our experience. Vrl is also pretty useful at "preparsing" massive logs e.g. aws cloudtrail and imperva abp
Timber definitely intended to just rock out & demolish everything else out there with their agent/forwarder/aggregator tech. But it wasn't a competitive play against OTel, in my humble opinion. Timber's whole shtick is that it integrates with everything, with really flexible/good glue logic in-between. A competent multi-system (logging, metrics, eventually traces) fluentd++. OTel - I want to believe - would have been part of that original vision.
I’ve used this before, to great success. Nice and straightforward to configure, the vrl language is just powerful enough for its needs, the cli’s handy “check” feature helps you catch a bunch of config issues. Performance wise it’s never missed a beat and it’s resource efficient, strongly recommend.
We had to push metrics we scrape via Prometheus into DataDog (coincidence that they acquired this) and do a custom transform to map to a set of custom metrics.
Very straightforward in how it runs and the helm chart had all the right things in there
Otel support in Vector is an often requested feature. Across multiple threads. There seems to be good noises & some occasional we'll get to it, but so far there's just otel log ingest support, which has been there for a while now. https://github.com/vectordotdev/vector/issues/17307
I'm excited for these front-end telemetry routers to keep going. Really hoping Vector can co-evolve with and grow with the rest of the telemetry ecosystem. Otel itself has really started in on the next front with OpAMP, Open Agent Management Protocol, to allow online reconfiguration. I'd love to know more about Vector's online management... quick scan seems to say it's rewriting your JSON config & doing a SIGHUP. https://opentelemetry.io/docs/specs/opamp/
Vectors configurability & fast-and-slim promise looks amazing. Everyone would be so much better off if it can grow to interop well with otel. Really hoping here.
I’m personally still waiting for Otel stuff to just…evolve a bit more? There’s some sharp edges, and a bunch of “bits” in that ecosystem that aren’t clear how we’re supposed to hold them and things don’t _quite_ work well enough yet.
Don’t get me wrong, I want to use OTEL, but it’s a struggle. In the meantime, I’ve still got normal apps and libraries outputting normal logs and normal prom metrics, so I’ve got to stick with that.
Vector to me is more than just "high-performance" - It's a true swiss army knife for metrics and logging. We regularly use it to transform logs to metrics, metrics to different format, push them to different datastores, filter them, etc. It's wild how flexible this program is. It has become my first choice for anything regarding gathering/aggregating/filtering/preprocessing observability data.
Is there a way to temporarily connect to Vector and select either sources or sinks to be duplicated into that stream (let's say stdout or TCP socket)? I'd love to find a use case for Logdy where I can just stream whatever is landing in Vector to Logdy[1] and literally see everything through a web UI. The use case would be to debug complex observability pipeline as Logdy serves a UI for everything that lands on it and allows to parse and filter easily.
I'm just getting to know about vector. I have noticed that most Vector examples and discussions are targeted towards databases or complex multi-tenant applications. And looks really cool!
Has anyone tried Vector in the context of autonomous vehicle, essentially distributed system, where vector would serve the purpose of aggregating the op-logs, system state, input and output of every application at every instance?
I only learned about vector after I had setup a new fluent-bit pipeline, and have to say there's a lot of stuff that looks interesting in vector and wish I had time to play with it earlier. Might still do it when I have some downtime, it looks very interesting and capable, could be fun to try on a new project.
The documentation is great but it can be hard to find examples of common patterns, although it's getting better with time and a growing audience.
My pro-tip has been to prefix your searches with "vector dev <query>" for best results on google. I think "vector" is/was just too generic.
A nice recent contribution added an alternative to prometheus pushgateway that handles counters better: https://github.com/vectordotdev/vector/issues/10304#issuecom...
1. Logs get processed (by a tool like vector) and stored to a sink that consists of widely-understood files in an object store. Parquet format would be a decent start. (Yscope has what sounds like a nifty compression scheme that could layer in here.)
2. Those logs objects are (transactionally!) enrolled into a metadata store so things can find them. Delta Lake or Iceberg seem credible. Sure, these tools are meant for Really Big Data, but I see so reason they couldn’t work at any scale. And because the transaction layer exists as a standalone entity, one could run multiple log processing pipelines all committing into the same store.
3. High-performance and friendly tools can read them. Think Clickhouse, DuckDB, Spark, etc. Maybe everything starts to support this as a source for queries.
4. If you want to switch tools, no problem — the formats are standard. You can even run more than one at once.
Has anyone actually put the pieces together to make something like this work?
The application writes directly to a local Vector instance running as a daemon set, using the TCP protocol. That instance buffers locally in case of upstream downtime. It also augments each payload with some metadata about the origin.
The local one then sends to a remote Vector using Vector's internal Protobuf-based framing protocol. That Vector has two sinks, one which writes the raw data in immutable chunks to an object store for archival, and another that ingests in real time into ClickHouse.
This all works pretty great. The point of having a local Vector is so applications can be thin clients that just "firehose" out their data without needing a lot of complex buffering, retrying, etc. and without a lot of overhead, so we can emit very fine-grained custom telemetry data.
There is a tiny bit of retrying logic with a tiny bit of in-memory buffering (Vector can go down or be restarted and the client must handle that), but it's very simple, and designed to sacrifice messages to preserve availability.
Grafana is a nice way to use ClickHouse. ClickHouse is a bit more low level than I'd like (it often feels more like a "database construction kit" than a database), but the design is fantastic.
Here’s a post on how to do this with fly.io which uses vector: https://scratchdata.com/blog/fly-logs-to-clickhouse/
This is my actual production vector.yaml: https://gist.github.com/poundifdef/293bf2c4cd5aaa734b0b8e25e...
You could literally download my product (it’s open source) and set it up in 5 minutes: scratchdata.com
Unfortunately, the files are not in Parquet so even though Quickwit is opensource, it is difficult to tap into the file format.
We did not pick Parquet because we want to actually be able to search and do analysis efficiently, so we ship an inverted index, a row-oriented store, and a columnar format that allows for random access.
We are planning to eventually add ways to tap into the file and get data in the Apache arrow format.
Very easy to get setup locally too as a POC.
Dead Comment
Timber definitely intended to just rock out & demolish everything else out there with their agent/forwarder/aggregator tech. But it wasn't a competitive play against OTel, in my humble opinion. Timber's whole shtick is that it integrates with everything, with really flexible/good glue logic in-between. A competent multi-system (logging, metrics, eventually traces) fluentd++. OTel - I want to believe - would have been part of that original vision.
It's just taking a really really long time. One can speculate how direction & velocity might have changed since the Datadog acquisition. The lack of tracing (anywhere except Datadog, so far) materializing has been a hard hard hard & sad thing to see. OG https://github.com/vectordotdev/vector/issues/1444 and newer https://github.com/vectordotdev/vector/issues/17307
We had to push metrics we scrape via Prometheus into DataDog (coincidence that they acquired this) and do a custom transform to map to a set of custom metrics.
Very straightforward in how it runs and the helm chart had all the right things in there
I'm excited for these front-end telemetry routers to keep going. Really hoping Vector can co-evolve with and grow with the rest of the telemetry ecosystem. Otel itself has really started in on the next front with OpAMP, Open Agent Management Protocol, to allow online reconfiguration. I'd love to know more about Vector's online management... quick scan seems to say it's rewriting your JSON config & doing a SIGHUP. https://opentelemetry.io/docs/specs/opamp/
Vectors configurability & fast-and-slim promise looks amazing. Everyone would be so much better off if it can grow to interop well with otel. Really hoping here.
Don’t get me wrong, I want to use OTEL, but it’s a struggle. In the meantime, I’ve still got normal apps and libraries outputting normal logs and normal prom metrics, so I’ve got to stick with that.
[1] https://logdy.dev/
Has anyone tried Vector in the context of autonomous vehicle, essentially distributed system, where vector would serve the purpose of aggregating the op-logs, system state, input and output of every application at every instance?