1. Logs get processed (by a tool like vector) and stored to a sink that consists of widely-understood files in an object store. Parquet format would be a decent start. (Yscope has what sounds like a nifty compression scheme that could layer in here.)
2. Those logs objects are (transactionally!) enrolled into a metadata store so things can find them. Delta Lake or Iceberg seem credible. Sure, these tools are meant for Really Big Data, but I see so reason they couldn’t work at any scale. And because the transaction layer exists as a standalone entity, one could run multiple log processing pipelines all committing into the same store.
3. High-performance and friendly tools can read them. Think Clickhouse, DuckDB, Spark, etc. Maybe everything starts to support this as a source for queries.
4. If you want to switch tools, no problem — the formats are standard. You can even run more than one at once.
Has anyone actually put the pieces together to make something like this work?
There’s a natural flow of outputs becoming inputs and I’m struggling to identify a situation where I would feed things back into the source. Also, named pipes kind of solve that already.
Everyone just seems to be running experiments independently and then randomly drop some results, with basically no documentation. Sometimes the motivation is clearly VC money or paper exposure, but sometimes there is no apparent motivation... Or even no model card. Then when something works, others copy the script.
Not that I dont enjoy it. I find the sea of finetune generations fascinating.
The more rational voices in my mind, though, become more and more afraid of a world where the only thing you can trust is people sitting right in front of you. That makes the world of information pretty small again.