Deleted Comment
A small nit pick but having loads of data that “most likely no-one will look at ever again” is ok to an extent, for the data that are there to diagnose incidents. It’s not useful most of the time, until it’s really really useful. But it’s a matter of degree, and dumping the same information redundantly is pointless and infuriating.
This is one reason why it’s nice to create readable specs from telemetry, with traces/spans initiated from test drivers and passed through the stack (rather than trying to make natural language executable the way Cucumber does it- that’s a lot of work and complexity for non-production code). Then our observability data get looked at many times before there’s a production incident, in order to diagnose test failures. And hopefully the attributes we added to diagnose tests are also useful for similar diagnostics in prod.
GitHub here - hope the tool can help some folks in this thread: https://github.com/coroot/coroot
Event features:
- Experts from Google, Microsoft, Oracle, Qdrant, Manticore Search, Weaviate sharing real-world applications, best practices, and future directions in high-performance search and retrieval systems
- Presentations for all skill levels
- Live Q&A to engage with industry leaders and virtual networking
A few of the presenting speakers:
- Gunjan Joyal (Google): “Indexing and Searching at Scale with PostgreSQL and pgvector – from Prototype to Production”
- Maxim Sainikov (Microsoft): “Advanced Techniques in Retrieval-Augmented Generation with Azure AI Search”
- Ridha Chabad (Oracle): “LLMs and Vector Search unified in one Database: MySQL HeatWave's Approach to Intelligent Data Discovery”
If you can’t make it but want to learn from experience shared in one of these talks, sessions will also be recorded. Free registration can be checked out at (https://vsearchcon.com/register/) - hope you learn something from the event!
I went spelunking around in the codebase trying to get the actual answer to your question and it seems it's like many things: theoretically yes with enough energy expended but by default it seems to be ssh-ing into the target hosts and running a pseudo agent over its own protocol back through ssh. So, "no"
Then after you go through all that effort most of the data is utterly ignored and rarely are the business insights much better then the trailer park version ssh'ing into a box and greping a log file to find the error output.
Like we put so much effort into this ecosystem but I don't think it has paid us back with any significant increase in uptime, performance, or ergonomics.
Coroot is an open source project I'm working with to try and to tackle this. eBPF automatically gathers your data into a centralized service map, and then the tool provides RCA insights (with things like mapped incident timeframes) to help implement fixes quicker and improve uptime.
GitHub here and we'd love any feedback if you think it can help: https://github.com/coroot/coroot