"All of this is made possible with the inclusion of frame pointers in all of Meta’s user space binaries, otherwise we couldn’t walk the stack to get all these addresses (or we’d have to do some other complicated/expensive thing which wouldn’t be as efficient)"
This makes things so, so, so much easier. Otherwise, a lot of effort has to built into creating an unwinder in ebpf code, essentially porting .eh_frame cfa/ra/bp calculations.
They claim to have event profilers for non-native languages (e.g. python). Does this mean that they use something similar to https://github.com/benfred/py-spy ? Otherwise, it's not obvious to me how they can read python state.
Thanks! Those blogs are incredibly useful. Nice work on the profiler. :)
I have multiple questions if you don’t mind answering them:
Is there significant overhead to native unwinding and python in ebpf? EBPF needs to constantly read & copy from user space to read data structures.
I ask this because unwinding with frame pointers can be done by reading without copying in userland.
Python can be ran with different engines (cpython, pypy, etc) and versions (3.7, 3.8,…) and compilers can reorganize offsets. Reading from offsets in seems me to be handwavy. Does this work well in practice/when did it fail?
I would assume the name is a reference to the use of strobes in examining high speed periodic motion, like that in motors or on production lines, eg:
https://www.checkline.com/inspection_stroboscope
This is really cool! I've always thought that one thing preventing major competitors to AWS/Azure/GCP is the lack of easy-to-use tooling for machine level monitoring like this. When I was at Microsoft, we built a tool like this that used Windows Firewall filters to track all the network traffic between our services and it was incredibly useful for debugging.
That said, as with anything from Meta, I approach this with a grain of salt and the fact that I can't tell what they stand to gain from this makes me suspicious.
> the fact that I can't tell what they stand to gain from this makes me suspicious.
Meta is one of the biggest contributors to FOSS in the world. (React, PyTorch, Llama, …). They stand to gain what every big company does, a community contributing to their infra.
You’ll note that nobody is open sourcing their ad recommender, that is the one you should be skeptical about if you ever see. You don’t share your secret sauce.
That's really cool. I only wish open source projects were this integrated. (Imagine if making a PR would estimate your AWS cost increase after running canary Kubernetes.)
Also what's really cool to see is that Facebook's internal UI actually looks decent. Never work in a company of anywhere close to that size and the tooling always look like it was puked by a dog.
Perhaps a contributing factor is how HN shows only the final non-eTLD [0] label of the domain. If it showed all labels, you'd have seen "engineering.fb.com" which, while not a dead giveaway, implies that the problem space is technical.
It would be nice if this aggressive truncation were applied only above a certain threshold of length.
They have support for many languages https://grafana.com/docs/pyroscope/latest/configure-client/l... ( also based on eBPF ).
C++ from Meta/FB is much more pleasant to read than ones from ... other older big techs. I appreciate that.
This makes things so, so, so much easier. Otherwise, a lot of effort has to built into creating an unwinder in ebpf code, essentially porting .eh_frame cfa/ra/bp calculations.
They claim to have event profilers for non-native languages (e.g. python). Does this mean that they use something similar to https://github.com/benfred/py-spy ? Otherwise, it's not obvious to me how they can read python state.
Lastly, the github repo https://github.com/facebookincubator/strobelight is pretty barebones. Wonder when they'll update it
1) native unwinding: https://www.polarsignals.com/blog/posts/2022/11/29/dwarf-bas...
2) python: https://www.polarsignals.com/blog/posts/2023/10/04/profiling...
Both available as part of the Parca open source project.
https://www.parca.dev/
(Disclaimer I work on Parca and am the founder of Polar Signals)
I have multiple questions if you don’t mind answering them:
Is there significant overhead to native unwinding and python in ebpf? EBPF needs to constantly read & copy from user space to read data structures.
I ask this because unwinding with frame pointers can be done by reading without copying in userland.
Python can be ran with different engines (cpython, pypy, etc) and versions (3.7, 3.8,…) and compilers can reorganize offsets. Reading from offsets in seems me to be handwavy. Does this work well in practice/when did it fail?
That said, as with anything from Meta, I approach this with a grain of salt and the fact that I can't tell what they stand to gain from this makes me suspicious.
Meta is one of the biggest contributors to FOSS in the world. (React, PyTorch, Llama, …). They stand to gain what every big company does, a community contributing to their infra.
You’ll note that nobody is open sourcing their ad recommender, that is the one you should be skeptical about if you ever see. You don’t share your secret sauce.
Actually... (2019) https://ai.meta.com/blog/dlrm-an-advanced-open-source-deep-l...
Source code:
https://github.com/facebookresearch/dlrm
Paper:
https://arxiv.org/abs/1906.00091
Updated 2023 blog post, but solely for content recommendation, but ads recommendation is ~90% the same:
https://engineering.fb.com/2023/08/09/ml-applications/scalin...
It's a little out of date, but the internal one is built with the same concepts, just more advanced modeling techniques and data.
Also what's really cool to see is that Facebook's internal UI actually looks decent. Never work in a company of anywhere close to that size and the tooling always look like it was puked by a dog.
Seeing the title and the domain I thought this was user profiling and I was wondering why would Meta be publishing this.
Perhaps a contributing factor is how HN shows only the final non-eTLD [0] label of the domain. If it showed all labels, you'd have seen "engineering.fb.com" which, while not a dead giveaway, implies that the problem space is technical.
It would be nice if this aggressive truncation were applied only above a certain threshold of length.
[0] https://en.wikipedia.org/wiki/Public_Suffix_List