IME a very very large number of impacting incidents arent strictly tied to “a” code change, if any at all. It _feels_ like theres an implied solution to tying running version back to deployment rev, to deployment artifacts, and vcs.
Boundary conditions and state changes in the distributed system were the biggest bug bear I ran in to at AWS. Then below that were all of the “infra” style failures like network faults, latency, API quota exhaustion, etc. And for all the cloudformation/cdk/terraform in the world its non trivial to really discover those effects and tie them to a “code change.” Totally ignoring older tools that may be managed via CLI or the ol’ point and click.
- Code changes
- Configuration changes (this includes the equivalent of server topology changes like cloudformation, quota changes)
- Experimentation rollout changes
There has been issues that are external (like user behavior change for new year / world cup final, physical connection between datacenters being severed…) but they tend to be a lot less frequent.
All the 3 big buckets are tied to a single trackable change with an id so this leads to the ability to do those kind of automated root cause analysis at scale.
Now, Meta is mostly a closed loop where all the infra and product is controlled as one entity so those results may not be applicable outside.
Let me know if I can be of any help. vjeuxx@gmail.com