The root cause of this incident was leadership driving velocity by cutting corners. It has been going on for years, eventually over the cliff.
This specific failure mode is known as query of death. A query triggers an existing bug that causes the server to crash. It is inevitable for C++ servers.
Service Control is in C++. It uses a comprehensive set of engineering guidelines to minimize and tolerate query of death and other failure modes. Before this incident, it had no major incident in the previous decade.
This incident is related to a new global quota policy. It was built quickly under leadership pressure, cutting corners. Such features should be built in a secondary service, or at least following the established engineering guidelines.
Regarding the action items mentioned in the report, the established engineering guidelines far exceed them. The team has been keeping up with their standard as much as they can.
You can't pin this entirely on leadership, allowing global blast radius deployments without an extreme level of scrutiny is a failure of engineering culture.
At the very least the global policy should have been deployed prior to the regional service control deployments.
Engineering culture needs leadership and executive support to succeed and thrive. Blaming failures stemming from top down mandates and directives is unfair, because if the grunts don't follow orders they'll be given poor performance reviews or expediently managed out (aka Fired).
this is really amateur level stuff: NPEs, no error handling, no exponential backoff, no test coverage, no testing in staging, no gradual rollout, fail deadly
Nearly every global outage at Google has looked vaguely like this. I.e. a bespoke system that rapidly deploys configs globally gets a bad config.
All the standard tools for binary rollouts and config pushes will typically do some kind of gradual rollout.
In some ways Google Cloud had actually greatly improved the situation since a bunch of global systems were forced to become regional and/or become much more reliable. Google also used to have short global outages that weren't publicly remarked on (at the time, if you couldn't connect to Google, you assumed your own ISP was broken), so this event wasn't as rare as you might think. Overall I don't think there is a worsening trend unless someone has a spreadsheet of incidents proving otherwise.
> a policy change was inserted into the regional Spanner tables that Service Control uses for policies. Given the global nature of quota management, this metadata was replicated globally within seconds
If there’s a root cause here, it’s that “given the global nature of quota management” wasn’t seen as a red flag that “quota policy changes must use the standard gradual rollout tooling.”
The baseline can’t be “the trend isn’t worsening;” the baseline should be that if global config rollouts are commonly the cause of problems, there should be increasingly elevated standards for when config systems can bypass best practices. Clearly that didn’t happen here.
As an outsider my quick guess is that at some point after enough layoffs and the CEO accusing everyone of being lazy, people focus on speed/perceived output over quality. After a while the culture shifts so if you block such things then you're the problem and will be ostracized.
As an outsider, what I perceive is quite different:
HN likes to pretend that FAANG is the pinnacle of existence. The best engineers, the best standards, the most “that wouldn’t have happened here,” the yardstick by which all companies should be measured for engineering prowess.
Incidents like this repeatedly happening reveal that’s mostly a myth. They aren’t much smarter, their standards are somewhat wishful thinking, their accomplishments are mostly rooted in the problems they needed to solve just like any other company.
IMO, it's that any defense, humans or automated, is imperfect, and life is a series of tradeoffs.
You can write as many unit tests as you want, and integration tests that your system works as you expect on sample data, static analysis to scream if you're doing something visibly unsafe, staged rollout from nightly builds to production, and so on and so on, but eventually, at large enough scale, you're going to find a gap in those layered safety measures, and if you're unlucky, it's going to be a gap in all of them at once.
It's the same reasoning from said book as why getting another nine is always going to involve much more work than the previous ones - eventually you're doing things like setting up complete copies of your stack running stable builds from months ago and replaying all the traffic to them in order to be able to fail over to them on a moment's notice, meaning that you also can't roll out new features until the backup copies support it too, and that's a level of cost/benefit that nobody can pay if the service is large enough.
When working on OpenZFS, a number of bugs have come from things like "this code in isolation works as expected, but an edge case we didn't know about in data written 10 years ago came up", or "this range of Red Hat kernels from 3 Red Hat releases ago has buggy behavior, and since we test on the latest kernel of that release, we didn't catch it".
Eventually, if there's enough complexity in the system, you cannot feasibly test even all the variation you know about, so you make tradeoffs based on what gets enough benefit for the cost.
(I'm an SRE at Google, not on any team related to this incident, all opinions unofficial/my own, etc.)
...the constant familiarity with even the most dangerous instruments soon makes men loose their first caution in handling them; they readily, therefore, come to think that the rules laid down for their guidance are unnecessarily strict - report on the explosion of a gunpowder magazine at Erith, 1864
I wish they would share more details here. Your take isn't fully correct. There was testing, just not for the bad input (the blank fields in the policy). They also didn't say there was no testing in staging, just that a flag would have caught it.
At google scale, if their standards were not sky high, such incidents would be happening daily. That it happens once in a blue moon indicates that they are really meticulous with all those processes and safeguards almost all the time.
- You do know their AZs are just firewalls across the same datacenter?
- And they used machines without ECC and their index got corrupted because of it? And instead of hiding the head in shame and getting lessons from IBM old timers they published a paper about it?
- What really accelerated the demise of Google+ was that an API issue allowed the harvesting of private profile fields for millions of users, and they hid that for months fearing the backlash...
Dont worry, you will have plenty more outages from the land of we only hire the best....
I wonder if machines without ECC could perhaps explain why our apps periodically see TCP streams with scrambled contents.
On GKE, we see different services (like Postgres and NATS) running on the same VM in different containers receive/send stream contents (e.g. HTTP responses) where the packets of the stream have been mangled with the contents of other packets. We've been seeing it since 2024, and all the investigation we've done points to something outside our apps and deeper in the system. We've only seen it in one Kubernetes cluster, and it lasts 2-3 hours and then magically resolves itself; draining the node also fixes it.
If there are physical nodes with faulty RAM, I bet something like this could happen. Or there's a bug in their SDN or their patched version of the Linux kernel.
That book was written with 40% of the engineers compared to when I left a couple years ago (not sure how many now with the layoffs now). I'm guessing those hires haven't read it yet. So yeah, reads like standards slipping to me.
Google’s standards, and from what I can tell, most FAANG standards are like beauty filters on Instagram. Judging yourself, or any company, against them is delusional.
I work on Cloud, but not this service. In general:
- All the code has unit tests and integration tests
- Binary and config file changes roll out slowly job by job, region by region, typically over several days. Canary analysis verifies these slow rollouts.
- Even panic rollbacks are done relatively slowly to avoid making the situation worse. For example globally overloading databases with job restarts. A 40m outage is better than a 4 hour outage.
I have no insider knowledge of this incident, but my read of the PM is: The code was tested, but not this edge case. The quota policy config is not rolled out as a config file, but by updating a database. The database was configured for replication which meant the change appeared in all the databases globally within seconds instead of applying job by job, region by region, like a binary or config file change.
I agree on the frustration with null pointers, though if this was a situation the engineers thought was impossible it could have just as likely been an assert() in another language making all the requests fail policy checks as well.
Rewriting a critical service like this in another language seems way higher risk than making sure all policy checks are flag guarded, that all quota policy checks fail open, and that db changes roll out slowly region by region.
Disclaimer: this is all unofficial and my personal opinions.
That's fair, though `if (isInvalidPolicy) reject();` causes the same outage. So the eng process policy change seems to be failing open and slow rollouts to catch that case too.
How is the fact that it was a database change and not a binary or a config supposed to make it ok? A change is a change, global changes that go everywhere at once are a recipe for disaster, it doesn't matter what kind of changes we're talking about. This is a second Crowdsrike
This is the core point. A canary deployment that was not preceded by deploying data that activates the region of the binary in question will prove nothing useful at all, while promoting a false sense of security.
The incident report is interesting. Fast reaction time by the SRE team (2 minutes), then the "red button" rollout. But then "Within some of our larger regions, such as us-central-1, as Service Control tasks restarted, it created a herd effect on the underlying infrastructure it depends on (i.e. that Spanner table), overloading the infrastructure. Service Control did not have the appropriate randomized exponential backoff implemented to avoid this. It took up to ~2h 40 mins to fully resolve in us-central-1 as we throttled task creation to minimize the impact on the underlying infrastructure and routed traffic to multi-regional databases to reduce the load."
In my experience this happens more often than not: In an exceptional situation like a recovery of many nodes quotas that make sense in regular operations get exceeded quickly and you run into another failure scenario. As long as the underlying infrastructure can cope with it, it's good if you can disable quotas temporarily and quickly. Or throttle the recovery operations that naturally take longer in that case.
(Throwaway since I was part of a related team a while back)
Service Control (Chemist) is a somewhat old service, been around for about a decade, and is critical for a lot of GCP APIs for authn, authz, auditing, quota etc. Almost mandated in Cloud.
There's a proxy in the path of most GCP APIs, that calls Chemist before forwarding requests to the backend. (Hence I don't think fail open mitigation mentioned in post-mortem will work)
Both Chemist and the proxy are written in C++, and have picked up a ton of legacy cruft over the years.
The teams have extensive static analysis & testing, gradual rollouts, feature flags, red buttons and strong monitoring/alerting systems in place. The SREs in particular are pretty amazing.
Since Chemist handles a lot of policy checks like IAM, quotas, etc., other teams involved in those areas have contributed to the codebase. Over time, shortcuts have been taken so those teams don’t have to go through Chemist's approval for every change.
However, in the past few years, the organization’s seen a lot of churn and a lot of offshoring too. Which has led to a bigger focus on flashy, new projects led by L8/L9s to justify headcount instead of prioritizing quality, maintenance, and reliability. This shift has contributed to a drop in quality standards and increased pressure to ship things out faster (and one of the reasons I ended up leaving Cloud).
Also many of the servers/services best practices common at Google are not so common here.
That said, in this specific case, it seems like the issue is more about lackluster code and code review. (iirc code was merged despite some failures). And pushing config changes instantly through Spanner made it worse.
> This policy data contained unintended blank fields. Service Control, then regionally exercised quota checks on policies in each regional datastore. This pulled in blank fields for this respective policy change and exercised the code path that hit the null pointer causing the binaries to go into a crash loop.
Another example of Hoare’s “billion-dollar mistake” in multiple multiple Google systems:
- Why is it possible to insert unintended “blank fields” (nulls)? The configuration should have a schema type that doesn’t allow unintended nulls. Unfortunately Spanner itself is SQL-like and so fields must be declared NOT NULL explicitly, the default is nullable fields.
- Even so, the program that manages these policies will have its own type system and possibly an application level schema language for the configuration. This is another opportunity to make invalid states unrepresentable.
- Then in Service Control, there’s an opportunity to prove “schema on read” as you deserialize policies from the data store into application objects, again either a programming language type or application level schema could be used to validate policy rows have the expected shape before they leave the data layer. Perhaps the null pointer error occurred in this layer, but since this issue occurred in a new code path, it sounds more likely the invalid data escaped the data layer into application code.
- Finally, the Service Control application is written in a language that allows for null pointer references.
If I were a maintainer of this system, the minimally invasive chance I would be thinking about, is how to introduce an application level schema to the policy writer and the policy reader that uses a “tagged enum type” or “union type” or “sum type” to represent policies that cannot express null. Ideally each new kind of policy could be expressed as a new variant added to the union type. You can add this in app code without rewriting the whole program to a safe language. Unfortunately it seems proto3, google’s usual schema language, doesn’t have this constraint.
Google post-mortems never cease to amaze me. From seeing it inside the company to outside. The level of detail, its amazing. The thing is. They will never make the same mistake again. They learn from it, put in the correct protocols and error handling and then create an even more robust system. The thing is, at the scale of Google there is always something going wrong, the point is, how is it being handled not to affect the customer/user and other systems. Honestly it's an ongoing thing you don't see unless you're inside and even then on a per team basis you might see things no one else is seeing. It is probably the closet we're going to come to the most complex systems of the universe, because we as humans will never do better than this. Maybe AGI does, but we won't.
But this is a whole series of junior level mistakes:
* Not dealing with null data properly
* Not testing it properly
* Not having test coverage showing your new thing is tested
* Not exercising it on a subset of prod after deployment to show it works without falling over before it gets pushed absolutely everywhere
Standards in this industry have dropped over the years, but by this much? If you had done this 10 years ago as a Google customer for something far less critical everyone on their side would be smugly lolling at you, and rightly so.
This is _hardly_ a "junior level mistake". That kind of bug is pervasive in all the languages they're likely using for this service (Go, Java, C++) written even by the most "senior" developers.
As I understand it the outage was caused by several mistakes:
1) A global feature release that went everywhere at the same time
2) Null pointer dereference
3) Lack of appropriate retry policies that resulted in a thundering herd problem
All of these are absolutely standard mistakes that everyone who's worked in the industry for some time has seen numerous times. There is nothing novel here, no weird distibited system logic, no google scale, just rookie mistakes all the way
They rolled out a change without feature flagging, didn’t implement exponential backoffs in the clients, didn’t implement load shedding in the servers.
This is all in the google SRE book from many years ago.
For a company the size and quality of Google to be bringing down the majority of their stack with this type of error really suggests they do not implement appropriate mitigations after serious issues.
This is literally the same mistake that has been made many times before. Of course it will be made again. “New feature is rolled out carefully with a bug that remains latent until triggered by new data” could summarize most global outages.
The thing is, nobody is perfect. Except armchair HN commenters on threads about FAANG outages, of course.
The root cause of this incident was leadership driving velocity by cutting corners. It has been going on for years, eventually over the cliff.
This specific failure mode is known as query of death. A query triggers an existing bug that causes the server to crash. It is inevitable for C++ servers.
Service Control is in C++. It uses a comprehensive set of engineering guidelines to minimize and tolerate query of death and other failure modes. Before this incident, it had no major incident in the previous decade.
This incident is related to a new global quota policy. It was built quickly under leadership pressure, cutting corners. Such features should be built in a secondary service, or at least following the established engineering guidelines.
Regarding the action items mentioned in the report, the established engineering guidelines far exceed them. The team has been keeping up with their standard as much as they can.
At the very least the global policy should have been deployed prior to the regional service control deployments.
I read their SRE books, all of this stuff is in there: https://sre.google/sre-book/table-of-contents/ https://google.github.io/building-secure-and-reliable-system...
have standards slipped? or was the book just marketing
All the standard tools for binary rollouts and config pushes will typically do some kind of gradual rollout.
In some ways Google Cloud had actually greatly improved the situation since a bunch of global systems were forced to become regional and/or become much more reliable. Google also used to have short global outages that weren't publicly remarked on (at the time, if you couldn't connect to Google, you assumed your own ISP was broken), so this event wasn't as rare as you might think. Overall I don't think there is a worsening trend unless someone has a spreadsheet of incidents proving otherwise.
[I was an SRE at Google several years ago]
> a policy change was inserted into the regional Spanner tables that Service Control uses for policies. Given the global nature of quota management, this metadata was replicated globally within seconds
If there’s a root cause here, it’s that “given the global nature of quota management” wasn’t seen as a red flag that “quota policy changes must use the standard gradual rollout tooling.”
The baseline can’t be “the trend isn’t worsening;” the baseline should be that if global config rollouts are commonly the cause of problems, there should be increasingly elevated standards for when config systems can bypass best practices. Clearly that didn’t happen here.
HN likes to pretend that FAANG is the pinnacle of existence. The best engineers, the best standards, the most “that wouldn’t have happened here,” the yardstick by which all companies should be measured for engineering prowess.
Incidents like this repeatedly happening reveal that’s mostly a myth. They aren’t much smarter, their standards are somewhat wishful thinking, their accomplishments are mostly rooted in the problems they needed to solve just like any other company.
You can write as many unit tests as you want, and integration tests that your system works as you expect on sample data, static analysis to scream if you're doing something visibly unsafe, staged rollout from nightly builds to production, and so on and so on, but eventually, at large enough scale, you're going to find a gap in those layered safety measures, and if you're unlucky, it's going to be a gap in all of them at once.
It's the same reasoning from said book as why getting another nine is always going to involve much more work than the previous ones - eventually you're doing things like setting up complete copies of your stack running stable builds from months ago and replaying all the traffic to them in order to be able to fail over to them on a moment's notice, meaning that you also can't roll out new features until the backup copies support it too, and that's a level of cost/benefit that nobody can pay if the service is large enough.
When working on OpenZFS, a number of bugs have come from things like "this code in isolation works as expected, but an edge case we didn't know about in data written 10 years ago came up", or "this range of Red Hat kernels from 3 Red Hat releases ago has buggy behavior, and since we test on the latest kernel of that release, we didn't catch it".
Eventually, if there's enough complexity in the system, you cannot feasibly test even all the variation you know about, so you make tradeoffs based on what gets enough benefit for the cost.
(I'm an SRE at Google, not on any team related to this incident, all opinions unofficial/my own, etc.)
Opinions are my own.
But if you change a schema, be it DB, protobuf, whatever, this is the major thing your tests should be covering.
This is why people are so amazed by it.
- And they used machines without ECC and their index got corrupted because of it? And instead of hiding the head in shame and getting lessons from IBM old timers they published a paper about it?
- What really accelerated the demise of Google+ was that an API issue allowed the harvesting of private profile fields for millions of users, and they hid that for months fearing the backlash...
Dont worry, you will have plenty more outages from the land of we only hire the best....
On GKE, we see different services (like Postgres and NATS) running on the same VM in different containers receive/send stream contents (e.g. HTTP responses) where the packets of the stream have been mangled with the contents of other packets. We've been seeing it since 2024, and all the investigation we've done points to something outside our apps and deeper in the system. We've only seen it in one Kubernetes cluster, and it lasts 2-3 hours and then magically resolves itself; draining the node also fixes it.
If there are physical nodes with faulty RAM, I bet something like this could happen. Or there's a bug in their SDN or their patched version of the Linux kernel.
Google’s standards, and from what I can tell, most FAANG standards are like beauty filters on Instagram. Judging yourself, or any company, against them is delusional.
- All the code has unit tests and integration tests
- Binary and config file changes roll out slowly job by job, region by region, typically over several days. Canary analysis verifies these slow rollouts.
- Even panic rollbacks are done relatively slowly to avoid making the situation worse. For example globally overloading databases with job restarts. A 40m outage is better than a 4 hour outage.
I have no insider knowledge of this incident, but my read of the PM is: The code was tested, but not this edge case. The quota policy config is not rolled out as a config file, but by updating a database. The database was configured for replication which meant the change appeared in all the databases globally within seconds instead of applying job by job, region by region, like a binary or config file change.
I agree on the frustration with null pointers, though if this was a situation the engineers thought was impossible it could have just as likely been an assert() in another language making all the requests fail policy checks as well.
Rewriting a critical service like this in another language seems way higher risk than making sure all policy checks are flag guarded, that all quota policy checks fail open, and that db changes roll out slowly region by region.
Disclaimer: this is all unofficial and my personal opinions.
Asserts are much easier to forbid by policy.
so... it wasn't tested
So like, the requirements are unknown? Or this service isn't critical enough to staff a careful migration?
In my experience this happens more often than not: In an exceptional situation like a recovery of many nodes quotas that make sense in regular operations get exceeded quickly and you run into another failure scenario. As long as the underlying infrastructure can cope with it, it's good if you can disable quotas temporarily and quickly. Or throttle the recovery operations that naturally take longer in that case.
A better fix is to quickly spread the load to backup databases that already exist. There are other options too.
Service Control (Chemist) is a somewhat old service, been around for about a decade, and is critical for a lot of GCP APIs for authn, authz, auditing, quota etc. Almost mandated in Cloud.
There's a proxy in the path of most GCP APIs, that calls Chemist before forwarding requests to the backend. (Hence I don't think fail open mitigation mentioned in post-mortem will work)
Both Chemist and the proxy are written in C++, and have picked up a ton of legacy cruft over the years.
The teams have extensive static analysis & testing, gradual rollouts, feature flags, red buttons and strong monitoring/alerting systems in place. The SREs in particular are pretty amazing.
Since Chemist handles a lot of policy checks like IAM, quotas, etc., other teams involved in those areas have contributed to the codebase. Over time, shortcuts have been taken so those teams don’t have to go through Chemist's approval for every change.
However, in the past few years, the organization’s seen a lot of churn and a lot of offshoring too. Which has led to a bigger focus on flashy, new projects led by L8/L9s to justify headcount instead of prioritizing quality, maintenance, and reliability. This shift has contributed to a drop in quality standards and increased pressure to ship things out faster (and one of the reasons I ended up leaving Cloud).
Also many of the servers/services best practices common at Google are not so common here.
That said, in this specific case, it seems like the issue is more about lackluster code and code review. (iirc code was merged despite some failures). And pushing config changes instantly through Spanner made it worse.
We must be at the trillion dollar mistake by now, right?
Another example of Hoare’s “billion-dollar mistake” in multiple multiple Google systems:
- Why is it possible to insert unintended “blank fields” (nulls)? The configuration should have a schema type that doesn’t allow unintended nulls. Unfortunately Spanner itself is SQL-like and so fields must be declared NOT NULL explicitly, the default is nullable fields.
- Even so, the program that manages these policies will have its own type system and possibly an application level schema language for the configuration. This is another opportunity to make invalid states unrepresentable.
- Then in Service Control, there’s an opportunity to prove “schema on read” as you deserialize policies from the data store into application objects, again either a programming language type or application level schema could be used to validate policy rows have the expected shape before they leave the data layer. Perhaps the null pointer error occurred in this layer, but since this issue occurred in a new code path, it sounds more likely the invalid data escaped the data layer into application code.
- Finally, the Service Control application is written in a language that allows for null pointer references.
If I were a maintainer of this system, the minimally invasive chance I would be thinking about, is how to introduce an application level schema to the policy writer and the policy reader that uses a “tagged enum type” or “union type” or “sum type” to represent policies that cannot express null. Ideally each new kind of policy could be expressed as a new variant added to the union type. You can add this in app code without rewriting the whole program to a safe language. Unfortunately it seems proto3, google’s usual schema language, doesn’t have this constraint.
Example of one that does: https://github.com/stepchowfun/typical
* Not dealing with null data properly
* Not testing it properly
* Not having test coverage showing your new thing is tested
* Not exercising it on a subset of prod after deployment to show it works without falling over before it gets pushed absolutely everywhere
Standards in this industry have dropped over the years, but by this much? If you had done this 10 years ago as a Google customer for something far less critical everyone on their side would be smugly lolling at you, and rightly so.
> Not dealing with null data properly
This is _hardly_ a "junior level mistake". That kind of bug is pervasive in all the languages they're likely using for this service (Go, Java, C++) written even by the most "senior" developers.
1) A global feature release that went everywhere at the same time
2) Null pointer dereference
3) Lack of appropriate retry policies that resulted in a thundering herd problem
All of these are absolutely standard mistakes that everyone who's worked in the industry for some time has seen numerous times. There is nothing novel here, no weird distibited system logic, no google scale, just rookie mistakes all the way
This is 100% a process problem.
They rolled out a change without feature flagging, didn’t implement exponential backoffs in the clients, didn’t implement load shedding in the servers.
This is all in the google SRE book from many years ago.
For a company the size and quality of Google to be bringing down the majority of their stack with this type of error really suggests they do not implement appropriate mitigations after serious issues.
The thing is, nobody is perfect. Except armchair HN commenters on threads about FAANG outages, of course.
Amateur hour in Mountain View.