You probably cannot achieve this with a single node, so you'll at least need to replicate it a few times to combat the normal 2-3 9s you get from a single node. But then you've got load balancers and dns, which can also serve as single point of failure, as seen with cloudflare.
Depending on the database type and choice, it varies. If you've got a single node of postgres, you can likely never achieve more than 2-3 9s (aws guarantees 3 9s for a multi-az RDS). But if you do multi-master cockroach etc, you can maybe achieve 5 9s just on the database layer, or using spanner. But you'll basically need to have 5 9s which means quite a bit of redundancy in all the layers going to and from your app and data. The database and DNS being the most difficult.
Reliable DNS provider with 5 9s of uptime guarantees -> multi-master load balancer each with 3 9s, -> each load balancer serving 3 or more apps each with 3 9s of availability, going to a database(s) with 5 9s.
This page from google shows their uptime guarantees for big tables, 3 9s for a single region with a cluster. 4 9s for multi cluster and 5 9s for multi region
https://docs.cloud.google.com/architecture/infra-reliability...
In general it doesn't matter really what you're running, it is all about redundancy. Whether that is instances, cloud vendor, region, zone etc.
Instead of focusing on the technical reasons why, they should answer how such a change bubbled out to cause such a massive impact instead.
Why: Proxy fails requests
Why: Handlers crashed because of OOM
Why: Clickhouse returns too much data
Why: A change was introduced causing double the amount of data
Why: A central change was rolled out immediately to all cluster (single point of failure)
Why: There are exemptions or standard operating procedure (gate) for releasing changes to the hot path for cloudflares network infra.
While the Clickhouse change is important, I personally think it is crucial that Cloudflare tackles the processes, and possibly gates / controls rollout for hot path system, no matter what kind of change they are when they're at that scale it should be possible. But that is probably enough co-driving. It to me seems like a process issue more than a technical one.