One of the most satisfying feature degradation steps I did with FastComments was making it so that if the DB went offline completely, the app would still function:
1. It auto restarts all workers in the cluster in "maintenance mode".
2. A "maintenance mode" message shows on the homepage.
3. The top 100 pages by comment volume will still render their comment threads, as a job on each edge node recalculates and stores this on disk periodically.
4. Logging in is disabled.
5. All db calls to the driver are stubbed out with mocks to prevent crashes.
6. Comments can still be posted and are added into an on-disk queue on each edge node.
7. When the system is back online the queue is processed (and stuff checked for spam etc like normal).
It's not perfect but it means in a lot of cases I can completely turn off the DB for a few minutes without panic. I haven't had to use it in over a year, though, and the DB doesn't really go down. But useful for upgrades.
built it on my couch during a Jurassic park marathon :P
Joining Google a few years ago, one thing I was impressed with is the amount of effort that goes into graceful degradation. For user facing services it gets quite granular, and is deeply integrated into the stack – from application layer to networking.
Previously I worked on a big web app at a growing startup, and it's probably the sort of thing I'd start adding in small ways from the early days. Being able to turn off unnecessary writes, turn down the rate of more expensive computation, turn down rates of traffic amplification, these would all have been useful levers in some of our outages.
it's really great to have such capabilities, but adding them has a cost where only few can afford. Cost in terms of investing in building those, which impacts your feature build velocity and the maintenance
Can you be specific about the cost of building these?
I've run into many situations where something was deemed costly, is found out later, and the team ultimately has implement it all while hoping no one groks that is was predicted. "Nobody ever gets credit for fixing problems that never happened" (https://news.ycombinator.com/item?id=39472693) is related.
So at my previous place we had a monolith with roughly 700 different URL handlers. Most of the problem with things like this was understanding what they all did.
Applying rate limiting, selective dropping of traffic, even just monitoring things by how much they affect the user experience, all require knowing what each one is doing. Figuring that out for one takes very little time. Figuring it out for 700 made it a project we'd never do.
The way I'd start with this is just by tagging things as I go. I'd build a lightweight way to attach a small amount of metadata to URL handlers/RPC handlers/GraphQL resolvers/whatever, and I'd decide a few facts to start with about each one – is it customer facing, is it authenticated, is it read or write, is it critical or nice to have, a few things like that. Then I'd do nothing else. That's probably a few hours of work, and would add almost no overhead
Now when it comes to needing something like this, you've got a base of understanding of the system to start from. You can incrementally use these, you can incrementally enforce that they are correct through other analysis, but the point is that I think it's low effort as a starting point with a potentially very high payoff.
Facebook makes over 300 requests for me just loading the main logged in page while showing me exactly 1 timeline item. Hovering my mouse over that item makes another 100 requests or so. Scrolling down loads another item at the cost of over 100 requests again. It's impressive in a perverse way just how inefficient they can be while managing to make it still work, and somewhat disturbing that their ads bring in enough money to make them extremely profitable despite it.
I can't comment on the numbers, but think of how many engineers work there and how many users Facebook, Whatsapp, Instagram have. Each engineer is adding new features and queries every day. You're going to get a lot of queries.
We’ve really wasted an incredible amount of talent-hours over last couple decades. Imagine if we’d worked on, like, climate change or something instead of ad platforms.
I think about 10 years ago when I was working there I checked the trace to load my own homepage. Just one page, just for myself, and there were 100,000 data fetches.
> Am I reading the second figure right? Facebook can do 130*10^6 queries/second == 130,000,000 queries/second?!
That sounds totally plausible to me.
Also keep in mind they didn't say what system this is. It's often true that 1 request to a frontend system becomes 1 each to 10 different backend services owned by different teams and then 20+ total to some database/storage layer many of them depend on. The qps at the bottom of the stack is in general a lot higher than the qps at the top, though with caching and static file requests and such this isn't a universal truth.
This is how information slowly changes. The original numbers from facebook needed to be taken with a grain of salt. 2 billion a day raises it more.
Facebook claims to have 2 billion accounts but no where near 2 billion unique accounts. I don't know what facebook calls an active user but it use to mean logged in once in the past 30 days.
- The comment is referring to a graph that used 10^6 on the vertical axis, which is a very common way to format graphs with large numbers (not just "lately"). It's also the default for a lot of plotting libraries.
- 10^n is more compact than million/billion/etc, more consistent, easier to translate, and doesn't suffer from regional differences (e.g. "one billion" is a different number in Britain than in the US).
I'm not saying it's clearly better than "million" in this specific case, but it's definitely not clearly worse.
A custom JIT + language + web framework + DB + queues + orchestrator + hardware built to your precise specifications + DCs all over the world go a long way ;)
Off-topic but: I love the font on the website. At first I thought it was the classic Computer Modern font (used in LateX). But nope. Upon inspection of the stylesheet, it's https://edwardtufte.github.io/et-book/ which was a font designed by Dmitry Krasny, Bonnie Scranton, and Edward Tufte. The font was originally designed for his book Beautiful Evidence. But people showed interest in font, see the bulletin board on ET's website: https://www.edwardtufte.com/bboard/q-and-a-fetch-msg?msg_id=... Initially he was reluctant to go the trouble of releasing it digitally. But eventually he did make it available on GitHub.
They're the same thing or close to it. "Load shedding" might be a bit more general. A couple possible nuances:
* Perhaps "graceful feature degradation" as a choice of words is a way of noting there's immediate user impact (but less than ungracefully running out of capacity). "Load shedding" could also mean something less impactful, for example some cron job that updates some internal dashboard skipping a run.
* "feature degradation" might focus on how this works at the granularity of features, where load shedding might mean something like dropping request hedges / retries, or individual servers saying they're overloaded and the request should go elsewhere.
The situation is that A depends on B, but B is overloaded; if we allow B to do load shedding, we must also write A to gracefully degrade when B is not available.
1. It auto restarts all workers in the cluster in "maintenance mode".
2. A "maintenance mode" message shows on the homepage.
3. The top 100 pages by comment volume will still render their comment threads, as a job on each edge node recalculates and stores this on disk periodically.
4. Logging in is disabled.
5. All db calls to the driver are stubbed out with mocks to prevent crashes.
6. Comments can still be posted and are added into an on-disk queue on each edge node.
7. When the system is back online the queue is processed (and stuff checked for spam etc like normal).
It's not perfect but it means in a lot of cases I can completely turn off the DB for a few minutes without panic. I haven't had to use it in over a year, though, and the DB doesn't really go down. But useful for upgrades.
built it on my couch during a Jurassic park marathon :P
Previously I worked on a big web app at a growing startup, and it's probably the sort of thing I'd start adding in small ways from the early days. Being able to turn off unnecessary writes, turn down the rate of more expensive computation, turn down rates of traffic amplification, these would all have been useful levers in some of our outages.
I've run into many situations where something was deemed costly, is found out later, and the team ultimately has implement it all while hoping no one groks that is was predicted. "Nobody ever gets credit for fixing problems that never happened" (https://news.ycombinator.com/item?id=39472693) is related.
Applying rate limiting, selective dropping of traffic, even just monitoring things by how much they affect the user experience, all require knowing what each one is doing. Figuring that out for one takes very little time. Figuring it out for 700 made it a project we'd never do.
The way I'd start with this is just by tagging things as I go. I'd build a lightweight way to attach a small amount of metadata to URL handlers/RPC handlers/GraphQL resolvers/whatever, and I'd decide a few facts to start with about each one – is it customer facing, is it authenticated, is it read or write, is it critical or nice to have, a few things like that. Then I'd do nothing else. That's probably a few hours of work, and would add almost no overhead
Now when it comes to needing something like this, you've got a base of understanding of the system to start from. You can incrementally use these, you can incrementally enforce that they are correct through other analysis, but the point is that I think it's low effort as a starting point with a potentially very high payoff.
That sounds totally plausible to me.
Also keep in mind they didn't say what system this is. It's often true that 1 request to a frontend system becomes 1 each to 10 different backend services owned by different teams and then 20+ total to some database/storage layer many of them depend on. The qps at the bottom of the stack is in general a lot higher than the qps at the top, though with caching and static file requests and such this isn't a universal truth.
Facebook claims to have 2 billion accounts but no where near 2 billion unique accounts. I don't know what facebook calls an active user but it use to mean logged in once in the past 30 days.
- 10^n is more compact than million/billion/etc, more consistent, easier to translate, and doesn't suffer from regional differences (e.g. "one billion" is a different number in Britain than in the US).
I'm not saying it's clearly better than "million" in this specific case, but it's definitely not clearly worse.
Defcon: Preventing Overload with Graceful Feature Degradation - https://news.ycombinator.com/item?id=36923049 - July 2023 (1 comment)
https://github.com/fluxninja/aperture
We built a similar tool at Netflix but the degradations could be both manual and automatic.
The manual part of Defcon is more "holy crap, we lost a datacenter and the whole site is melting, turn stuff off to bring the load down ASAP"
* Perhaps "graceful feature degradation" as a choice of words is a way of noting there's immediate user impact (but less than ungracefully running out of capacity). "Load shedding" could also mean something less impactful, for example some cron job that updates some internal dashboard skipping a run.
* "feature degradation" might focus on how this works at the granularity of features, where load shedding might mean something like dropping request hedges / retries, or individual servers saying they're overloaded and the request should go elsewhere.
The situation is that A depends on B, but B is overloaded; if we allow B to do load shedding, we must also write A to gracefully degrade when B is not available.
Deleted Comment