Defcon: Preventing overload with graceful feature degradation (2023)

winrid · 2 years ago

One of the most satisfying feature degradation steps I did with FastComments was making it so that if the DB went offline completely, the app would still function:

1. It auto restarts all workers in the cluster in "maintenance mode".

2. A "maintenance mode" message shows on the homepage.

3. The top 100 pages by comment volume will still render their comment threads, as a job on each edge node recalculates and stores this on disk periodically.

4. Logging in is disabled.

5. All db calls to the driver are stubbed out with mocks to prevent crashes.

6. Comments can still be posted and are added into an on-disk queue on each edge node.

7. When the system is back online the queue is processed (and stuff checked for spam etc like normal).

It's not perfect but it means in a lot of cases I can completely turn off the DB for a few minutes without panic. I haven't had to use it in over a year, though, and the DB doesn't really go down. But useful for upgrades.

built it on my couch during a Jurassic park marathon :P

tsss · 2 years ago

Sounds like you're just failing over to a custom database.

winrid · 2 years ago

The abstraction isn't really defined or isolated that way, but kinda.

danpalmer · 2 years ago

Joining Google a few years ago, one thing I was impressed with is the amount of effort that goes into graceful degradation. For user facing services it gets quite granular, and is deeply integrated into the stack – from application layer to networking.

Previously I worked on a big web app at a growing startup, and it's probably the sort of thing I'd start adding in small ways from the early days. Being able to turn off unnecessary writes, turn down the rate of more expensive computation, turn down rates of traffic amplification, these would all have been useful levers in some of our outages.

tuyguntn · 2 years ago

it's really great to have such capabilities, but adding them has a cost where only few can afford. Cost in terms of investing in building those, which impacts your feature build velocity and the maintenance

zerkten · 2 years ago

Can you be specific about the cost of building these?

I've run into many situations where something was deemed costly, is found out later, and the team ultimately has implement it all while hoping no one groks that is was predicted. "Nobody ever gets credit for fixing problems that never happened" (https://news.ycombinator.com/item?id=39472693) is related.

danpalmer · 2 years ago

So at my previous place we had a monolith with roughly 700 different URL handlers. Most of the problem with things like this was understanding what they all did.

Applying rate limiting, selective dropping of traffic, even just monitoring things by how much they affect the user experience, all require knowing what each one is doing. Figuring that out for one takes very little time. Figuring it out for 700 made it a project we'd never do.

The way I'd start with this is just by tagging things as I go. I'd build a lightweight way to attach a small amount of metadata to URL handlers/RPC handlers/GraphQL resolvers/whatever, and I'd decide a few facts to start with about each one – is it customer facing, is it authenticated, is it read or write, is it critical or nice to have, a few things like that. Then I'd do nothing else. That's probably a few hours of work, and would add almost no overhead

Now when it comes to needing something like this, you've got a base of understanding of the system to start from. You can incrementally use these, you can incrementally enforce that they are correct through other analysis, but the point is that I think it's low effort as a starting point with a potentially very high payoff.

Banditoz · 2 years ago

Am I reading the second figure right? Facebook can do 130*10^6 queries/second == ‭130,000,000‬ queries/second?!

ndriscoll · 2 years ago

Facebook makes over 300 requests for me just loading the main logged in page while showing me exactly 1 timeline item. Hovering my mouse over that item makes another 100 requests or so. Scrolling down loads another item at the cost of over 100 requests again. It's impressive in a perverse way just how inefficient they can be while managing to make it still work, and somewhat disturbing that their ads bring in enough money to make them extremely profitable despite it.

akira2501 · 2 years ago

This is the company that instead of ditching PHP created a full on PHP to C++ transpiler and then deployed their while site on that for a few years.

golergka · 2 years ago

Wasn't the whole point of GraphQL in mitigating this?

callalex · 2 years ago

Do you have an ad blocker stopping the requests and causing retries?

yodsanklai · 2 years ago

Could someone tell me what these hundreds of requests could do?

bagels · 2 years ago

I can't comment on the numbers, but think of how many engineers work there and how many users Facebook, Whatsapp, Instagram have. Each engineer is adding new features and queries every day. You're going to get a lot of queries.

bee_rider · 2 years ago

We’ve really wasted an incredible amount of talent-hours over last couple decades. Imagine if we’d worked on, like, climate change or something instead of ad platforms.

storyinmemo · 2 years ago

I think about 10 years ago when I was working there I checked the trace to load my own homepage. Just one page, just for myself, and there were 100,000 data fetches.

vaylian · 2 years ago

By "homepage" you mean your Facebook profile?

scottlamb · 2 years ago

> Am I reading the second figure right? Facebook can do 130*10^6 queries/second == ‭130,000,000‬ queries/second?!

That sounds totally plausible to me.

Also keep in mind they didn't say what system this is. It's often true that 1 request to a frontend system becomes 1 each to 10 different backend services owned by different teams and then 20+ total to some database/storage layer many of them depend on. The qps at the bottom of the stack is in general a lot higher than the qps at the top, though with caching and static file requests and such this isn't a universal truth.

gaogao · 2 years ago

Those queries are probably mostly memcache hits, though of course with distributed cache invalidation and consistency fun

ipaddr · 2 years ago

If it doesn't hit the database is it really a query?

sebzim4500 · 2 years ago

Sounds plausible. There are probably many queries required to display a page and Facebook has 2 billion daily active users.

ipaddr · 2 years ago

This is how information slowly changes. The original numbers from facebook needed to be taken with a grain of salt. 2 billion a day raises it more.

Facebook claims to have 2 billion accounts but no where near 2 billion unique accounts. I don't know what facebook calls an active user but it use to mean logged in once in the past 30 days.

avery17 · 2 years ago

Whats with people lately writing 10^6 instead of 1 million. Its not that big that we need exponents to get involved.

jcparkyn · 2 years ago

- The comment is referring to a graph that used 10^6 on the vertical axis, which is a very common way to format graphs with large numbers (not just "lately"). It's also the default for a lot of plotting libraries.

- 10^n is more compact than million/billion/etc, more consistent, easier to translate, and doesn't suffer from regional differences (e.g. "one billion" is a different number in Britain than in the US).

I'm not saying it's clearly better than "million" in this specific case, but it's definitely not clearly worse.

rijx · 2 years ago

Erosion of education makes basic scientific knowledge very trendy

bdd · 2 years ago

Yes. And that was 4 years ago. Must add that figure does NOT include static asset serving path.

IncreasePosts · 2 years ago

I forgot how to count that low.

reissbaker · 2 years ago

A custom JIT + language + web framework + DB + queues + orchestrator + hardware built to your precise specifications + DCs all over the world go a long way ;)

Thaxll · 2 years ago

We're close to 1 million servers, not 12 racks in a DC.

AlienRobot · 2 years ago

iirc Facebook has 3 billion users, so that sounds plausible.

sonicanatidae · 2 years ago

Yeah, they allocated ALL of the ram to their DB servers. lol

dang · 2 years ago

Discussed (a tiny bit) at the time:

Defcon: Preventing Overload with Graceful Feature Degradation - https://news.ycombinator.com/item?id=36923049 - July 2023 (1 comment)

mrb · 2 years ago

Off-topic but: I love the font on the website. At first I thought it was the classic Computer Modern font (used in LateX). But nope. Upon inspection of the stylesheet, it's https://edwardtufte.github.io/et-book/ which was a font designed by Dmitry Krasny, Bonnie Scranton, and Edward Tufte. The font was originally designed for his book Beautiful Evidence. But people showed interest in font, see the bulletin board on ET's website: https://www.edwardtufte.com/bboard/q-and-a-fetch-msg?msg_id=... Initially he was reluctant to go the trouble of releasing it digitally. But eventually he did make it available on GitHub.

gillh · 2 years ago

Anyone interested in load shedding and graceful degradation with request prioritization should check out the Aperture OSS project.

https://github.com/fluxninja/aperture

jedberg · 2 years ago

I'm surprised they don't have automated degradation (or at least the article implies that it must be operator initiated).

We built a similar tool at Netflix but the degradations could be both manual and automatic.

packetslave · 2 years ago

There's definitely automated degradation at smaller scale ("if $random_feature's backend times out, don't show it", etc.).

The manual part of Defcon is more "holy crap, we lost a datacenter and the whole site is melting, turn stuff off to bring the load down ASAP"

mikerg87 · 2 years ago

Isn't this referred to as Load Shedding in some circles? If its not, can someone explain how its different?

scottlamb · 2 years ago

They're the same thing or close to it. "Load shedding" might be a bit more general. A couple possible nuances:

* Perhaps "graceful feature degradation" as a choice of words is a way of noting there's immediate user impact (but less than ungracefully running out of capacity). "Load shedding" could also mean something less impactful, for example some cron job that updates some internal dashboard skipping a run.

* "feature degradation" might focus on how this works at the granularity of features, where load shedding might mean something like dropping request hedges / retries, or individual servers saying they're overloaded and the request should go elsewhere.

kqr · 2 years ago

This is the other side of the load shedding coin.

The situation is that A depends on B, but B is overloaded; if we allow B to do load shedding, we must also write A to gracefully degrade when B is not available.

Deleted Comment