The Thundering Herd Problem

As someone who has worked on very large networks for 15 years, the author left out some important bits about using caching with thundering herd problems.

You really need to make sure that your cache settings will not generate a ton of requests to the origin when new content is requested.

If you have a bunch of clients that are going to request content at nearly the exact same time (this happens a lot with live video streaming, but can also happen when your clients are configured to download something as soon as it becomes available, or on a schedule, etc), you need to make sure your caches are set up so that the second request will wait for the first request to fill from origin rather than send its own cache fill request. Ideally, if 1000 clients all request the same content from your cache, only ONE request should go back to the origin, and the other 999 should wait and used the content retrieved by the first request. Otherwise, your cache isn't going to do anything since every request is going to require pulling from origin anyway.

In nginx, you do this with something like the proxy_cache_lock setting.

bitexploder · 2 years ago

Caching is generally way harder to get right than a lot of people probably think. Having delved into building caching systems that get the behavior and needs of the system just right, it took way more effort and research than I thought it would. You start with something simple and by the time you have chased down all the edge cases and tuned the cache behavior to your situation it is often not so simple after all.

omeze · 2 years ago

Yep, Ive found that, generally, caching logic should be a separate flow from business logic. E.g wrap your RPC logic such that an in-memory cache transparently sits in your RPC stack. This lets you implement locking on the cache in a single place, and makes it easy to do things like eager prefetching or offloading cache logic to another thread.

01HNNWZ0MV43FF · 2 years ago

I recall seeing that called "Request coalescing" in Faster Than Lime's article https://fasterthanli.me/articles/request-coalescing-in-async...

oops · 2 years ago

It is also sometimes called request collapsing.

Once again, what TFA describes is NOT the thundering herd problem.

https://en.wikipedia.org/wiki/Thundering_herd_problem

> In computer science, the thundering herd problem occurs when a large number of processes or threads waiting for an event are awoken when that event occurs, but only one process is able to handle the event.

(emphasis mine)

gilbetron · 2 years ago

Thundering herd has evolved beyond the original definition, that happens. Sometimes terms even flip their meaning (like how "one bad apple" has flipped it's meaning, or how "pre-optimization is the root of all evil" as used these days isn't what the original author actually meant, but in this case I think most of the new usages fit the basic idea, which is a large amount of unusual traffic happens outside of normal usage patterns.

PaulDavisThe1st · 2 years ago

Fair point that terms like this can change meanings.

But this version absolutely does not fit the original, which had nothing to do with amounts of traffic.

It was a problem with kernel APIs: if N threads sleep waiting for an event on a socket (i.e. a thread pool), there was no way to wake up only one of them when a single msg arrived. They would all wake, all check if there was work to do, then all (but one) would go back to sleep. This is a behavioral flaw even in the case of next to no traffic (though it may not have an material effect).

The problem has been solved with better kernel APIs.

Rapzid · 2 years ago

Okay but that's actually not the most general definition I'm familiar with.

It's a synchronization and resource constraint issue with various symptoms depending on domain specifics.

Let's discuss specifics about how the post isn't discussing a thundering heard problem in the general sense..

PaulDavisThe1st · 2 years ago

Please read the comment from user gilbetron appearing really close to yours, and my reply.

cortesoft · 2 years ago

rollcat · 2 years ago

Adding a little bit of random delay to any scheduled / triggered actions is always a good idea, and is usually very cheap to implement.

If you're on OpenBSD, their cron supports randomized ranges using the ~ operator (see <https://man.openbsd.org/crontab.5>). Otherwise, you can use something like 'sleep $(($RANDOM % 60)) && some-task', but beware that $RANDOM has a range of 0-65535; you won't get the full, uniform 86400s range for daily jobs.

If you are unable to add a different random delay on every invocation (e.g. you have an RSS reader that only allows you to specify an exact interval), pick some nearby prime number, e.g. 37 instead of 30 or 45, or 71 instead of 60 (71-60=11 being a prime is also great).

If you're deploying a whole fleet at once, you can also vary the exact timings of cron.hourly, cron.daily, etc between machines.

traceroute66 · 2 years ago

For those who don't have the weird systemd allergy, systemd has many great features, including random scheduling, see `RandomizedDelaySec`.

Oh yes, systemd in general and timers in particular do have a whole bunch of awesome stuff - I just tend to forget any of it exists, it's my coping mechanism for dealing with "heterogeneous" environments. I still have CentOS6 machines in prod...

https://www.freedesktop.org/software/systemd/man/latest/syst...

matsemann · 2 years ago

Once had to interface with an external API for many of our client requests (booking requests for transportation). Due to certain combinations of queries, it could sometimes be slow to get a response if the requested journey would contain multiple legs and combinations. So we needed to have a high timeout for our requests to that system. Which was fine most of the time.

Unfortunately, if that system started to struggle, all our calls would be slow, or even time out. And of course us continuously spamming them wouldn't help (we were like 95% of their traffic). But that didn't just wreak havoc on their side, but it completely exhausted all our incoming connections, thread pools etc. just waiting for responses that would never come, with long timeouts.

A circuit breaker here was golden. When too many requests in a row were slow, we would break and stop requests from our side. That would allow them to recover on their side, and things on our side to fail fast and let other services still work.

A key here, was that the circuit breaker would allow a small percentage through still. How else would you detect things are back up again? And then slowly let more and more through if things are going fine (so not just open completely all at once).

cbsmith · 2 years ago

The thundering herd problem comes from throughput decreasing due to overhead when you need to scale up capacity (traditionally due to increased context switching, but these days often from autoscaling activity). While rate-limiting, circuit-breaking, load-shedding, etc. are decent band aids for the problem, the better solution tends to be finding ways to increase amortization of request processing overhead as request load increases. Caching is a way to do it, but if done improperly can exacerbate problems. Another common approach, that I didn't see mentioned here in the article, is microbatching, where your pay your overhead costs on a per batch basis (often using streaming interfaces like Kafka). As you fall behind, your batch sizes increase, which increases the number of requests you process per batch, spreading the batch processing overhead over more and more requests.

pphysch · 2 years ago

The single biggest way to mitigate this is to switch from "push" to "pull" architecture where possible. Let the central servers reach out to the "clients" at their own honest pace. Not every process can be modeled this way, but a surprising amount can, particularly internal processes where you control all of the agents.

Ironically, stream processing, which we tend to call "push" architecture, is largely the same solution.

Do you mean something like "at most once" delivery over UDP?

Berniek · 2 years ago

Philosophically, the "thundering herd problem" actually CAUSES congestion.

There are 2 ways to handle it.

1 Stop the thundering herd.(make all the clients do something different). That may make things worse. Congestion in networks is usually exponential. You can't fulfill a request so you repeat the request, it can't be fulfilled so it repeats. You can add a random delay at the client end but that is just The US governments answer to the debt problem, it kicks the can down the road in the short term but it will almost certainly come back and bite you. Mathematically it is very easy for this scenario to become catastrophically exponential when a threshold is reach

2 Stop the congestion (make the handling process faster or add processes)

The system already has a cache to handle this but if its not in the cache it doesn't help. There needs to be an extra request cache exclusively for congestion scenarios. The existing cache request is already doing some processing, so extend that to route "thundering herd requests processing". This second cache does a bit more processing as well.As each new request is routed to it, it checks itself to see if this requestor is in the cache and removes it or overwrites it. It should never contain more than one entry per client.

When no more editions are made to this congestion cache (or the rate has slowed significantly) then the requests can be forwarded and processed via the original cache system.

Under this configuration, the congestion does not become exponential and should only delay the thundering herd requests. All other requests will be handled as per normal.

Once the original cache has the information there is no need for any thundering herd requests to be routed to the congestion cache.

Some clients will encounter some delays but not all and only on the "thundering herd process".

_whiteCaps_ · 2 years ago

When working on an embedded product, we used the last few digits of the MAC address to control timing of API accesses. Obviously doesn't work unless you control the whole stack, but for us, it was an easy fix since we had our own OUI.