Reining in the thundering herd: Getting to 80% CPU utilization with Django

stingraycharles · 5 years ago

Tangent, but I always had a different understanding of the “thundering herd” problem; that is, if a service is down for whatever reason, and it’s brought back online, it immediately grinds to a halt again because there are a bazillion requests waiting to be handled.

And the solution to this problem is to slowly, rate-limited, bring the service back online, rather than letting the whole thundering herd go through the door immediately.

toast0 · 5 years ago

That's really not the traditional meaning of thundering herd, which is about waking up all the processes when a connection comes in, then they all try to accept it and it's a lot of work for nothing. You get much better results if only a single process is woken up for each event.

Your problem is a real problem though. Where I worked, we would call that backlog, and we would manage it with 'floodgates' ... When the system is broken, close the gates, and you need to open them slowly.

In an ideal world, your system would self-regulate from dead to live, shedding load as necessary, but always making headway. But sometimes a little help is needed to avoid the feedback loop of timed out client requests that still get processed on the server keeping the server in overload.

Ozzie_osman · 5 years ago

Yea you are right. It could be a service being down and requests piling up, or a cache key expiring and many processes trying to regenerate the value at the same time, etc.

I think the article just used this phrase to describe something else. (Great article otherwise).

fanf2 · 5 years ago

There is an explanation of this kind of thundering herd about 3/4 down this article https://httpd.apache.org/docs/trunk/misc/perf-scaling.html

The short version is that when you have multiple processes waiting on listening sockets and a connection arrives, they all get woken up and scheduled to run, but only one will pick up the connection, and the rest have to go back to sleep. These futile wakeups can be a huge waste of CPU, so on systems without accept() scalability fixes, or with more tricky server configurations, the web server puts a lock around accept() to ensure only one process is woken up at a time.

The term (and the fix) dates back to the performance improvement work on Apache 1.3 in the mid-1990s.

taylorhughes · 5 years ago

Phrase borrowed from excellent uWSGI docs https://uwsgi-docs.readthedocs.io/en/latest/articles/Seriali...

lookACamel · 5 years ago

That's not the thundering herd. If someone rings the door (request), only one person (agent, process) needs to answer the door. But what might happen is that everyone in the house rushes to answer the door. The people "thundering" to the door (and making a mess as they do so) are the "herd". This can quickly become a problem if there are a lot of people in the house and the doorbell keeps ringing.

Deleted Comment

thaumasiotes · 5 years ago

> but I always had a different understanding of the “thundering herd” problem; that is, if a service is down for whatever reason, and it’s brought back online, it immediately grinds to a halt again because there are a bazillion requests waiting to be handled.

That... doesn't have much to do with the thundering herd problem. It also doesn't make much sense as a concept on its own merits -- say you come in to work and your inbox is full enough for three inboxes. Does that fact, in itself, mean that you decide you're done for the day? No, it just means you have a much longer queue to work through than usual.

The thundering herd problem refers to what happens when (1) a bunch of agents come to you for something while you're busy; (2) you tell them all "I'm busy, go away and come back later"; and (3) the come-back-later time you give to each of them is identical, so they all come back simultaneously.

And that's exactly what's happening here, except that instead of giving each worker thread a come-back-later time when it asks for work, you're receiving work, sending out individual messages to every worker saying "hey, I'm not busy anymore, come back RIGHT NOW and get some more work", and then rejecting all but one of the thundering herd that shows up. The reason the Gunicorn docs and the uWSGI docs both refer to this as a "thundering herd" problem is that it's a near-perfect match for the problem prototype. The only difference is that, instead of giving out identical come-back-later times to worker threads as they ask you for work, you tell them to wait for a notification that includes a come-back-later time, and then when you get one piece of work you fire off that notification separately to every sleeping thread, including identical come-back-later times in each one.

toast0 · 5 years ago

> That... doesn't have much to do with the thundering herd problem. It also doesn't make much sense as a concept on its own merits -- say you come in to work and your inbox is full enough for three inboxes. Does that fact, in itself, mean that you decide you're done for the day? No, it just means you have a much longer queue to work through than usual.

If my SLA is 24 hour response time, and the inbox is FIFO, and I can't drop old messages, I'm most likely not hitting the SLA. If they all came in overnight, I'll hit the SLA for day 1, but I will be busy all of day 2 and 3 and never respond on time. If after day 1, I get a days worth of messages every day, I'll never catch up.

c_o_n_v_e_x · 5 years ago

This reminds me of inrush current when starting large motors... You get a huge current spike when you initially turn on the motor, so large that it can trip the breaker.

One solution is to use a soft starter which slow brings the motor up to speed.

luhn · 5 years ago

Unfortunately HAProxy doesn't buffer requests*, which is necessary for a production deployment of gunicorn. And for anybody using AWS, ALB doesn't buffer requests either. Because of this I'm actually running both HAProxy and nginx in front of my gunicorn instances—nginx in front for request buffering and HAProxy behind that for queuing.

If anybody is interested, I've packaged both as Docker containers:

HAProxy queuing/load shedding: https://hub.docker.com/r/luhn/spillway

nginx request buffering: https://hub.docker.com/r/luhn/gunicorn-proxy

* It does have an http_buffer_request option, but this only buffers the first 8kB (?) of the request.

Twirrim · 5 years ago

Couldn't Apache httpd just do all of that for you? mod_buffer provides request buffering, and mod_proxy_balancer provides load balancing capabilities.

luhn · 5 years ago

Can Apache do request queuing?

jhgg · 5 years ago

This is somewhat suspect. At my place of work, we operate a rather large Python API deployment (over an order of magnitude more QPS than the OP's post). However, our setup is... pretty simple. We only run nginx + gunicorn (gevent reactor), 1 master process + 1 worker per vCPU. In-front of that we have an envoy load-balancing tier that does p2c backend selection to each node. I actually think the nginx is pointless now that we're using envoy, so that'll probably go away soon.

Works amazingly well! We run our python API tier at 80% target CPU utilization.

lddemi · 5 years ago

glad you are seeing such awesome performance with gevent+envoy! which part of our experience do you think is suspect?

jhgg · 5 years ago

So, in guincorn default mode (sync), the mode I'm assuming you're using. This means you really have 1 process handling 1 request at a time. The "thundering herd" problem really only applies to connection acceptance. Which is to say, that in the process of accepting a connection, it is possible to wake all idle processes that are waiting for a connection comes in (they will wake and hit EAGAIN and then go back to waiting.) Busy processes that are servicing requests (not waiting on the accept call) will not be woken, since they aren't waiting on a new request to come in. The "thundering herd" problem as I understand it, can indeed waste CPU cycles, but only on processes that aren't doing much anyways. I do however believe that `accept()` calls have been synchronized between processes on Linux for a while now to prevent spurious wakeups. You should verify you're actually doing spurious wakeups by using `strace` and seeing if you are seeing a bunch of `accept()` calls returning EAGAIN.

In gunicorn, `sync` mode does exhibit a rather pathological connection churn, because it does not support keep-alive. Generally, most load balancing layers already will do connection pooling to the upstream, meaning, your gunicorn processes won't really be accepting much connections after they've "warmed up". This doesn't apply in sync mode unfortunately :(. Connection churn can waste CPU.

Another thing to also note is that if you have 150 worker processes, but your load balancer only allows 50 connections per upstream, chances are 100 of your processes will be sitting there idle.

Something just doesn't feel quite right here.

EDIT: I do see mention of `gthread` worker - so you might be already able to support http-keepalives. If this is the case, then you should really have no big thundering herd problem after the LB establishes connections to all the workers.

mac-chaffee · 5 years ago

Could the discrepancy be explained by the type of responses?

Sounds like an app like clubhouse might have lots of small, fast responses (like direct messaging), where very little of the response time is spent in application code. Does your API happen to do a lot of CPU-intensive stuff in application code?

jhgg · 5 years ago

Our app is also a messaging app. So lots of small & fast responses.

kvalekseev · 5 years ago

HAProxy is a beautiful tool but it doesn't buffer requests that is why NGINX is recommended in front of gunicorn otherwise it's suspectible to slowloris attack. So either cloubhouse can be easily DDOS'd right now or they have some tricky setup that prevents slow post reqests reaching gunicorn. In the blog post they don't mention that problem while recommend others to try and replace NGINX with HAPRoxy.

lddemi · 5 years ago

1. HAProxy does support request buffering https://cbonte.github.io/haproxy-dconv/2.2/configuration.htm...

2. our load balancer buffers requests as well

kvalekseev · 5 years ago

From HAProxy mailing list about http_buffer_request option https://www.mail-archive.com/haproxy@formilux.org/msg23074.h...

> In fact, with some app-servers (e.g. most Ruby/Rack servers, most Python servers, ...) the recommended setup is to put a fully buffering webserver in front. Due to it's design, HAProxy can not fill this role in all cases with arbitrarily large requests.

A year ago I was evaluating recent version of HAProxy as buffering web server and successfully run slowloris attack against it. Thus switching from NGINX is not a straightforward operation and your blog post should mention http-buffer-request option and slow client problem.

TekMol · 5 years ago

Performance is the only thing that is holding me back to consider Python for bigger web applications.

Of the 3 main languages for web dev these days - Python, PHP and Javascript - I like Python the most. But it is scary how slow the default runtime, CPython, is. Compared to PHP and Javascript, it crawls like a snake.

Pypy could be a solution as it seems to be about 6x faster on average.

Is anybody here using Pypy for Django?

Did Clubhouse document somewhere if they are using CPython or Pypy?

zinodaur · 5 years ago

I feel like people let go of perf too easily.

When using something like Golang, I have apps doing normal CRUD-ish queries at 10k QPS, on 32c/64g machines. For most web apps, 10k QPS is much more than they will ever see, and the fact that it is all done in a single process means you could do really cool things with in-memory datastructures.

Instead, every single web app is written as a distributed system, when almost none of them need to be, if they were written on a platform that didn't eat all of their resources.

midrus · 5 years ago

I could rephrase you comment as why would anyone use Go when I could just use assembler or C and keep all into a single node.

People don't use python because they want performance. People use python because of productivity, frameworks, libraries, documentation, resources and ecosystems. Most projects don't even need 10k qps, but instead most projects do need an ORM, a migrations system, authentication, sessions, etc. Python has bottle tested tools and frameworks for this.

jusssi · 5 years ago

People have been taught to be irrationally afraid of in-process concurrency (including async). Not too long ago the standard approach for concurrency was "it's hard, don't do it".

I've been told off in code review for using Python's concurrent.futures.ThreadPoolExecutor to run some http requests (making the code finish N times faster, in a context where latency mattered) "because it's hard to reason about".

mixmastamyk · 5 years ago

Backend controller performance is rarely a bottleneck, and if raw-compute still is there are a number of ways to speed it up, such as cython and/or work queues.

domino · 5 years ago

Clubhouse is using CPython

TekMol · 5 years ago

Interesting. Is there a reason for this?

fnord77 · 5 years ago

The top 5 web app programming languages by market share are PHP, Java, JS, Lua and Ruby.

fnord77 · 5 years ago

lol at the downvotes. sorry your favorite language isn't on the list.

I specifically said "market share", not "best" or "favorite".

https://www.wappalyzer.com/technologies/programming-language...

TekMol · 5 years ago

...said a stranger on the internet without any sources to back up this claim.

IshKebab · 5 years ago

Typescript is a nicer language than Python in many ways and it doesn't suffer from Python's crippling performance issues or dubious static typing situation. Plus you can run it in a browser so there's only one language to learn.

franga2000 · 5 years ago

Typescript would be nice if it weren't essentially just a bunch of macros for JavaScript. As it is now, as soon as you want to run it, you lose all the benefits of it (including many performance optimizations that could be made in a statically typed runtime) and of course, all the usual footguns of vanilla JS still apply. It's a great development tool though, I'll give you that.

TekMol · 5 years ago

You cannot run TS in a browser.

You can compile it to JS or to Webassembly. But you can do that with every language.

petargyurov · 5 years ago

> Which exacerbated another problem: uWSGI is so confusing. It’s amazing software, no doubt, but it ships with dozens and dozens of options you can tweak.

I am glad I am not the only one. I've had so many issues with setting up sockets, both with gevent and uWSGI, only to be left even more confused after reading the documentation.

j4mie · 5 years ago

If you’re delegating your load balancing to something else further up the stack and would prefer a simpler WSGI server than Gunicorn, Waitress is worth a look: https://github.com/pylons/waitress

tbrock · 5 years ago

Aside: AWS only allows registering 1000 targets in a target group… i wonder if thats the limit they hit. If so, its documented.