> With Prequal we saw dramatic improvements across all metrics [for youtube], including reductions of 2x in tail latency, 5-10x in tail RIF, 10-20% in tail memory usage, 2x in tail CPU utilization, and a near-elimination of errors due to load imbalance. In addition to meeting SLOs, this also allowed us to significantly raise the utilization targets that govern how much traffic we are willing to send to each datacenter, thereby saving significant resources.
This feels like one of those "my company realized savings greater than my entire career's expected compensation."
The question you should ask is for how much less others would accept to do it. How much it saves isn't how to price some work, you price work based on the people available that can do it.
If the fire department puts out a fire in your house you don't pay them the cost of the building. You don't give your life to a doctor, etc. That way of thinking is weird.
It's not weird, but incomplete. It is broadly captured in the Economics concept of "willingness to pay", loosely the maximum price someone or some firm would be willing to pay for something of benefit to them.
This contrasts with "willingness to accept", loosely the minimum compensation someone or some firm would accept to produce a good or service (or accept some negative thing).
Neither of these is sufficient to determine the price of something precisely, but, in aggregate, these concepts bound the market price for some good or service.
which should make the employee proud, and I'm sure Google is compensating them very, very well.
let's also not forget that the people involved didn't create this in a vacuum. it cost Google a LOT more than their compensation to make it possible for them to even start working on this project, let alone carrying it forward to completion.
people underestimate how hard and expensive it is to manage a company in a way that allows its employees to do a good job.
> it cost Google a LOT more than their compensation to make it possible for them to even start working on this project
The actual "not a vacuum" context here is an environment that has been basically printing money for google for the last twenty years. It did not "cost" them anything. It's fine to acknowledge that the people who built google, however well paid they are, are creating vastly more value than they are personally receiving.
> let's also not forget that the people involved didn't create this in a vacuum. it cost Google a LOT more than their compensation to make it possible for them to even start working on this project
...let's also not forget that Google didn't manufacture this opportunity in a vacuum. It cost the rest of society a lot more than their revenue to make it possible for them to employ people who can spend their entire days working on software.
Pretty common really. Huge web properties like this have just massive room for cost optimizations, and AFAICT, is largely bottlenecked by management capacity. On the other hand, it's also pretty common to emphasize pet metrics like tails, which are pretty much by definition a minority of the cost. We give them the benefit of the doubt because a) google and b) peer review publication, but thats not usually a heurestic available to decision makers.
Optimizing tail latency isn't about saving fleet cost. It's about reducing the chance of one tail latency event ruining a page view, when a page view incurs dozens of backend requests.
The more requests you have, the higher the chance one of them hits a tail. So the overall latency a user sees is largely dependent on a) number of requests b) tail latency of each event.
This method improves the tail latency for ALL supported services, in a generic way. That's multiplicative impact across all services, from a user perspective.
Presumably, the number of requests is harder to reduce if they're all required for the business.
I disagree with toenail being downvoted. In Europe, employees enjoy a great deal of stability. In Netherlands, it's nigh-impossible to fire someone: you have to fill in 10 pages of justification paperwork and have an independent government agency review and approve it. If someone has a long-term illness then you have to pay 70% of their salary for up to 2 years, even when they do no work at all. Most people don't want to be entrepreneur: they want clear instructions and stability. When you try to give stock to employees, the tax authorities raise an eyebrow: why would you give stock to employees when they enjoy none of your risks? It makes no sense, so we'll treat it as a form of salary, so we'll tax you 52% on the stock's paper value.
At the end of the day, what's left for the entrepreneur? You enjoy all the risk, but you don't get to have a paid 2 year sick leave. Even sympathy for your hard work can be hard to get. The potential of money is all you have.
Things are different in the US of course, where people can be fired the next minute without reason. That looks like just borderline abuse to me. But from a European perspective, the above comment does not deserve downvoting at all.
This is pretty much the only option for a person to have any sort of ownership... at least in the US where co-ops and the like are extremely rare w/ little legal help.
I'm surprised that S3/AWS hasn't blogged or done a white paper about their approach yet. It's been something like 7 years now since they moved away from standard load balancing.
If you think about an object storage platform, much like with YouTube, traditional load balancing is a really bad fit. No two requests are even remotely the same in terms of resource requirements, duration etc.
Nothing specific that I'm aware of. Off the top of my head, and sticking to things that are pretty safe NDA territory, load balancers algorithms typically do things round-robin style, or least current connections, or speed of response etc. They don't know anything about the underlying servers, or the requests, just what they've sent in which direction. If you have multiple load balancers sitting in front of your fleet, they often don't know anything about what each other is doing either.
With an Object Storage service like S3, no two GET or PUT requests an LB serves are really the same, or have the same impact. They use different amounts of bandwidth, pull different at different speeds, different latency, require different amounts of CPU for handling or checksumming etc. It didn't used to be too weird to find API servers that were bored stiff, while others were working hard, all while having approximately the same number of requests going to them.
Smartphones used to be a nightmare, especially with the number that would be on poor signal quality, and/or reaching internationally. Millions of live connections just sitting there slowly GETing or PUTing requests, using up precious connection resources on web servers, but not much CPU time.
They talked a lot about using probes to select candidate servers, but I struggled to find a good explanation of what exactly a probe was.
However the "Replica selection" section seems to shed some detail, albeit somewhat indirectly. From what I can gather a probe consists of N metrics, which are gathered by the backend servers upon request from the load balancers.
In the paper they used two metrics, requests in flight (RIF) and measured latency for the most recent requests.
I assume the backend server maintains a RIF counter and a circular list of the last N requests, which it uses to compute the average latency of recent requests (so skipping old requests in the list presumably). They mention that responding to a probe should be fast and O(1).
At least that's my understanding after glossing through the paper.
a) are fast: they certainly incur the same network cost of a regular request. But more than that, all they do is read two counters, so they're super quick for backends to serve.
b) cheap: they don't do nearly as much work as a "real" request, so the cost of enabling this system is not prohibitive. They simply return two numbers. The probes don't compete with "real" requests for resources.
c) give the load balancer useful information: among all the metrics they could have returned from the backend, the ones they chose led to good prediction outcomes.
One could imagine playing with the metrics used, even using ML to select the best ones, and to adapt them dynamically based on workload and time period.
I would have liked to have read something quantitative about the measured cost (in terms of client and server CPU) other than describing them as "small". I'm trying to imagine doing this in gRPC and it seems like the overhead would be pretty high. I know Stubby is more efficient but some hard numbers would have been nice.
Needs to read this in-depth over weekends. Have fully imersed into LLMs for the past 2 years, ignored system research.
This appears a very logical solution, i.e., use estimation service quality instead of resource metrics, for scheduling. This is also more or less a known facts in the recent years, as systems are becoming so complex and so distributed intertwined that scheduling based on host load concerns a minor factors of serving requests. It's like one grew taller, and need to worry not stepping on huddles, but not bumping heads into door frame.
But we do need this kind of research to fomalize the practice, and get everyone on board.
Google's applied research absolutely winning here.
Load balancing is a term of art; the actual algorithm for distributing requests need not be load-based. A more accurate term for the component might be "request distributor," but I don't foresee people changing their vocabulary any time soon.
I’ve never heard of a load balancer that balances CPU load. They balance queuing depth and that’s only a proxy for cpu load and a pretty terrible one at that.
I really don’t understand how their claim is anything more than a least-conn with a better weighting algorithm.
We don’t generally use heterogenous server clusters anymore. Noisy neighbors and differences from one data center to the next are definitely things, but outside of microservices, you’ve got a lot of requests with different overhead to them. Route B might be five times as expensive as route A. So it’s not server predictors that I want, but route predictors. Those need a weight or cost estimator based on previous traffic.
Poor man version of this: we had ingress load balancers and then a local load balancer, like one does for Ruby or NodeJS or a handful of other languages. I found that we got much better tail latency running a more “square” arrangement. We initially had a little under 3 times as many boxes as cores per box, and I switched to the next biggest EC2 instance, which takes you to 3:4 ratio. That not only cancelled out a slight latency increase from moving to docker containers but also let me to reduce the cluster size by about 5% and still have a bit better p95 times.
I get two equally weighted attempts to balance the load fairly, instead of one and change.
Rquests (in flight or currently processing) are the load in this case. But I guess "queue balancing" captures the intuittion better: what matters for latency is the future delay more than current delay.
Talk: https://www.youtube.com/watch?v=PSP3GrZP2oo
This feels like one of those "my company realized savings greater than my entire career's expected compensation."
If the fire department puts out a fire in your house you don't pay them the cost of the building. You don't give your life to a doctor, etc. That way of thinking is weird.
This contrasts with "willingness to accept", loosely the minimum compensation someone or some firm would accept to produce a good or service (or accept some negative thing).
Neither of these is sufficient to determine the price of something precisely, but, in aggregate, these concepts bound the market price for some good or service.
let's also not forget that the people involved didn't create this in a vacuum. it cost Google a LOT more than their compensation to make it possible for them to even start working on this project, let alone carrying it forward to completion.
people underestimate how hard and expensive it is to manage a company in a way that allows its employees to do a good job.
The actual "not a vacuum" context here is an environment that has been basically printing money for google for the last twenty years. It did not "cost" them anything. It's fine to acknowledge that the people who built google, however well paid they are, are creating vastly more value than they are personally receiving.
...let's also not forget that Google didn't manufacture this opportunity in a vacuum. It cost the rest of society a lot more than their revenue to make it possible for them to employ people who can spend their entire days working on software.
The more requests you have, the higher the chance one of them hits a tail. So the overall latency a user sees is largely dependent on a) number of requests b) tail latency of each event.
This method improves the tail latency for ALL supported services, in a generic way. That's multiplicative impact across all services, from a user perspective.
Presumably, the number of requests is harder to reduce if they're all required for the business.
At the end of the day, what's left for the entrepreneur? You enjoy all the risk, but you don't get to have a paid 2 year sick leave. Even sympathy for your hard work can be hard to get. The potential of money is all you have.
Things are different in the US of course, where people can be fired the next minute without reason. That looks like just borderline abuse to me. But from a European perspective, the above comment does not deserve downvoting at all.
If you think about an object storage platform, much like with YouTube, traditional load balancing is a really bad fit. No two requests are even remotely the same in terms of resource requirements, duration etc.
https://youtu.be/NXehLy7IiPM
With an Object Storage service like S3, no two GET or PUT requests an LB serves are really the same, or have the same impact. They use different amounts of bandwidth, pull different at different speeds, different latency, require different amounts of CPU for handling or checksumming etc. It didn't used to be too weird to find API servers that were bored stiff, while others were working hard, all while having approximately the same number of requests going to them.
Smartphones used to be a nightmare, especially with the number that would be on poor signal quality, and/or reaching internationally. Millions of live connections just sitting there slowly GETing or PUTing requests, using up precious connection resources on web servers, but not much CPU time.
However the "Replica selection" section seems to shed some detail, albeit somewhat indirectly. From what I can gather a probe consists of N metrics, which are gathered by the backend servers upon request from the load balancers.
In the paper they used two metrics, requests in flight (RIF) and measured latency for the most recent requests.
I assume the backend server maintains a RIF counter and a circular list of the last N requests, which it uses to compute the average latency of recent requests (so skipping old requests in the list presumably). They mention that responding to a probe should be fast and O(1).
At least that's my understanding after glossing through the paper.
a) are fast: they certainly incur the same network cost of a regular request. But more than that, all they do is read two counters, so they're super quick for backends to serve.
b) cheap: they don't do nearly as much work as a "real" request, so the cost of enabling this system is not prohibitive. They simply return two numbers. The probes don't compete with "real" requests for resources.
c) give the load balancer useful information: among all the metrics they could have returned from the backend, the ones they chose led to good prediction outcomes.
One could imagine playing with the metrics used, even using ML to select the best ones, and to adapt them dynamically based on workload and time period.
This appears a very logical solution, i.e., use estimation service quality instead of resource metrics, for scheduling. This is also more or less a known facts in the recent years, as systems are becoming so complex and so distributed intertwined that scheduling based on host load concerns a minor factors of serving requests. It's like one grew taller, and need to worry not stepping on huddles, but not bumping heads into door frame.
But we do need this kind of research to fomalize the practice, and get everyone on board.
Google's applied research absolutely winning here.
> PReQuaL does not balance CPU load, but instead selects servers according to estimated latency and active requests-in-flight
So, still load balancing
I really don’t understand how their claim is anything more than a least-conn with a better weighting algorithm.
We don’t generally use heterogenous server clusters anymore. Noisy neighbors and differences from one data center to the next are definitely things, but outside of microservices, you’ve got a lot of requests with different overhead to them. Route B might be five times as expensive as route A. So it’s not server predictors that I want, but route predictors. Those need a weight or cost estimator based on previous traffic.
Poor man version of this: we had ingress load balancers and then a local load balancer, like one does for Ruby or NodeJS or a handful of other languages. I found that we got much better tail latency running a more “square” arrangement. We initially had a little under 3 times as many boxes as cores per box, and I switched to the next biggest EC2 instance, which takes you to 3:4 ratio. That not only cancelled out a slight latency increase from moving to docker containers but also let me to reduce the cluster size by about 5% and still have a bit better p95 times.
I get two equally weighted attempts to balance the load fairly, instead of one and change.
We present PReQuaL (Probing to Reduce Queuing and Latency), a load balancer for...
Fascinating that 2-3 probes per request is a sweet spot, intuitively it seems like a lot of overhead.