10-15 years ago, I found myself needing to regularly find the median of many billions of values, each parsed out of a multi-kilobyte log entry. MapReduce was what we were using for processing large amounts of data at the time. With MapReduce over that much data, you don't just want linear time, but ideally single pass, distributed across machines. Subsequent passes over much smaller amounts of data are fine.
It was a struggle until I figured out that knowledge of the precision and range of our data helped. These were timings, expressed in integer milliseconds. So they were non-negative, and I knew the 90th percentile was well under a second.
As the article mentions, finding a median typically involves something akin to sorting. With the above knowledge, bucket sort becomes available, with a slight tweak in my case. Even if the samples were floating point, the same approach could be used as long as an integer (or even fixed point) approximation that is very close to the true median is good enough, again assuming a known, relatively small range.
The idea is to build a dictionary where the keys are the timings in integer milliseconds and the values are a count of the keys' appearance in the data, i.e., a histogram of timings. The maximum timing isn't known, so to ensure the size of the dictionary doesn't get out of control, use the knowledge that the 90th percentile is well under a second and count everything over, say, 999ms in the 999ms bin. Then the dictionary will be limited to 2000 integers (keys in the range 0-999 and corresponding values) - this is the part that is different from an ordinary bucket sort. All of that is trivial to do in a single pass, even when distributed with MapReduce. Then it's easy to get the median from that dictionary / histogram.
Did you actually need to find the true median of billions of values? Or would finding a value between 49.9% and 50.1% suffice? Because the latter is much easier: sample 10,000 elements uniformly at random and take their median.
(I made the number 10,000 up, but you could do some statistics to figure out how many samples would be needed for a given level of confidence, and I don't think it would be prohibitively large.)
The kind of margin you indicate would have been plenty for our use cases. But, we were already processing all these log entries for multiple other purposes in a single pass (not one pass per thing computed). With this single pass approach, the median calculation could happen with the same single-pass parsing of the logs (they were JSON and that parsing was most of our cost), roughly for free.
Uniform sampling also wasn't obviously simple, at least to me. There were thousands of log files involved, coming from hundreds of computers. Any single log file only had timings from a single computer. What kind of bias would be introduced by different approaches to distributing those log files to a cluster for the median calculation? Once the solution outlined in the previous comment was identified, that seemed simpler that trying to understand if we were talking about 49-51% or 40-50%. And if it was too big a margin, restructuring our infra to allow different log file distribution algorithms would have been far more complicated.
> the latter is much easier: sample 10,000 elements uniformly at random and take their median
Do you have a source for that claim?
I don't see how could that possibly be true... For example, if your original points are sampled from two gaussians of centers -100 and 100, of small but slightly different variance, then the true median can be anywhere between the two centers, and you may need a humungous number of samples to get anywhere close to it.
True, in that case any point between say -90 and 90 would be equally good as a median in most applications. But this does not mean that the median can be found accurately by your method.
I am not sure. But from the outside, it looks like what Prometheus does behind the scenes.
It seems to me that Prometheus works like that because it has a limit on latency time around 10s on some systems I worked.
So when we had requests above that limit it got all on 10s, even though it could be higher than that.
Interesting.
The metrics were about speed. And I was decades past my last internship at the time in question. But, as is so often the case, more than one of us may have been reinventing pretty similar wheels. :)
I was using the term dictionary for illustration purposes. Remember, this was all in the context of MapReduce. Computation within MapReduce is built around grouping values by keys, which makes dictionaries a natural way to think about many MapReduce oriented algorithms, at least for me. The key/value pairs appear as streams of two-tuples, not as dictionaries or arrays.
Sorry, but I'm trying to keep this account relatively anonymous to sidestep some of my issues with being shy.
But, you're right, I was lucky to work on a bunch of fun problems. That period, in particular, was pretty amazing. I was part of a fun, collaborative team working on hard problems. And management showed a lot of trust in us. We came up with some very interesting solutions, some by skill and some by luck, that set the foundation for years of growth that came after that (both revenue growth and technical platform growth).
> P.S: In 2017 a new paper came out that actually makes the median-of-medians approach competitive with other selection algorithms. Thanks to the paper’s author, Andrei Alexandrescu for bringing it to my attention!
He also gave a talk about his algorithm in 2016. He's an entertaining presenter, I highly recommended!
Andrei Alexandrescu is awesome; around 2000 he gave on talk on lock-free wait-free algorithms that I immediately applied to a huge C++ industrial control networking project at the time.
I'd recommend anyone who writes software listening and reading anything of Andrei's you can find; this one is indeed a Treasure!
that's wild, a bit of a polymath by computer science standards. I know him from template metaprogramming fame and here he is shifting from programming languages to algorithms
I learned about the median-of-medians quickselect algorithm when I was an undergrad and was really impressed by it. I implemented it, and it was terribly slow. It's runtime grew linearly, but that only really mattered if you had at least a few billion items in your list.
I was chatting about this with a grad student friend who casually said something like "Sure, it's slow, but what really matters is that it proves that it's possible to do selection of an unsorted list in O(n) time. At one point, we didn't know whether that was even possible. Now that we do, we know there might an even faster linear algorithm." Really got into the philosophy of what Computer Science is about in the first place.
The lesson was so simple yet so profound that I nearly applied to grad school because of it. I have no idea if they even recall the conversation, but it was a pivotal moment of my education.
Does the fact, that any linear time algorithm exist, indicate, that a faster linear time algorithm exists? Otherwise, what is the gain from that bit of knowledge? You could also think: "We already know, that some <arbitrary O(...)> algorithm exists, there might be an even faster <other O(...)> algorithm!"
What makes the existence of an O(n) algo give more indication, than the existence of an O(n log(n)) algorithm?
I am not the original commenter, but I (and probably many CS students) have had similar moments of clarity. The key part for me isn't
> there might be an even faster linear algorithm,
but
> it's possible to do selection of an unsorted list in O(n) time. At one point, we didn't know whether that was even possible.
For me, the moment of clarity was understanding that theoretical CS mainly cares about problems, not algorithms. Algorithms are tools to prove upper bounds on the complexity of problems. Lower bounds are equally important and cannot be proved by designing algorithms. We even see theorems of the form "there exists an O(whatever) algorithm for <problem>": the algorithm's existence can sometimes be proven non-constructively.
So if the median problem sat for a long time with a linear lower bound and superlinear upper bound, we might start to wonder if the problem has a superlinear lower bound, and spend our effort working on that instead. The existence of a linear-time algorithm immediately closes that path. The only remaining work is to tighten the constant factor. The community's effort can be focused.
A famous example is the linear programming problem. Klee and Minty proved an exponential worst case for the simplex algorithm, but not for linear programming itself. Later, Khachiyan proved that the ellipsoid algorithm was polynomial-time, but it had huge constant factors and was useless in practice. However, a few years later, Karmarkar gave an efficient polynomial-time algorithm. One can imagine how Khachiyan's work, although inefficient, could motivate a more intense focus on polynomial-time LP algorithms leading to Karmarkar's breakthrough.
If you had two problems, and a linear time solution was known to exist for only one of them, I think it would be reasonable to say that it's more likely that a practical linear time solution exists for that one than for the other one.
We studied (I believe) this algorithm in my senior year of Computer Science. We talked about the theory side of it that you mention, but this algorithm was also used to demonstrate that "slow linear algorithm" is not faster than "Fast nlogn algorithm" in most real life cases.
I think we got a constant factor of 22 for this algorithm so maybe it was a related one or something.
One of the fun things about the median-of-medians algorithm is its completely star-studded author list.
Manuel Blum - Turing award winner in 1995
Robert Floyd - Turing award winner in 1978
Ron Rivest - Turing award winner in 2002
Bob Tarjan - Turing award winner in 1986 (oh and also the inaugural Nevanlinna prizewinner in 1982)
Vaughan Pratt - oh no, the only non-Turing award winner in the list. Oh right but he's emeritus faculty at Stanford, directed the SUN project before it became Sun Microsystems, was instrumental in Sun's early days (director of research and designer of the Sun logo!), and is responsible for all kinds of other awesome stuff (near and dear to me: Pratt certificates of primality).
Four independent Turing awards! SPARCstations! This paper has it all.
Job interview question for an entry-level front end developer: "Reproduce the work of four Turing award winners in the next thirty minutes. You have a dirty whiteboard and a dry pen. Your time begins... now."
I'm not a Python expert, but doesn't the `/` operator return a float in Python? Why would you use a float as an array index instead of doing integer division (with `//`)?
I know this probably won't matter until you have extremely large arrays, but this is still quite a code smell.
Perhaps this could be forgiven if you're a Python novice and hadn't realized that the two different operators exist, but this is not the case here, as the article contains this even more baffling code which uses integer division in one branch but float division in the other:
That we're 50 comments in and nobody seems to have noticed this only serves to reinforce my existing prejudice against the average Python code quality.
Well spotted! In Python 2 there was only one operator, but in Python 3 they are distinct.
Indexing an array with a float raises an exception, I believe.
I do agree that it is a code smell. However given that this is an algorithms article I don't think it is exactly that fair to judge it based on code quality. I think of it as: instead of writing it in pseudocode the author chose a real pseudocode-like programming language, and it (presumably) runs well for illustrative purposes.
> Technically, you could get extremely unlucky: at each step, you could pick the largest element as your pivot. Each step would only remove one element from the list and you’d actually have O(n2) performance instead of O(n)
If adversarial input is a concern, doing a O(n) shuffle of the data first guarantees this cannot happen. If the data is really too big to shuffle, then only shuffle once a bucket is small enough to be shuffled.
If you do shuffle, probabilities are here to guarantee that that worst case cannot happen. If anyone says that "technically" it can happen, I'll answer that then "technically" an attacker could also guess correctly every bit of your 256 bits private key.
Our world is build on probabilities: all our private keys are protected by the mathematical improbability that someone shall guess them correctly.
From what I read, a shuffle followed by quickselect is O(n) for all practical purposes.
You're already using your own randomness to pick the pivot at random, so I don't see why the shuffle helps more. But yes, if your randomness is trustworthy, the probability of more than O(n) runtime is very low.
https://danlark.org/2020/11/11/miniselect-practical-and-gene...
It was a struggle until I figured out that knowledge of the precision and range of our data helped. These were timings, expressed in integer milliseconds. So they were non-negative, and I knew the 90th percentile was well under a second.
As the article mentions, finding a median typically involves something akin to sorting. With the above knowledge, bucket sort becomes available, with a slight tweak in my case. Even if the samples were floating point, the same approach could be used as long as an integer (or even fixed point) approximation that is very close to the true median is good enough, again assuming a known, relatively small range.
The idea is to build a dictionary where the keys are the timings in integer milliseconds and the values are a count of the keys' appearance in the data, i.e., a histogram of timings. The maximum timing isn't known, so to ensure the size of the dictionary doesn't get out of control, use the knowledge that the 90th percentile is well under a second and count everything over, say, 999ms in the 999ms bin. Then the dictionary will be limited to 2000 integers (keys in the range 0-999 and corresponding values) - this is the part that is different from an ordinary bucket sort. All of that is trivial to do in a single pass, even when distributed with MapReduce. Then it's easy to get the median from that dictionary / histogram.
(I made the number 10,000 up, but you could do some statistics to figure out how many samples would be needed for a given level of confidence, and I don't think it would be prohibitively large.)
Uniform sampling also wasn't obviously simple, at least to me. There were thousands of log files involved, coming from hundreds of computers. Any single log file only had timings from a single computer. What kind of bias would be introduced by different approaches to distributing those log files to a cluster for the median calculation? Once the solution outlined in the previous comment was identified, that seemed simpler that trying to understand if we were talking about 49-51% or 40-50%. And if it was too big a margin, restructuring our infra to allow different log file distribution algorithms would have been far more complicated.
Do you have a source for that claim?
I don't see how could that possibly be true... For example, if your original points are sampled from two gaussians of centers -100 and 100, of small but slightly different variance, then the true median can be anywhere between the two centers, and you may need a humungous number of samples to get anywhere close to it.
True, in that case any point between say -90 and 90 would be equally good as a median in most applications. But this does not mean that the median can be found accurately by your method.
In all use-cases I've seen a close estimate of the median was enough.
But, you're right, I was lucky to work on a bunch of fun problems. That period, in particular, was pretty amazing. I was part of a fun, collaborative team working on hard problems. And management showed a lot of trust in us. We came up with some very interesting solutions, some by skill and some by luck, that set the foundation for years of growth that came after that (both revenue growth and technical platform growth).
He also gave a talk about his algorithm in 2016. He's an entertaining presenter, I highly recommended!
There's Treasure Everywhere - Andrei Alexandrescu
https://www.youtube.com/watch?v=fd1_Miy1Clg
I'd recommend anyone who writes software listening and reading anything of Andrei's you can find; this one is indeed a Treasure!
I was chatting about this with a grad student friend who casually said something like "Sure, it's slow, but what really matters is that it proves that it's possible to do selection of an unsorted list in O(n) time. At one point, we didn't know whether that was even possible. Now that we do, we know there might an even faster linear algorithm." Really got into the philosophy of what Computer Science is about in the first place.
The lesson was so simple yet so profound that I nearly applied to grad school because of it. I have no idea if they even recall the conversation, but it was a pivotal moment of my education.
> there might be an even faster linear algorithm,
but
> it's possible to do selection of an unsorted list in O(n) time. At one point, we didn't know whether that was even possible.
For me, the moment of clarity was understanding that theoretical CS mainly cares about problems, not algorithms. Algorithms are tools to prove upper bounds on the complexity of problems. Lower bounds are equally important and cannot be proved by designing algorithms. We even see theorems of the form "there exists an O(whatever) algorithm for <problem>": the algorithm's existence can sometimes be proven non-constructively.
So if the median problem sat for a long time with a linear lower bound and superlinear upper bound, we might start to wonder if the problem has a superlinear lower bound, and spend our effort working on that instead. The existence of a linear-time algorithm immediately closes that path. The only remaining work is to tighten the constant factor. The community's effort can be focused.
A famous example is the linear programming problem. Klee and Minty proved an exponential worst case for the simplex algorithm, but not for linear programming itself. Later, Khachiyan proved that the ellipsoid algorithm was polynomial-time, but it had huge constant factors and was useless in practice. However, a few years later, Karmarkar gave an efficient polynomial-time algorithm. One can imagine how Khachiyan's work, although inefficient, could motivate a more intense focus on polynomial-time LP algorithms leading to Karmarkar's breakthrough.
I think we got a constant factor of 22 for this algorithm so maybe it was a related one or something.
Manuel Blum - Turing award winner in 1995
Robert Floyd - Turing award winner in 1978
Ron Rivest - Turing award winner in 2002
Bob Tarjan - Turing award winner in 1986 (oh and also the inaugural Nevanlinna prizewinner in 1982)
Vaughan Pratt - oh no, the only non-Turing award winner in the list. Oh right but he's emeritus faculty at Stanford, directed the SUN project before it became Sun Microsystems, was instrumental in Sun's early days (director of research and designer of the Sun logo!), and is responsible for all kinds of other awesome stuff (near and dear to me: Pratt certificates of primality).
Four independent Turing awards! SPARCstations! This paper has it all.
That's an impressive list of authors, for sure.
Pratt parsing (HN discussion: https://news.ycombinator.com/item?id=39066465), the "P" in the KMP algorithm.
I know this probably won't matter until you have extremely large arrays, but this is still quite a code smell.
Perhaps this could be forgiven if you're a Python novice and hadn't realized that the two different operators exist, but this is not the case here, as the article contains this even more baffling code which uses integer division in one branch but float division in the other:
That we're 50 comments in and nobody seems to have noticed this only serves to reinforce my existing prejudice against the average Python code quality.Deleted Comment
> Technically, you could get extremely unlucky: at each step, you could pick the largest element as your pivot. Each step would only remove one element from the list and you’d actually have O(n2) performance instead of O(n)
If adversarial input is a concern, doing a O(n) shuffle of the data first guarantees this cannot happen. If the data is really too big to shuffle, then only shuffle once a bucket is small enough to be shuffled.
If you do shuffle, probabilities are here to guarantee that that worst case cannot happen. If anyone says that "technically" it can happen, I'll answer that then "technically" an attacker could also guess correctly every bit of your 256 bits private key.
Our world is build on probabilities: all our private keys are protected by the mathematical improbability that someone shall guess them correctly.
From what I read, a shuffle followed by quickselect is O(n) for all practical purposes.
It doesn't guarantee that you avoid the worst case, it just removes the possibility of forcing the worst case.
However I never managed to understand how it works.
https://en.m.wikipedia.org/wiki/Floyd%E2%80%93Rivest_algorit...