My Favorite Algorithm: Linear Time Median Finding (2018)

10-15 years ago, I found myself needing to regularly find the median of many billions of values, each parsed out of a multi-kilobyte log entry. MapReduce was what we were using for processing large amounts of data at the time. With MapReduce over that much data, you don't just want linear time, but ideally single pass, distributed across machines. Subsequent passes over much smaller amounts of data are fine.

It was a struggle until I figured out that knowledge of the precision and range of our data helped. These were timings, expressed in integer milliseconds. So they were non-negative, and I knew the 90th percentile was well under a second.

As the article mentions, finding a median typically involves something akin to sorting. With the above knowledge, bucket sort becomes available, with a slight tweak in my case. Even if the samples were floating point, the same approach could be used as long as an integer (or even fixed point) approximation that is very close to the true median is good enough, again assuming a known, relatively small range.

The idea is to build a dictionary where the keys are the timings in integer milliseconds and the values are a count of the keys' appearance in the data, i.e., a histogram of timings. The maximum timing isn't known, so to ensure the size of the dictionary doesn't get out of control, use the knowledge that the 90th percentile is well under a second and count everything over, say, 999ms in the 999ms bin. Then the dictionary will be limited to 2000 integers (keys in the range 0-999 and corresponding values) - this is the part that is different from an ordinary bucket sort. All of that is trivial to do in a single pass, even when distributed with MapReduce. Then it's easy to get the median from that dictionary / histogram.

justinpombrio · a year ago

Did you actually need to find the true median of billions of values? Or would finding a value between 49.9% and 50.1% suffice? Because the latter is much easier: sample 10,000 elements uniformly at random and take their median.

(I made the number 10,000 up, but you could do some statistics to figure out how many samples would be needed for a given level of confidence, and I don't think it would be prohibitively large.)

rented_mule · a year ago

The kind of margin you indicate would have been plenty for our use cases. But, we were already processing all these log entries for multiple other purposes in a single pass (not one pass per thing computed). With this single pass approach, the median calculation could happen with the same single-pass parsing of the logs (they were JSON and that parsing was most of our cost), roughly for free.

Uniform sampling also wasn't obviously simple, at least to me. There were thousands of log files involved, coming from hundreds of computers. Any single log file only had timings from a single computer. What kind of bias would be introduced by different approaches to distributing those log files to a cluster for the median calculation? Once the solution outlined in the previous comment was identified, that seemed simpler that trying to understand if we were talking about 49-51% or 40-50%. And if it was too big a margin, restructuring our infra to allow different log file distribution algorithms would have been far more complicated.

enriquto · a year ago

> the latter is much easier: sample 10,000 elements uniformly at random and take their median

Do you have a source for that claim?

I don't see how could that possibly be true... For example, if your original points are sampled from two gaussians of centers -100 and 100, of small but slightly different variance, then the true median can be anywhere between the two centers, and you may need a humungous number of samples to get anywhere close to it.

True, in that case any point between say -90 and 90 would be equally good as a median in most applications. But this does not mean that the median can be found accurately by your method.

andruby · a year ago

I was thinking the same thing.

In all use-cases I've seen a close estimate of the median was enough.

hhmc · a year ago

You also can use the fact that for any distribution, the median is never further than 1SD away from the mean.

digaozao · a year ago

I am not sure. But from the outside, it looks like what Prometheus does behind the scenes. It seems to me that Prometheus works like that because it has a limit on latency time around 10s on some systems I worked. So when we had requests above that limit it got all on 10s, even though it could be higher than that. Interesting.

Filligree · a year ago

Was this by any chance for generating availability metrics, and were you an intern at the time? The system sounds, ah, very familiar.

rented_mule · a year ago

The metrics were about speed. And I was decades past my last internship at the time in question. But, as is so often the case, more than one of us may have been reinventing pretty similar wheels. :)

ant6n · a year ago

I’m not sure why you use a dictionary with keys 0…999, instead of an array indexed 0…999.

rented_mule · a year ago

I was using the term dictionary for illustration purposes. Remember, this was all in the context of MapReduce. Computation within MapReduce is built around grouping values by keys, which makes dictionaries a natural way to think about many MapReduce oriented algorithms, at least for me. The key/value pairs appear as streams of two-tuples, not as dictionaries or arrays.

tomrod · a year ago

That's just a dict/map with less flexibility on the keys :D

ashton314 · a year ago

Where were you working? Sounds like you got lucky to work on some fun problems!

rented_mule · a year ago

Sorry, but I'm trying to keep this account relatively anonymous to sidestep some of my issues with being shy.

But, you're right, I was lucky to work on a bunch of fun problems. That period, in particular, was pretty amazing. I was part of a fun, collaborative team working on hard problems. And management showed a lot of trust in us. We came up with some very interesting solutions, some by skill and some by luck, that set the foundation for years of growth that came after that (both revenue growth and technical platform growth).

I learned about the median-of-medians quickselect algorithm when I was an undergrad and was really impressed by it. I implemented it, and it was terribly slow. It's runtime grew linearly, but that only really mattered if you had at least a few billion items in your list.

I was chatting about this with a grad student friend who casually said something like "Sure, it's slow, but what really matters is that it proves that it's possible to do selection of an unsorted list in O(n) time. At one point, we didn't know whether that was even possible. Now that we do, we know there might an even faster linear algorithm." Really got into the philosophy of what Computer Science is about in the first place.

The lesson was so simple yet so profound that I nearly applied to grad school because of it. I have no idea if they even recall the conversation, but it was a pivotal moment of my education.

zelphirkalt · a year ago

Does the fact, that any linear time algorithm exist, indicate, that a faster linear time algorithm exists? Otherwise, what is the gain from that bit of knowledge? You could also think: "We already know, that some <arbitrary O(...)> algorithm exists, there might be an even faster <other O(...)> algorithm!" What makes the existence of an O(n) algo give more indication, than the existence of an O(n log(n)) algorithm?

blt · a year ago

I am not the original commenter, but I (and probably many CS students) have had similar moments of clarity. The key part for me isn't

> there might be an even faster linear algorithm,

but

> it's possible to do selection of an unsorted list in O(n) time. At one point, we didn't know whether that was even possible.

For me, the moment of clarity was understanding that theoretical CS mainly cares about problems, not algorithms. Algorithms are tools to prove upper bounds on the complexity of problems. Lower bounds are equally important and cannot be proved by designing algorithms. We even see theorems of the form "there exists an O(whatever) algorithm for <problem>": the algorithm's existence can sometimes be proven non-constructively.

So if the median problem sat for a long time with a linear lower bound and superlinear upper bound, we might start to wonder if the problem has a superlinear lower bound, and spend our effort working on that instead. The existence of a linear-time algorithm immediately closes that path. The only remaining work is to tighten the constant factor. The community's effort can be focused.

A famous example is the linear programming problem. Klee and Minty proved an exponential worst case for the simplex algorithm, but not for linear programming itself. Later, Khachiyan proved that the ellipsoid algorithm was polynomial-time, but it had huge constant factors and was useless in practice. However, a few years later, Karmarkar gave an efficient polynomial-time algorithm. One can imagine how Khachiyan's work, although inefficient, could motivate a more intense focus on polynomial-time LP algorithms leading to Karmarkar's breakthrough.

anonymoushn · a year ago

If you had two problems, and a linear time solution was known to exist for only one of them, I think it would be reasonable to say that it's more likely that a practical linear time solution exists for that one than for the other one.

mrguyorama · a year ago

We studied (I believe) this algorithm in my senior year of Computer Science. We talked about the theory side of it that you mention, but this algorithm was also used to demonstrate that "slow linear algorithm" is not faster than "Fast nlogn algorithm" in most real life cases.

I think we got a constant factor of 22 for this algorithm so maybe it was a related one or something.

def quickselect_median(l, pivot_fn=random.choice): if len(l) % 2 == 1: return quickselect(l, len(l) // 2, pivot_fn) else: return 0.5 * (quickselect(l, len(l) / 2 - 1, pivot_fn) + quickselect(l, len(l) / 2, pivot_fn))

danlark · a year ago

Around 4 years ago I compared lots of different median algorithms and the article turned out to be much longer than I anticipated :)

https://danlark.org/2020/11/11/miniselect-practical-and-gene...

thanatropism · a year ago

Is any of those easily modifiable to return the arg-median (the index which has the median).

cfors · a year ago

Just wanted to say thank you for this article - I've read and shared this a few times over the years!

xinok · a year ago

> P.S: In 2017 a new paper came out that actually makes the median-of-medians approach competitive with other selection algorithms. Thanks to the paper’s author, Andrei Alexandrescu for bringing it to my attention!

He also gave a talk about his algorithm in 2016. He's an entertaining presenter, I highly recommended!

There's Treasure Everywhere - Andrei Alexandrescu

https://www.youtube.com/watch?v=fd1_Miy1Clg

_yid9 · a year ago

Andrei Alexandrescu is awesome; around 2000 he gave on talk on lock-free wait-free algorithms that I immediately applied to a huge C++ industrial control networking project at the time.

I'd recommend anyone who writes software listening and reading anything of Andrei's you can find; this one is indeed a Treasure!

fasa99 · a year ago

that's wild, a bit of a polymath by computer science standards. I know him from template metaprogramming fame and here he is shifting from programming languages to algorithms

mabbo · a year ago

kwantam · a year ago

One of the fun things about the median-of-medians algorithm is its completely star-studded author list.

Manuel Blum - Turing award winner in 1995

Robert Floyd - Turing award winner in 1978

Ron Rivest - Turing award winner in 2002

Bob Tarjan - Turing award winner in 1986 (oh and also the inaugural Nevanlinna prizewinner in 1982)

Vaughan Pratt - oh no, the only non-Turing award winner in the list. Oh right but he's emeritus faculty at Stanford, directed the SUN project before it became Sun Microsystems, was instrumental in Sun's early days (director of research and designer of the Sun logo!), and is responsible for all kinds of other awesome stuff (near and dear to me: Pratt certificates of primality).

Four independent Turing awards! SPARCstations! This paper has it all.

jiggawatts · a year ago

Job interview question for an entry-level front end developer: "Reproduce the work of four Turing award winners in the next thirty minutes. You have a dirty whiteboard and a dry pen. Your time begins... now."

ted_dunning · a year ago

And if you really want to impress, you reach into your pack and pull out the pens you carry just in case you run into dry pens at a critical moment.

Munksgaard · a year ago

Here's a direct link for anyone who, like me, would be interested in reading the original article: https://people.csail.mit.edu/rivest/pubs/BFPRT73.pdf

That's an impressive list of authors, for sure.

praptak · a year ago

Some other awesome stuff by Pratt:

Pratt parsing (HN discussion: https://news.ycombinator.com/item?id=39066465), the "P" in the KMP algorithm.

someplaceguy · a year ago

    return l[len(l) / 2]

I'm not a Python expert, but doesn't the `/` operator return a float in Python? Why would you use a float as an array index instead of doing integer division (with `//`)?

I know this probably won't matter until you have extremely large arrays, but this is still quite a code smell.

Perhaps this could be forgiven if you're a Python novice and hadn't realized that the two different operators exist, but this is not the case here, as the article contains this even more baffling code which uses integer division in one branch but float division in the other:

That we're 50 comments in and nobody seems to have noticed this only serves to reinforce my existing prejudice against the average Python code quality.

jononor · a year ago

Well spotted! In Python 2 there was only one operator, but in Python 3 they are distinct. Indexing an array with a float raises an exception, I believe.

runeblaze · a year ago

I do agree that it is a code smell. However given that this is an algorithms article I don't think it is exactly that fair to judge it based on code quality. I think of it as: instead of writing it in pseudocode the author chose a real pseudocode-like programming language, and it (presumably) runs well for illustrative purposes.

Deleted Comment

TacticalCoder · a year ago

I really enjoyed TFA but this:

> Technically, you could get extremely unlucky: at each step, you could pick the largest element as your pivot. Each step would only remove one element from the list and you’d actually have O(n2) performance instead of O(n)

If adversarial input is a concern, doing a O(n) shuffle of the data first guarantees this cannot happen. If the data is really too big to shuffle, then only shuffle once a bucket is small enough to be shuffled.

If you do shuffle, probabilities are here to guarantee that that worst case cannot happen. If anyone says that "technically" it can happen, I'll answer that then "technically" an attacker could also guess correctly every bit of your 256 bits private key.

Our world is build on probabilities: all our private keys are protected by the mathematical improbability that someone shall guess them correctly.

From what I read, a shuffle followed by quickselect is O(n) for all practical purposes.

bo1024 · a year ago

You're already using your own randomness to pick the pivot at random, so I don't see why the shuffle helps more. But yes, if your randomness is trustworthy, the probability of more than O(n) runtime is very low.

Reubend · a year ago

> If adversarial input is a concern, doing a O(n) shuffle of the data first guarantees this cannot happen.

It doesn't guarantee that you avoid the worst case, it just removes the possibility of forcing the worst case.

furstenheim · a year ago

Floyd Ryvest also does the job . A bit more efficient IIRC.

However I never managed to understand how it works.

https://en.m.wikipedia.org/wiki/Floyd%E2%80%93Rivest_algorit...