In college in the 1970's, I took a combinatorial algorithms class with Herb Wilf. He explained that choosing a subset of k distinct elements from 1..n uniformly at random was likely k log k, because you needed to detect collisions, and "sorting" k elements was k log k.
(Always check your assumptions. Sorting has an k log k lower bound if all you can do is compare elements. With integers 1..n, one can do much more.)
I showed up at 5pm at his office, as he was bracing himself for the drive from Swarthmore back to Philly, wondering why he ever took this gig. I proposed a linear algorithm: Break 1..n into k bins, and only sort to detect collisions within each bin. Bins would have an average occupancy of 1, so it didn't matter what sort you used.
(I found this idea later in Knuth, and was briefly disillusioned that adults didn't read everything.)
I looked like George Harrison post-Beatles, I probably smelled of pot smoke, and he dismissed my idea. Then my dorm hall phone rang at 10pm, a professor of mine relaying an apology. For that and other ideas, the second edition of his book opened with thanks to me.
I'm surprised now that I realized then that bubble sort actually had the least overhead, in such an application. People don't use Strassen to multiple matrices unless the matrices are really big. For sorting, "really big" is 2 or 3 elements.
> choosing k elements from 1..n uniformly at random
Can you clarify the problem?
k elements uniformly at random, independently, with replacement, sounds like it’s trivially O(k), or perhaps O(k log n) if n is so big that you need O(k log n) bits to merely represent the answer. (Those logs are pesky and can appear and disappear in different models of computation!)
Maybe you mean k distinct elements? Off the top of my head, if n is not huge (n = O(k), or more concretely, n < ck for your choice or c > 1, and I would probably start with 2 if I were implementing this), I would make an array of all the elements and draw k of them using any of a number of straightforward algorithms. (Fischer-Yates, for example, but I admit I had to look up the name.)
For larger n, make a hash table to store the elements selected so far. Now repeat k times: choose an element at random and retry if it’s a duplicate. Then add it to the hash table and continue. Each iteration takes at most 1/c tries on average, so the overall expected time is O(k). (For n < ck, the last few draws may retry quite a lot.)
Thinking of this as sorting seems odd, but I admit I’m thinking about this in 2023, not the 1970s :)
Yes, distinct elements. Fixed. That's the only way what I said makes sense, like how one error-corrects listening to a talk.
They were concerned with the case where n could be absurdly large, and wanted no time or space dependence on n. They improved my idea to cut the multiple of k needed for space.
> Bins would have an average occupancy of 1, so it didn't matter what sort you used.
That's not quite right, though - you only care about the average occupancy if your sorting algorithm is O(b) for b elements. If you're using a O(b^2) sort, you actually care about the root-mean-square occupancy (or in general just the average runtime, or just the total runtime since you'll multiply the average by the number of bins anyway).
The k-bin approach is definitely much faster than O(k log k), but I'm fairly sure it's slightly slower than truely linear, asymptotically, since the runtime will be something like 1+0+4+1 in typical cases, where linear would have to be 1+0+2+1.
Edit: on some cursory Monte Carlo testing, the sum of squares of occupancies actually seems to still converge on a constant factor (of two) times the sum of plain/linear occupancies, which I was not expecting but does make the above point rather moot. (Also, the sum of cubed occupancies seems to converge on a factor of five, which seems downright bizzare.)
(Tangential edit: The factors seem to be 1 2 5 15 52±2 200±20, which might be https://oeis.org/A000110, though none of the interpretations there seem obviously related.)
There's a fixed, rapidly vanishing probability distribution on bin occupancies, which evaluates to a constant per bin when one applies a sorting algorithm. One gets linear algorithms with different constants, depending on the sorting algorithm. You're thinking "deep end of the pool" worrying about b log b versus b^2. I was in the "shallow end of the pool" where the constants in front of fancy sorts took them out of the running.
Like I said about Strassen and matrix multiplication. Great algorithm, but it takes a while to kick in for the lead.
I've written a sort perhaps a handful of times in the last 40 years. Each of those was a bubble sort. I was nearly-always sorting <4 elements (and often just 1, or even 0).
These were rare cases where I didn't have a sort algorithm in a library -- perhaps because I needed to pass in a comparator that wasn't easily supported in the language. So I banged out whatever was easiest to get me moving on to the actual problem. If the sort became a noticeable part of the running time, I'd deal with that when I had performance numbers.
Bubble sort was easy to remember and quick to bang out. For all I know I really was writing an insertion sort; all O(n^2) sorts look pretty much the same.
I haven't written any sorts at all in well over a decade, since all languages have substantial support. If I'm writing a sort, something has gone very wrong.
> On some very particular hardwares bubblesort stills ends up better, like in this NVIDIA study, but you probably don't have that hardware.
The author might have misunderstood the hardware in question and the reasons why bubble sort works well here. You probably do have applicable hardware; it’s any GPU. The bubble sort wasn’t running on ray tracing hardware, it was just CUDA code. The reason bubble sort works well is because it’s running on a SIMD machine, and it’s better when all threads in a work unit (a “warp” in CUDA parlance) all do the same thing at all times.
If you try to do insertion sort on a SIMD machine, you’re very likely to approach worst-case O(N^2) performance very quickly, even when each individual thread involved would have had average-case perf on it’s own. Bubble sort is good for the parallel SIMD machine because of it’s deterministic and regular evaluation order that is the same across all threads.
Of course, it’s still true that for anything larger than very small arrays of data, bubble sort complexity is unacceptable on SIMD or parallel processors. There are better parallel sorting algorithms for that.
In the other words, you are sorting lots of equally-sized slabs at once, and you can do the same operation to each slab efficiently. Algorithms with no data-dependent control flow would be optimal in this situation, and bubblesort is one of them. But there are even better algorithms, I think shellsort with a fixed gap sequence (since the slab size is known in advance) could've been a good alternative. On the other hands, if you are sorting a single huge list, you need a parallel sorting algorithm like bitonic sort.
>> bubble sort in a production game engine to keep transparent objects mostly sorted.
A single pass of bubble sort per frame is often enough to keep things in order. And if something glitches because it takes 2 passes it's just a single frame glitch!
> If two threads go to CAS-sort at the same time, the cache coherency overhead would be apocalyptically bad. I'm assuming that's not what you meant?
I think it's apparent they didn't mean the CAS atomic instruction, just the compare-and-swap operation in bubblesort. Nothing in their comment suggested this was a multi-threaded situation.
Bubble sorts are useful in broad-phase collision detection. Each object has an axis-aligned bounding box, and you want to sort those by X, Y, and Z values, then check for overlap in all dimensions. Only if the bounding boxes overlap in all three dimensions do you have to check the object pair's detailed geometry.
Most objects aren't moving that much. The lists are thus almost in order.
So running a bubble sort on each axis to check the ordering is not much worse than O(N).
Startup has a transient. The very first sort will be O(N^2). If the number of objects is large enough it may be worthwhile to do something different at startup.
[citation needed]. Even just one object moving from one end of an axis to the other end would incur catastrophic slowdown, and depending on what you are simulating, this might occur regularly (for example, consider teleporters in a video game). When that happens, you incur the full O(n^2) iterations until the axis is sorted again, even though only one object is actually out of order.
There are certainly simulations where this approach is fine (if you can be absolutely confident that no object ever "skips" around, like when you're running a simulation that is configured with an explicit upper bound on velocity), but I think that this still requires this warning label.
Funny, if I were doing that I'd use a spatial index using oct-trees since I have a fairly optimized one. You find all potential overlaps in O(logN), But to check them all would be O(NlogN). So by your approach it looks like bubble sort wins!
Bubble-sort is great if you have many cores to spare but no extra memory. It parallelizes nicely and works in-place.
Insertion sort is great if you don't have any cores or memory to spare. It uses one core and works in-place.
Selection sort is much like insertion sort, runs in-place, single-threaded, but it will be faster if your CPU is much faster than your memory access as it uses more comparisons and fewer swaps.
Comb sort tip --- instead of dividing by 1.3 use 1.333 (repeating) instead.
In practical terms using integers, this makes almost no difference *except* it can be performed using faster shift and add operations:
gap = ((gap << 1) + gap) >> 2;
For even more speed, I'd also recommend in-lining this to eliminate the function call. Properly implemented, respectable speed can be achieved from a really trivial implementation.
That link seems to insist on defining bubble sort without a termination condition, which is not how I think of bubble sort and I suspect is not how anyone is defining it in the cases it argues against. Also the use case in this article isn't listed.
(Always check your assumptions. Sorting has an k log k lower bound if all you can do is compare elements. With integers 1..n, one can do much more.)
I showed up at 5pm at his office, as he was bracing himself for the drive from Swarthmore back to Philly, wondering why he ever took this gig. I proposed a linear algorithm: Break 1..n into k bins, and only sort to detect collisions within each bin. Bins would have an average occupancy of 1, so it didn't matter what sort you used.
(I found this idea later in Knuth, and was briefly disillusioned that adults didn't read everything.)
I looked like George Harrison post-Beatles, I probably smelled of pot smoke, and he dismissed my idea. Then my dorm hall phone rang at 10pm, a professor of mine relaying an apology. For that and other ideas, the second edition of his book opened with thanks to me.
I'm surprised now that I realized then that bubble sort actually had the least overhead, in such an application. People don't use Strassen to multiple matrices unless the matrices are really big. For sorting, "really big" is 2 or 3 elements.
Can you clarify the problem?
k elements uniformly at random, independently, with replacement, sounds like it’s trivially O(k), or perhaps O(k log n) if n is so big that you need O(k log n) bits to merely represent the answer. (Those logs are pesky and can appear and disappear in different models of computation!)
Maybe you mean k distinct elements? Off the top of my head, if n is not huge (n = O(k), or more concretely, n < ck for your choice or c > 1, and I would probably start with 2 if I were implementing this), I would make an array of all the elements and draw k of them using any of a number of straightforward algorithms. (Fischer-Yates, for example, but I admit I had to look up the name.)
For larger n, make a hash table to store the elements selected so far. Now repeat k times: choose an element at random and retry if it’s a duplicate. Then add it to the hash table and continue. Each iteration takes at most 1/c tries on average, so the overall expected time is O(k). (For n < ck, the last few draws may retry quite a lot.)
Thinking of this as sorting seems odd, but I admit I’m thinking about this in 2023, not the 1970s :)
They were concerned with the case where n could be absurdly large, and wanted no time or space dependence on n. They improved my idea to cut the multiple of k needed for space.
That's not quite right, though - you only care about the average occupancy if your sorting algorithm is O(b) for b elements. If you're using a O(b^2) sort, you actually care about the root-mean-square occupancy (or in general just the average runtime, or just the total runtime since you'll multiply the average by the number of bins anyway).
The k-bin approach is definitely much faster than O(k log k), but I'm fairly sure it's slightly slower than truely linear, asymptotically, since the runtime will be something like 1+0+4+1 in typical cases, where linear would have to be 1+0+2+1.
Edit: on some cursory Monte Carlo testing, the sum of squares of occupancies actually seems to still converge on a constant factor (of two) times the sum of plain/linear occupancies, which I was not expecting but does make the above point rather moot. (Also, the sum of cubed occupancies seems to converge on a factor of five, which seems downright bizzare.)
(Tangential edit: The factors seem to be 1 2 5 15 52±2 200±20, which might be https://oeis.org/A000110, though none of the interpretations there seem obviously related.)
Like I said about Strassen and matrix multiplication. Great algorithm, but it takes a while to kick in for the lead.
These were rare cases where I didn't have a sort algorithm in a library -- perhaps because I needed to pass in a comparator that wasn't easily supported in the language. So I banged out whatever was easiest to get me moving on to the actual problem. If the sort became a noticeable part of the running time, I'd deal with that when I had performance numbers.
Bubble sort was easy to remember and quick to bang out. For all I know I really was writing an insertion sort; all O(n^2) sorts look pretty much the same.
I haven't written any sorts at all in well over a decade, since all languages have substantial support. If I'm writing a sort, something has gone very wrong.
In my opinion, a bubble sort is a solid choice for that sort of use case.
The author might have misunderstood the hardware in question and the reasons why bubble sort works well here. You probably do have applicable hardware; it’s any GPU. The bubble sort wasn’t running on ray tracing hardware, it was just CUDA code. The reason bubble sort works well is because it’s running on a SIMD machine, and it’s better when all threads in a work unit (a “warp” in CUDA parlance) all do the same thing at all times.
If you try to do insertion sort on a SIMD machine, you’re very likely to approach worst-case O(N^2) performance very quickly, even when each individual thread involved would have had average-case perf on it’s own. Bubble sort is good for the parallel SIMD machine because of it’s deterministic and regular evaluation order that is the same across all threads.
Of course, it’s still true that for anything larger than very small arrays of data, bubble sort complexity is unacceptable on SIMD or parallel processors. There are better parallel sorting algorithms for that.
You do the compare-and-swap when you're iterating over the objects to do some other render pass, so it adds almost nothing to the CPU time.
A single pass of bubble sort per frame is often enough to keep things in order. And if something glitches because it takes 2 passes it's just a single frame glitch!
If two threads go to CAS-sort at the same time, the cache coherency overhead would be apocalyptically bad. I'm assuming that's not what you meant?
> If two threads go to CAS-sort at the same time, the cache coherency overhead would be apocalyptically bad. I'm assuming that's not what you meant?
I think it's apparent they didn't mean the CAS atomic instruction, just the compare-and-swap operation in bubblesort. Nothing in their comment suggested this was a multi-threaded situation.
Most objects aren't moving that much. The lists are thus almost in order. So running a bubble sort on each axis to check the ordering is not much worse than O(N).
Startup has a transient. The very first sort will be O(N^2). If the number of objects is large enough it may be worthwhile to do something different at startup.
There are certainly simulations where this approach is fine (if you can be absolutely confident that no object ever "skips" around, like when you're running a simulation that is configured with an explicit upper bound on velocity), but I think that this still requires this warning label.
Insertion sort is great if you don't have any cores or memory to spare. It uses one core and works in-place.
Selection sort is much like insertion sort, runs in-place, single-threaded, but it will be faster if your CPU is much faster than your memory access as it uses more comparisons and fewer swaps.
https://www.delftstack.com/tutorial/algorithm/comb-sort/
In practical terms using integers, this makes almost no difference *except* it can be performed using faster shift and add operations:
gap = ((gap << 1) + gap) >> 2;
For even more speed, I'd also recommend in-lining this to eliminate the function call. Properly implemented, respectable speed can be achieved from a really trivial implementation.