So the idea is that you'll eventually arrange n elements into ideally about √n sorted lists, organized so that the endpoints of these lists are also ordered (the "rainbow" shape). A new element goes on the end of one of those lists or a new list; which one is generally decided by binary search, but the index is saved to speed up the next insertion if it's the same. So a natural run might benefit by having adjacenct values inserted to the same list.
Which cases benefit and which don't seems potentially quite erratic. If you have a series of runs that shrink over the course of the array, with increasing start value and decreasing start value, each will become a new sorted list and you end up with O(n) insertion plus a perfect merge. If the runs get wider instead, they'll cross over existing endpoints and end up spattered around before merging.
It feels like the paper keeps trying to spin constant-factor improvements as somehow being asymptotically relevant despite acknowledging that mathematically they aren't, for example the fact that log(√n) is only log(n)/2 "would not reflect the sublinear nature of base array length k" and runs hitting the best case "approximately 50% of the time... reinforces JesseSort's average runtime complexity as somewhere between O(n) and O(n log_2 n)". Powersort provably takes O(n + n log r) worst-case time where r is the number of runs; I expect this algorithm is O(n log n) worst- and average-case.
However, on random arrays the given benchmarks do improve relative to Python default as the length increases, with faster performance starting a little under a million elements. There isn't a lot of investigation into why this is. I'd be interested to know if the number of comparisons is reduced in particular. Perhaps this algorithm is more cache-friendly with a random input, but this wouldn't be reliable as many inputs can cause the number of sorted lists to scale with O(n) so that the binary searches need to work on a large array.
The analysis would make a lot more sense if it dropped every big O and computed an expected number of comparisons. Of course, the real issue is that the whole thing relies on an empirically-determined length of √8√n for the number of lists in the rainbow, when the theoretical justification isn't there and clearly the worst case is n/2. I'm not seeing any awareness that the expected number of comparisons has to be at least log_2(n!), and am starting to see some worrying signs that the author thinks averages like n log log n are possible. I'd initially assumed that "somewhere between O(n) and O(n log_2 n)" statement meant in the presence of asymptotically large natural runs, but there's a claim that the runs "need not be of any minimum size" and another that the algorithm can "capitalize on natural runs" that seems to be referring to the benchmarks on "purely random values".
Ah, the README prior to commit b98f724 does claim O(n log(1.5 ln(n))) asymptotic runtime. This is based on a flawed analysis assuming rainbow endpoints will be evenly distributed, which the paper presents before describing the problem that each inserted element pushes one of those endpoints away from the middle. Bit of a perpetual motion machine search: initial results showed the promise of this approach, some details got in the way but surely there's some trick to work around them...
Still there's the benchmark. I remembered timsort.txt shows comparison counts relative to the log_2(n!) bound and it's only like 1% worse, so the >30% improvement over Timsort at the high end has to come from some other factor. I'm thinking it is actually cache effects—not because the algorithm has inherently better cache properties, as it's just a merge sort at large scales, but because it copies the array to ints internally rather than using a generic comparison. That would be a poor result, as good comparison sorts compiled for 4-byte data are many times faster than Timsort.
If you're going to make a big claim about sort speed, tell me how speed is better/worse for various data. How do the algorithms compare when the data is already ordered, when it's almost (but not quite) already ordered, when it's largely ordered, when it's completely random, and it's in the opposite order. This stuff, as well as the size of the dataset, is what we need to know in practice.
Rather than, or at least in addition to, raw measured speed on a specific piece of hardware, which is often affected in hard to understand ways by niche optimisation choices and nuances of specific hardware, I actually really like the choice to report how many comparisons your algorithm needed.
For example Rust's current unstable sort takes ~24 comparisons on average for each of 10^7 truly random elements to sort them all, but if instead all those elements are chosen (still at random) from only 20 possibilities, it only needs a bit more than five comparisons regardless of whether there are 10^3 elements or 10^7.
Unlike "On this Intel i6-9402J Multi Wazoo (Bonk Nugget Edition) here are my numbers" which is not very useful unless you also have the i6-9402J in that specific edition, these comparison counts get to a more fundamental property of the algorithm that transcends micro architecture quirks which will not matter next year.
"My boss wants to buy systems with the Intel i10-101010F Medium Core Platinum (with rowhammer & Sonic & Knuckles), can you buy this $20,000 box and test your program so I can write him a report?"
I want to elaborate on timing cautions: a sort that specializes to 4-byte machine ints is doing something that can't be expressed directly in CPython. Its lists are heterogeneous, types aren't known until runtime, and there is no native (to Python) concept of "4-byte int". All CPython ints are stored in "bigint" (unbounded) format, and even zero requires 32 bytes of storage in the format (overheads for a type pointer, refcount, 16-byte heap alignment padding). That format isn't 2's complement either (although it creates the semantic _illusion_ of being such): it's sign+magnitude.
The general object comparison machinery is very expensive at runtime, essentially repeated from scratch on every compare. To mitigate this, CPython does a prepass over the list (O(n)) to see whether the list is in fact homogenous, and, if so, of a type that can use a more specialized comparison function. Not free, but pays off hugely if you are in fact sorting a list of "small enough" machine ints. But it doesn't change the storage format: on every compare, cycles are still consumed to transform CPython's sign+magnitude bigint format to native 2's-complement for native machine code to compare directly.
I expect this accounts for most of the observed "speedup" over timsort. Despite semi-heroic efforts to cut the cost, comparisons of native ints in CPython will always be significantly more expensive than code specialized to compare native ints directly. At least until (if ever) CPython learns how to "unbox" ints. The PyPy implementation of Python already does, and sorting lists of native machine ints runs about twice as fast under PyPy. It's not the sorting algorithm that matters to this, it's the relative cost of machine-int compares, and almost certainly too the cache pressure of using many more bytes to store small ints than necessary in "boxed" format.
As mentioned elsewhere, JesseSort reminds me most of the old and simpler "patience sorting", with which it has a lot in common. Both keep a collection of sorted sublists (to be merged at the end), both use binary search to pick a collection member to which to add 'the next" input, both excel at "low diversity" inputs, both (on average) end up with O(sqrt(n)) sublists on randomly ordered inputs, and neither is "naturally" stable.
As I noted in listsort.txt at the time, while timsort gets a lot of value out of repeated elements ("low diversity"), it's just not the right approach to exploit that aa much as possible. JesseSort and pdqsort (and, by extension, glidesort) will always do better on that. OTOH, timsoet gets value out of many kinds of partial ordering, which "reveal themselves" dynamically via one run in a merge "winning" consistently.
Interesting approach, although I'm wondering why they didn't just use a double-ended queue instead of a linked list.
Also, I get the urge to name the algorithm after yourself, since Timsort is named after its creator as well, but if you call you data structure "rainbow" then my first association would be the rainbow flag, since flag-sorts also have a tradition in sorting algorithm names.
It's true that awful Deque implementations are often built from a linked list. If you're taught about big-O notation and haven't noticed how modern computers work you might even do that today.
I would expect that they're thinking of something more like Rust's VecDeque, so that's a conventional "growable array" (Rust's Vec, Java ArrayList, C++ std::vector) used internally with some head + tail tracking to form a circular buffer so as to deliver the Deque API (fast push to head or tail, fast pop from head or tail).
And what about pdqsort, glidesort, driftsort, ipnsort? New research is always cool, but there's plenty of other research in this space to compare this to
I am a bit concerned that the primary channel of distribution of the paper is ResearchGate. I am somehow biased against taking it seriously because of this.
I have mixed feelings about the naming. Isn't it kinda off-putting that he named it after himself? usually it's other people who name your algorithm after you. At least that's how it's like in science and engineering (Fourier didn't call it "Fourier transform", Laplace didn't call it "Laplace transform", Kalman didn't name it "Kalman filters", etc.)
Perhaps ;-) I'm the "Tim" in "timsort". The name was an inside joke. I'm not a self-promoter, and never have been. As the so-called "Zen of Python" author, I thought it would be funny to pick a name I'd never pick ;-)
CPython had a unique (in my experience) combination of cheap data movement (only pointer swaps) and very expensive comparisons. That's what I was aiming at. I never publicized it, never wrote about it outside the CPython repo, and never thought I'd hear about it again.
Of course I'm pleased it found wider use, but that took my wholly by surprise. If I had it to do over again, I would probably have named it, say, gallopsort.
If this new sort catches on, Jesse should rename it before it's too late ;-)
I know it is not possible in our Universe but on a lighter note, in some parallel universe out there, someone has made a sorting algorithm that runs in O(1). I wonder though if a quantum computer can simultaneously sort all the elements in O(1)
Which cases benefit and which don't seems potentially quite erratic. If you have a series of runs that shrink over the course of the array, with increasing start value and decreasing start value, each will become a new sorted list and you end up with O(n) insertion plus a perfect merge. If the runs get wider instead, they'll cross over existing endpoints and end up spattered around before merging.
It feels like the paper keeps trying to spin constant-factor improvements as somehow being asymptotically relevant despite acknowledging that mathematically they aren't, for example the fact that log(√n) is only log(n)/2 "would not reflect the sublinear nature of base array length k" and runs hitting the best case "approximately 50% of the time... reinforces JesseSort's average runtime complexity as somewhere between O(n) and O(n log_2 n)". Powersort provably takes O(n + n log r) worst-case time where r is the number of runs; I expect this algorithm is O(n log n) worst- and average-case.
However, on random arrays the given benchmarks do improve relative to Python default as the length increases, with faster performance starting a little under a million elements. There isn't a lot of investigation into why this is. I'd be interested to know if the number of comparisons is reduced in particular. Perhaps this algorithm is more cache-friendly with a random input, but this wouldn't be reliable as many inputs can cause the number of sorted lists to scale with O(n) so that the binary searches need to work on a large array.
Still there's the benchmark. I remembered timsort.txt shows comparison counts relative to the log_2(n!) bound and it's only like 1% worse, so the >30% improvement over Timsort at the high end has to come from some other factor. I'm thinking it is actually cache effects—not because the algorithm has inherently better cache properties, as it's just a merge sort at large scales, but because it copies the array to ints internally rather than using a generic comparison. That would be a poor result, as good comparison sorts compiled for 4-byte data are many times faster than Timsort.
For example Rust's current unstable sort takes ~24 comparisons on average for each of 10^7 truly random elements to sort them all, but if instead all those elements are chosen (still at random) from only 20 possibilities, it only needs a bit more than five comparisons regardless of whether there are 10^3 elements or 10^7.
Unlike "On this Intel i6-9402J Multi Wazoo (Bonk Nugget Edition) here are my numbers" which is not very useful unless you also have the i6-9402J in that specific edition, these comparison counts get to a more fundamental property of the algorithm that transcends micro architecture quirks which will not matter next year.
Deleted Comment
The general object comparison machinery is very expensive at runtime, essentially repeated from scratch on every compare. To mitigate this, CPython does a prepass over the list (O(n)) to see whether the list is in fact homogenous, and, if so, of a type that can use a more specialized comparison function. Not free, but pays off hugely if you are in fact sorting a list of "small enough" machine ints. But it doesn't change the storage format: on every compare, cycles are still consumed to transform CPython's sign+magnitude bigint format to native 2's-complement for native machine code to compare directly.
I expect this accounts for most of the observed "speedup" over timsort. Despite semi-heroic efforts to cut the cost, comparisons of native ints in CPython will always be significantly more expensive than code specialized to compare native ints directly. At least until (if ever) CPython learns how to "unbox" ints. The PyPy implementation of Python already does, and sorting lists of native machine ints runs about twice as fast under PyPy. It's not the sorting algorithm that matters to this, it's the relative cost of machine-int compares, and almost certainly too the cache pressure of using many more bytes to store small ints than necessary in "boxed" format.
As mentioned elsewhere, JesseSort reminds me most of the old and simpler "patience sorting", with which it has a lot in common. Both keep a collection of sorted sublists (to be merged at the end), both use binary search to pick a collection member to which to add 'the next" input, both excel at "low diversity" inputs, both (on average) end up with O(sqrt(n)) sublists on randomly ordered inputs, and neither is "naturally" stable.
As I noted in listsort.txt at the time, while timsort gets a lot of value out of repeated elements ("low diversity"), it's just not the right approach to exploit that aa much as possible. JesseSort and pdqsort (and, by extension, glidesort) will always do better on that. OTOH, timsoet gets value out of many kinds of partial ordering, which "reveal themselves" dynamically via one run in a merge "winning" consistently.
Also, I get the urge to name the algorithm after yourself, since Timsort is named after its creator as well, but if you call you data structure "rainbow" then my first association would be the rainbow flag, since flag-sorts also have a tradition in sorting algorithm names.
[0] https://en.m.wikipedia.org/wiki/Dutch_national_flag_problem
[1] https://en.m.wikipedia.org/wiki/American_flag_sort
if you back it by two arrays then you can't do O(1) inserts in the middle (like is going on here)
I would expect that they're thinking of something more like Rust's VecDeque, so that's a conventional "growable array" (Rust's Vec, Java ArrayList, C++ std::vector) used internally with some head + tail tracking to form a circular buffer so as to deliver the Deque API (fast push to head or tail, fast pop from head or tail).
CPython had a unique (in my experience) combination of cheap data movement (only pointer swaps) and very expensive comparisons. That's what I was aiming at. I never publicized it, never wrote about it outside the CPython repo, and never thought I'd hear about it again.
Of course I'm pleased it found wider use, but that took my wholly by surprise. If I had it to do over again, I would probably have named it, say, gallopsort.
If this new sort catches on, Jesse should rename it before it's too late ;-)
https://en.wikipedia.org/wiki/Donald_Shell
Deleted Comment
https://en.m.wikipedia.org/wiki/Quantum_sort