https://danlark.org/2020/11/11/miniselect-practical-and-gene...
https://danlark.org/2020/11/11/miniselect-practical-and-gene...
(it is named this way because it was an archival effort to collect the information before the dislike feature was removed)
It can be used to find things like the most controversial videos; top videos with a description in a certain language, etc.
As someone that knows a thing or two about sorting... bullshit. No new algorithms were uncovered, and the work here did not lead to the claimed improvements.
They found a sequence of assembly that saves... one MOV. That's it. And it's not even novel, it's simply an unrolled insertion sort on three elements. That their patch for libc++ is 70% faster for small inputs is only due to the library not having an efficient implementation with a *branchless* sorting network beforehand. Those are not novel either, they already exist, made by humans.
> By open sourcing our new sorting algorithms in the main C++ library, millions of developers and companies around the world now use it on AI applications across industries from cloud computing and online shopping to supply chain management. This is the first change to this part of the sorting library in over a decade and the first time an algorithm designed through reinforcement learning has been added to this library. We see this as an important stepping stone for using AI to optimise the world’s code, one algorithm at a time.
I'm happy for the researchers that the reinforcement learning approach worked, and that it gave good code. But the paper and surrounding press release is self-aggrandizing in both its results and impact. That this is the first change to 'this part' of the sorting routine in a decade is also just completely cherry-picked. For example, I would say that my 2014 report and (ignored patch of) the fact that the libc++ sorting routine was QUADRATIC (https://bugs.llvm.org/show_bug.cgi?id=20837) finally being fixed late 2021 https://reviews.llvm.org/D113413 is quite the notable change. If anything it shows that there wasn't a particularly active development schedule on the libc++ sorting routine the past decade.
I was one of the members who reviewed expertly what has been done both in sorting and hashing. Overall it's more about assembly, finding missed compiler optimizations and balancing between correctness and distribution (in hashing in particular).
It was not revolutionary in a sense it hasn't found completely new approaches but converged to something incomprehensible for humans but relatively good for performance which proves the point that optimal programs are very inhuman.
Note that for instructions in sorting, removing them does not always lead to better performance, for example, instructions can run in parallel and the effect can be less profound. Benchmarks can lie and compiler could do something differently when recompiling the sort3 function which was changed.
For hashing it was even funnier, very small strings up to 64 bit already used 3 instructions like add some constant -> multiply 64x64 -> xor upper/lower. For bigger ones the question becomes more complicated, that's why 9-16 was a better spot and it simplified from 2 multiplications to just one and a rotation. Distribution on real workloads was good, it almost passed smhasher and we decided it was good enough to try out in prod. We did not rollback as you can see from abseil :)
But even given all that, it was fascinating to watch how this system was searching and was able to find particular programs can be further simplified. Kudos to everyone involved, it's a great incremental change that can bring more results in the future.
You can read about it in https://community.arm.com/arm-community-blogs/b/infrastructu...
You still can do 3 or 4 but with slight modifications
https://arxiv.org/abs/1409.3600
For example, for 4 elements, it's advised to take lower median for the first half and upper median for the second half. Then the complexity will be linear