danlark (u/danlark) - Readit News

danlark commented on My Favorite Algorithm: Linear Time Median Finding (2018) rcoh.me/posts/linear-time... · Posted by u/skanderbm

SkiFire13 · a year ago

I wonder what's the reason of picking groups of 5 elements instead of 2 or 8.

danlark · a year ago

3 and 4 elements will fail to prove the complexity is linear

You still can do 3 or 4 but with slight modifications

https://arxiv.org/abs/1409.3600

For example, for 4 elements, it's advised to take lower median for the first half and upper median for the second half. Then the complexity will be linear

danlark commented on My Favorite Algorithm: Linear Time Median Finding (2018) rcoh.me/posts/linear-time... · Posted by u/skanderbm

danlark · a year ago

Around 4 years ago I compared lots of different median algorithms and the article turned out to be much longer than I anticipated :)

https://danlark.org/2020/11/11/miniselect-practical-and-gene...

danlark commented on How big is YouTube? ethanzuckerman.com/2023/1... · Posted by u/MBCook

zX41ZdbW · 2 years ago

I recommend checking the dataset of "YouTube dislikes": https://clickhouse.com/docs/en/getting-started/example-datas...

(it is named this way because it was an archival effort to collect the information before the dislike feature was removed)

It can be used to find things like the most controversial videos; top videos with a description in a certain language, etc.

danlark · 2 years ago

Disclaimer: the author of the comment is the CEO of ClickHouse

danlark commented on Deepmind Alphadev: Faster sorting algorithms discovered using deep RL nature.com/articles/s4158... · Posted by u/anjneymidha

orlp · 2 years ago

> AlphaDev uncovered new sorting algorithms that led to improvements in the LLVM libc++ sorting library that were up to 70% faster for shorter sequences and about 1.7% faster for sequences exceeding 250,000 elements.

As someone that knows a thing or two about sorting... bullshit. No new algorithms were uncovered, and the work here did not lead to the claimed improvements.

They found a sequence of assembly that saves... one MOV. That's it. And it's not even novel, it's simply an unrolled insertion sort on three elements. That their patch for libc++ is 70% faster for small inputs is only due to the library not having an efficient implementation with a *branchless* sorting network beforehand. Those are not novel either, they already exist, made by humans.

> By open sourcing our new sorting algorithms in the main C++ library, millions of developers and companies around the world now use it on AI applications across industries from cloud computing and online shopping to supply chain management. This is the first change to this part of the sorting library in over a decade and the first time an algorithm designed through reinforcement learning has been added to this library. We see this as an important stepping stone for using AI to optimise the world’s code, one algorithm at a time.

I'm happy for the researchers that the reinforcement learning approach worked, and that it gave good code. But the paper and surrounding press release is self-aggrandizing in both its results and impact. That this is the first change to 'this part' of the sorting routine in a decade is also just completely cherry-picked. For example, I would say that my 2014 report and (ignored patch of) the fact that the libc++ sorting routine was QUADRATIC (https://bugs.llvm.org/show_bug.cgi?id=20837) finally being fixed late 2021 https://reviews.llvm.org/D113413 is quite the notable change. If anything it shows that there wasn't a particularly active development schedule on the libc++ sorting routine the past decade.

danlark · 2 years ago

I agree on the sorting front. Removing one cmov is not likely to improve much.

danlark commented on Deepmind Alphadev: Faster sorting algorithms discovered using deep RL nature.com/articles/s4158... · Posted by u/anjneymidha

hajile · 2 years ago

To what extent is this simply working around the weirdness of x86? Do these improvements apply to something like MIPS, ARM64, or RISC-V that have inherently simpler ISAs?

danlark · 2 years ago

In this particular case they were universal but in paper it's said the optimizations were done on x86. One of the ideas was to use LLVM IR but intuition for optimizer over optimizer was unlikely to work properly.

danlark commented on Deepmind Alphadev: Faster sorting algorithms discovered using deep RL nature.com/articles/s4158... · Posted by u/anjneymidha

danlark · 2 years ago

You can see hashing optimizations as well https://www.deepmind.com/blog/alphadev-discovers-faster-sort..., https://github.com/abseil/abseil-cpp/commit/74eee2aff683cc7d...

I was one of the members who reviewed expertly what has been done both in sorting and hashing. Overall it's more about assembly, finding missed compiler optimizations and balancing between correctness and distribution (in hashing in particular).

It was not revolutionary in a sense it hasn't found completely new approaches but converged to something incomprehensible for humans but relatively good for performance which proves the point that optimal programs are very inhuman.

Note that for instructions in sorting, removing them does not always lead to better performance, for example, instructions can run in parallel and the effect can be less profound. Benchmarks can lie and compiler could do something differently when recompiling the sort3 function which was changed.

For hashing it was even funnier, very small strings up to 64 bit already used 3 instructions like add some constant -> multiply 64x64 -> xor upper/lower. For bigger ones the question becomes more complicated, that's why 9-16 was a better spot and it simplified from 2 multiplications to just one and a rotation. Distribution on real workloads was good, it almost passed smhasher and we decided it was good enough to try out in prod. We did not rollback as you can see from abseil :)

But even given all that, it was fascinating to watch how this system was searching and was able to find particular programs can be further simplified. Kudos to everyone involved, it's a great incremental change that can bring more results in the future.

danlark commented on “csinc”, the AArch64 instruction you didn’t know you wanted danlark.org/2023/06/06/cs... · Posted by u/jandeboevrie

jart · 2 years ago

I discovered a really cool ARM64 trick today. One thing about x86 that I've found useful on so many occasions is the PCMPEQB + PMOVMSKB + BSF trick that lets me scan the bytes of a string 10x faster. I couldn't find any information on Google for doing PMOVMSKB with ARM, so I've been studying ARM's "Optimized Routines" codebase where I stumbled upon the answer in their strnlen() implementation. It turns out the trick is to use `shrn dst.8b, src.8h, 4` which turns a 128-bit mask into a 64-bit mask. You can then get the string offset index with fmov, rbit, clz and finally shift by 2.

danlark · 2 years ago

I am the author of this trick as well

You can read about it in https://community.arm.com/arm-community-blogs/b/infrastructu...