Why GNU grep is fast (2010)

GNU grep uses the well-known Boyer-Moore algorithm, which looks first for the final letter of the target string, and uses a lookup table to tell it how far ahead it can skip in the input whenever it finds a non-matching character.

I wonder how many times that algorithm or its variants has been rediscovered, given that not everyone who writes code reads CS papers; I recall several decades ago, working in x86 Asm, and coming up with that "match the last element first and use an indexed skip table" idea when I had to do substring matching, and then much later, when I explained it to someone else who was interested to know how I made that functionality so fast, he basically told me I had rediscovered the https://en.wikipedia.org/wiki/Boyer%E2%80%93Moore%E2%80%93Ho... .

As it turns out, BMH is more common in practice (even when the author doesn't realise it) than BM, and on average performs very well:

https://old.blog.phusion.nl/2010/12/06/efficient-substring-s...

I suspect at least some standard library's memmem() uses BMH, since it doesn't require dynamic allocation or other complexity unlike BM, and is otherwise a drop-in replacement for the naive algorithm.

burntsushi · 6 years ago

As the author of ripgrep, I'd like to add some more notes to your comment:

* Some libc implementations (at least, glibc and musl) use Two-Way (linked in a sibling comment). In particular, it also skips parts of the input like BM, uses constant space (no allocation needed) and formally has O(n) time.

* The advice in the OP is somewhat outdated. The fact that BM skips some of the input isn't really too important on modern hardware. What matters about BM is its skip loop. Most implementations use `memchr` on the last byte of the needle. If that last byte happens to occur infrequently in the haystack, then the vast majority of your search time will be in an extremely well optimized loop that uses SIMD. For that reason, BM is somewhat of a red herring, and can be easily beaten through heuristics that choose better bytes to feed into its skip loop. (For example, assume a frequency distribution of bytes and pick the rarest one in your needle.)

bberenberg · 6 years ago

OT but seriously thanks for rg, it makes massive log file analysis so much faster for me.

jjice · 6 years ago

Very interesting! Ripgrep is my favorite new wave Unix utility by far. Modern syntax, incredible speeds, and a ridiculous amount of options that come in handy quite a bit. Great work!

mehrdadn · 6 years ago

You might be interested in https://en.wikipedia.org/wiki/Two-way_string-matching_algori...

/me laughs in ripgrep

lcdoutlet · 6 years ago

I am a big fan of ripgrep. I think the tool is great for quick simple searches. On larger data sets with sufficiently complicated regex patterns the performance gap is small. I have also encountered situations using rg on med-large datasets where the files may contain Unicode, regular expressions, random garbage... and rg failed. I don't recall a situation like this where the error handling in grep didn't recover and move on to the next file.

Here is an example showing the performance gap to be small using (enwiki-20200401-pages-articles-multistream.xml 72G) split into 24 files on a ramdisk. I ran this 50 times and the results are similar.

du -sch . 72G . 72G total

time ls|xargs -i -P24 rg -c "<.?>" {} real 0m6.841s user 2m8.766s sys 0m5.637s

time ls|xargs -i -P24 grep -Ec "<.

?>" {} real 0m6.168s user 1m35.058s sys 0m24.177s

burntsushi · 6 years ago

> I have also encountered situations using rg on med-large datasets where the files may contain Unicode, regular expressions, random garbage... and rg failed.

I would really love to have a bug report if you could find a way to reproduce this. ripgrep should certainly not fail regardless of the input. I've actually spent a lot of time trying to make sure that ripgrep's error handling and so forth generally match GNU grep's behavior.

> Here is an example showing the performance gap to be small using (enwiki-20200401-pages-articles-multistream.xml 72G) split into 24 files on a ramdisk. I ran this 50 times and the results are similar.

Could you show me where to get `enwiki-20200401-pages-articles-multistream.xml`? I'd love to run the benchmark myself, although I don't have a ramdisk quite that big. :P I could shrink the data set though a bit.

If I were to guess though, yeah, I'd generally expect GNU grep and ripgrep to perform similarly here. Speculating:

If the match count of those runs is very high, then total search time is likely dominated by per-match overhead. Both ripgrep and GNU grep should be pretty close there, but there's not too much room to improve drastically.

If the match count of those runs isn't high or is very small, then GNU grep and ripgrep will use a search algorithm that is virtually identical in its performance characteristics. They will both probably look for occurrences of `>` using `memchr`, and then run a full search on matching lines. It's hard to get much faster than that, so again, not a lot of room for differentiating yourself.

If I can get my hands on the corpus, then I'm pretty sure I could give you (non-pathological) patterns on which searches would show very different performance characteristics between GNU grep and ripgrep.

userbinator · 6 years ago

zenhack · 6 years ago

It's always amazing to me how much performance work the basic gnu tools have seen in general. Grep makes some sense, but even yes(1) is fairly carefully tuned; in some cases it actually strikes me as kindof excessive, and not clearly worth the readability drop.

smabie · 6 years ago

The GNU people are absolutely obsessed with correctness and performance. It's kind of annoying actually. I assume it has something to do with RMS' ties to MIT and Lisp machines (an example of staggering complexity), see Richard Gabriel's "Worse is Better" essay.

I tend to fall in the NJ school myself, though I do think Emacs is something special. And though it's derided as slow, Emacs has a ton of super crazy optimizations, especially around rendering.

Pretty much every GNU project written in C is totally unreadable and hopelessly baroque: gcc, glibc, coreutils, etc. For fun, compare some OpenBSD Unix utility implementations with their GNU counterparts. The OpenBSD tools' source code is a joy to read and represents the pinnacle of elegant C. The GNU versions are ugly as sin, but functionally superior in pretty much every way. I don't really agree with how they write code, but you got to respect the almost OCD attention to detail: no edge case goes unhandled.

World's apart from the terse and cavalier style of traditional Unix hackers.

throwaway_pdp09 · 6 years ago

> The GNU people are absolutely obsessed with correctness and performance. It's kind of annoying actually

I can't accept this as a criticism without some elaboration.

> The GNU versions are ugly as sin, but functionally superior in pretty much every way. I don't really agree with how they write code, but you got to respect the almost OCD attention to detail: no edge case goes unhandled.

it sounds like you want the gnu people to stop making software that is 'functionally superior in pretty much every way'.

I don't get what you're saying.

sfoley · 6 years ago

RE yes’s performance, this old reddit post has a good breakdown.

https://old.reddit.com/r/unix/comments/6gxduc/how_is_gnu_yes...

milesvp · 6 years ago

Oh, man, I did some firmware writing to an sd card this year. I can’t tell you just how important that 4096 boundary is for sd card performance. Tuning the buffer size to the right multiple of 4096 was important too (especially with only 48k to work with).

hvs · 6 years ago

1 year ago https://news.ycombinator.com/item?id=19521872

4 years ago https://news.ycombinator.com/item?id=12350890

many more: https://hn.algolia.com/?q=Why+GNU+grep+is+fast

jsnell · 6 years ago

But also: https://ridiculousfish.com/blog/posts/old-age-and-treachery....

glouwbug · 6 years ago

Yup, this article is an oldie, but also a goodie, and I like seeing it reposted because it shows a terrific programmer speaking about their work with great passion, and it reminds me how to carefully answer questions without a condenscending tone (be that tone accidental or not, it's very easy to be off-putting when you are "idolized").

rektide · 6 years ago

statquontrarian · 6 years ago

> Since GNU grep passed out of my hands, it appears that use of mmap became non-default, but you can still get it via --mmap. [...] using --mmap can be worth a >20% speedup.

This option was ignored in March 2010 and later removed completely: https://lists.gnu.org/archive/html/grep-commit/2010-03/msg00...

shakna · 6 years ago

And removed for performance reasons, too.

> On modern systems, --mmap rarely if ever yields better performance.

and

> mmap is a bad idea for sequentially accessed file because it will cause a page fault for every read page. Just consider it a failed experiment, and ignore --mmap while accepting it for backwards compatibility.

That may have been true back then, but it isn't true today. Or at least, it depends on your OS. On Linux, you can actually run this test fairly reliable with ripgrep. Pick a big file (hundreds of MB or a couple GB), make sure it's on a ramdisk (or in cache) and search for a pattern that doesn't exist:

    $ rg --mmap zqzqzqzq big-file
    $ rg --no-mmap zqzqzqzq big-file

The first is faster on my system by a sizeable portion.

There are cases where mmap can be slower. For example, when searching lots of small files simultaneously. The overhead of managing the memory maps in the OS ends up dominating.

EDIT: Memorys maps also have the downside of being susceptible to SIGBUS if the file it is searching is truncated. In this case, ripgrep aborts. (This is mentioned in the README, and you can avoid this in all scenarios by passing --no-mmap.)

The_rationalist · 6 years ago

https://github.com/BurntSushi/ripgrep

zests · 6 years ago

This part seemed very applicable to the thread at hand:

> It supports searching with either memory maps or by searching incrementally with an intermediate buffer. The former is better for single files and the latter is better for large directories. ripgrep chooses the best searching strategy for you automatically.

zquestz · 6 years ago

Going to just leave this here:

https://sift-tool.org/performance

klyrs · 6 years ago

[2010]

This is an inspiration.

jasonmp85 · 6 years ago

Conversely, your post is depressing.

You’re impressed someone wrote something worth reading after a decade?

One, your interpretation of my comment widely misses the mark. First paragraph indicates the title is (was) lacking a date. Second paragraph is how I feel about the article itself.

Two, I said "inspiration" and not "impressive." I aspire to be this diligent about performance. None of the actual tricks are terribly impressive to me, as I'm familiar with them. The communication itself, from a GNU author to the BSD community, clearly explaining how they can achieve similar performance, is an exemplary open source attitude. That's even more inspirational than the actual tricks, IMO