Readit News logoReadit News
ok_dad · 3 months ago
> The regex methods are unbelievably fast regardless of string length.

I immediately pinned regex as the winner, and here is why:

In Python, you can almost always count on specialized functionality in the stdlib to be faster than any Python code you write, because most of it has probably been optimized in CPython by now.

Second, I have to ask, why would someone think that regular expressions, a tool specifically designed to search strings, would be slower than any other tool? Of course it's going to be the fastest! (At least in a language like Python.)

Kranar · 3 months ago
But it's not the fastest, it's actually incredibly slow and in general regexs are very slow. The key insight is that string processing in Python is very slow, and that regex here outperforms everything else because it's the only approach that is implemented in C.

With that insight it should follow that using another implementation in C should outperform even the regex, and indeed the following simple Python method that this article for whatever reason ignored vastly outperforms everything:

    def contains_vowel_find(s):
        for c in "aeiouAEIOU":
            if s.find(c) != -1:
                return True
        return False
That's because s.find(c) is implemented in C.

In my benchmark this approach is 10 times faster than using a regex:

https://gist.github.com/kranar/24323e81ea1c34fb56aff621f6c09...

a_e_k · 3 months ago
Playing around, I found that the generator approach was competitive with this if you permuted the loop nest, and often even slightly faster (at the 1000 character length):

    def any_gen_perm(s):
        return any(c in s for c in "aeiouAEIOU")
I think the crux is that you want the inner loop inside the fast C-implemented primitive to be the one iterating over the longer string, and to leave the outer loop in Python to iterate over the shorter string. With both my version and yours, the Python loop only iterates and calls into the C search loop 10 times, so there's less interpreter overhead.

I suspect that permuting the loop nest in similar variations will also see a good speed up, and indeed trying just now:

    def loop_in_perm(s):
        for c in "aeiouAEIOU":
            if c in s:
                return True
        return False
seems to give the fastest result yet. (Around twice as fast as the permuted generator expression and your find implementation on my machine, with 100 and 1000 character strings.)

jonstewart · 3 months ago
The more you learn about regexes, the more you learn that you can’t make general statements about their performance. Performance is contingent upon the engine, its algorithms, its implementation, the patterns, and the input.
tylerhou · 3 months ago
> in general regexs are very slow

I really don't think this is true. If you assume that the string is ASCII, a well-optimized regex for a pattern of this type should be a tight loop over a few instructions that loads the next state from a small array. The small array should fit completely in cache as well. This is basically branchless. I expect that you could process 1 character per cycle on modern CPUs.

If the string is short (~100 characters or less, guessing), I expect this implementation to outperform the find() implementation by far as find() almost certainly will incur at least one more branch mispredict than the regex. For longer strings, it depends on the data, as the branchless regex implementation will scan through the whole string, so find() will be faster if there is an vowel early on in the string. Find still might be faster even if there are no vowels; the exact time in that case depends on microarchitecture.

For non-ASCII, things are a bit trickier, but one can construct a state machine that is not too much larger.

azhenley · 3 months ago
Very nice find (pun intended). I added an update at the end with your solution and a link to this comment.
ok_dad · 3 months ago
I've always found regexes to do a string search faster than other methods, but there is always more to learn! Thanks for the lesson today.
arp242 · 3 months ago
> Second, I have to ask, why would someone think that regular expressions, a tool specifically designed to search strings, would be slower than any other tool? Of course it's going to be the fastest! (At least in a language like Python.)

I know that in Go doing string operations "manually" is almost always faster than regexps. In a quick check, about 10 times faster for this case (~120ns vs. ~1150ns for 100 chars where the last is a vowel).

Of course Python is not Go, but I wouldn't actually expect simple loops in Python to be that much slower – going from "10x faster" to "2x slower" is quite a jump.

Perhaps "yeah duh obvious" if you're familiar with Python and its performance characteristics, but many people aren't. Or at least I'm not. Based on my background, I wouldn't automatically expect it.

silisili · 3 months ago
It's worth noting here that Go has a notoriously slow regexp implementation.
high_na_euv · 3 months ago
>Second, I have to ask, why would someone think that regular expressions, a tool specifically designed to search strings, would be slower than any other tool? Of course it's going to be the fastest! (At least in a language like Python.)

Flexibly, search string flexibly.

Why you think that we cannot achieve faster search in non flexible fashion?

glangdale · 3 months ago
To paraphrase Arthur Dent, "this must be a strange new usage of the word 'fastest' with which I am not previously familiar".

The fastest way to detect a vowel in a string on any reasonable architecture (Intel or AMD equipped with SIMD of some kind) is using 3-4 instructions which will process 16/32/64 (depends on SIMD length) bytes at once. Obviously access to these will require using a Python library that exposes SIMD.

Leaving SIMD aside, a flat byte array of size 256 will outperform a bitmap since it's always faster to look up bytes in an array than bits in a bitmap, and the size is trivial.

windward · 2 months ago
I don't think the size is trivial, it's 4 times the cache length on x64
glangdale · 2 months ago
I'm not sure what you mean here. A single cache line is 64B, and this table would thus occupy 4 cache lines, but typical x86 cache sizes are 32K or for more recent cores 48K. Whether consuming 1/512th or 1/768th of your level 1 cache is excessive is a value judgement, but most people wouldn't think so.
Rendello · 3 months ago
A few years ago I came across the article "SIMD-friendly algorithms for substring searching". The "Generic SIMD" section into and example is small and quite understandable. I modified it to implement an LZ77 sliding window in Zig.*

http://0x80.pl/notesen/2016-11-28-simd-strfind.html

* Or rather, I tried my best. I burnt out on that project because I kept jumping back and forth between making a proper DEFLATE implementation or something bespoke. The SIMD stuff was really tough and once I got it "working", I figured I got all I needed from the project and let it go.

https://github.com/rendello/compressor/blob/dev/src/str_matc...

DannyB2 · 3 months ago
Assume use of 8 bit characters. Declare a constant 256 entry array pre-filled with all False except for the five (or six) vowel characters. This is baked into the code and not initialized at runtime.

Now for each character c in the input string, simply do an array index and see if it is true (a vowel) or not. This avoids either five conditionals, or a loop over the string 'aeiou'. The vowel test is constant time regardless of the character value.

masklinn · 3 months ago
In Python this is going to be slow, as you’ll have a ton of Python code in your hot loop.

It’s also going to be even more broken than TFA if any non-ascii character is present in the string.

absurdo · 3 months ago
Waiting for a few follow-ups:

“What every programmer should know about vowels”

“You and your vowels”

“What the vowels have wrought”

“We’re hiring! Vowels (YC 26)”

And last but not least:

“Things I Won’t Work With: Vowels”

haiku2077 · 3 months ago
"Falsehoods programmers believe about vowels"

Deleted Comment

Deleted Comment

layer8 · 3 months ago
This doesn’t even cover strings like “fly” and “naïve”. ;)
cenamus · 3 months ago
Yeah, should be called "how to detect aeiou in a string".
zerocrates · 3 months ago
well... it'll work for naïve. twice over, even
s09dfhks · 3 months ago
What am I missing about fly
ninkendo · 3 months ago
The article thinks aeiou are all the vowels, and forgets about y.