This is one of the features that Ruby cribbed directly from Perl. The Ruby documentation seems really bad, in particular “interpolation mode” is grievously misleading.
Perl’s documentation is far more clear about the consequences:
o Compile pattern only once.
[…]
PATTERN may contain variables, which will be
interpolated every time the pattern search is
evaluated, except for when the delimiter is a
single quote. […] Perl will not recompile the
pattern unless an interpolated variable that
it contains changes. You can force Perl to skip
the test and never recompile by adding a /o
(which stands for "once") after the trailing
delimiter. Once upon a time, Perl would recompile
regular expressions unnecessarily, and this
modifier was useful to tell it not to do so,
in the interests of speed. But now, the only
reasons to use /o are one of:
[reasons]
The bottom line is that using /o is almost
never a good idea.
In the days before Perl automatically memoized the compilation of regexes with interpolation, even back in the 1990s, it said,
However, mentioning /o constitutes a promise
that you won't change the variables in the
pattern. If you change them, Perl won't even
notice.
PATTERN may contain references to scalar
variables, which will be interpolated
(and the pattern recompiled) every time the
pattern search is evaluated. […] If you want
such a pattern to be compiled only once, add
an “o” after the trailing delimiter. This
avoids expensive run-time recompilations, and
is useful when the value you are interpolating
won't change over the life of the script.
Nowadays, computers are super fast and super huge and super always available.
But even in the early 2000s, that was not the case. I know this b/c one of my first jobs out of college was to build a regex based system to analyze spam emails sent to big providers (like Yahoo and Hotmail (Microsoft)".
IIRC, we only had ONE Dell 2600 box to do all of the above with literally millions of emails and the same box did the parsing and the database storage (via MySql).
You REALLY learn what makes regexes efficient versus not and I remember reading about "/o" and testing it to see if it made a difference for what I was doing. I don't remember either way what the results were but this was def a blast from the past.
This is a footgun. A language should strive not to add footguns. Every footgun you provide, somebody is going to blow their foot off with it, so that's a high price. If your language is popular it might be a lot of somebodies.
The opposite behaviour (we have a constant regular expression, we re-use it often but the tooling doesn't realise and so it's created each time we mention it) is not a footgun, it results in poor performance, and so you might want (especially in some managed languages) to just magically optimise this case, but if not you won't cause mysterious bugs. An expert, asked "Why is this slow?" can just fix it - you have to supply basic tools for that, but this flag is not a sensible tool.
Is it really though? There are tons of characters you can add to a regex that have difficult if not impossible to mentally comprehend impacts on the potential matches. That's why you need 100 test cases for every 10 characters you write in a regex. Regex itself could all be a footgun by this standard. No one in the history of no one has ever thought "why dont I just add a random character to my regex I don't need or understand" that's just boogie man level irrational fear if you think this has any bearing on the ease of use of ruby.
Regexes are not fundamentally hard. People make regexes hard by trying to parse things by sight rather than finding a spec. If you have a spec, and it can be parsed by a regular expression, they are pretty damn rote to implement.
If you aren’t working from a specified input grammar, the task is going to be borderline impossible no matter the tool and you’re going to have a bad time. If you aren’t working with a regular grammar, this is the wrong tool for the job and again you’re going to have a bad time.
A hint; if you find yourself using `.`, you are probably shooting yourself in the foot.
Ruby is a well-sharpened knife. Not everyone should be given a sharp knife though, especially children. And not all jobs need a sharp knife, like buttering toast. So I think it’s good for dull knives to exist as part of your tool belt. If we can only choose one language though, I’d rather it be a nimble, sharp one.
One of my favorites was Python’s datetime.time() object evaluating to True for every value except exact midnight, which is the sort of thing that makes fine sense when you think about the underlying implementation but is absolutely going to take a toe off of someone.
My favorite part about that one was it got to go through the full feature deprecation cycle before removal because several people argued in the bug thread about it that they were relying on that behavior in their systems.
In the 1990s, with the processing power of the time, /o was a reasonable compromise. The language later evolved to do the smart thing you describe, but you can't just remove features. A warning would be in order though.
In the spirit of "what's old is new again," PowerShell also has the same idea, and is done per Function with "begin", "process", "end", and "clean" stanzas that allow setup, teardown, for-each-item, and "finally" behavior: https://learn.microsoft.com/en-us/powershell/module/microsof...
Oh, that’s an interesting take. I’ve long been looking for newer developments on Awk’s clause structure, and this seems like an interesting take (though I’m unclear on whether I can have multiple begin/end clauses, which are the best thing about Awk’s version). It also finally connects this idea to something else in my mind—specifically advice[1] and CLOS’s :before/:after/:around methods[2]. (I guess Go’s defer also counts?)
This isn't the same problem, because this is about whether the regex is instantiated each time the code around the regex is executed, or only the first time and cached for subsequent executions. The same could in theory happen with closures, but I haven't ever seen programming-language semantics where, for example, a function containing the definition of a closure that depends on an argument of that outer function, would use the argument value of the first invocation of the function for all subsequent invocations of the function.
For example, when you have
fn f x = (y -> x + y)
then a sequence of invocations of f
f 1 3
f 2 6
will yield 4 and 8 respectively, but never will the second invocation yield 7 due to reusing the value of x from the first invocation. However, that is precisely what happens in the article's regex example, because the equivalent is for the closure value (y -> x + y) to be cached between invocations, so that the x retains the value of the first invocation of f — regardless of whether x is a reference by name or by value.
The parallel is apt, but regex /o is more like a closure that captures by value at declaration time rather than an ambiguity between capture strategies.
> Modifier o means that the first time a literal regexp with interpolations is encountered, the generated Regexp object is saved and used for all future evaluations of that literal regexp.
That is crystal clear to me. It means that on the next execution, the new values of the interpolation will be ignored; the regexp is now "baked" with the first ones.
Like this in C++:
void fun(int arg)
{
static int once = arg;
}
if we call this as f(42) the first time, once gets initialized to 42. If we then call it f(73), once stays 42.
There is a function in POSIX for once-only initializations: pthread_once. C++ compilers for multithreaded environments emit thread-safe code to do something similar to pthread_once to ensure that even if there are several concurrent first invocations of the function, the initialization happens once.
Perl’s documentation is far more clear about the consequences:
(https://perldoc.perl.org/perlop#Regexp-Quote-Like-Operators)
In the days before Perl automatically memoized the compilation of regexes with interpolation, even back in the 1990s, it said, Perl 4’s documentation is briefer. It says,(https://github.com/Perl/perl5/blob/perl-4.0.00/perl.man#L272...)
But even in the early 2000s, that was not the case. I know this b/c one of my first jobs out of college was to build a regex based system to analyze spam emails sent to big providers (like Yahoo and Hotmail (Microsoft)".
IIRC, we only had ONE Dell 2600 box to do all of the above with literally millions of emails and the same box did the parsing and the database storage (via MySql).
You REALLY learn what makes regexes efficient versus not and I remember reading about "/o" and testing it to see if it made a difference for what I was doing. I don't remember either way what the results were but this was def a blast from the past.
PS If you want to read more about the spam work back the, I have a Twitter thread about it here: https://x.com/alexpotato/status/1208948480867127296
The opposite behaviour (we have a constant regular expression, we re-use it often but the tooling doesn't realise and so it's created each time we mention it) is not a footgun, it results in poor performance, and so you might want (especially in some managed languages) to just magically optimise this case, but if not you won't cause mysterious bugs. An expert, asked "Why is this slow?" can just fix it - you have to supply basic tools for that, but this flag is not a sensible tool.
If you aren’t working from a specified input grammar, the task is going to be borderline impossible no matter the tool and you’re going to have a bad time. If you aren’t working with a regular grammar, this is the wrong tool for the job and again you’re going to have a bad time.
A hint; if you find yourself using `.`, you are probably shooting yourself in the foot.
You are 100% correct on the "100 to 10" ratio on test cases.
PLUS, the ways in which broker files can break due to:
- random carriage returns
- different date formats
- time zones
- etc etc
and regexes become both great and terrifying at the same time.
One pattern I did find useful:
regex + if/then
e.g. if (regex is true) then if (regex2 is true) then
My favorite part about that one was it got to go through the full feature deprecation cycle before removal because several people argued in the bug thread about it that they were relying on that behavior in their systems.
In the spirit of "what's old is new again," PowerShell also has the same idea, and is done per Function with "begin", "process", "end", and "clean" stanzas that allow setup, teardown, for-each-item, and "finally" behavior: https://learn.microsoft.com/en-us/powershell/module/microsof...
[1] https://en.wikipedia.org/wiki/Advice_(programming)
[2] https://gigamonkeys.com/book/object-reorientation-generic-fu...
>o - pretend to optimize your code, but actually introduce bugs
1: I still think of it as a relatively new change, but it's from 2013: <https://github.com/Perl/perl5/commit/7cf040c1f649790a4040aec...>
For example, when you have
then a sequence of invocations of f will yield 4 and 8 respectively, but never will the second invocation yield 7 due to reusing the value of x from the first invocation. However, that is precisely what happens in the article's regex example, because the equivalent is for the closure value (y -> x + y) to be cached between invocations, so that the x retains the value of the first invocation of f — regardless of whether x is a reference by name or by value.That is crystal clear to me. It means that on the next execution, the new values of the interpolation will be ignored; the regexp is now "baked" with the first ones.
Like this in C++:
if we call this as f(42) the first time, once gets initialized to 42. If we then call it f(73), once stays 42.There is a function in POSIX for once-only initializations: pthread_once. C++ compilers for multithreaded environments emit thread-safe code to do something similar to pthread_once to ensure that even if there are several concurrent first invocations of the function, the initialization happens once.