The dangers of single line regular expressions

ufmace · a year ago

Seems to me this is more about the danger of passing anything derived from user input into the TEMPLATE side of a templating engine. Why in the world would you ever do that?!?

Obviously if you pass data into the variable side of the engine, you hardly have to worry about it at all, since it's already going into a place that was designed for handling arbitrary and possibly-hostile input and been battle-tested at doing it correctly in Production for many years. If you pass it into the template side, you're betting that you can be as good as dozens of templating engine writers working for a decade at doing that, in exchange for, well, I can't really think of any possible legitimate advantage for doing that.

klysm · a year ago

What if you want to allow users to regex search their documents?

ses1984 · a year ago

Do it on the client side?

Do it in a sandbox and have aggressive timeouts.

neilk · a year ago

In my experience `$` does reliably mean end of string for regular expressions, unless you specifically ask for "multiline" mode.

Ruby seems to be in multiline mode all the time?

    $ python -c 'import re; print "yes" if re.match(r"^[a-z ]+$", "foobar") else "no"'
    yes
    $ python -c 'import re; print "yes" if re.match(r"^[a-z ]+$", "foo\nbar") else "no"'
    no
    $ python -c 'import re; print "yes" if re.match(r"^[a-z ]+$", "foo\nbar", re.M) else "no"'
    yes

    $ perl -le 'print "foobar" =~ /^[a-z ]+$/ ? "yes" : "no"'
    yes
    $ perl -le 'print "foo\nbar" =~ /^[a-z ]+$/ ? "yes" : "no"'
    no
    $ perl -le 'print "foo\nbar" =~ /^[a-z ]+$/m ? "yes" : "no"'
    yes

    $ node -e 'console.log(/^[a-z ]+$/.test("foobar") ? "yes" : "no")'
    yes           
    $ node -e 'console.log(/^[a-z ]+$/.test("foo\nbar") ? "yes" : "no")'
    no            
    $ node -e 'console.log(/^[a-z ]+$/m.test("foo\nbar") ? "yes" : "no")'
    yes

    $ ruby -e 'if "foobar" =~ /^[0-9a-z ]+$/i then puts "yes" else puts "no" end'
    yes
    $ ruby -e 'if "foo\nbar" =~ /^[0-9a-z ]+$/i then puts "yes" else puts "no" end'
    yes

EDIT: this is documented behavior for Ruby. What other languages call multiline mode is the default; you're supposed to use \A and \Z instead. They do have an `/m` but it only affects the interpretation of `.`

https://docs.ruby-lang.org/en/master/Regexp.html#class-Regex...

dwheeler · a year ago

False. "$" does NOT mean end-of-string in Perl, Python, PHP, Ruby, Java, or .NET. In particular, a trailing newline (at least) is accepted in those languages.

A $ does mean end-of-string in Javascript, POSIX, Rust (if using its usual package), and Go.

I'm working with the OpenSSF best practices working group to create some guidance on this stuff. It's a very common misconception. Stay tuned.

If anyone knows of vulnerabilities caused by thus, let me know.

sjrd · a year ago

`$` does mean end of input in Java, unless you explicitly ask for multiline mode. In the latter case it means `(?=$|\n)` if also in Unix-lines mode, and the horrible `(?=$|(?<!\r)\n|[\r\u0085\u2028\u2029])` otherwise.

I wrote a compiler from Java regex to JavaScript RegExp, in which you'll find that particular compilation scheme [1].

Edit: also quoting from [2]:

> By default, the regular expressions ^ and $ ignore line terminators and only match at the beginning and the end, respectively, of the entire input sequence. If MULTILINE mode is activated then ^ matches at the beginning of input and after any line terminator except at the end of input. When in MULTILINE mode $ matches just before a line terminator or the end of the input sequence.

[1] https://github.com/scala-js/scala-js/blob/eb160f1ef113794999...

[2] https://docs.oracle.com/javase/8/docs/api/java/util/regex/Pa...

thaumaturgy · a year ago

    $ php -v
    PHP 8.2.7 (cli) (built: Jun  9 2023 19:37:27) (NTS)
    $ php -r 'var_dump(preg_match("/^[a-z0-9 ]+\$/", "hello world"));'
    int(1)
    $ php -r 'var_dump(preg_match("/^[a-z0-9 ]+\$/", "hello\nworld"));'
    int(0)
    $ php -r 'var_dump("hello\nworld");'
    string(11) "hello
    world"
    ...
    $ php -v
    PHP 7.2.26-1+0~20191218.33+debian8~1.gbpb5a34b (cli) (built: Dec 18 2019 16:09:52) ( NTS )
    $ php -r 'var_dump(preg_match("/^[a-z0-9 ]+\$/", "hello world"));'
    int(1)
    $ php -r 'var_dump(preg_match("/^[a-z0-9 ]+\$/", "hello\nworld"));'
    int(0)
    $ php -r 'var_dump("hello\nworld");'
    string(11) "hello
    world"

I'm not sure which version of PHP had the behavior you describe, or whether it misbehaves under more specific conditions, but preg_match() is one of the more commonly-used regex functions, all of which share the same engine. The behavior here seems to be "correct" for at least the last 5 years, for varying interpretations of "correct".

edit: https://3v4l.org/N4o8D suggests that the behavior here is identical for all versions of PHP from 4.3 to 8.3.6.

jasonlotito · a year ago

So, you are technically correct when you say PHP accepts a trailing newline, but it doesn't mean it refutes the comment and the context we are discussing.

This is easily demonstrated with an example.

    <?php
    var_dump(preg_match("/^[a-z0-9 ]+\$/", "hello\n"));
    int(1)

versus

    <?php
    var_dump(preg_match("/^[a-z0-9 ]+\$/", "hello\nworld"));
    int(0)

versus

    <?php
    var_dump(preg_match("/^[a-z0-9 ]+\$/", "hello\n\n"));
    int(0)

Which all makes sense, as by default PHP doesn't operate in multiline mode. So, by default, PHP is not going to fall prey to the same problem being discussed here. In addition, the first \n would be apart of the first line it's on, so including it as a part of the string would make sense. More to the point, in this context, $ does mean end of the string in PHP. You can prove otherwise by getting the 2nd and 3rd example above to output a 1 instead of a 0 without going into multiline mode.

phyzome · a year ago

Interesting that a trailing newline is accepted. Not as bad as what's in the post, at least. Definitely worth breaking out which languages do which of those, though! Python, for instance, only accepts a trailing newline but not additional chars beyond that.

I don't think Java should be in your first list, though? Pattern.matches("^foo$", "foo\n") returns false.

xarope · a year ago

yes that's correct. I came from perl and python, and got caught out a few times in Go(lang).

js2 · a year ago

The potential trouble with $ (even in single-line mode) is that it matches the end of a string BOTH with AND without a newline at the end. If you're using it to ensure the string has no newline before doing something with it, this can lead to trouble.

  $ python3 -c 'import re; print("yes" if re.search(r"^foo$", "foo") else "no")'
    yes

  $ python3 -c 'import re; print("yes" if re.search(r"^foo$", "foo\n") else "no")'
    yes

  $ python3 -c 'import re; print("yes" if re.search(r"\Afoo\Z", "foo") else "no")'
    yes

  $ python3 -c 'import re; print("yes" if re.search(r"\Afoo\Z", "foo\n") else "no")'
    no

Even if the newline is not problematic, using \A and \Z makes your intentions clearer to the reader, especially if you add re.X and place comments into the pattern.

Asides:

1. Based on syntax, you appear to be testing with python2.

2. With python, re.match is implicitly anchored to the start, so the ^ is redundant. Use re.search or omit the ^.

medstrom · a year ago

Correct me if I'm wrong, but if you extract a capture group (^foo$), you would get "foo" without the "\n", right?

If so, it is not "matching the end of a string" at all. Just end of line. That's exactly as expected in single-line mode, so it's good. May mismatch your expectations in multi-line mode though.

interroboink · a year ago

Yeah, my takeaway from this was more "the dangers of Ruby" rather than "the dangers of single line regular expressions" (:

I think the simplest fix would be to use "\Z" rather than "$", which means "match end of input" rather than "end of line." This is also Perl-compatible. So weird that the "$" default meaning is different in Ruby.

I guess one could argue that Ruby's way is better since "$" has a fixed meaning, rather than being context-dependent.

> Ruby seems to be in multiline mode all the time?

Ruby does have a "/m" for multiline mode, but it just makes "." match newline, rather than changing the meaning of "$", it seems.

[1] https://ruby-doc.org/3.2.2/Regexp.html#class-Regexp-label-An...

[2] https://perldoc.perl.org/perlre#Metacharacters

Borg3 · a year ago

In case of ruby, best would be to actually use result of match for futher computation like this:

if !m=/^[a-z0-9 ]+$/match(str) return "Bad Input" end str=m[0]

neilk · a year ago

looks like we both updated our answers as we looked up the docs :)

brobinson · a year ago

Note that Ruby also has \z which is what you generally want instead of \Z.

(\Z allows a trailing newline, \z does not)

dwheeler · a year ago

You want \Z in Python, and \z in most other languages, to match on end of string. But in some languages $ really does match end of string. As always, you must check your docs.

sfink · a year ago

Alternatively, don't validate and then use the original. Instead, pull out the acceptable input and use that.

Even better, compare that to the original and fail validation if they're not identical, but that requires maintaining a higher level of paranoia than may be reasonable to expect.

Akronymus · a year ago

> Alternatively, don't validate and then use the original. Instead, pull out the acceptable input and use that.

Parse don't validate https://lexi-lambda.github.io/blog/2019/11/05/parse-don-t-va...

sfink · a year ago

Heh. I wrote up my comment, and then thought "hey, I bet that's what that 'Parse don't validate' article meant, the one I never quite got around to reading." So I pulled it up — great article! — but then didn't post the link because it uses the type system to record the results of the parse. Whereas here, you'd probably parse from a string into another string.

But philosophically I agree, that's exactly the relevant advice.

wrsh07 · a year ago

This was interesting and new to me, but as other commenters indicate, part of the problem is that we're trying to find the bad thing rather than trying to verify it is the good thing

There's a related concept of "failing open vs failing closed" (fail open: fire exit, fail closed: ranch gate)

In Jurassic park (amazing book/film to understand system failures), when the power goes out, the fence is functionally an open gate

In this case, we shouldn't assume that we can enumerate all possible bad strings (even with a regex)

tsimionescu · a year ago

I don't think this is a good example, because the regex does just that: it doesn't try to filter out bad input, it specifically only accepts known good input. If the regex did what it was meant to do, only allowing strings composed of ascii letters and numbers, and space, than the code would have not been exploitable.

floxy · a year ago

Still seems like that is broken. Shouldn't they be escaping whatever control characters? Like if your user wanted to highlight "Now 75% off". Seems like it is reasonable to want to allow that.

Deleted Comment

wodenokoto · a year ago

I think it is a surprise that a partial match return true.

But I guess this is why Python has so many ways of matching a pattern against a string (match, find, findall, I think - they are hard to remember)

roywiggins · a year ago

I have to look it up every time.

ec109685 · a year ago

Escape the output based on the context a string is being used in versus trying to sanitize for all use cases on input.

This will guarantee that you’re safe no matter how a piece of content is used tomorrow (just need a new escaping function for that content type), and prevent awkward things like not letting users use “unsafe” strings as input. JSX and XHP are example templating systems that understand context and escape appropriately.

If a user wants their title to be “hello%0a%3C%25%3D%20File.open%28%27flag.txt%27%29.read%20%25%3E”, so be it.

Use input validation / parsing to ensure data types aren’t violated, but not as an output safety mechanism.

Jerrrry · a year ago

>If a user wants their title to be “hello%0a%3C%25%3D%20File.open%28%27flag.txt%27%29.read%20%25%3E”, so be it.

that's a good way to horizontally propagate/reflect XSS and other Code As Data vulnerabilities.

better to strip the known-bad/problematic characters

https://en.wikipedia.org/wiki/Code_as_data

ec109685 · a year ago

The known problematic characters are different in json, xml, css, html content, html attributes, MySQL, etc. Unless you have output escaping, it is hard to ensure everything gets caught, no matter how the data enters the system.

phyzome · a year ago

And that's how you end up pissing off users with apostrophes in their names.

tsimionescu · a year ago

The output is not the problem here, it is the input. And, if you can get away with, accepting a small set of known-safe characters is much safer than accepting any character and hoping it will be properly escaped at every level.

When the user hands you a string and you then pass this down to other bits of code, you can't know if it will be used in an SQL query, a regex, in an error message that will be rendered into HTML, etc.

Ideally all layers of your code would handle user input with the utmost care, but that is often very hard to achieve. If you take user input and use it in a regex, it's easy to regex-escape it, but it's much harder to remember that now this whole regex is user input and can't be safely used to, say, construct an SQL query. And even if you remember to properly escape it in the SQL query, it may show up in the returned result, and now if you display that result, you need to be careful to escape it before passing it to some HTML engine.

But then none of this works if you did intend to have some SQL syntax in the regex, or some HTML snippets in the DB: you'd need to make all of these technologies aware of which parts of the expressions are safe and which are tainted by user input.

And this is all just to prevent code injection type attacks. I haven't even discussed more subtle attacks, like using Unicode look-like characters to confuse other users.

int_19h · a year ago

> accepting a small set of known-safe characters is much safer than accepting any character and hoping it will be properly escaped at every level

It's also how you end up with apps that people can't use because they reject their perfectly valid legal name, address etc.

ec109685 · a year ago

What was actually invalid about that input? Why shouldn’t that html escaped string be shown as is to the user?

I guess I would tweak my first comment and say input filtering is not enough. You must do output filtering to truly be safe.

librasteve · a year ago

Raku (perl6) was a chance for Larry Wall to fix some of the limitations of the perl regex syntax, as you would expect from the perl heritage, it behaves similarly.

    ~ > raku -e 'say "foobar"   ~~ /^ <[a..z ]> +$/ ?? "yes" !! "no"'    
    yes
    ~ > raku -e 'say "foo\nbar" ~~ /^ <[a..z ]> +$/ ?? "yes" !! "no"'  
    no
    ~ > raku -e 'say "foo\nbar" ~~ /^^<[a..z ]>+$$/ ?? "yes" !! "no"'
    yes

- ^^ and $$ are the raku flavour of multiline mode

- ~~ the smartmatch operator binds the regex to the matchee and much more

- character classes are now <[...]> (plain [...] does what (...) does in math)

- perl's triadic x ? y : z becomes x ?? y !! z

We can have whitespace in our regexen now (and comments and multiline regexen)

    my $regex =  rx/ \d ** 4            #`(match the year YYYY) 
                 '-'
                 \d ** 2                # ...the month MM 
                 '-'
                 \d ** 2 /;             # ...and the day DD 
 
    say '2015-12-25'.match($regex);     # OUTPUT: «｢2015-12-25｣␤»

btilly · a year ago

Perl has supported whitespace and comments in regular expressions since approximately forever. Just use the /x modifier. All that Raku did was make that flag a default.

The same thing is available in many other languages. They copied it when they copied from Perl. For example Python's https://docs.python.org/3/library/re.html#flags documents that re.X, also called re.VERBOSE, does the same exact thing.

The fact that people don't use it is because few people care to learn regular expressions well enough to even know that it is an option. One of my favorite examples of astounding people with this was when I was writing a complex stored procedure in PostgreSQL. I read https://www.postgresql.org/docs/current/functions-matching.h.... I looked for flags. And yup, there is an x flag. It turns on "extended syntax". Which does the same exact thing. I needed a complex regular expression that I knew my coworkers couldn't have written themselves. So I commented the heck out of it. They couldn't believe that that was even a thing that you could do!

librasteve · a year ago

I think it's fair to say(?) that Larry's adoption of regex as a first class aspect of perl was one of its unique strengths. My opinion is that none of the subsequent languages (Python, Go, Rust, etc) that incorporated regex really embraced it - was more of a bolt on. So there has been a syntactic barrier to incorporate or improve regex within those languages. Not so with Larry's raku which had the vision to build on the perl basis and to address many of the inconsistencies that have been baked in elsewhere.

librasteve · a year ago

I was a full time perl coder with plenty of regex back in the day and never appreciated that (sorry) - so maybe making it the default is a good call

PS. raku has added quite a lot to the regex facilities we are familiar with, not least a straight line to using them in Grammars with rule and token methods that give you control over handling of whitespace in the target

jlv2 · a year ago

More like "the danger of thinking you can trivially validate user-supplied input" before evaluating the string.

cratermoon · a year ago

Even non-trivially validating it can go wrong. See Log4Shell, e.g.

The bigger problem here is executing user input.