Seems to me this is more about the danger of passing anything derived from user input into the TEMPLATE side of a templating engine. Why in the world would you ever do that?!?
Obviously if you pass data into the variable side of the engine, you hardly have to worry about it at all, since it's already going into a place that was designed for handling arbitrary and possibly-hostile input and been battle-tested at doing it correctly in Production for many years. If you pass it into the template side, you're betting that you can be as good as dozens of templating engine writers working for a decade at doing that, in exchange for, well, I can't really think of any possible legitimate advantage for doing that.
EDIT: this is documented behavior for Ruby. What other languages call multiline mode is the default; you're supposed to use \A and \Z instead. They do have an `/m` but it only affects the interpretation of `.`
False. "$" does NOT mean end-of-string in Perl, Python, PHP, Ruby, Java, or .NET. In particular, a trailing newline (at least) is accepted in those languages.
A $ does mean end-of-string in Javascript, POSIX, Rust (if using its usual package), and Go.
I'm working with the OpenSSF best practices working group to create some guidance on this stuff. It's a very common misconception. Stay tuned.
If anyone knows of vulnerabilities caused by thus, let me know.
`$` does mean end of input in Java, unless you explicitly ask for multiline mode. In the latter case it means `(?=$|\n)` if also in Unix-lines mode, and the horrible `(?=$|(?<!\r)\n|[\r\u0085\u2028\u2029])` otherwise.
I wrote a compiler from Java regex to JavaScript RegExp, in which you'll find that particular compilation scheme [1].
Edit: also quoting from [2]:
> By default, the regular expressions ^ and $ ignore line terminators and only match at the beginning and the end, respectively, of the entire input sequence. If MULTILINE mode is activated then ^ matches at the beginning of input and after any line terminator except at the end of input. When in MULTILINE mode $ matches just before a line terminator or the end of the input sequence.
I'm not sure which version of PHP had the behavior you describe, or whether it misbehaves under more specific conditions, but preg_match() is one of the more commonly-used regex functions, all of which share the same engine. The behavior here seems to be "correct" for at least the last 5 years, for varying interpretations of "correct".
edit: https://3v4l.org/N4o8D suggests that the behavior here is identical for all versions of PHP from 4.3 to 8.3.6.
So, you are technically correct when you say PHP accepts a trailing newline, but it doesn't mean it refutes the comment and the context we are discussing.
Which all makes sense, as by default PHP doesn't operate in multiline mode. So, by default, PHP is not going to fall prey to the same problem being discussed here. In addition, the first \n would be apart of the first line it's on, so including it as a part of the string would make sense. More to the point, in this context, $ does mean end of the string in PHP. You can prove otherwise by getting the 2nd and 3rd example above to output a 1 instead of a 0 without going into multiline mode.
Interesting that a trailing newline is accepted. Not as bad as what's in the post, at least. Definitely worth breaking out which languages do which of those, though! Python, for instance, only accepts a trailing newline but not additional chars beyond that.
I don't think Java should be in your first list, though? Pattern.matches("^foo$", "foo\n") returns false.
The potential trouble with $ (even in single-line mode) is that it matches the end of a string BOTH with AND without a newline at the end. If you're using it to ensure the string has no newline before doing something with it, this can lead to trouble.
$ python3 -c 'import re; print("yes" if re.search(r"^foo$", "foo") else "no")'
yes
$ python3 -c 'import re; print("yes" if re.search(r"^foo$", "foo\n") else "no")'
yes
$ python3 -c 'import re; print("yes" if re.search(r"\Afoo\Z", "foo") else "no")'
yes
$ python3 -c 'import re; print("yes" if re.search(r"\Afoo\Z", "foo\n") else "no")'
no
Even if the newline is not problematic, using \A and \Z makes your intentions clearer to the reader, especially if you add re.X and place comments into the pattern.
Asides:
1. Based on syntax, you appear to be testing with python2.
2. With python, re.match is implicitly anchored to the start, so the ^ is redundant. Use re.search or omit the ^.
Correct me if I'm wrong, but if you extract a capture group (^foo$), you would get "foo" without the "\n", right?
If so, it is not "matching the end of a string" at all. Just end of line. That's exactly as expected in single-line mode, so it's good. May mismatch your expectations in multi-line mode though.
Yeah, my takeaway from this was more "the dangers of Ruby" rather than "the dangers of single line regular expressions" (:
I think the simplest fix would be to use "\Z" rather than "$", which means "match end of input" rather than "end of line." This is also Perl-compatible. So weird that the "$" default meaning is different in Ruby.
I guess one could argue that Ruby's way is better since "$" has a fixed meaning, rather than being context-dependent.
> Ruby seems to be in multiline mode all the time?
Ruby does have a "/m" for multiline mode, but it just makes "." match newline, rather than changing the meaning of "$", it seems.
You want \Z in Python, and \z in most other languages, to match on end of string. But in some languages $ really does match end of string. As always, you must check your docs.
Alternatively, don't validate and then use the original. Instead, pull out the acceptable input and use that.
Even better, compare that to the original and fail validation if they're not identical, but that requires maintaining a higher level of paranoia than may be reasonable to expect.
Heh. I wrote up my comment, and then thought "hey, I bet that's what that 'Parse don't validate' article meant, the one I never quite got around to reading." So I pulled it up — great article! — but then didn't post the link because it uses the type system to record the results of the parse. Whereas here, you'd probably parse from a string into another string.
But philosophically I agree, that's exactly the relevant advice.
This was interesting and new to me, but as other commenters indicate, part of the problem is that we're trying to find the bad thing rather than trying to verify it is the good thing
There's a related concept of "failing open vs failing closed" (fail open: fire exit, fail closed: ranch gate)
In Jurassic park (amazing book/film to understand system failures), when the power goes out, the fence is functionally an open gate
In this case, we shouldn't assume that we can enumerate all possible bad strings (even with a regex)
I don't think this is a good example, because the regex does just that: it doesn't try to filter out bad input, it specifically only accepts known good input. If the regex did what it was meant to do, only allowing strings composed of ascii letters and numbers, and space, than the code would have not been exploitable.
Still seems like that is broken. Shouldn't they be escaping whatever control characters? Like if your user wanted to highlight "Now 75% off". Seems like it is reasonable to want to allow that.
Escape the output based on the context a string is being used in versus trying to sanitize for all use cases on input.
This will guarantee that you’re safe no matter how a piece of content is used tomorrow (just need a new escaping function for that content type), and prevent awkward things like not letting users use “unsafe” strings as input. JSX and XHP are example templating systems that understand context and escape appropriately.
If a user wants their title to be “hello%0a%3C%25%3D%20File.open%28%27flag.txt%27%29.read%20%25%3E”, so be it.
Use input validation / parsing to ensure data types aren’t violated, but not as an output safety mechanism.
The known problematic characters are different in json, xml, css, html content, html attributes, MySQL, etc. Unless you have output escaping, it is hard to ensure everything gets caught, no matter how the data enters the system.
The output is not the problem here, it is the input. And, if you can get away with, accepting a small set of known-safe characters is much safer than accepting any character and hoping it will be properly escaped at every level.
When the user hands you a string and you then pass this down to other bits of code, you can't know if it will be used in an SQL query, a regex, in an error message that will be rendered into HTML, etc.
Ideally all layers of your code would handle user input with the utmost care, but that is often very hard to achieve. If you take user input and use it in a regex, it's easy to regex-escape it, but it's much harder to remember that now this whole regex is user input and can't be safely used to, say, construct an SQL query. And even if you remember to properly escape it in the SQL query, it may show up in the returned result, and now if you display that result, you need to be careful to escape it before passing it to some HTML engine.
But then none of this works if you did intend to have some SQL syntax in the regex, or some HTML snippets in the DB: you'd need to make all of these technologies aware of which parts of the expressions are safe and which are tainted by user input.
And this is all just to prevent code injection type attacks. I haven't even discussed more subtle attacks, like using Unicode look-like characters to confuse other users.
Raku (perl6) was a chance for Larry Wall to fix some of the limitations of the perl regex syntax, as you would expect from the perl heritage, it behaves similarly.
- ^^ and $$ are the raku flavour of multiline mode
- ~~ the smartmatch operator binds the regex to the matchee and much more
- character classes are now <[...]> (plain [...] does what (...) does in math)
- perl's triadic x ? y : z becomes x ?? y !! z
We can have whitespace in our regexen now (and comments and multiline regexen)
my $regex = rx/ \d ** 4 #`(match the year YYYY)
'-'
\d ** 2 # ...the month MM
'-'
\d ** 2 /; # ...and the day DD
say '2015-12-25'.match($regex); # OUTPUT: «「2015-12-25」»
Perl has supported whitespace and comments in regular expressions since approximately forever. Just use the /x modifier. All that Raku did was make that flag a default.
The same thing is available in many other languages. They copied it when they copied from Perl. For example Python's https://docs.python.org/3/library/re.html#flags documents that re.X, also called re.VERBOSE, does the same exact thing.
The fact that people don't use it is because few people care to learn regular expressions well enough to even know that it is an option. One of my favorite examples of astounding people with this was when I was writing a complex stored procedure in PostgreSQL. I read https://www.postgresql.org/docs/current/functions-matching.h.... I looked for flags. And yup, there is an x flag. It turns on "extended syntax". Which does the same exact thing. I needed a complex regular expression that I knew my coworkers couldn't have written themselves. So I commented the heck out of it. They couldn't believe that that was even a thing that you could do!
I think it's fair to say(?) that Larry's adoption of regex as a first class aspect of perl was one of its unique strengths. My opinion is that none of the subsequent languages (Python, Go, Rust, etc) that incorporated regex really embraced it - was more of a bolt on. So there has been a syntactic barrier to incorporate or improve regex within those languages. Not so with Larry's raku which had the vision to build on the perl basis and to address many of the inconsistencies that have been baked in elsewhere.
I was a full time perl coder with plenty of regex back in the day and never appreciated that (sorry) - so maybe making it the default is a good call
PS. raku has added quite a lot to the regex facilities we are familiar with, not least a straight line to using them in Grammars with rule and token methods that give you control over handling of whitespace in the target
Obviously if you pass data into the variable side of the engine, you hardly have to worry about it at all, since it's already going into a place that was designed for handling arbitrary and possibly-hostile input and been battle-tested at doing it correctly in Production for many years. If you pass it into the template side, you're betting that you can be as good as dozens of templating engine writers working for a decade at doing that, in exchange for, well, I can't really think of any possible legitimate advantage for doing that.
Do it in a sandbox and have aggressive timeouts.
Ruby seems to be in multiline mode all the time?
EDIT: this is documented behavior for Ruby. What other languages call multiline mode is the default; you're supposed to use \A and \Z instead. They do have an `/m` but it only affects the interpretation of `.`https://docs.ruby-lang.org/en/master/Regexp.html#class-Regex...
A $ does mean end-of-string in Javascript, POSIX, Rust (if using its usual package), and Go.
I'm working with the OpenSSF best practices working group to create some guidance on this stuff. It's a very common misconception. Stay tuned.
If anyone knows of vulnerabilities caused by thus, let me know.
I wrote a compiler from Java regex to JavaScript RegExp, in which you'll find that particular compilation scheme [1].
Edit: also quoting from [2]:
> By default, the regular expressions ^ and $ ignore line terminators and only match at the beginning and the end, respectively, of the entire input sequence. If MULTILINE mode is activated then ^ matches at the beginning of input and after any line terminator except at the end of input. When in MULTILINE mode $ matches just before a line terminator or the end of the input sequence.
[1] https://github.com/scala-js/scala-js/blob/eb160f1ef113794999...
[2] https://docs.oracle.com/javase/8/docs/api/java/util/regex/Pa...
edit: https://3v4l.org/N4o8D suggests that the behavior here is identical for all versions of PHP from 4.3 to 8.3.6.
This is easily demonstrated with an example.
versus versus Which all makes sense, as by default PHP doesn't operate in multiline mode. So, by default, PHP is not going to fall prey to the same problem being discussed here. In addition, the first \n would be apart of the first line it's on, so including it as a part of the string would make sense. More to the point, in this context, $ does mean end of the string in PHP. You can prove otherwise by getting the 2nd and 3rd example above to output a 1 instead of a 0 without going into multiline mode.I don't think Java should be in your first list, though? Pattern.matches("^foo$", "foo\n") returns false.
Asides:
1. Based on syntax, you appear to be testing with python2.
2. With python, re.match is implicitly anchored to the start, so the ^ is redundant. Use re.search or omit the ^.
If so, it is not "matching the end of a string" at all. Just end of line. That's exactly as expected in single-line mode, so it's good. May mismatch your expectations in multi-line mode though.
I think the simplest fix would be to use "\Z" rather than "$", which means "match end of input" rather than "end of line." This is also Perl-compatible. So weird that the "$" default meaning is different in Ruby.
I guess one could argue that Ruby's way is better since "$" has a fixed meaning, rather than being context-dependent.
> Ruby seems to be in multiline mode all the time?
Ruby does have a "/m" for multiline mode, but it just makes "." match newline, rather than changing the meaning of "$", it seems.
[1] https://ruby-doc.org/3.2.2/Regexp.html#class-Regexp-label-An...
[2] https://perldoc.perl.org/perlre#Metacharacters
if !m=/^[a-z0-9 ]+$/match(str) return "Bad Input" end str=m[0]
(\Z allows a trailing newline, \z does not)
Even better, compare that to the original and fail validation if they're not identical, but that requires maintaining a higher level of paranoia than may be reasonable to expect.
Parse don't validate https://lexi-lambda.github.io/blog/2019/11/05/parse-don-t-va...
But philosophically I agree, that's exactly the relevant advice.
There's a related concept of "failing open vs failing closed" (fail open: fire exit, fail closed: ranch gate)
In Jurassic park (amazing book/film to understand system failures), when the power goes out, the fence is functionally an open gate
In this case, we shouldn't assume that we can enumerate all possible bad strings (even with a regex)
Deleted Comment
But I guess this is why Python has so many ways of matching a pattern against a string (match, find, findall, I think - they are hard to remember)
This will guarantee that you’re safe no matter how a piece of content is used tomorrow (just need a new escaping function for that content type), and prevent awkward things like not letting users use “unsafe” strings as input. JSX and XHP are example templating systems that understand context and escape appropriately.
If a user wants their title to be “hello%0a%3C%25%3D%20File.open%28%27flag.txt%27%29.read%20%25%3E”, so be it.
Use input validation / parsing to ensure data types aren’t violated, but not as an output safety mechanism.
that's a good way to horizontally propagate/reflect XSS and other Code As Data vulnerabilities.
better to strip the known-bad/problematic characters
https://en.wikipedia.org/wiki/Code_as_data
When the user hands you a string and you then pass this down to other bits of code, you can't know if it will be used in an SQL query, a regex, in an error message that will be rendered into HTML, etc.
Ideally all layers of your code would handle user input with the utmost care, but that is often very hard to achieve. If you take user input and use it in a regex, it's easy to regex-escape it, but it's much harder to remember that now this whole regex is user input and can't be safely used to, say, construct an SQL query. And even if you remember to properly escape it in the SQL query, it may show up in the returned result, and now if you display that result, you need to be careful to escape it before passing it to some HTML engine.
But then none of this works if you did intend to have some SQL syntax in the regex, or some HTML snippets in the DB: you'd need to make all of these technologies aware of which parts of the expressions are safe and which are tainted by user input.
And this is all just to prevent code injection type attacks. I haven't even discussed more subtle attacks, like using Unicode look-like characters to confuse other users.
It's also how you end up with apps that people can't use because they reject their perfectly valid legal name, address etc.
I guess I would tweak my first comment and say input filtering is not enough. You must do output filtering to truly be safe.
- ~~ the smartmatch operator binds the regex to the matchee and much more
- character classes are now <[...]> (plain [...] does what (...) does in math)
- perl's triadic x ? y : z becomes x ?? y !! z
We can have whitespace in our regexen now (and comments and multiline regexen)
The same thing is available in many other languages. They copied it when they copied from Perl. For example Python's https://docs.python.org/3/library/re.html#flags documents that re.X, also called re.VERBOSE, does the same exact thing.
The fact that people don't use it is because few people care to learn regular expressions well enough to even know that it is an option. One of my favorite examples of astounding people with this was when I was writing a complex stored procedure in PostgreSQL. I read https://www.postgresql.org/docs/current/functions-matching.h.... I looked for flags. And yup, there is an x flag. It turns on "extended syntax". Which does the same exact thing. I needed a complex regular expression that I knew my coworkers couldn't have written themselves. So I commented the heck out of it. They couldn't believe that that was even a thing that you could do!
PS. raku has added quite a lot to the regex facilities we are familiar with, not least a straight line to using them in Grammars with rule and token methods that give you control over handling of whitespace in the target
The bigger problem here is executing user input.