But I do know that tools often make it easy for people do incredibly stupid things by accident. Like the 'sudo remove file' command followed by '-rf' in the wrong place.
I rimraf a lot; it's very useful. It's a miracle I haven't wiped out any computers doing so...
I think it'd be possible to inject this kind of thing into your code if you're just starting out with vim, aren't cognisant of all commands you invoke, and then copy/paste all code straight into a browser.
Have to agree. I read a Post Incident Review like this and I'm like "Yep, totally see how that could happen. Thanks for solving it and thanks for letting me know why it happened".
Ha! The same bug happened internally at my company. In that case it was a regex matching a URL taking so much CPU as to cause a DOS of a proxy server. I won't be surprised if it's happened to someone here too.
This is very timely, because minutes ago, I made a link to Russ Cox's articles in my Kernighan awk repo:
"Regular expressions are one of computer science's shining examples of how using good theory leads to good programs ..."
"Today, regular expressions have also become a shining example of how ignoring good theory leads to bad programs. The regular expression implementations used by today's popular tools are significantly slower than the ones used in many of those thirty-year-old Unix tools."
The alarming thing is that regex are supposed to be compiled before use, and the class of expressions that need quadratic or exponential time is distinct from the class that needs linear time. Why don't these popular, widely-used regex implementations perform any optimizations?
Unfortunately, there is a hint as to how this has happened:
"This strategy is no longer practical: users have come to rely on backreferences for at least occasional use, and backreferences are part of the POSIX standard for regular expressions."
What better excuse is there for a poor implementation than standards compliance? In many ways, using regex with backtracking by default is like programming in Lisp without tail-call optimizations. If the POSIX standard is going to require backreferences, then it should also require a non-backtracking implementation for regular expressions without backreferences, just like the Scheme specification requires implementations to support tail-call optimization.
The comparison is valid because they both can create a quadratic runtime from a linear time algorithm.
FWIW, the conversion from NFA (non-deterministic, i.e. backtracking) to DFA (deterministic, linear time) can take exponential space. So there's another avenue for DDOS; it's a lot harder to exploit, though, because it requires the attacker to control the input (i.e. regular expression) to the NFA->DFA transformation, rather than merely provide a string that takes a long time for an NFA to recognize.
The name regular expression seems meant to evoke regular languages, whose significance is that they can be recognized in a very straightforward way (by finite state automata) without pathological gotcha cases like this.
Perhaps we ought to call them irregular expressions - this kind of behavior is the literal antithesis of what regular means in automata/parsing theory.
It's a little unfair to complain that they're slower than 30 year old regex engines when the old regex engines were so feature limited that they were nearly useless.
In particular there don't seem to be crazy optimizations and non-portable stuff like you see in other production regex implementations and in interpreters. It's pure ANSI C.
The functions are all short -- no 1000 line monsters like I've seen in a lot of old C code lately.
The rsc example code is actually a great example of using pointers in C. People rightly avoid lots of raw pointers for application code, but for this use case, they yield compact and elegant code.
There's a simple trick the real-time and high-assurance communities use to catch stuff like this: enforce a hard limit for time each task takes to complete. Some you might give leeway to prevent DDOS'ing your users. Many (most?) things have a sane, hard limit that can apply along with a log entry or notification to ops. Keeps one user or task from DDOSing whole system or forces it to fail fast + noticeably.
Note a lot of developers use this trick anymore but it's quite powerful. Should get more consideration. Not to mention the real solution of choosing tools or algorithms with predictable, reliable behavior. Those definitely exist for the task they're performing as other commenters noted.
EDIT to add modern example from high-assurance field that shows how awesome it can be (esp see Github link):
For the specific example, this is how I dealt with a situation in one service I wrote that takes an arbitrary user-controlled regex (for search).
The program was written in python, and as it turns out the GIL prevents a thread from being able to interrupt another thread that's doing a re.match() call - the entire regex engine is in C so the GIL is held for the entire call.
My solution was to use SIGALRM to set a timer before calling, and have the signal handler raise an exception. This causes the regex call to raise that exception, which I then catch and display an appropriate error.
It doesn't actually take that many lines of code to implement a linear time NFA engine. Most of the code is actually in the regex compiler. That is, there are only a few actual "instructions" or node types in a regex engine (alternation, concatenation, etc.). The rest is just compiling the bizarre syntax to a those nodes/instructions. (And dealing with Unicode if you need that.)
The whole awk implementation is 958 lines, so it can't be that bad. It supports a fairly featureful regex language (though no capturing AFAICT).
I've fixed so many bugs using regex, only to have to fix several bugs later.
My current stance is, avoid regex if at all possible. Turns out, many of the things we use regex for is possible without. Often times, .Substring, .IndexOf, and using LINQ over strings is sufficient.
Regexes are almost always a massive code smell. They should almost never be used, bad idea, bad implementation, hard to grok, hard to debug, hard to test, hard to spot.
Whoever came up with them has surely been given the same honorary place in hell with Jon Postel, who invented the utterly disastrous "be liberal in what you accept", that has plagued all web developers for the last 20 years.
I'll write a trivial state machine over using regex any day of the week. They are easier to write and read (especially if complex), aren't a black box in terms of profiling and don't have surprising performance characteristics like regex does. Unless we're talking about a Thompson NFA (which isn't used by many modern stdlibs) or JS, regexes simply do not appear in my belt of solutions.
Heh... as a Java developer, my favorite version of that joke is where you solve the problem with Java, and now you have an `AbstractProblemFactoryFactory`.
My guess is they're searching for the first non-whitespace character, reverse-searching for the last non-whitespace character, and then using String.Substring to return only what's in the middle.
In the past, I have done Load Balancer status checks against a special /status endpoint. I queried all the connected services (i.e. DB, Redis, etc) with a super fast query (i.e. `SELECT version();`). Monitoring CPU/MEM usage for scaling was separate.
Comparing this to checking the home page, what is the best way to setup a health check for your load balancers?
I think you've got the right approach - a vertical slice through the app that checks every layer. You want to know if a user can get useful info from your site, and it tracks (separately!) the common path their query would follow.
The danger is that the endpoint becomes public knowledge and comes under a DDOS attack. Putting an IP address filter on that endpoint is usually enough to stop that.
The concept I try to go for with that status check is 'Can this node connect to everything so it can successfully respond to http requests'. However my approach wouldn't identify an overloaded server, which might be a good thing if we need to scale up - taking down an overloaded server is just going to make the other servers that much more overloaded.
I'm aways up for hearing about other ways people solve health checks.
I'd be VERY careful about including external dependancies in an HTTP health check which results in a web server being removed from service - it's usually an invitation for cascading failures.
1) If you do have a back-end failure, this setup can cloud the root cause during recovery because your downstream web servers are down as well.
2) Transitory back-end failures can cascade, and your health checks can make this worse as things flap around. You need to be VERY careful with your timers and retry logic, and tune them appropriately. eg, how does `/status` handle a single TCP reset from the DB? When does it decide to timeout if the DB doesn't respond? It's hard to get this right - network errors vs protocol errors vs timeouts are often handled differently at different layers of the controller, and the load balancer has its own retries and timeouts.
3) If this health check is being used for an API, recovery can be more difficult because of thundering herds. eg, 5/5 API servers go down because of a temporary DB failure. Queues and requests backup. 1/5 API servers is restored to service a bit earlier than the others, and immediately goes down again because it's not degrading gracefully under heavy load. Wash/rinse/repeat until you figure out how to load shed.
4) Are you absolutely positive that your app is worthless without a DB/redis/backend-baz, and every endpoint is broken? Ideally, the app will degrade gracefully if certain request can be served from cache, or don't require the failed backend.
5) The failure case where this type of thing might be useful (network partition affecting DB connectivity from a subset of your webservers) isn't very common in the environments I've been involved with.
A far more productive approach if you absolutely need this sort of thing is to implement circuit breakers[1] at the app level, and explicitly make certain tripped breakers a hard failure for the /status endpoint. This has the advantage of not depending solely on synthetic queries, and having very explicit restoration logic.
For load balancer health checks, I prefer ones that hit the controller and expect a 200, and nothing more. (The DB and overall health of the service cluster are monitored through more organic instrumentation.)
I was laughing to myself when I read the Stack Overflow post mortem because I have a healthy pile of healthcheck-too-expensive war stories, and their experience resonated.
You make a good point about health checks. Its easy to conflate single node health with overall service health. The LB check should really be about "can this node serve HTTP traffic", even if the non-health check requests can only send back 5xx responses because their dependencies are failing.
I've also had the opposite problem where the health check would answer 200 OK, but the real app itself had a corrupted internal state because of a poor design/threading bug. If the health check had been the home page the node would have been pulled from the LB. While a more in-depth health check would have helped here, I think its better to alert/monitor/kill any node with a higher than normal error rate and leave the health check simple.
You make a good point, but I don't see it as black and white.
> 4) Are you absolutely positive that your app is worthless without a DB/redis/backend-baz, and every endpoint is broken?
Yes, if that's the case you should not respond "unable to serve". However, you should probably fail setup check and not start serving traffic if you're a newly provisioned node.
But!
I wouldn't call this cascading failure more than error propagation. If a critical dependency is down making the node essential worthless, it should be removed. This can be caused by things like network partitions and whatnot - as you say by 5. You say "it's not likely" which is sadly a common statement, but when it does happen it can be nasty - and the bigger environment the often it happens.
Cascading failure usually (in my experience anyway) refers to having the actual failover cause more outages. In the case of failing /status I can only see this happening if the architecture of the system is broken anyway - or the load is genuinely too high for the failure that just occurred.
You say: "The DB and overall health of the service cluster are monitored through more organic instrumentation". What is acting on that? What happens when you do get that rare network partition and half of your web nodes cannot reach the database?
This is the right way. It's too easy to bork everything otherwise; yesterday I borked a new build server because I set security up to disallow anonymous access, which lead to the main page returning a 302 redirect to the login page and thereby failed the load balancer health check.
I remember the day I learned that Python's "re" module uses backtracking for non-extended regexes. My tests covered lots of corner cases in the regex logic, but were too short for me to notice the performance penalty. Luckily I only caused a partial outage in production.
I actually got to talk to Raymond Hettinger (Python core team) about why re uses a potentially exponential-time algorithm for regexes when there is a famous linear-time algorithm, and (I suspect) most programmers would assume linear complexity. As it turns out, there was an attempt to re-write re to fix this, but the re-write never managed to present exactly the same (extremely large) API as the existing module. He advised me that "the standard library is where code goes to die."
I vaguely remember a replacement for Python's re module, but I can't remember the name, and Googling for it is an exercise in frustration.
Edit: it's "regex" - https://pypi.python.org/pypi/regex. I have no idea if it behaves better with backtracking, a cursory glance at the documentation doesn't indicate any changes in that area.
regex module shows the same behavior as re module in this case:
$ python -mtimeit -s "s=' '*20000; import regex as re" "re.search(r'\s+$', s)" # ends with spaces is fast
10000 loops, best of 3: 201 usec per loop
$ python -mtimeit -s "s=' '*20000+'non-space'; import regex as re" "re.search(r'\s+$', s)" # ends with a non-space is slow
10 loops, best of 3: 3.39 sec per loop
I looked at that module a while back. The goto statements make it very difficult to analyze. It seems to integrate well with the signals module, which could be used to interrupt a large class of ReDoS attempts, but determining whether all potential regexps could be interrupted using signals, with the goto statements, is a difficult task.
Weirdly, I've "known" this since I started writing Perl in the mid-'90. Not sure where I originally read it (or was told it). Funny how that works.
I try to write my regexes such that they anchor at the front of the strong or the back, or they describe the whole string; never an either-or anchoring type situation like this example.
Spaces at beginning of string (100,000 iterations):
Rate onestep twostep
onestep 62500/s -- -2%
twostep 63694/s 2% --
real 0m3.093s
user 0m3.066s
sys 0m0.018s
Spaces at end of string (100,000 iterations):
Rate twostep onestep
twostep 55249/s -- -9%
onestep 60976/s 10% --
real 0m3.453s
user 0m3.421s
sys 0m0.022s
Spaces in middle of string (only 500 iterations because I don't want to sit here for four hours):
Rate onestep twostep
onestep 7.11/s -- -100%
twostep 16667/s 234333% --
real 1m10.741s
user 1m10.207s
sys 0m0.228s
You can see in my other post that this doesn't always work. For example, Python's regex engine chokes on `\s+$` if you use `re.search`, but works fine if you use `re.match`.
It's likely that Perl has an optimization for `\s+$` but not `^\s+|\s+$` (the former regex only ever matches at the end of the string, which is amenable to optimizations).
I did see your other post, and upvoted it. This rule of thumb has served me well between different regex dialects and implementations, but it's not surprising that there are some specific cases that are "broken" for lack of a better word.
I haven't done much Python but the documentation for re.search() and re.match() is very clear: use search to find an expression anywhere in a string, use match to find an expression at the beginning of a string. It appears to ignore anchors in both cases? Left undetermined then is how to anchor an expression at the end of a string. You say re.match() works, but this is pretty confusingly described, and it's easy to see how the ambiguity can lead to problems for even the most experienced programmers.
I don't understand something: the regex expected a space character, followed by the end of the string. If the last character wasn't a space, this could never match. Why did the engine keep backtracking, even though it's easy to figure out that it could never match the regex?
Most simple regular expression evaluators are basically state machines. They hold a few variables:
a) what part of the regex am I currently trying to match
b) what point in the string am I currently starting at
c) how much of the string has this piece of the regex consumed so far
Then the state machine basically has three transitions:
* if (b+c) terminally matches (a), increment both (a) and (b) and reset (c)
* if (b+c) matches (a), but more could be consumed, increment (c) and check again
* if (b+c) doesn't match (a), increment (b) and reset (c)
So yeah, you could put in short cuts for things like "well this didn't match because the next piece of the regex is matching on a line ending", but keeping it simple is straightforward and generally smiled upon.
If you are really using a state machine and every match in the regex is at the end of the haystack, then you can reverse the state machine and use the same algorithm you described in reverse. :-)
While this might have been caused by mistake - these types of bugs can be (and are) abused by hackers.
https://www.owasp.org/index.php/Regular_expression_Denial_of...https://en.wikipedia.org/wiki/ReDoS
The post also links to this video: https://vimeo.com/112065252
But I do know that tools often make it easy for people do incredibly stupid things by accident. Like the 'sudo remove file' command followed by '-rf' in the wrong place. I rimraf a lot; it's very useful. It's a miracle I haven't wiped out any computers doing so...
This is very timely, because minutes ago, I made a link to Russ Cox's articles in my Kernighan awk repo:
https://github.com/andychu/bwk
https://swtch.com/~rsc/regexp/regexp1.html
If you are not familiar with this issue, basically Perl popularized bad computer science... "regexes" are not regular languages.
They say that this particular case triggered quadratic behavior, not exponential, but the point is that there is a linear time algorithm to do this.
The file b.c in the awk repo implements the linear time algorithm:
https://github.com/andychu/bwk/blob/master/b.c
(and rsc's site has some nice sample code too, as well as caveats with regard to capturing and so forth)
"Regular expressions are one of computer science's shining examples of how using good theory leads to good programs ..."
"Today, regular expressions have also become a shining example of how ignoring good theory leads to bad programs. The regular expression implementations used by today's popular tools are significantly slower than the ones used in many of those thirty-year-old Unix tools."
The alarming thing is that regex are supposed to be compiled before use, and the class of expressions that need quadratic or exponential time is distinct from the class that needs linear time. Why don't these popular, widely-used regex implementations perform any optimizations?
At least they keep you warm. Sometimes.
How can it not be a feature that you can heat more space the more spaces you feed it?
"This strategy is no longer practical: users have come to rely on backreferences for at least occasional use, and backreferences are part of the POSIX standard for regular expressions."
What better excuse is there for a poor implementation than standards compliance? In many ways, using regex with backtracking by default is like programming in Lisp without tail-call optimizations. If the POSIX standard is going to require backreferences, then it should also require a non-backtracking implementation for regular expressions without backreferences, just like the Scheme specification requires implementations to support tail-call optimization.
The comparison is valid because they both can create a quadratic runtime from a linear time algorithm.
Perhaps we ought to call them irregular expressions - this kind of behavior is the literal antithesis of what regular means in automata/parsing theory.
("Abandon all hope, you who enter here.")
https://swtch.com/~rsc/regexp/
In particular there don't seem to be crazy optimizations and non-portable stuff like you see in other production regex implementations and in interpreters. It's pure ANSI C.
The functions are all short -- no 1000 line monsters like I've seen in a lot of old C code lately.
The rsc example code is actually a great example of using pointers in C. People rightly avoid lots of raw pointers for application code, but for this use case, they yield compact and elegant code.
Note a lot of developers use this trick anymore but it's quite powerful. Should get more consideration. Not to mention the real solution of choosing tools or algorithms with predictable, reliable behavior. Those definitely exist for the task they're performing as other commenters noted.
EDIT to add modern example from high-assurance field that shows how awesome it can be (esp see Github link):
https://leepike.github.io/Copilot/
The program was written in python, and as it turns out the GIL prevents a thread from being able to interrupt another thread that's doing a re.match() call - the entire regex engine is in C so the GIL is held for the entire call.
My solution was to use SIGALRM to set a timer before calling, and have the signal handler raise an exception. This causes the regex call to raise that exception, which I then catch and display an appropriate error.
The whole awk implementation is 958 lines, so it can't be that bad. It supports a fairly featureful regex language (though no capturing AFAICT).
This should be the title of a book on software engineering.
My current stance is, avoid regex if at all possible. Turns out, many of the things we use regex for is possible without. Often times, .Substring, .IndexOf, and using LINQ over strings is sufficient.
Whoever came up with them has surely been given the same honorary place in hell with Jon Postel, who invented the utterly disastrous "be liberal in what you accept", that has plagued all web developers for the last 20 years.
http://regex.info/blog/2006-09-15/247
It dates back to August 12, 1997!
(Perhaps someone should encourage jwz to have a 20th birthday party for that at DNA?)
As to why they're not using String.Trim (https://msdn.microsoft.com/en-us/library/t97s7bs3(v=vs.110)....), maybe it's because String.Trim doesn't seem to know about the 200c whitespace character.
https://archive.org/details/stackexchange
Comparing this to checking the home page, what is the best way to setup a health check for your load balancers?
The danger is that the endpoint becomes public knowledge and comes under a DDOS attack. Putting an IP address filter on that endpoint is usually enough to stop that.
I'm aways up for hearing about other ways people solve health checks.
Why is there even direct external IP connectivity to the realserver, sidestepping the loadbalancer?
1) If you do have a back-end failure, this setup can cloud the root cause during recovery because your downstream web servers are down as well.
2) Transitory back-end failures can cascade, and your health checks can make this worse as things flap around. You need to be VERY careful with your timers and retry logic, and tune them appropriately. eg, how does `/status` handle a single TCP reset from the DB? When does it decide to timeout if the DB doesn't respond? It's hard to get this right - network errors vs protocol errors vs timeouts are often handled differently at different layers of the controller, and the load balancer has its own retries and timeouts.
3) If this health check is being used for an API, recovery can be more difficult because of thundering herds. eg, 5/5 API servers go down because of a temporary DB failure. Queues and requests backup. 1/5 API servers is restored to service a bit earlier than the others, and immediately goes down again because it's not degrading gracefully under heavy load. Wash/rinse/repeat until you figure out how to load shed.
4) Are you absolutely positive that your app is worthless without a DB/redis/backend-baz, and every endpoint is broken? Ideally, the app will degrade gracefully if certain request can be served from cache, or don't require the failed backend.
5) The failure case where this type of thing might be useful (network partition affecting DB connectivity from a subset of your webservers) isn't very common in the environments I've been involved with.
A far more productive approach if you absolutely need this sort of thing is to implement circuit breakers[1] at the app level, and explicitly make certain tripped breakers a hard failure for the /status endpoint. This has the advantage of not depending solely on synthetic queries, and having very explicit restoration logic.
For load balancer health checks, I prefer ones that hit the controller and expect a 200, and nothing more. (The DB and overall health of the service cluster are monitored through more organic instrumentation.)
I was laughing to myself when I read the Stack Overflow post mortem because I have a healthy pile of healthcheck-too-expensive war stories, and their experience resonated.
[1] http://martinfowler.com/bliki/CircuitBreaker.html
I've also had the opposite problem where the health check would answer 200 OK, but the real app itself had a corrupted internal state because of a poor design/threading bug. If the health check had been the home page the node would have been pulled from the LB. While a more in-depth health check would have helped here, I think its better to alert/monitor/kill any node with a higher than normal error rate and leave the health check simple.
> 4) Are you absolutely positive that your app is worthless without a DB/redis/backend-baz, and every endpoint is broken?
Yes, if that's the case you should not respond "unable to serve". However, you should probably fail setup check and not start serving traffic if you're a newly provisioned node.
But!
I wouldn't call this cascading failure more than error propagation. If a critical dependency is down making the node essential worthless, it should be removed. This can be caused by things like network partitions and whatnot - as you say by 5. You say "it's not likely" which is sadly a common statement, but when it does happen it can be nasty - and the bigger environment the often it happens.
Cascading failure usually (in my experience anyway) refers to having the actual failover cause more outages. In the case of failing /status I can only see this happening if the architecture of the system is broken anyway - or the load is genuinely too high for the failure that just occurred.
You say: "The DB and overall health of the service cluster are monitored through more organic instrumentation". What is acting on that? What happens when you do get that rare network partition and half of your web nodes cannot reach the database?
A very sound strategy, IMO. I subscribe to this KISS practice. If response/status != 200, issue an alert and/or call a deeper health check routine.
You live and learn.
I actually got to talk to Raymond Hettinger (Python core team) about why re uses a potentially exponential-time algorithm for regexes when there is a famous linear-time algorithm, and (I suspect) most programmers would assume linear complexity. As it turns out, there was an attempt to re-write re to fix this, but the re-write never managed to present exactly the same (extremely large) API as the existing module. He advised me that "the standard library is where code goes to die."
Edit: it's "regex" - https://pypi.python.org/pypi/regex. I have no idea if it behaves better with backtracking, a cursory glance at the documentation doesn't indicate any changes in that area.
I try to write my regexes such that they anchor at the front of the strong or the back, or they describe the whole string; never an either-or anchoring type situation like this example.
Spaces at beginning of string (100,000 iterations):
Spaces at end of string (100,000 iterations): Spaces in middle of string (only 500 iterations because I don't want to sit here for four hours):It's likely that Perl has an optimization for `\s+$` but not `^\s+|\s+$` (the former regex only ever matches at the end of the string, which is amenable to optimizations).
I haven't done much Python but the documentation for re.search() and re.match() is very clear: use search to find an expression anywhere in a string, use match to find an expression at the beginning of a string. It appears to ignore anchors in both cases? Left undetermined then is how to anchor an expression at the end of a string. You say re.match() works, but this is pretty confusingly described, and it's easy to see how the ambiguity can lead to problems for even the most experienced programmers.
a) what part of the regex am I currently trying to match
b) what point in the string am I currently starting at
c) how much of the string has this piece of the regex consumed so far
Then the state machine basically has three transitions:
* if (b+c) terminally matches (a), increment both (a) and (b) and reset (c)
* if (b+c) matches (a), but more could be consumed, increment (c) and check again
* if (b+c) doesn't match (a), increment (b) and reset (c)
So yeah, you could put in short cuts for things like "well this didn't match because the next piece of the regex is matching on a line ending", but keeping it simple is straightforward and generally smiled upon.
(Rust's regex engine does this.)
Deleted Comment
Deleted Comment