The common case of only two pairs of quotes is indeed regular, but if you want to support either all Unicode quotes (about 60 pairs of them) or C++11 raw string literals `R"delim(...)delim"` (intrinsically not regular) you are out of luck.
For any finite set of quotation character pairs you can get away with a strategy like `(left_char1)[^right_char1](right_char1)|(left_char2)[^right_char2](right_char2)|...`. Escape characters aren't much harder to accommodate.
Won't work if you're already in a string, or if there are escaped quotes in the string. Also won't work if you have two or more double quoted strings that both contain an apostrophe.
Escaping isn't an intrinsic property of all quoted strings (e.g. single quoted strings in bash), but even so one can work around them without backreferences, by searching for anything that's not a quote or a backslash, _or_ any escaped character:
/"([^"\\]|\\.)*"/
Now double that up with a single quote version if you wish.
What you can't match without backreferences, however, is strings with customisable terminators, e.g. the behaviour in sed that whatever character you use after `s` is the regex terminator (it doesn't have to be `/`), or raw strings in C++.
Backreferences don't really help with those problems.
> Won't work if you're already in a string
This doesn't make sense. How can you search for a string if you're already in a string? I can't think of a realistic situation where that would be useful or even really possible.
> or if there are escaped quotes in the string.
Solvable:
'(\'|\\|[^\'])*'|"(\"|\\|[^\"])*"
> Also won't work if you have two or more double quoted strings that both contain an apostrophe.
As some other replies pointed out, there are straightforward modifications to handle all those scenarios if those are your requirements instead. A place where regex _does_ fail is in arbitrarily nested string interpolations (the key being _arbitrary_ nesting because with enough time anyone can come up with a convoluted enough regex to handle a bounded degree of recursion).
This line is super smart, and yet, despite I should know a lot about regex and NP-complete, my head feels dizzy as I try to make full sense of it. A sign I'm getting old or dumb, perhaps :(
Jokes apart: I'd love for you to elaborate a bit more on this. I'm pretty sure I would benefit a lot from a more expanded, "dumber" explanation.
>> That is quite literally a formal proof that "regex+backreferences" is NP-complete, since SAT is the index NP-complete problem.
> I'd love for you to elaborate a bit more on this.
I'm not the original poster, but I'll have a go.
SAT is NP-Hard. In other words, any literally any NP problem can be efficiently[0] converted to SAT, and any solution can then be efficiently[0] converted back to a solution to the original.
Example: Think of the problem of factoring integers. Someone gives you an integer to factor, with a little work you can create a SAT instance, solve that, and then read off the factorisation of the original integer. SAT is, in some real sense, at least has hard as INT.
So there is a proof that SAT is at least as hard as every NP problem. That's what we call "NP-Hard".
Now someone has shown that they can solve SAT problems by using regex_backtrack. That means that every NP problem can be converted to SAT, then converted to regex+backtrack, solved, and the solution to the original read out from the result.
Thus regex+backtrack is at least as hard as every NP problem.
Now in the case of SAT, it itself is NP. So the combination of being NP and being NP-Hard is called "NP-Complete", or NPC. So SAT is an example of a problem that's NPC.
What has not been shown (I think) is that regex+backtrack is in NP. Showing that a solution to regex+backtrack implies a solution to SAT shows that regex+backtrack is NP-Hard.
If the linked article also shows that regex+backtrack is NP, then it is therefore NPC. But we can see that regex+backtrack is in NP, because verifying an alleged match is a polynomial time operation.
Proving that a problem is NP-complete requires proving that it is NP-hard and in NP. Reducing SAT to regex matching with backreferences does the former. The latter requires proving that any solution can be checked in polynomial time.
The author is also incorrect in stating However the author incorrectly states that only 3SAT problems are solvable. Proof that 3SAT is NP-hard does not exclude broader SAT.
While not a formal proof, it is fairly obvious that verifying a match is in P. Just fill in the captured groups from the answer in the regex and see if it corresponds to the input text.
Every SAT problem can be converted into a 3-SAT problem so that's also not really an issue, they are both NPC
I don’t think it’s pedantic to say that a regular language is not the same thing as a regular expression. The difference between syntax and semantics is real and important.
One of the cool features of SAT problems is that they always terminate (if you're patient enough). Aren't regex, especially with backreferences, Turing-complete though? If so, they could be caught in an infinite loop, meaning they are more general than the SAT problem.
Yeah, I meant it the other way, if those regexps are Turing-complete, not all of them have an equivalent CNF representation, contrarily to what the article seems to state in its first paragraph (and title).
That being said, regexps were not initially meant to be "programming languages", so I'm not sure about the "feature, not bug" part. I'd rather have a notation that would let me solve, for instance, the "HTML tag matching" problem and would be guaranteed to always terminate, than one that also lets me implement Conway's game of life.
This is actually amazing. My python programs rarely run at 100% cpu, whereas C++ binaries are usually up there. Always thought python's inefficiency causes the drop in cpu utilization.
I don't know about how efficient it is but I have always been able to peg all cores with the multiprocessing module. Even something useless like "x * x" is more then enough for 800%.
You don't need backreferences for that:
What you can't match without backreferences, however, is strings with customisable terminators, e.g. the behaviour in sed that whatever character you use after `s` is the regex terminator (it doesn't have to be `/`), or raw strings in C++.
> Won't work if you're already in a string
This doesn't make sense. How can you search for a string if you're already in a string? I can't think of a realistic situation where that would be useful or even really possible.
> or if there are escaped quotes in the string.
Solvable:
> Also won't work if you have two or more double quoted strings that both contain an apostrophe.The regex in my previous comment already solves that. See: https://repl.it/repls/SolidCapitalProgram
Jokes apart: I'd love for you to elaborate a bit more on this. I'm pretty sure I would benefit a lot from a more expanded, "dumber" explanation.
> I'd love for you to elaborate a bit more on this.
I'm not the original poster, but I'll have a go.
SAT is NP-Hard. In other words, any literally any NP problem can be efficiently[0] converted to SAT, and any solution can then be efficiently[0] converted back to a solution to the original.
Example: Think of the problem of factoring integers. Someone gives you an integer to factor, with a little work you can create a SAT instance, solve that, and then read off the factorisation of the original integer. SAT is, in some real sense, at least has hard as INT.
So there is a proof that SAT is at least as hard as every NP problem. That's what we call "NP-Hard".
Now someone has shown that they can solve SAT problems by using regex_backtrack. That means that every NP problem can be converted to SAT, then converted to regex+backtrack, solved, and the solution to the original read out from the result.
Thus regex+backtrack is at least as hard as every NP problem.
Now in the case of SAT, it itself is NP. So the combination of being NP and being NP-Hard is called "NP-Complete", or NPC. So SAT is an example of a problem that's NPC.
What has not been shown (I think) is that regex+backtrack is in NP. Showing that a solution to regex+backtrack implies a solution to SAT shows that regex+backtrack is NP-Hard.
If the linked article also shows that regex+backtrack is NP, then it is therefore NPC. But we can see that regex+backtrack is in NP, because verifying an alleged match is a polynomial time operation.
So regex-backtrack is NPC.
[0] For a technical definition of "efficient"https://perl.plover.com/NPC/
The author is also incorrect in stating However the author incorrectly states that only 3SAT problems are solvable. Proof that 3SAT is NP-hard does not exclude broader SAT.
Every SAT problem can be converted into a 3-SAT problem so that's also not really an issue, they are both NPC
Is there a general consensus to use "regular expression" to refer to the actual regular ones and "regex" to refer to the non-regular variants?
https://docs.raku.org/language/regexes (see the intro paragraph)
Timings for fred.cnf on my machine:
python: 0m53.744s
PHP (no PCRE JIT): (hits backtrack limit in 1m15.994s)
PHP (PCRE JIT): 0m20.109s
# math and/or computer science texts? It's the regular ones.
# pretty much elsewhere? It's the extended ones.
# threads about parsing HTML with regular expressions? People using both and insisting their version is the only correct one.
https://cstheory.stackexchange.com/q/448/362
Personally I make the distinction, but I've noticed many many people do not, hence the question.
Still, yes, you can mess up your "add 1 to the input" program and make it run infinitely.
That being said, regexps were not initially meant to be "programming languages", so I'm not sure about the "feature, not bug" part. I'd rather have a notation that would let me solve, for instance, the "HTML tag matching" problem and would be guaranteed to always terminate, than one that also lets me implement Conway's game of life.
https://en.m.wikipedia.org/wiki/One-hot
Took 9min and 10seconds on RPi 3 running Ubuntu 20.04. Consuming 100% CPU and 1% RAM (1024MB).