Readit News logoReadit News
Thorrez · 6 years ago
> Another practical usage I've heard: match "string" or 'string', but not "string'.

You don't need backreferences for that:

    '[^']*'|"[^"]*"

lifthrasiir · 6 years ago
The common case of only two pairs of quotes is indeed regular, but if you want to support either all Unicode quotes (about 60 pairs of them) or C++11 raw string literals `R"delim(...)delim"` (intrinsically not regular) you are out of luck.
hansvm · 6 years ago
For any finite set of quotation character pairs you can get away with a strategy like `(left_char1)[^right_char1](right_char1)|(left_char2)[^right_char2](right_char2)|...`. Escape characters aren't much harder to accommodate.
throw681158 · 6 years ago
Won't work if you're already in a string, or if there are escaped quotes in the string. Also won't work if you have two or more double quoted strings that both contain an apostrophe.
Leszek · 6 years ago
Escaping isn't an intrinsic property of all quoted strings (e.g. single quoted strings in bash), but even so one can work around them without backreferences, by searching for anything that's not a quote or a backslash, _or_ any escaped character:

    /"([^"\\]|\\.)*"/
Now double that up with a single quote version if you wish.

What you can't match without backreferences, however, is strings with customisable terminators, e.g. the behaviour in sed that whatever character you use after `s` is the regex terminator (it doesn't have to be `/`), or raw strings in C++.

Thorrez · 6 years ago
Backreferences don't really help with those problems.

> Won't work if you're already in a string

This doesn't make sense. How can you search for a string if you're already in a string? I can't think of a realistic situation where that would be useful or even really possible.

> or if there are escaped quotes in the string.

Solvable:

    '(\'|\\|[^\'])*'|"(\"|\\|[^\"])*"
> Also won't work if you have two or more double quoted strings that both contain an apostrophe.

The regex in my previous comment already solves that. See: https://repl.it/repls/SolidCapitalProgram

hansvm · 6 years ago
As some other replies pointed out, there are straightforward modifications to handle all those scenarios if those are your requirements instead. A place where regex _does_ fail is in arbitrarily nested string interpolations (the key being _arbitrary_ nesting because with enough time anyone can come up with a convoluted enough regex to handle a bounded degree of recursion).
bottled_poe · 6 years ago
That wasn’t specified in the requirements
quickthrower2 · 6 years ago
At some point you'd want to pull out a parser. Maybe 1 more step after this point.
PaulHoule · 6 years ago
That is quite literally a formal proof that "regex+backreferences" is NP-complete, since SAT is the index NP-complete problem.
simonebrunozzi · 6 years ago
This line is super smart, and yet, despite I should know a lot about regex and NP-complete, my head feels dizzy as I try to make full sense of it. A sign I'm getting old or dumb, perhaps :(

Jokes apart: I'd love for you to elaborate a bit more on this. I'm pretty sure I would benefit a lot from a more expanded, "dumber" explanation.

ColinWright · 6 years ago
>> That is quite literally a formal proof that "regex+backreferences" is NP-complete, since SAT is the index NP-complete problem.

> I'd love for you to elaborate a bit more on this.

I'm not the original poster, but I'll have a go.

SAT is NP-Hard. In other words, any literally any NP problem can be efficiently[0] converted to SAT, and any solution can then be efficiently[0] converted back to a solution to the original.

Example: Think of the problem of factoring integers. Someone gives you an integer to factor, with a little work you can create a SAT instance, solve that, and then read off the factorisation of the original integer. SAT is, in some real sense, at least has hard as INT.

So there is a proof that SAT is at least as hard as every NP problem. That's what we call "NP-Hard".

Now someone has shown that they can solve SAT problems by using regex_backtrack. That means that every NP problem can be converted to SAT, then converted to regex+backtrack, solved, and the solution to the original read out from the result.

Thus regex+backtrack is at least as hard as every NP problem.

Now in the case of SAT, it itself is NP. So the combination of being NP and being NP-Hard is called "NP-Complete", or NPC. So SAT is an example of a problem that's NPC.

What has not been shown (I think) is that regex+backtrack is in NP. Showing that a solution to regex+backtrack implies a solution to SAT shows that regex+backtrack is NP-Hard.

If the linked article also shows that regex+backtrack is NP, then it is therefore NPC. But we can see that regex+backtrack is in NP, because verifying an alleged match is a polynomial time operation.

So regex-backtrack is NPC.

             +--------------------+
             |                    |
  NP-Hard -> |                    |
             |   ,------------.   |
              \ /              \ /
               X  NP-Complete   X
              / \              / \
             /   `------------'   \
      NP -> |                      |
            |                      | 
            .    +------------+    ,
             \   | Polynomial |   /
              `--+------------+--'
[0] For a technical definition of "efficient"

pradn · 6 years ago
I assume this result is already known in the literature?
klyrs · 6 years ago
This doesn't show a date but archive.org has snapshots dating back to 2001

https://perl.plover.com/NPC/

gbacon · 6 years ago
Proving that a problem is NP-complete requires proving that it is NP-hard and in NP. Reducing SAT to regex matching with backreferences does the former. The latter requires proving that any solution can be checked in polynomial time.

The author is also incorrect in stating However the author incorrectly states that only 3SAT problems are solvable. Proof that 3SAT is NP-hard does not exclude broader SAT.

dtech · 6 years ago
While not a formal proof, it is fairly obvious that verifying a match is in P. Just fill in the captured groups from the answer in the regex and see if it corresponds to the input text.

Every SAT problem can be converted into a 3-SAT problem so that's also not really an issue, they are both NPC

awirth · 6 years ago
This reduction is really cool. I love reductions like this.

Is there a general consensus to use "regular expression" to refer to the actual regular ones and "regex" to refer to the non-regular variants?

zokier · 6 years ago
I think Raku (neé Perl 6) has been spearheading that distinction

https://docs.raku.org/language/regexes (see the intro paragraph)

robinhouston · 6 years ago
This distinction was introduced by Jeffrey Friedl’s book _Mastering Regular Expressions_ (2006), and it seems to be fairly commonly used now.
awirth · 6 years ago
Too late to edit, but out of curiosity I re-implemented this to use the PCRE JIT (in PHP) to see what kind of speedup it would provide: https://gist.github.com/allanlw/69df509519335b88db886d48503a...

Timings for fred.cnf on my machine:

python: 0m53.744s

PHP (no PCRE JIT): (hits backtrack limit in 1m15.994s)

PHP (PCRE JIT): 0m20.109s

chubot · 6 years ago
I wouldn't say so, but I use the term "regular language" if I mean the mathematical concept.
robinhouston · 6 years ago
I don’t think it’s pedantic to say that a regular language is not the same thing as a regular expression. The difference between syntax and semantics is real and important.
gnulinux · 6 years ago
But a "regular language" is not the same as "regular expression" as mathematical concepts.
praptak · 6 years ago
I don't think so. Usually you can tell from the context:

# math and/or computer science texts? It's the regular ones.

# pretty much elsewhere? It's the extended ones.

# threads about parsing HTML with regular expressions? People using both and insisting their version is the only correct one.

gbacon · 6 years ago
Yes, for at least ten years and likely much longer.

https://cstheory.stackexchange.com/q/448/362

awirth · 6 years ago
Interesting, I wasn't aware of the history.

Personally I make the distinction, but I've noticed many many people do not, hence the question.

stevefan1999 · 6 years ago
Isn't it PCRE that we started to call it regex?
sacado2 · 6 years ago
One of the cool features of SAT problems is that they always terminate (if you're patient enough). Aren't regex, especially with backreferences, Turing-complete though? If so, they could be caught in an infinite loop, meaning they are more general than the SAT problem.
dmichulke · 6 years ago
Programming languages are more general than the problems they solve. (= feature, not bug)

Still, yes, you can mess up your "add 1 to the input" program and make it run infinitely.

sacado2 · 6 years ago
Yeah, I meant it the other way, if those regexps are Turing-complete, not all of them have an equivalent CNF representation, contrarily to what the article seems to state in its first paragraph (and title).

That being said, regexps were not initially meant to be "programming languages", so I'm not sure about the "feature, not bug" part. I'd rather have a notation that would let me solve, for instance, the "HTML tag matching" problem and would be guaranteed to always terminate, than one that also lets me implement Conway's game of life.

klyrs · 6 years ago
That "popcnt1" is also known as a 1-hot constraint.

https://en.m.wikipedia.org/wiki/One-hot

punnerud · 6 years ago
time python3 solver.py fred.cnf

Took 9min and 10seconds on RPi 3 running Ubuntu 20.04. Consuming 100% CPU and 1% RAM (1024MB).

nurettin · 6 years ago
This is actually amazing. My python programs rarely run at 100% cpu, whereas C++ binaries are usually up there. Always thought python's inefficiency causes the drop in cpu utilization.
nromiun · 6 years ago
I don't know about how efficient it is but I have always been able to peg all cores with the multiprocessing module. Even something useless like "x * x" is more then enough for 800%.
ygra · 6 years ago
Python's regex implementation is probably not written in Python, so while it's trying to match, no Python code runs; it's all /C(++)?/.