Parser Combinators Beat Regexes

Parser combinators are one of those ideas that is really powerful in practice, but will never be mainstream because it had the bad fortune of coming from the functional programming world. It's a shame too, because parser combinators are what made parsers click for me.

Take the hard problem of parsing a language, break it down into the much simpler problem of reading tokens, then write simple "combinators" (and, or, one_or_many, etc etc) to combine these simple parsers into larger more complex parsers.

You could almost build an entire parser combinator library from scratch from memory, because none of the individual components are complicated.

egonschiele · 5 months ago

Hey! I love parser combinators and wish they were more mainstream. That's why I wrote Tarsec, which is Parsec for TypeScript: https://github.com/egonSchiele/tarsec

I have a short and a long intro, both illustrated: https://github.com/egonSchiele/tarsec/blob/main/tutorials/5-... and https://github.com/egonSchiele/tarsec/blob/main/tutorials/th...

It also has a neat debug mode, and a way to benchmark performance. I've loved using it, hopefully someone reading will use it too! Use this as a chance to learn something cool (parser combinators)!

furyofantares · 5 months ago

These are great.

I accidentally invented parser combinators for a project once. I was quite fond of the approach but it ended up being pretty nearly unusable for anyone else on the team. This was problematic because the project was such that we were always adding new things to be parsed. The whole reason I took the approach was that I wanted people to be able to freely add parsers as new requirements came in.

It being my first shot at it, it was for sure not as usable as it could be. But looking at your stuff, which is delightful, I am filled with doubt that my goal was even possible. Your stuff looks about as clean as possible, and yet seems likely to be almost as impenetrable to someone not versed in parsing.

That said, I am gonna look for a chance to use this in appropriate side projects.

yen223 · 5 months ago

Love the intro articles you wrote!

mrkeen · 5 months ago

> You could almost build an entire parser combinator library from scratch from memory, because none of the individual components are complicated.

I do this every now and then. Usually in the days leading up to advent of code.

"A parser is a function from an input to either failure or a (remaining input, result) pair" is all you need to remember to build the rest.

idahoduncan · 5 months ago

I do this from time to time as well, although I tend to get hung up on error handling. I would say that it's a simple enough exercise if you don't care too much about reporting parse errors in a meaningful way, although I'd be happy for someone to point out an implementation of error handling that fits into one's head.

Tabular-Iceberg · 5 months ago

I’ve seen it in practice. I decided to use a Java parser combinator library for one problem. As soon as I was off the project they ripped out my code and replaced it with regexes.

That what was to be parsed wasn’t regular apparently did not matter.

zahlman · 5 months ago

> break it down into the much simpler problem of reading tokens, then write simple "combinators" (and, or, one_or_many, etc etc) to combine these simple parsers into larger more complex parsers.

The popular `pyparsing` library for Python (https://pypi.org/project/pyparsing/) expresses these concepts, just written more in traditional OO style rather than functional style (i.e. there are classes like OneOrMany that you instantiate, rather than binding a single-token-matcher argument to lambda or whatever).

fooker · 5 months ago

Parser combinators are already mainstream. All/most real languages are implemented with hand written recursive descent parsers that closely resemble parser combinators, or at the very least borrow a bunch of concepts.

It's the name that's unlikely to become mainstream. That's squarely upon FP folks giving sci-fi names to common sense concepts.

antonvs · 5 months ago

Presumably you're not objecting to the word "parser".

What name would you use for "combinator"? It's a pretty simple word, based on the root "combine".

The common sense concept here is functions whose results depends only on their arguments, which makes them well-suited for being used in "combination" with other combinators, as in `f . g` or `f(g(x))`. That's called function composition, which can be defined as the act of combining functions.

It's a term that has existed in logic for over a century, from before computers existed.

It seems to me that no amount of renaming of terms will fix people's lack of curiosity or lack of desire to learn or understand. That's on them.

foldr · 5 months ago

As a sibling points out, recursive descent parsers have been mainstream for a long time. Higher order parser combinators are only moderately useful in an imperative language with regular looping constructs. (For example, the code to parse a list of Xs separated by Ys is a pretty straightforward loop.) A lot of parser combinator libraries also make the mistake of encouraging "short circuit" error handling, which is rarely what you want for a production-grade parser.

bazoom42 · 5 months ago

> will never be mainstream because it had the bad fortune of coming from the functional programming world

Nah, mainstream developer don’t care where an idea comes from as long as it solves the problem and helps them hit the deadline. E.g. linq in C# have been broadly adopted.

Regexes are disliked by many developers, so parser combinators should be an easy sell if you can show it solves realistic problems with simpler code.

skybrian · 5 months ago

Never is a long time. Other ideas from functional language have become mainstream.

yen223 · 5 months ago

I would be very happy to be proven wrong here.

But I view parser combinators in the same lens as lenses, another powerful approach to solving a specific problem that is unlikely to receive mainstream recognition because of all the Haskell.

worldsayshi · 5 months ago

I remember logstash having a quite intuitive parser language, called grok I think, where recursive decent was built on top of regex. We should use something like that!

thaumasiotes · 5 months ago

> Take the hard problem of parsing a language, break it down into the much simpler problem of reading tokens, then write simple "combinators" (and, or, one_or_many, etc etc) to combine these simple parsers into larger more complex parsers.

That's a terrible example for "parser combinators beat regexes"; those are the three regular operations. (Except that you use zero_or_many in preference to one_or_many.)

mrkeen · 5 months ago

Can I see the (implied) counter-example?

A regex which turns a language's source code into an AST?

meinersbur · 5 months ago

Flang (LLVM Fortran Compiler) uses parser combinators.

https://github.com/llvm/llvm-project/blob/main/flang/lib/Par...

wruza · 5 months ago

It's never the origin but the ease of install/use/mentor and compatibility across envs.

conaclos · 5 months ago

I once tried to use parser combinators and I was quite disappointed by the result. I find it harder to read and to handle error than regular recursive descendant parsers.

fooker · 5 months ago

The trick is to have your grammar or combinator setup to have error productions for all the common errors you can think of.

That way you get nice, targeted error messages.

// STRing: matches anything inside quotes (single or double) #define STR "[\"'](.*)[\"']" // NUMber: matches decimal or hexadecimal numbers #define NUM "([[:digit:]]x?[[:xdigit:]]*)" regcomp(&reg_exp, STR NUM , REG_EXTENDED | REG_ICASE);

Note that most implementations of both parser combinators and regexes can fail very badly (exponential time). Never use either on untrusted input, unless you can prove your implementation lets you stay linear.

internet_points · 5 months ago

This is one of the reasons I've been afraid to use parser combinators too heavily. With regular (finite-state) languages I know their time usage, with parser combinators I have no clue, and there are so many different libraries and implementations and some are monadic and some are applicative and few mention worst-case. There are benchmarks https://gitlab.com/FinnBender/haskell-parsing-benchmarks#byt... but what kinds of "dangerous operations" do I have to watch out for when writing with parser combinators?

(I guess part of the problem is just that regular languages have been studied since Kleene had a full head of hair, while parser combinators were more or less unknown until the 80's.)

wyager · 5 months ago

The evaluation model of most of these parser combinator libraries is simple enough that you can just think directly about the call tree. It's difficult to get "surprised" like you can with PCREs. E.g. (x+x+)+y as a regex may blow up, but "many1 (many1 x >> many1 x) >> y" will just evaluate forwards a single time, greedily, and fail, at least with any haskell parser combinator library I can think of.

mrkeen · 5 months ago

> I've been afraid to use parser combinators

> With regular (finite-state) languages I know their time usage

Are you talking about parsers or grammars?

giraffe_lady · 5 months ago

PEGs don't have this problem right? Linear time over length of input? Though I guess most peg libraries probably aren't "pure" pegs to get around its limitations and may reintroduce this issue?

o11c · 5 months ago

No, PEG is prone to exactly this in its default form. Parser combinator usually means something PEG-ish anyway.

If you memoize (packrat), it improves to polynomial time (not sure what, but it's not linear; that's false advertising and a fraudulent claim), but you're still stuck with the vulnerability to bugs in the grammar.

A better idea is to create a parser-combinator-style API on top of an LL or LR backend, for guaranteed linear time and only enough stack space for the deepest part of the AST.

yen223 · 5 months ago

PEGs don't have this problem only because they place restrictions on the grammar.

In practice, this isn't a problem, but it does require you write grammar rules in a specific way (e.g. no left-recursion)

vrighter · 5 months ago

Only if a packrat parser is implemented, which requires a lot of memory.

Deleted Comment

thaumasiotes · 5 months ago

Only PCREs are exponential time, in service of a feature you basically never need. Regexes are always linear time.

They can take exponential space, though, so I'm not sure why knowing you'll be able to process the data in linear time is supposed to keep you safe.

moregrist · 5 months ago

It’s not only PCRE. Any dialect with back references will require a backtracking engine that can go exponential with the wrong expression and input.

This includes most flavors of regexp you find in the wild: Python’s re module, JavaScript regular expressions, Ruby’s regular expressions, Perl, PCRE, and even basic and extended REs used in grep.

Russ Cox has written some very accessible posts on the linear (in input) properties of NFAs and the equivalence of Thompson regular expressions [0]. There’s also quite a bit of literature on the Glushkov construction of regular expressions (and its NFA equivalence) [1] that’s worth reading if you find the topic interesting.

Both Go and Rust have non-backtracking regular expression libraries, and you can find solid non-backtracking C and C++ libraries for regular expressions (eg: libfsm and Hyperscan).

0: https://swtch.com/~rsc/regexp/ 1: My favorite is _Flexible Pattern Matching Strings: Practical On-Line Search Algorithms for Texts and Biological Sequences_ by Gonzalo Navarro

pjscott · 5 months ago

This depends on the implementation of the regex engine; most are potentially superlinear in time, since that’s the easiest way of doing it, and quite fast until suddenly it isn’t. I always check the algorithm being used before I use regular expressions in production. I was surprised how many use a recursive descent strategy!

(Also, BTW, you can deal with the exponential space issue by using an NFA instead of a DFA – it’s slower that way, but the memory space required is reliably linearly proportional to the size of the regex.)

dullcrisp · 5 months ago

I don’t believe space > time is possible on a Turing machine?

masklinn · 5 months ago

> Only PCREs are exponential time, in service of a feature you basically never need. Regexes are always linear time.

Any re dialect which supports backtracking necessarily has a non-linear worst case, and while a select few have very high resilience against exponential backtracking (e.g. never managed to make postgres fall over) most can be made to fail with a pattern a few characters long.

FA-based engines are getting more popular, but they’re far from universal.