Why I write recursive descent parsers, despite their issues (2020)

Pet subject of the week here.

Big choices are handrolled recursive decent vs LALR, probably backed by bison or lemon generator and re2c for a lexer.

Passing the lalr(1) check, i.e. having bison actually accept the grammar without complain about ambiguities, is either very annoying or requires thinking clearly about your language, depending on your perspective.

I claim that a lot of the misfires in language implementations are from not doing that work, and using a hand rolled approximation to the parser you had in mind instead, because that's nicer/easier than the formal grammar.

The parser generators emit useless error messages, yes. So if you want nice user feedback, that'll be handrolled in some fashion. Sure.

Sometimes people write a grammar and use a hand rolled parser, hoping they match. Maybe with tests.

The right answer, used by noone as far as I can tell, is to parse with the lalr generated parser, then if that rejects your string because the program was ill formed, call the hand rolled one for guesswork/diagnostics. Never feed the parse tree from the hand rolled parser into the rest of the compiler, that way lies all the bugs.

As alternative phrasing, your linter and your parser don't need to be the same tool, even if it's convenient in some senses to mash them together.

mrkeen · a month ago

> parse with the lalr generated parser, then if that rejects your string because the program was ill formed, call the hand rolled one for guesswork/diagnostics

This feels like a recipe for disaster. If the hand-rolled parser won't match a formal grammar, why would it match the generated parser?

The poor programmer will be debugging the wrong thing.

It reminds me of my short stint writing C++ where I'd read undefined memory in release mode, but when I ran it under debug mode it just worked.

senkora · a month ago

> It reminds me of my short stint writing C++ where I'd read undefined memory in release mode, but when I ran it under debug mode it just worked.

I assume it’s far too late at this point, but that almost always means that you’re invoking UB. Your next step should be enabling UBSan.

JonChesterfield · a month ago

The generated parser will match the grammar.

The hand rolled parser might do, but also might not, what with software being difficult and testing being boring and so forth.

8n4vidtmkvmk · a month ago

There's risk, but it seems like you could run both parsers against the same unit tests to help mitigate.

I wonder who it is that likes other kinds of parser. Over the last ~10 years or so I've read several articles arguing that recursive descent parsers are in fact great on HN. And they seem to be both the easiest to get started with and what almost all production-grade systems use. I've seen very little in the way of anything arguing for any other approaches.

o11c · a month ago

Recursive descent is fine if you trust that you won't write buggy code. If you implement a generator for it (easy enough), this may be a justifiable thing to trust (though this is not a given). I am assuming you're willing to put up with the insanity of grammar rewriting, one way or another.

LR however is more powerful, though this mostly matters if you don't have access to automatic grammar rewriting for your LL. More significantly, however, there's probably more good tooling for LR (or perhaps: you can assume that if tooling exists, it is good at what it is designed for); one problem with LL being so "simple" is that there's a lot of bad tooling out there.

The important things are 1. that you meaningfully eliminate ambiguities (which is easy to enforce for LR and doable for LL if your tooling is good), and 2. that you keep linear time complexity. Any parser other than LL/LR should be rejected because it fails at least one of these, and often both.

Within the LL and LR families there are actually quite a few members. SLR(1) is strong enough to be interesting but too weak for anything I would call a "language". LALR(1) is probably fine; I have never encountered a useful language that must resort to LR(1) (though note that modern tooling can do an optimistic fallback, avoiding the massive blowups of ancient LR tools). SLL(1) I'm not personally familiar with. X(k), where X is one of {SLL, LL, SLR, LALR, LR} and where k > 1, are not very useful; k=1 suffices. LL(*) however should be avoided due to backtracking, but in some cases consider if you can parsing token trees first (this is currently poorly represented in the literature; you want to be doing some form of this for error recovery anyway - automated error recovery is a useless lie) and/or defer the partial ambiguity until the AST is built (often better for error messages anyway, independent of using token trees).

kerkeslager · a month ago

> Recursive descent is fine if you trust that you won't write buggy code. If you implement a generator for it (easy enough), this may be a justifiable thing to trust (though this is not a given).

The idea that you're going to hand-roll a parser generator and then use that to generate a parser and the result is going to be less buggy than just hand-rolling a recursive descent parser, screams "I've never written code outside of an academic context".

rstuart4133 · 21 days ago

> LALR(1) is probably fine; I have never encountered a useful language that must resort to LR(1)

An LR(1) parser can have many more states in it's DFA than LALR(1). That was important back in the 1970's when I was fighting for every byte of RAM, but now it's a total non-issue. I don't know why you would bother with LALR(1) now if you had a LR(1) parser generator.

nrds · a month ago

> ambiguities

It's important to note that ambiguities are something which exist in service of parser generators and the restricted formal grammars that drive them. They do not actually exist in the language to be parsed (unless that language is not well-specified, but then all bets are off and it is meaningless to speak of parsing), because they can be eliminated by side-conditions.

For example, one famous ambiguity is the dangling 'else' problem in C. But this isn't an actual ambiguity in the C language: the language has a side-condition which says that 'else' matches to the closest unmatched 'if'. This is completely unambiguous and so a recursive descent parser for C simply doesn't encounter this problem. It is only because parser generators, at least in their most academic form, lack a way to specify this side-condition that their proponents have to come up with a whole theory of "ambiguities". (Shockingly, Wikipedia gets this exactly right in the article on dangling else which I just thought to look up: "The dangling else is a problem in programming of parser generators".)

Likewise goes the problem of left-recursion. Opponents of recursive descent always present left-recursion as a gotcha which requires some special handling. Meanwhile actual programmers writing actual recursive descent parsers don't have any idea what these academics are talking about because the language that they're parsing (as it exists in their mind) doesn't feature left-recursion, but instead iteration. Left-recursion is only introduced in service of restricted formal grammars in which recursion is the only available primitive and iteration either doesn't exist or is syntactic sugar for recursion. For the recursive descent user, iteration is a perfectly acceptable primitive. The reason for the discrepancy goes back to side-conditions: iteration requires a side-condition stating how to build the parse tree; parser generators call this "resolving the ambiguity" because they can't express this in their restricted grammar, not because the language was ambiguous.

jasperry · a month ago

But remember that the articles arguing for recursive descent parsers are arguing against the long-dominant paradigm of using LR parsers. Plenty of us still like LR parser generators (see my other comment.)

In between "easiest to get started with" and "what production-grade systems use", there is "easy to actually finish a medium-sized project with." I think LR parsers still defend that middle ground pretty well.

nicoburns · a month ago

> But remember that the articles arguing for recursive descent parsers are arguing against the long-dominant paradigm of using LR parsers

That was part of my question I think. I wouldn't have been able to tell you that the dominant paradigm being argued against was LR parsers, because I've never come across even one that I'm aware of (I've heard of them, but that's about it). Perhaps it's academia where they're popular?

userbinator · a month ago

I wonder who it is that likes other kinds of parser.

It seems to be mainly academics and others interested in parsing theory, and those who like complexity for the sake of complexity.

masklinn · a month ago

Pratt parsers are really fun if slightly mind-bending, their ability to handle odd associativities, precedences, and arities is basically unmatched making them really useful to embed inside recursive descent for when you reach expressions. If you need infix and mixfix operators anyway.

lenkite · a month ago

The literature for incremental parsing doesn't appear to have much for recursive descent. Everyone appears to use the LR tree sitter approach.

cxr · a month ago

The post by Laurence Tratt, which this piece is a response to, argues for another approach and is mentioned in the first sentence.

> If I was routinely working in a language that had a well respected de facto standard parser generator and lexer, and regularly building parsers for little languages for my programs, it would probably be worth mastering these tools.

In OCaml, a language highly suited for developing languages in, that de facto standard is the Menhir LR parser generator. It's a modern Yacc with many convenient features, including combinator-like library functions. I honestly enjoy the work of mastering Menhir, poring over the manual, which is all one page: https://gallium.inria.fr/~fpottier/menhir/manual.html

debugnik · a month ago

I gave up on Menhir after I understood how allocation-heavy it is during the hot path, at least in the incremental API which is needed for proper errors; and how much of a giant hack you need to force extra lookahead, which shouldn't be such a big deal for parser generators.

These days I just handroll recursive descent parsers with a mutable stream record, `raise_notrace` and maybe some combinators inspired by FParsec for choices, repetition and error messages. I know it's not as rigorous, but at least it's regular code without unexpected limitations.

Could be, I'm not that far along yet. I've only just peeked into the incremental API. I'm still using the error token to try to improve my messages. It's just for syntax errors anyway, right?

fuzztester · a month ago

>In OCaml, a language highly suited for developing languages in,

What makes OCaml suited for that?

mjburgess · a month ago

algebraic datatypes (tagged unions + pattern matching); compiled, garbage collected (you dont really need memory management for a compiler), statically typed with inference

greggyb · a month ago

ML, the language heritage from which OCaml derives, was explicitly designed with interpreters and compilers in mind.

In terms of language-agnosticism, you can use Bison to calculate the tables (the hard part) and dump an xml file, then implement the machine yourself trivially.

I get really annoyed when people still complain about YACC while ignoring the four decades of practical improvement that Bison has given us if you bother to configure it.

randomNumber7 · a month ago

The paper "Top Down Operator Precedence" also called "Pratt's Paper" introduced a very elegant algorithm for recursive descent parsers in 1973.

Is is also written in a badass style and argues that this is superior to parser generators.

https://dl.acm.org/doi/pdf/10.1145/512927.512931

pratt4the_win · a month ago

Pratt parsers are elegant. I really like them.

For those to whom they are new: I found them a little tricky to implement directly from Pratt's paper or even Crockford's javascript that popularized them.

So, through trial and error I figured out how to actually implement them in regular languages (i.e. not in Lisp).

If it helps, examples in C and Go are here:

https://github.com/glycerine/PrattParserInC

https://github.com/glycerine/zygomys/blob/master/zygo/pratt....

I find them easier to work with than the cryptic LALR(1) bison/yacc tools, but then I never really felt like I mastered yacc to begin with.

ivanjermakov · a month ago

Related: Resilient LL Parsing Tutorial https://matklad.github.io/2023/05/21/resilient-ll-parsing-tu...

ufo · a month ago

A middle ground that I think is sometimes useful is to use an LR parser generator to check if the grammar is ambiguous, but use recursive descent for the actual implementation. Since we won't actually use any code from the LR parser generator, you can pick whatever one you prefer regardless of the programming language.

sirwhinesalot · a month ago

It's trivial to get a recursive descent parser without any ambiguities hidden in it if you don't go the PEG route (which is only unambiguous because you always pick the first choice, which might not be what you want). Just always branch on the current token. No way to have an ambiguity like that.

I disagree. When writing recursive descent by hand, it's easy to miss an ambiguity because of miscomputed FIRST and FOLLOW sets.

In practice most recursive descent parsers use if-else liberally. Thus, they effectively work like pegs where the first match wins (but without the limited backtracking of pegs). They are deterministic in the sense that the implementation always returns a predictable result. But they are still ambiguous in the sense that this behavior might not have been planned by the language designer, and the ambiguity may not have been resolved how the programmer expected.

thechao · a month ago

I've been having thoughts along these lines. Earley parsers match recursive descent really nicely. In my head there'd by an Earley parser "oracle": you'd tell the oracle about the operations you've performed (in terms of terminal consumption); and, then, you can ask the oracle which recursive descent subfunctions are safe to call (based on the prediction phase).

marssaxman · a month ago

I have never found parser generators to be worth the hassle. Recursive descent with a little Pratt-style precedence climbing is all you need.

derriz · a month ago

Agree completely and I’ve used a bunch of them and also functional combinator libraries. I‘d go further and say the recursive descent and Pratt approach is the way if you want to offer useful error messages and feedback to the user. They’re also trivial to debug and test unlike any generation based approach.

>functional combinator libraries

By that, do you mean parser combinators?