For serious projects, I find myself typically resorting to making a hand-written parser (usually recursive descent) because it’s the only way I can really get excellent and contextual error reporting and recovery. Parser generators are computer science marvels, but for the most part they’re typically too hard to customize well enough to produce a sensible error to the user, with line numbers, token stream provenance, and suggested recovery information. I can’t imagine the difficulties of trying to get good messages for something as large and pervasive as, say, C.
I think that with enough work, we can adapt and evolve parser generators to mesh well with better error reporting, and give the programmer more options and control, but it’ll take some elbow grease and probably breaking backwards compatibility with the production syntax of these parser generator systems. There’s also a risk, of course, that if there’s too much customization, you sort of lose the value of having a parser generator DSL in the first place.
One of my favorite PLT related books is Programming Language Pragmatics, which spends a non trivial section near the beginning on just this issue. Among other things it's very useful to continue a best effort parse on the remainder after the parse error so you can show more syntax errors vs merely the first. But this is something of a nebulous art to accomplish.
I agree that I think generators could address this issue, they simply haven't.
Also I'm particularly fond of PEGs, because they match the intuitiveness of recursive descent parsers with a generator suitable formalism (though they have their rough edges as well).
In my programming language MethodScript, (which has a handwritten parser), I fail immediately during lexing, (because then that was a really bad error) but during compilation, I only immediately fail on some things that are likely to cause a cascade of errors, such as a missing bracket or something.
But yeah, it makes the compiler output way easier to work through when you have everything at once.
I'm always shocked to find major projects that use parser generators. A hand-written recursive descent parser is easier to write, read, understand, and modify. It's almost trivial to translate a language specification's grammar productions into a recursive descent parser.
So looking at the list I'm dumbfounded that CPython and Ruby use generators.
It's harder to ensure that the grammar doesn't have unintended ambiguities when using a parser generator. That doesn't matter as much when "just" developing another implementation of an existing fully specced language, but for cases like the examples you cite that doesn't exist.
That issue is the primary reason why postgres continues to use bison. The SQL standard introduced potential parsing ambiguities frequently and leaves a large portion of the things necessary for a functioning RDBMS unspecced.
I really hope to know why people choose one parsing algorithm vs another.
I implemented an earley parser, because from what I read on wikipedia, it seems to be more advanced.
"Earley parsers are appealing because they can parse all context-free languages, unlike LR parsers and LL parsers, which are more typically used in compilers but which can only handle restricted classes of languages."
however I seldom see languages use Earley parser, there must be a reason, but I've never seen anybody explaining why choosing one algorithm over another.
Early and similar algorithms give you a parse forest rather than a parse tree. For a programming language you don't want to have your grammar be ambiguous. When Early gives you a bunch of alternative parse trees, how do you disambiguate?
Most languages have a simple, unambiguous syntax, so LL or LR is fine. LL or LR is almost certainly faster then Earley, since in general more restricted = faster.
As the above commenter mentioned, most language designers make hand-rolled parsers. Making a hand-rolled LL or LR parser is easier too.
In general most people think advanced = bad, they want the easiest solution which gets the job done well.
From what I infer from articles like https://jeffreykegler.github.io/personal/timeline_v3 , the original Earley paper had a bug, and wasn't a good fit for 1960s hardware, and had poor performance for some types of grammars.
By 1991, "Most researchers see the Parsing Problem as "solved" -- a closed issue. Earley parsing is almost forgotten, and Leo's discovery is ignored. Two decades will pass before anyone attempts a practical implementation of Leo 1991."
It takes Aycock and Horspool's work in 2002 and Kegler's work in 2010 in Marpa to have a "practical implementation" (quoting that link).
(I quote that also because Aycock distributed SPARK, an Earley parser, which was included as part of the Python distribution, in the Parser/ subdirectory, and a couple of people here on HN report having used it.)
jq uses bison, and it takes a startling amount of work to diagnose parsing problems so as to give helpful error messages. Last time I checked, the issues were full of new users stopped by something an appropriate error messsge would resolve.
I took the compilers class at Stanford and never really understood the algorithms of bottom up parsing, or even really how grammars worked. I just made the tool work.
I then went to work at a private company, and an older guy who had gone to a state school that taught recursive descent (his was the last class to teach it) taught me how to do it. In a month or so I had learned more about how grammars actually work, what ambiguity is, and so forth, than in my whole class at Stanford.
I now teach compilers at a university, and I teach recursive descent.
I recommend you also teach "precedence climbing", which is basically a slight refactoring of RD to parameterise the recursive functions by level so that grammars with many levels don't require proportionally deep call stacks and it also massively reduces the duplication of code at each level:
I still recommend spending the time to (re-)learn bottom-up parsing though, even if you never use it again. Specifically, LR and then GLR (no need to bother with LALR/SLR/all those variants). Maybe even recursive ascent too, or recursive ascent-descent if you find those interesting. It's painful and confusing and takes a long time to pick them up, but it's quite gratifying once you do learn it. The algorithms are super clever and cool, and help you gain a lot of appreciation and intuition for what is/isn't possible to do (whether efficiently or inefficiently).
Using recursive descent for everything is like using Python (or Go or pick your favorite easy language to learn) for everything. Yeah you can totally do it, and a heck of a lot of people do precisely because of how nice and convenient it is, but it's good to have a sense what else is out there and how you can think about a problem differently IMO, even if you rarely use other languages in practice.
i agree, but for an undergraduate understanding of parsing and grammars, recursive descent is perfect
i think the more complex parsing algorithms were good for producing papers, which explains their success in academia but relative lack of success in industry
Recursive parsing is a completely valid concept, but it's unnatural at first to hand-write parsers with it. I have tried it - even wrote a prototype DSL in Haskell (the CPS monad is very useful for that), but what I end up doing is using the DSL to write down the LR state machine, bottom up. This is eerily familiar to embedded and GUI programming where internal state matters and the program reacts to input. It might even be a better way to incorporate ad-hoc code that has side effects. However, most people think about programming languages top-down and use grammars to describe them.
as an aside, the parser for https://hyperscript.org is recursive descent, and our motto is: "remember what our parser professors taught us, then do the opposite of that."
There's some ugly stuff in there, but it's fun, all because we wrote it in recursive descent and can do whatever we darn well please.
I think the best way to understand it is that you're creating an AST (a tree describing the program), for a bottom up parser you are recognising and making little treelets and stitching them together into larger treelets until you reach the top/root, while for recursive descent you're building them from the top down from the root.
The main difference is that LR (bottom up) family grammars can describe more languages than LL (top down) ones can largely because bottom up has slightly more contextual information available when it puts stuff together (often small treelets rather than just the next token)
Get a lexer (like, just steal one if you don't think you'll be able to write one without getting bored, they're usually fairly self contained) and just starting writin' code. They're basically that simple.
The lovely thing about a nicely written recursive descent parser is that the code pretty much is the grammar (topologically), so any grammar you can actually imagine that can be parsed by recursive descent doesn't take much thought to make work.
Ultimately, the reason many languages use manually written parsers is because the code generated by parser generators just isn't worth the time savings not manually writing a parser instead. The underlying assumption is that that is somehow super hard, error prone, and time consuming. I think that assumption is simply wrong.
I had my compiler course at the University of Utrecht in the mid nineties. The local faculty there had some functional programming pioneers and did a lot of work on parse generators. So, I learned how to write a parser generator in Gopher (a Haskell predecessor). Very nice course and definitely a bit of a mind fuck since I also had to wrap my head around monads at the same time. And of course not really that useful for real world stuff and I've since forgotten most of what I learned as I have not used it in over 25 years.
A few years ago, I had to sit down and write a parser for a little query language that we came up with. I spend some time looking at all sorts of convoluted tools for this and refreshing some decades old knowledge on grammars to actually try to understand what the tools were trying to do. I then found a useful example of a lexer that simply used regular expressions and suddenly realized I was wasting time and that this stuff was stupidly simple. I then knocked out this little project out in a few hours using that. Once you have a lexer, things get easy and they are not that hard to write. I think this is why people end up doing these things manually. It's just not that hard or worth automating and it's rare enough that it's not worth investing time in learning new tools.
Also, modern languages are a lot better at this stuff too. String handling in C was always a bit tedious. But modern languages come with so many built in stuff for that that it's not really a burden. So, in short, parser generators solve a problem that's not really that big of a problem anymore.
> Developers often think parser generators are the sole legit way to build programming language frontends, possibly because compiler courses in university teach lex/yacc variants.
They're far from the only way, handwritten parsers are often mentionned as having way better error recovery. But if you're making a new configuration language, programming language, or something like that, ensuring that it has LR-compliant grammar and that this grammar is the source of truth can avoid a lot of pain later down the road.
The C++ grammar is not LALR(n) for any n. It requires unbounded lookahead. This means that it does not fit the yaccc/bison model, and back when bison was used, this was dealt with by adding a third pass, called spew, between the lexer and the parser to look far enough ahead to tell types from variables from functions. It was a nasty hack which made it very difficult to generate good error messages. But the language is a much better fit for a good hand written recursive descent parser.
I built an Algol68 compiler way back when - it was a mix - used recursive descent to parse the bracketing structure and type definitions and an LALR (homegrown generator) for the second pass it fit well provided you allowed the parser generator to resolve on of the shift/reduce conflicts on the fly at runtime (to handle the language's ability to change the priority of operators)
I think more programmers should learn how to write parsers by hand. You need them everywhere; for protocols, file formats, user input, little languages. They often use regular expressions or other crutches when a parser would be the better choice.
Sometimes when the problem space is simple enough you can get the best of both worlds by using regex to perform your tokenization when lexing and then passing the tokens to a traditional stack based parser. It’s possible for this approach to be bad performance wise but it depends on the use case.
The lexer that yacc implements is indeed regex based and as I also recently discovered was originally written by a little known developer by the name of Eric Schmidt [1]
Start with a simple language like parsing JSON or S-expressions and do it in a language you already know. There are a ton of tutorials you can find online along these lines!
Do you know Haskell? If not, I suggest you get accustomed to the language, and then read about monadic parsing [1] through Graham Hutton's work. Graham is a famous CS professor at U Notthingham, appears often in ComputerPhile [3,4], and wrote a book on Haskell [2].
I had to write an interpreter, optimizer and engine for a declarative language plus bottom up knowledge base in Haskell as part of an assignment, and an exam in a graduate course on advanced programming. Haskell made the problem significantly easier compared to languages I am much more comfortable with, like Python or C.
The Ruby yacc file is scary to look at. 13+ thousand lines in a single file.
Would it be better with hand rolled and they could have abstracted and organized somethings or does it all make sense in its current format if you are familiar with it?
Parser generators generally suffer from at least speed, recovery, error handling and lack of debuggability.
They also require you to learn a new language, the parser generators language.
Last but not least, what you actually end up wanting to build is an ast, as at some point you want to do something with the input, for most parsers you then have to implement even more code to build up the ast.
It is much easier to hand write it. In the end its faster to write and usually faster to run.
Every few years I evaluate the new options for the languages I use (c#, pascal), every time so far I am disappointed with the tooling. Maybe one year.
I think you have to see them a bit like regexes. It may be a bit annoying to learn the grammar at first, but it's cross-language learning in a way, as you can use parser generators libraries in any language.
I think that with enough work, we can adapt and evolve parser generators to mesh well with better error reporting, and give the programmer more options and control, but it’ll take some elbow grease and probably breaking backwards compatibility with the production syntax of these parser generator systems. There’s also a risk, of course, that if there’s too much customization, you sort of lose the value of having a parser generator DSL in the first place.
I agree that I think generators could address this issue, they simply haven't.
Also I'm particularly fond of PEGs, because they match the intuitiveness of recursive descent parsers with a generator suitable formalism (though they have their rough edges as well).
But yeah, it makes the compiler output way easier to work through when you have everything at once.
So looking at the list I'm dumbfounded that CPython and Ruby use generators.
That issue is the primary reason why postgres continues to use bison. The SQL standard introduced potential parsing ambiguities frequently and leaves a large portion of the things necessary for a functioning RDBMS unspecced.
It's also potentially quite slow.
I implemented an earley parser, because from what I read on wikipedia, it seems to be more advanced.
"Earley parsers are appealing because they can parse all context-free languages, unlike LR parsers and LL parsers, which are more typically used in compilers but which can only handle restricted classes of languages."
however I seldom see languages use Earley parser, there must be a reason, but I've never seen anybody explaining why choosing one algorithm over another.
As the above commenter mentioned, most language designers make hand-rolled parsers. Making a hand-rolled LL or LR parser is easier too.
In general most people think advanced = bad, they want the easiest solution which gets the job done well.
By 1991, "Most researchers see the Parsing Problem as "solved" -- a closed issue. Earley parsing is almost forgotten, and Leo's discovery is ignored. Two decades will pass before anyone attempts a practical implementation of Leo 1991."
It takes Aycock and Horspool's work in 2002 and Kegler's work in 2010 in Marpa to have a "practical implementation" (quoting that link).
(I quote that also because Aycock distributed SPARK, an Earley parser, which was included as part of the Python distribution, in the Parser/ subdirectory, and a couple of people here on HN report having used it.)
I then went to work at a private company, and an older guy who had gone to a state school that taught recursive descent (his was the last class to teach it) taught me how to do it. In a month or so I had learned more about how grammars actually work, what ambiguity is, and so forth, than in my whole class at Stanford.
I now teach compilers at a university, and I teach recursive descent.
https://www.engr.mun.ca/~theo/Misc/exp_parsing.htm#climbing
Using recursive descent for everything is like using Python (or Go or pick your favorite easy language to learn) for everything. Yeah you can totally do it, and a heck of a lot of people do precisely because of how nice and convenient it is, but it's good to have a sense what else is out there and how you can think about a problem differently IMO, even if you rarely use other languages in practice.
i think the more complex parsing algorithms were good for producing papers, which explains their success in academia but relative lack of success in industry
There's some ugly stuff in there, but it's fun, all because we wrote it in recursive descent and can do whatever we darn well please.
The main difference is that LR (bottom up) family grammars can describe more languages than LL (top down) ones can largely because bottom up has slightly more contextual information available when it puts stuff together (often small treelets rather than just the next token)
https://craftinginterpreters.com/contents.html
The lovely thing about a nicely written recursive descent parser is that the code pretty much is the grammar (topologically), so any grammar you can actually imagine that can be parsed by recursive descent doesn't take much thought to make work.
I had my compiler course at the University of Utrecht in the mid nineties. The local faculty there had some functional programming pioneers and did a lot of work on parse generators. So, I learned how to write a parser generator in Gopher (a Haskell predecessor). Very nice course and definitely a bit of a mind fuck since I also had to wrap my head around monads at the same time. And of course not really that useful for real world stuff and I've since forgotten most of what I learned as I have not used it in over 25 years.
A few years ago, I had to sit down and write a parser for a little query language that we came up with. I spend some time looking at all sorts of convoluted tools for this and refreshing some decades old knowledge on grammars to actually try to understand what the tools were trying to do. I then found a useful example of a lexer that simply used regular expressions and suddenly realized I was wasting time and that this stuff was stupidly simple. I then knocked out this little project out in a few hours using that. Once you have a lexer, things get easy and they are not that hard to write. I think this is why people end up doing these things manually. It's just not that hard or worth automating and it's rare enough that it's not worth investing time in learning new tools.
Also, modern languages are a lot better at this stuff too. String handling in C was always a bit tedious. But modern languages come with so many built in stuff for that that it's not really a burden. So, in short, parser generators solve a problem that's not really that big of a problem anymore.
They're far from the only way, handwritten parsers are often mentionned as having way better error recovery. But if you're making a new configuration language, programming language, or something like that, ensuring that it has LR-compliant grammar and that this grammar is the source of truth can avoid a lot of pain later down the road.
The lexer that yacc implements is indeed regex based and as I also recently discovered was originally written by a little known developer by the name of Eric Schmidt [1]
1: https://en.m.wikipedia.org/wiki/Lex_(software)
Part of the course is understanding exactly when Regexps are good enough and when you need something more powerful.
http://anthony-zhang.me/University-Notes/CS241/CS241.html
Just would like to add a link: https://craftinginterpreters.com/contents.html
It explains parsing and other topics in a really clear and accessible way.
I had to write an interpreter, optimizer and engine for a declarative language plus bottom up knowledge base in Haskell as part of an assignment, and an exam in a graduate course on advanced programming. Haskell made the problem significantly easier compared to languages I am much more comfortable with, like Python or C.
[1] www.cs.nott.ac.uk/~pszgmh/pearl.pdf
[2] https://www.amazon.com/Programming-Haskell-Graham-Hutton/dp/...
[3] https://www.youtube.com/channel/UC9-y-6csu5WGm29I7JiwpnA
[4] https://www.youtube.com/watch?v=eis11j_iGMs
Would it be better with hand rolled and they could have abstracted and organized somethings or does it all make sense in its current format if you are familiar with it?
https://github.com/ruby/ruby/blob/v3_0_2/parse.y
I suspect not, because Scala is super flexible and probably more difficult to parse, but has a better organized parser.
Last but not least, what you actually end up wanting to build is an ast, as at some point you want to do something with the input, for most parsers you then have to implement even more code to build up the ast.
It is much easier to hand write it. In the end its faster to write and usually faster to run.
Every few years I evaluate the new options for the languages I use (c#, pascal), every time so far I am disappointed with the tooling. Maybe one year.