Parsing: The Solved Problem That Isn't (2011)

Parsing computer languages is an entirely self-inflicted problem. You can easily design a language so it doesn't require any parsing techniques that were not known and practical in 1965, and it will greatly benefit the readability also.

dkjaudyeqooe · 2 years ago

This is entirely the case. Given a sensible grammar stated in a sensible way, it's very easy to write a nice recursive decent parser. They are fast and easy to maintain. It doesn't limit the expressiveness of your grammar unduly.

Both GCC and LLVM implement recursive decent parsers for their C compilers.

Parser generators are an abomination inflicted upon us by academia, solving a non problem, and poorly.

derriz · 2 years ago

Agree completely. Having used a bunch of parser generators (Antlr and bison most extensively) and written a parser combinator library, I came to the conclusion that they're a complete waste of time for practical applications.

A hand-written recursive descent parser (with an embedded Pratt parser to handle expressions/operators) solves all the problems that parser generators struggle with. The big/tricky "issue" mentioned in the article - composing or embedding one parser in another - is a complete non-issue with recursive-descent - it's just a function call. Other basic features of parsing: informative/useful error messages, recovery (i.e. don't just blow up with the first error), seamless integration with the rest of the host language, remain issues with all parser generators but are simply not issues with recursive descent.

And that's before you consider non-functional advantages of recursive-descent: debuggability, no change/additions to the build system, fewer/zero dependencies, no requirement to learn a complex DSL, etc.

LgWoodenBadger · 2 years ago

Just like email addresses. The specification/rfc/whatever could have defined a reg-ex that determines a valid address, instead of the essential impossibility we have today.

paulddraper · 2 years ago

> and it will greatly benefit the readability also

This is the controversial part, Lisp aficionados to the contrary.

kazinator · 2 years ago

People are misunderstanding my original comment.

You can parse quite a lot more than Lisp with techniques from 1965.

loevborg · 2 years ago

Do you mean lisp? If yes I agree

kazinator · 2 years ago

Including, but not limited to.

wglb · 2 years ago

Smalltalk is another.

amelius · 2 years ago

End-comment

lgas · 2 years ago

)

Dead Comment

Legend2440 · 2 years ago

But I don't want to be able to parse only highly restricted languages. I want to be able to parse anything, including natural language or even non-languages like raw audio.

My brain can do it, why can't my computer?

throwbadubadu · 2 years ago

Yes, never do humans misunderstand each other, or instructions are not clear to everyone and totally unambiguous, and luckily no language has pure differentiation of meaning by intonation, and, and.. and...

reissbaker · 2 years ago

Your computer can do it, if it has a beefy enough Nvidia card. That's what LLMs are for!

Grammar-based parsing for natural language isn't anywhere close to working, sadly, and may never be.

Deleted Comment

Have there been any notable innovations in parsing since this was written?

thechao · 2 years ago

Aycock & Horspool came up with a 'practical' method for implementing Earley parsing (conversion to a state-machine) that has pretty humorously good performance delta over "naive" Earley, and is still reasonable to implement. Joop Leo figured out how to get the worst-case of Earley parsing down to either O(n) (left-recursive, non-ambiguous) or O(n^2) (right-recursive, non-ambiguous). That means the Earley algorithm is only O(n^3) on right-recursive, ambiguous grammars; and, if you're doing that, you're holding your language wrong.

A somewhat breathless description of all of this is in the Marpa parser documentation:

    https://jeffreykegler.github.io/Marpa-web-site/

In practice, I've found that computers are so fast, that with just the Joop Leo optimizations, 'naive' Earley parsing is Good Enough™:

    https://loup-vaillant.fr/tutorials/earley-parsing/

marcusf · 2 years ago

An extremely layman answer is that most interesting innovation in parsing in relatively modern times has happened seems to be in the context of IDE's. I.e. incremental, high-performance parsing to support syntax highlighting, refactoring, etc. etc.

(I may be talking out of my ass here.)

ReleaseCandidat · 2 years ago

Actually the most important step of parsers (as even non-incremental, slow (or better: not fast) parsers are fast enough) is error recovery (error resilience) from syntax errors (mostly half written or half deleted code). What is time consuming is e.g. type-checking. Semantic checking in general, like exhaustiveness checks of pattern matches, syntax checking is fast.

o11c · 2 years ago

Not sure, but I at least am certainly aware of possibilities that such writeups exclude.

In particular, you can do (a subset of) the following in sequence:

* write your own grammar in whatever bespoke language you want

* compose those grammars into a single grammar

* generate a Bison grammar from that grammar

* run `bison --xml` instead of actually generating code

* read the XML file and implement your own (trivial) runtime so you can easily handle ownership issues

In particular, I am vehemently opposed to the idea of implementing parsers separately using some non-proven tool/theory, since that way leads to subtle grammar incompatibilities later.

sse · 2 years ago

The one I'm most excited about is improved error recovery:

https://soft-dev.org/pubs/html/diekmann_tratt__dont_panic/

https://drops.dagstuhl.de/storage/00lipics/lipics-vol166-eco...

danielvaughn · 2 years ago

I'm not super familiar with the space, but tree-sitter seems to take an interesting approach in that they are an incremental parser. So instead of re-parsing the entire document on change, it only parses the affected text, thereby making it much more efficient for text editors.

I don't know if that's specific to tree-sitter though, I'm sure there are other incremental parsers. I have to say that I've tried ANTLR and tree-sitter, and I absolutely love tree-sitter. It's a joy to work with.

IshKebab · 2 years ago

In my experience incremental parsing doesn't really make much sense. Non-incremental parsing can easily parse huge documents in milliseconds.

Also Tree Sitter only does half the parsing job - you get a tree on nodes, but you have to do your own parse of that tree to get useful structures out.

I prefer Chumsky or Nom which go all the way.

ReleaseCandidat · 2 years ago

> [incremental parsing] I don't know if that's specific to tree-sitter though

No, it isn't. And incremental parsing is older than 2011 too (like at least the 70s).

For example: https://dl.acm.org/doi/pdf/10.1145/357062.357066

troupo · 2 years ago

Yes. You roll your own manual parser. It's not as difficult as people make it out to be.

davidkellis · 2 years ago

These are not new, but my takeaways from https://tratt.net/laurie/blog/2020/which_parsing_approach.ht... and https://rust-analyzer.github.io/blog/2020/09/16/challeging-L... are to embrace various forms of LR parsing. https://github.com/igordejanovic/parglare is a very capable GLR parser, and I've been keeping a close eye on it for use in my projects.

bsder · 2 years ago

Yes, we resdesigned our programming languages to be easy to parse with limited lookahead.

norir · 2 years ago

I feel that most of the time the two options are presented as either write a handwritten parser or use a parser generator. A nice third way is to write a custom parser generator for the language you wish to parse. Handwritten parsers do tend to get unwieldy and general purpose parser generators can have inscrutable behavior for any specific language.

Because the grammar for a parser generator is usually much simpler than most general purpose programming languages, it is typically relatively straightforward to handwrite a parser for it.

dboreham · 2 years ago

You could write a language with which to specify that custom parser. Oh wait...