For those unfamiliar with it, tree-sitter (https://emacs-tree-sitter.github.io/) aims to be a foundational package that understands code structurally (think abstract syntax trees). This was done earlier via regex's, which has its limitations.
This HN post though is about a (new) core tree-sitter implementation in Emacs itself, which is not the same as the third party package[1] you linked. To give credit where credit is due though, it was obviously inspired by this work and what it allowed in community-maintained packages.
The new implementation has been authored by Yuan Fu in close collaboration with the core Emacs maintainers and the rest of the community. It has been an ongoing effort for many, many months.
This is great news, and means that also core Emacs language-binding provided as part of Emacs itself will now be able to make use of tree-sitter based parsers as well, something which wouldn't have been happening if they would have to depend on a third-party package to get those bindings.
I've been somewhat involved in the process, although not a major player, but needless to say I'm very excited about these news and can't wait to see what sort of improvements this enables across the line once people start using it.
So, we still have to wait for each major-mode mainteners to update their code in order to benefits from those change ? In this case, how big should the change be for a "typical" mode ? Is it going to happen for C/python/typescript/etc.. anytime soon ?
One thing I'll add, because I think it's an interesting insight about the priorities of code parsing for text editors: Tree-sitter is specifically designed to be very effective at parsing code that's in an invalid state. E.g., think about adding a new line to a program, the new line you're adding is typically invalid for the majority of time until you've finished typing it out.
> E.g., think about adding a new line to a program, the new line you're adding is typically invalid for the majority of time until you've finished typing it out.
That doesn't really bother me. The line of code you're currently typing is generally so invalid that there's not much point trying to color it.
But what does bother me is the code below where you're typing changing its color as the IDE tries to make your partial line fit.
Unfortunately this is a really hard problem, and I’ve used projects with tree-sitter and it actually chokes up on invalid syntax lines.
Granted, maybe the projects weren’t using tree-sitter correctly. But regex parsing is surprisingly practical, so despite being really amazing at its goals, tree-sitter may not have a definitive advantage
It's not related to tree-sitter, but recent work on using text properties instead of overlays for folded regions in org has improved performance opening org files with folded regions from O(n^2) to O(nlogn). See https://blog.tecosaur.com/tmio/2022-05-31-folding.html It's a big improvement in practice.
Incremental parsing of incorrect code is one of those things that is literally impossible in the general case, but tree-sitter has found a lot of good ways to do it that are not just possible for a large fraction of reality, but also performant. It's hard to understate how impressive a piece of engineering this is.
That point you make about syntax highlighting being slow while using eglot/LSP-mode is a great one. I've been a bit underwhelmed with eglot, and I think that must be the reason: it feels like I'm programming in a bowl of oatmeal with every keystroke.
Do you have any tips or guides for using treesitter for syntax highlighting/structural editing and eglot/LSP-mode for everything else?
AFAIK eglot/lsp-mode don't do syntax highlighting. The article's just explaining why that is (i.e. because it would be too slow).
If you don't have tree-sitter your syntax highlighting will be done by the regex based font-lock-mode. I don't think eglot/lsp-mode make that slower, and I believe tree-sitter should speed it up (and make it more correct) without affecting them. I haven't tried it yet, though.
I'm really impressed with the strides Emacs has made recently: native compilation, project.el, eglot, and now tree-sitter?
As a user who hadn't kept up with development news until recently, I'd always mentally sorted Emacs into the same taxonomy as stuff like `find`: old, powerful, with a clunky interface and a stodgy resistance to updating how it does things (though not without reason).
I'm increasingly feeling like that's an unfair classification on my part--I'm genuinely super excited to see where Emacs is in 5 years.
Yes, it feels there is a lot of momentum going on recently.
Both neovim and Emacs are being improved at breakneck pace, and it is quite incredible for such an old piece of software with, dare I say, a quirky contribution model. The maintainers are working really hard on keeping it current and competitive.
I'm really hoping that Emacs becomes multithreaded somehow. Or at least improves some operations so that they're non-blocking.
I've been using Emacs primarily for org-mode/roam/babel for a few years now. I'm very glad for its existence, I really think I've become a more effective DevOps person because of it.
I'll be entirely satisfied with a process/event queue/loop that we can submit tasks to like Javascript's. There is already a command loop in Emacs, we just can't use it for anything other than input events and commands. Once we have an good event loop, we can build a state machine like Redux on it, then we can start rebuilding the display machinery, then we can start deleting all those hooks that constantly interfere with each other...
I don't think it needs to become multithreaded, it just needs better support for async/event loop style concurrency!
Right now we can run subprocesses without blocking anything with "make-process", but interacting with the process is pretty clunky, and you have to use the process sentinel and filter to perform callbacks when the process changes state or exits. There is quite a lot of boilerplate to setup for all of this and the control flow is pretty confusing IMO.
A nice "async/await" style interface to these things could really go a long way I think!
Indeed, I'm using Emacs for Code, reading/writing documents and emails, as well as consuming RSS feeds. The ecosystem and values that underpin Emacs are fantastic - in my personal case the only downside to heavy use of Emacs is that it can struggle to utilise my hardware. This tends to be particularly noticeable when using TRAMP and Eglot, or producing large org tables.
There is one more, possibly gigantic, thing though: Better handling of very long strings. I know the data structures for strings have various tradeoffs, but properly abstracted, it should be possible to even give a choice, no? So users could choose the data structure, based on their use cases. But I know little about the internals and maybe that is all too low level to be something a user could choose from the user interface or configuration.
I hope string data structure is properly abstracted from, so that it is exchangable for another data structure, but I have my doubts. Would like to be surprised here and anyone credibly telling me, that string data structure in Emacs has an abstraction barrier around it, and is actually exchangable, by implementing basic string functions like "get nth character" or "get substring" in terms of another data structure.
If it is not properly abstracted from, then of course it could be a nightmare to change the data structure.
I use IntelliJ products but still prefer Emacs as an editor. I moved off it for code for IDE features, even if I managed to get some convenience in Emacs it ran synchronously which meant experience could be pretty laggy at times vs "at worst popup with extra info will be delayed" in IDEA
I have a huge belief in tree-sitter. I think it's going to continue to grow and become an important tool, especially in security/code tooling contexts.
The main innovation of tree-sitter, even more than incremental parsing, as I see it is that it provides a uniform api for traversing a parse tree, which makes it relatively straightforward to onboard a new language to a tool with tree-sitter support. The problem though is that the tree-sitter grammar is nearly always going to be an approximation to the actual language grammar, unless the language compiler/interpreter uses tree-sitter for parsing. To me, this is problematic for tooling because it is always possible for a tree-sitter based tool to be flat out wrong relative to the actual language. For syntax highlighting, this is generally not a huge deal (and tree-sitter does generally work well, though there are exceptions), but I'd be more cautious with security tools based on tree-sitter.
If all languages changed their reference parsers to tree-sitter, this would be moot, but that seems unlikely. Language parsers are often optimized beyond what is possible in a general purpose parser generator like tree-sitter and/or have ambiguities that cannot be resolved with the tree-sitter dsl.
What feels perhaps likely in the future is that a standard parse tree api emerges, analogous to lsp, and then language parsers could emit trees traversable by this api. Maybe it's just the tree-sitter c api with an alternate front end? Hard to say, but I suspect either something better than (but likely at least partially inspired by) tree-sitter will emerge or we will get stuck in a local minimum with tooling based on slightly incorrect language parsers.
> as I see it is that it provides a uniform api for traversing a parse tree, which makes it relatively straightforward to onboard a new language to a tool with tree-sitter support. The problem though is that the tree-sitter grammar is nearly always going to be an approximation to the actual language grammar, unless the language compiler/interpreter uses tree-sitter for parsing.
Author of DiffLens (https://marketplace.visualstudio.com/items?itemName=DiffLens...) here. A uniform API for traversing a parse tree for all languages would be amazing for DiffLens! However, I fear languages are different enough that this ideal may never be reached :) Or maybe there would be a core set of APIs and extensions for the idiosyncrasies of each language. For DiffLens though, we try to use the language's official parser/compiler if it exposes an AST
> unless the language compiler/interpreter uses tree-sitter for parsing
Doubtful, last time I tried tree-sitter would parse invalid inputs without even tagging any errors in the parse tree. For example, it would silently accept extra tokens, or keywords in the place of identifiers. Replacing the built-in lexer and then validating the parse tree for correctness would be close to writing the grammar twice.
And accepting partially correct inputs within the compiler toolchain isn't too hard, so I don't really see the advantage of agreeing on tree-sitter and not just on a parse tree representation that editors can then query, as you then suggested. If the big deal is having it execute client-side or being sandboxed, I feel that's orthogonal to parsing algorithms.
tree-sitter is a bit better than regexp but it is not an actual parser of grammars, a fast actual parser of all languages for syntax coloring is the future I think, tree-sitter is a pragmatic middle-ground while we wait for the prime solution
tree-sitter creates parsers, e.g. for programming languages, config formats, etc.
Emacs modes can use those parsers on buffer contents, e.g. for syntax colouring/highlighting, finding matching delimiters (e.g. moving the cursor over an `if`, and having all the corresponding clauses (e.g. else/elif/fi) highlighted), for contextual editing (e.g. escaping " when inside a string), etc.
This can be remarkably tricky to get right; e.g. consider languages which can splice expressions inside strings (which can themselves contain strings, containing spliced expressions, etc.)
Using tree-sitter should make this easier and more robust (i.e. less time spent implementing parsers; more time spent implementing features!). I think it would also allow grammars to be re-used across different tools, which should improve support for obscure/niche languages.
Does this mean that every emacs language package would automatically make use of this once it is built in. Or will this rather enable the possibility to write/rewrite programming language modes so they make use of tree-sitter because they can assume it is available in the default emacs install from then on?
Congratulations, Emacs! I hope it will be a similar success story as in Neovim. If more systems use it, the question "should my programming language provide a Tree-Sitter parser" becomes a no brainer.
This talk: https://www.thestrangeloop.com/2018/tree-sitter---a-new-pars... by the author is quite instructive as well.
The new implementation has been authored by Yuan Fu in close collaboration with the core Emacs maintainers and the rest of the community. It has been an ongoing effort for many, many months.
This is great news, and means that also core Emacs language-binding provided as part of Emacs itself will now be able to make use of tree-sitter based parsers as well, something which wouldn't have been happening if they would have to depend on a third-party package to get those bindings.
I've been somewhat involved in the process, although not a major player, but needless to say I'm very excited about these news and can't wait to see what sort of improvements this enables across the line once people start using it.
[1] https://github.com/emacs-tree-sitter/elisp-tree-sitter
That doesn't really bother me. The line of code you're currently typing is generally so invalid that there's not much point trying to color it.
But what does bother me is the code below where you're typing changing its color as the IDE tries to make your partial line fit.
Slow parsing is why emacs used regexps for syntax highlighting instead of parsers in the first place.
Granted, maybe the projects weren’t using tree-sitter correctly. But regex parsing is surprisingly practical, so despite being really amazing at its goals, tree-sitter may not have a definitive advantage
https://www.masteringemacs.org/article/tree-sitter-complicat...
Thanks, your articles and your book are the best guides into the world of Emacs.
Do you have any tips or guides for using treesitter for syntax highlighting/structural editing and eglot/LSP-mode for everything else?
If you don't have tree-sitter your syntax highlighting will be done by the regex based font-lock-mode. I don't think eglot/lsp-mode make that slower, and I believe tree-sitter should speed it up (and make it more correct) without affecting them. I haven't tried it yet, though.
What lang do you use?
As a user who hadn't kept up with development news until recently, I'd always mentally sorted Emacs into the same taxonomy as stuff like `find`: old, powerful, with a clunky interface and a stodgy resistance to updating how it does things (though not without reason).
I'm increasingly feeling like that's an unfair classification on my part--I'm genuinely super excited to see where Emacs is in 5 years.
Both neovim and Emacs are being improved at breakneck pace, and it is quite incredible for such an old piece of software with, dare I say, a quirky contribution model. The maintainers are working really hard on keeping it current and competitive.
I've been using Emacs primarily for org-mode/roam/babel for a few years now. I'm very glad for its existence, I really think I've become a more effective DevOps person because of it.
Right now we can run subprocesses without blocking anything with "make-process", but interacting with the process is pretty clunky, and you have to use the process sentinel and filter to perform callbacks when the process changes state or exits. There is quite a lot of boilerplate to setup for all of this and the control flow is pretty confusing IMO.
A nice "async/await" style interface to these things could really go a long way I think!
There is one more, possibly gigantic, thing though: Better handling of very long strings. I know the data structures for strings have various tradeoffs, but properly abstracted, it should be possible to even give a choice, no? So users could choose the data structure, based on their use cases. But I know little about the internals and maybe that is all too low level to be something a user could choose from the user interface or configuration.
I hope string data structure is properly abstracted from, so that it is exchangable for another data structure, but I have my doubts. Would like to be surprised here and anyone credibly telling me, that string data structure in Emacs has an abstraction barrier around it, and is actually exchangable, by implementing basic string functions like "get nth character" or "get substring" in terms of another data structure.
If it is not properly abstracted from, then of course it could be a nightmare to change the data structure.
> Emacs is now capable of editing files with very long lines.
> The display of long lines has been optimized, and Emacs should no
> longer choke when a buffer on display contains long lines.
> ...
If all languages changed their reference parsers to tree-sitter, this would be moot, but that seems unlikely. Language parsers are often optimized beyond what is possible in a general purpose parser generator like tree-sitter and/or have ambiguities that cannot be resolved with the tree-sitter dsl.
What feels perhaps likely in the future is that a standard parse tree api emerges, analogous to lsp, and then language parsers could emit trees traversable by this api. Maybe it's just the tree-sitter c api with an alternate front end? Hard to say, but I suspect either something better than (but likely at least partially inspired by) tree-sitter will emerge or we will get stuck in a local minimum with tooling based on slightly incorrect language parsers.
Author of DiffLens (https://marketplace.visualstudio.com/items?itemName=DiffLens...) here. A uniform API for traversing a parse tree for all languages would be amazing for DiffLens! However, I fear languages are different enough that this ideal may never be reached :) Or maybe there would be a core set of APIs and extensions for the idiosyncrasies of each language. For DiffLens though, we try to use the language's official parser/compiler if it exposes an AST
Doubtful, last time I tried tree-sitter would parse invalid inputs without even tagging any errors in the parse tree. For example, it would silently accept extra tokens, or keywords in the place of identifiers. Replacing the built-in lexer and then validating the parse tree for correctness would be close to writing the grammar twice.
And accepting partially correct inputs within the compiler toolchain isn't too hard, so I don't really see the advantage of agreeing on tree-sitter and not just on a parse tree representation that editors can then query, as you then suggested. If the big deal is having it execute client-side or being sandboxed, I feel that's orthogonal to parsing algorithms.
Emacs modes can use those parsers on buffer contents, e.g. for syntax colouring/highlighting, finding matching delimiters (e.g. moving the cursor over an `if`, and having all the corresponding clauses (e.g. else/elif/fi) highlighted), for contextual editing (e.g. escaping " when inside a string), etc.
This can be remarkably tricky to get right; e.g. consider languages which can splice expressions inside strings (which can themselves contain strings, containing spliced expressions, etc.)
Using tree-sitter should make this easier and more robust (i.e. less time spent implementing parsers; more time spent implementing features!). I think it would also allow grammars to be re-used across different tools, which should improve support for obscure/niche languages.
Think highlighting for html/JS/CSS in a single file or fully featured highlighting inside markdown code snippets.
https://www.masteringemacs.org/article/tree-sitter-complicat...