Readit News logoReadit News
signa11 · 3 years ago
For those unfamiliar with it, tree-sitter (https://emacs-tree-sitter.github.io/) aims to be a foundational package that understands code structurally (think abstract syntax trees). This was done earlier via regex's, which has its limitations.

This talk: https://www.thestrangeloop.com/2018/tree-sitter---a-new-pars... by the author is quite instructive as well.

josteink · 3 years ago
This HN post though is about a (new) core tree-sitter implementation in Emacs itself, which is not the same as the third party package[1] you linked. To give credit where credit is due though, it was obviously inspired by this work and what it allowed in community-maintained packages.

The new implementation has been authored by Yuan Fu in close collaboration with the core Emacs maintainers and the rest of the community. It has been an ongoing effort for many, many months.

This is great news, and means that also core Emacs language-binding provided as part of Emacs itself will now be able to make use of tree-sitter based parsers as well, something which wouldn't have been happening if they would have to depend on a third-party package to get those bindings.

I've been somewhat involved in the process, although not a major player, but needless to say I'm very excited about these news and can't wait to see what sort of improvements this enables across the line once people start using it.

[1] https://github.com/emacs-tree-sitter/elisp-tree-sitter

phtrivier · 3 years ago
So, we still have to wait for each major-mode mainteners to update their code in order to benefits from those change ? In this case, how big should the change be for a "typical" mode ? Is it going to happen for C/python/typescript/etc.. anytime soon ?
robenkleene · 3 years ago
One thing I'll add, because I think it's an interesting insight about the priorities of code parsing for text editors: Tree-sitter is specifically designed to be very effective at parsing code that's in an invalid state. E.g., think about adding a new line to a program, the new line you're adding is typically invalid for the majority of time until you've finished typing it out.
thaumasiotes · 3 years ago
> E.g., think about adding a new line to a program, the new line you're adding is typically invalid for the majority of time until you've finished typing it out.

That doesn't really bother me. The line of code you're currently typing is generally so invalid that there's not much point trying to color it.

But what does bother me is the code below where you're typing changing its color as the IDE tries to make your partial line fit.

srcreigh · 3 years ago
That and very fast incremental parsing as you type.

Slow parsing is why emacs used regexps for syntax highlighting instead of parsers in the first place.

armchairhacker · 3 years ago
Unfortunately this is a really hard problem, and I’ve used projects with tree-sitter and it actually chokes up on invalid syntax lines.

Granted, maybe the projects weren’t using tree-sitter correctly. But regex parsing is surprisingly practical, so despite being really amazing at its goals, tree-sitter may not have a definitive advantage

tmalsburg2 · 3 years ago
Is there a chance that this is going to make the parsing of large org mode files faster?
AlanYx · 3 years ago
It's not related to tree-sitter, but recent work on using text properties instead of overlays for folded regions in org has improved performance opening org files with folded regions from O(n^2) to O(nlogn). See https://blog.tecosaur.com/tmio/2022-05-31-folding.html It's a big improvement in practice.
aidenn0 · 3 years ago
Incremental parsing of incorrect code is one of those things that is literally impossible in the general case, but tree-sitter has found a lot of good ways to do it that are not just possible for a large fraction of reality, but also performant. It's hard to understate how impressive a piece of engineering this is.
Jenk · 3 years ago
I think you mean "hard to overstate" :)
aidenn0 · 3 years ago
Indeed. Too late to edit now though.
mickeyp · 3 years ago
If you're wondering what Tree-Sitter is and why Emacs would want it, I wrote about it a while ago:

https://www.masteringemacs.org/article/tree-sitter-complicat...

sph · 3 years ago
There is not an Emacs topic I'd like to know more about haven't already covered on your website.

Thanks, your articles and your book are the best guides into the world of Emacs.

mickeyp · 3 years ago
Thank you :) I'm glad you like my site and my book!
afry1 · 3 years ago
That point you make about syntax highlighting being slow while using eglot/LSP-mode is a great one. I've been a bit underwhelmed with eglot, and I think that must be the reason: it feels like I'm programming in a bowl of oatmeal with every keystroke.

Do you have any tips or guides for using treesitter for syntax highlighting/structural editing and eglot/LSP-mode for everything else?

omnicognate · 3 years ago
AFAIK eglot/lsp-mode don't do syntax highlighting. The article's just explaining why that is (i.e. because it would be too slow).

If you don't have tree-sitter your syntax highlighting will be done by the regex based font-lock-mode. I don't think eglot/lsp-mode make that slower, and I believe tree-sitter should speed it up (and make it more correct) without affecting them. I haven't tried it yet, though.

srcreigh · 3 years ago
tree-sitter emacs package tree-sitter-hl-mode

What lang do you use?

erganemic · 3 years ago
I'm really impressed with the strides Emacs has made recently: native compilation, project.el, eglot, and now tree-sitter?

As a user who hadn't kept up with development news until recently, I'd always mentally sorted Emacs into the same taxonomy as stuff like `find`: old, powerful, with a clunky interface and a stodgy resistance to updating how it does things (though not without reason).

I'm increasingly feeling like that's an unfair classification on my part--I'm genuinely super excited to see where Emacs is in 5 years.

sph · 3 years ago
Yes, it feels there is a lot of momentum going on recently.

Both neovim and Emacs are being improved at breakneck pace, and it is quite incredible for such an old piece of software with, dare I say, a quirky contribution model. The maintainers are working really hard on keeping it current and competitive.

bloopernova · 3 years ago
I'm really hoping that Emacs becomes multithreaded somehow. Or at least improves some operations so that they're non-blocking.

I've been using Emacs primarily for org-mode/roam/babel for a few years now. I'm very glad for its existence, I really think I've become a more effective DevOps person because of it.

wyuenho · 3 years ago
I'll be entirely satisfied with a process/event queue/loop that we can submit tasks to like Javascript's. There is already a command loop in Emacs, we just can't use it for anything other than input events and commands. Once we have an good event loop, we can build a state machine like Redux on it, then we can start rebuilding the display machinery, then we can start deleting all those hooks that constantly interfere with each other...
chlorion · 3 years ago
I don't think it needs to become multithreaded, it just needs better support for async/event loop style concurrency!

Right now we can run subprocesses without blocking anything with "make-process", but interacting with the process is pretty clunky, and you have to use the process sentinel and filter to perform callbacks when the process changes state or exits. There is quite a lot of boilerplate to setup for all of this and the control flow is pretty confusing IMO.

A nice "async/await" style interface to these things could really go a long way I think!

s0l1dsnak3123 · 3 years ago
Indeed, I'm using Emacs for Code, reading/writing documents and emails, as well as consuming RSS feeds. The ecosystem and values that underpin Emacs are fantastic - in my personal case the only downside to heavy use of Emacs is that it can struggle to utilise my hardware. This tends to be particularly noticeable when using TRAMP and Eglot, or producing large org tables.
tmalsburg2 · 3 years ago
ilyt · 3 years ago
Yeah the extra micro-waits introduced by some IDE-like features were annoying last time I used it.
zelphirkalt · 3 years ago
I have the same feeling.

There is one more, possibly gigantic, thing though: Better handling of very long strings. I know the data structures for strings have various tradeoffs, but properly abstracted, it should be possible to even give a choice, no? So users could choose the data structure, based on their use cases. But I know little about the internals and maybe that is all too low level to be something a user could choose from the user interface or configuration.

I hope string data structure is properly abstracted from, so that it is exchangable for another data structure, but I have my doubts. Would like to be surprised here and anyone credibly telling me, that string data structure in Emacs has an abstraction barrier around it, and is actually exchangable, by implementing basic string functions like "get nth character" or "get substring" in terms of another data structure.

If it is not properly abstracted from, then of course it could be a nightmare to change the data structure.

b3morales · 3 years ago
This was also something that was enhanced recently and will be in Emacs 29: https://github.com/emacs-mirror/emacs/blob/21b387c39bd9cf07c...

> Emacs is now capable of editing files with very long lines.

> The display of long lines has been optimized, and Emacs should no

> longer choke when a buffer on display contains long lines.

> ...

ilyt · 3 years ago
I use IntelliJ products but still prefer Emacs as an editor. I moved off it for code for IDE features, even if I managed to get some convenience in Emacs it ran synchronously which meant experience could be pretty laggy at times vs "at worst popup with extra info will be delayed" in IDEA
bjourne · 3 years ago
Check out the emacs-devel@gnu.org list sometime. It's incredibly well run and is in my opinion the secret sauce that keeps the project running.
mcqueenjordan · 3 years ago
I have a huge belief in tree-sitter. I think it's going to continue to grow and become an important tool, especially in security/code tooling contexts.
norir · 3 years ago
The main innovation of tree-sitter, even more than incremental parsing, as I see it is that it provides a uniform api for traversing a parse tree, which makes it relatively straightforward to onboard a new language to a tool with tree-sitter support. The problem though is that the tree-sitter grammar is nearly always going to be an approximation to the actual language grammar, unless the language compiler/interpreter uses tree-sitter for parsing. To me, this is problematic for tooling because it is always possible for a tree-sitter based tool to be flat out wrong relative to the actual language. For syntax highlighting, this is generally not a huge deal (and tree-sitter does generally work well, though there are exceptions), but I'd be more cautious with security tools based on tree-sitter.

If all languages changed their reference parsers to tree-sitter, this would be moot, but that seems unlikely. Language parsers are often optimized beyond what is possible in a general purpose parser generator like tree-sitter and/or have ambiguities that cannot be resolved with the tree-sitter dsl.

What feels perhaps likely in the future is that a standard parse tree api emerges, analogous to lsp, and then language parsers could emit trees traversable by this api. Maybe it's just the tree-sitter c api with an alternate front end? Hard to say, but I suspect either something better than (but likely at least partially inspired by) tree-sitter will emerge or we will get stuck in a local minimum with tooling based on slightly incorrect language parsers.

difflens · 3 years ago
> as I see it is that it provides a uniform api for traversing a parse tree, which makes it relatively straightforward to onboard a new language to a tool with tree-sitter support. The problem though is that the tree-sitter grammar is nearly always going to be an approximation to the actual language grammar, unless the language compiler/interpreter uses tree-sitter for parsing.

Author of DiffLens (https://marketplace.visualstudio.com/items?itemName=DiffLens...) here. A uniform API for traversing a parse tree for all languages would be amazing for DiffLens! However, I fear languages are different enough that this ideal may never be reached :) Or maybe there would be a core set of APIs and extensions for the idiosyncrasies of each language. For DiffLens though, we try to use the language's official parser/compiler if it exposes an AST

debugnik · 3 years ago
> unless the language compiler/interpreter uses tree-sitter for parsing

Doubtful, last time I tried tree-sitter would parse invalid inputs without even tagging any errors in the parse tree. For example, it would silently accept extra tokens, or keywords in the place of identifiers. Replacing the built-in lexer and then validating the parse tree for correctness would be close to writing the grammar twice.

And accepting partially correct inputs within the compiler toolchain isn't too hard, so I don't really see the advantage of agreeing on tree-sitter and not just on a parse tree representation that editors can then query, as you then suggested. If the big deal is having it execute client-side or being sandboxed, I feel that's orthogonal to parsing algorithms.

cjohansson · 3 years ago
tree-sitter is a bit better than regexp but it is not an actual parser of grammars, a fast actual parser of all languages for syntax coloring is the future I think, tree-sitter is a pragmatic middle-ground while we wait for the prime solution
antipaul · 3 years ago
What's the "explain it like I'm 5 years old" (ELI5) for tree-sitter? Why should I, an emacs user but not lisp hacker, care about it?
chriswarbo · 3 years ago
tree-sitter creates parsers, e.g. for programming languages, config formats, etc.

Emacs modes can use those parsers on buffer contents, e.g. for syntax colouring/highlighting, finding matching delimiters (e.g. moving the cursor over an `if`, and having all the corresponding clauses (e.g. else/elif/fi) highlighted), for contextual editing (e.g. escaping " when inside a string), etc.

This can be remarkably tricky to get right; e.g. consider languages which can splice expressions inside strings (which can themselves contain strings, containing spliced expressions, etc.)

Using tree-sitter should make this easier and more robust (i.e. less time spent implementing parsers; more time spent implementing features!). I think it would also allow grammars to be re-used across different tools, which should improve support for obscure/niche languages.

2pEXgD0fZ5cF · 3 years ago
Does this mean that every emacs language package would automatically make use of this once it is built in. Or will this rather enable the possibility to write/rewrite programming language modes so they make use of tree-sitter because they can assume it is available in the default emacs install from then on?
lawn · 3 years ago
Another useful feature is that it makes it easier to support mixing languages in the same file.

Think highlighting for html/JS/CSS in a single file or fully featured highlighting inside markdown code snippets.

giraffe_lady · 3 years ago
You know how emacs typically has the worst syntax highlighting of all mainstream editors for a given language? This makes it better.
davidkunz · 3 years ago
Congratulations, Emacs! I hope it will be a similar success story as in Neovim. If more systems use it, the question "should my programming language provide a Tree-Sitter parser" becomes a no brainer.
mickeyp · 3 years ago
If you're wondering what Tree-Sitter is and why Emacs would want it, I wrote about it a while ago:

https://www.masteringemacs.org/article/tree-sitter-complicat...