For those who don't already know, this is built on tree-sitter (https://tree-sitter.github.io/tree-sitter/) which does for parsing what LSP does for analysis. That is, it provides a standard interface for turning code into an AST and then making that AST available to clients like editors and diff tools. Instead of a neat tool like this having to support dozens of languages, it can just support tree-sitter and automatically work with anything that tree-sitter supports. And if you're developing a new language, you can create a tree-sitter parser for it, and now every tool that speaks tree-sitter knows how to support your language.
Those 2 massive innovations are leading to an explosion of tooling improvements like this. Now every editor, diff tool, or whatever can support dozens or hundreds of languages without having to duplicate all the work of every other similar tool. That's freaking amazing.
Absolutely agreed, and copying from a comment I wrote last year: I think the fact that tree-sitter is dependency-free is worth highlighting. For context, some of my teammates maintain the OCaml tree-sitter bindings and often contribute to grammars as part of our work on Semgrep (Semgrep uses tree-sitter for searching code and parsing queries that are code snippets themselves into AST matchers).
Often when writing a linter, you need to bring along the runtime of the language you're targeting. E.g., in python if you're writing a parser using the builtin `ast` module, you need to match the language version & features. So you can't parse Python 3 code with Pylint running on Python 2.7, for instance. This ends up being more obnoxious than you'd think at first, especially if you're targeting multiple languages.
Before tree-sitter, using a language's built-in AST tooling was often the best approach because it is guaranteed to keep up with the latest syntax. IMO the genius of tree-sitter is that it's made it way easier than with traditional grammars to keep the language parsers updated. Highly recommend Max Brunsfield's strange loop talk if you want to learn more about the design choices behind tree-sitter: https://www.youtube.com/watch?v=Jes3bD6P0To
And this has resulted in a bunch of new tools built off on tree-sitter, off the top of my head in addition to difftastic: neovim, Zed, Semgrep, and Github code search!
Okay, but how does that work with language versions? Like, if I get a "C++ parser" for tree-sitter, how do I know if it's C++03, C++17, C++21 or what? Last time I checked (which was months ago, to be fair), this wasn't documented anywhere, nor were there apparent any mechanisms to support langauge versions and variants.
While I agree tree-sitter is an amazing tool, writing the grammar out can be incredibly difficult I found. I tried writing out a grammar and highlighting query set for vhdl with tree-sitter, and found that there were a lot of difficulties in expressing vhdl grammar in tree-sitter.
I don't believe this is correct - there's no such thing as "speaking tree-sitter." Every tree-sitter parser emits a different concrete syntax tree, not a standard abstract syntax tree.
LSP truly solves the M editors to N languages needing M * N many integrations by using a standard interface for a query oriented compiler. Tree sitter doesn't solve this problem, it just makes it way easier to write N many integrations for your editor/tool.
That depends on how deep you want to go with the result. I use the Nova editor which uses tree-sitter for syntax highlighting, and I've packaged several languages for it. Each time it goes like this:
1. Clone someone's tree-sitter grammar off GitHub.
2. Build it into a Mac .dylib.
3. Create a Nova extension that says "use this .dylib to highlight that language."
4. Use it.
I don't have to make any changes to Nova itself, and the amount of configuration I have to write is so tiny that Nova could have a DIY wizard if they wanted it to.
The source for Difftastic discussed here (at https://github.com/Wilfred/difftastic/blob/master/src/parse/...) is also very simple: for each of a list of supported languages, import the tree-sitter parser and wrap a teensy amount of configuration around it.
The main issue I have with tree-sitter is that it’s approach can’t work for many languages I care about: Common Lisp cannot be parsed without a full lisp implementation; Haskell’s syntax is complicated enough that the grammar is incomplete; C/C++ can’t be parsed accurately if only because of the pre-processor; parsing perl is Turing-complete, etc. I think the suggestion elsewhere makes sense: don’t make us write parsers in a new ecosystem, but instead define a format for existing parsers to produce as a side-output.
> C/C++ can’t be parsed accurately if only because of the pre-processor
Yeah, decided to check this out to see if it could help review in our massive C-based project. Unfortunately, in a recent patch, of the 90 "hunks", 88 of them had fallen back to "normal diff" because "$N C parse errors, exceeded DFT_PARSE_ERROR_LIMIT").
Can one write a tree-sitter grammar for English (or any other natural language), that basically labels each sentence as a statement, so I can use difftastic to show changes on sentences rather than visual lines?
This is because visual line diffs for an essay is bonkers. Usually the sentence changed starts in the middle of a visual line.
The common advice[0] is to just write one sentence per line. I usually split at commas etc as well. Then use editor soft wrapping instead of fixing a maximum line length - but if your lines get longer than the screen width that might be a sign your sentences are too complex.
[0]: anyone have a good source for this? I’m not sure where I first encountered it
As soon as you said tree sitter I immediately understood. Yes, I can’t believe I never realized that you could totally build a syntax-aware VCS on top of it. That’s brilliant.
I just wrote a language parser a few months ago in tree sitter and it’s probably the most delightful software I’ve used apart from ffmpeg.
Tree-sitter is nice, but I would like parsers that make a better effort on invalid inputs. Something like an Early parser that maximizes some quality function. This would be useful for parsing (for example) C and C++ where the preprocessor prevents true parsing of unpreprocessed code. I understand that tree-sitter is intended for interactive use in editors where it can't spend too much time parsing.
(1) The top comment is from the author of difftastic (the subject here), saying that treesitter Nim plugin can't be merged, because it's 60 MB of generated C source code. There's a scalability problem supporting multiple languages.
The author of Treesitter proposes using the WASM runtime, which is new.
(2) The original blog post concludes with some Treesitter issues, prefering Syntect (a Rust library that accepts Textmate grammars)
Because of these issues I’ll evaluate what highlighter to use on a case-by-case basis, with Syntect as the default choice.
(3) The idea of a uniform api for querying syntax trees is a good one and tree-sitter deserves credit for popularizing it. It's unfortunately not a great implementation of the idea
(4) [It] segfaults constantly ... More than any NPM module I've ever used before. Any syntax that doesn't precisely match the grammar is liable to take down your entire thread.
---
I think some of the feedback was rude and harsh, and maybe even using Treesitter outside its intended use cases. But as someone who's been interested in Treesitter, but hasn't really used it, it seems real.
One problem I see is that Treesitter is meant to be incremental, so it can be used in an editor/IDE. And that's a significantly harder problem than batch syntax highlighting, parsing, semantic understanding.
---
That is, difftastic is a batch tool, i.e. you run it with git diff.
So to me the obvious thing for difftastic is to throw out the GLR algorithm, and throw out the heinous external lexers written in C that are constrained by it, and just use normal batch parsers written in whatever language, with whatever algorithm. Recursive descent.
These parsers can output a CST in the TreeSitter format, which looks pretty simple.
They don't even need to be linked into the difftastic binary -- you could emit an CST / S-expression format and match it with the text.
Unix style! Parsers can live in different binaries and still be composed.
The blog post use case can also just use batch parsers that output a CST. You don't Treesitter's incremental features to render HTML for your blog.
As one of the harsh and rude commentators, I would say I basically agree with your interpretation. You also correctly inferred that I have experience with working with it in an area that is arguably outside of its true use case.
At the same time, I believe that there needs to be a corrective about what tree-sitter should and should not be used for. There are companies building security products on top of tree-sitter which I think is an objectively bad idea given its problems and limitations. Difftastic is to me a grey area because it could lead hypothetically to a security issue if it generates an incorrect diff due to an incorrect tree-sitter grammar. Unlikely but not impossible.
Your point about batch vs incremental is spot on, though even for IDEs, I think incremental is usually overkill (I have written a recursive descent parser for a language in c that can do 3million lines per second on a decent laptop which is about 60k lines per 20 ms, which is the window I look to for reactivity). How many non-generated source files exceed say 100k lines? Incremental parsing feels like taking on a lot of complexity for rather limited benefit except in fairly niche use cases (granting that one person's niche is another's common case).
That being said, it is impressive that their incremental algorithm works as well as it does but the cost is that grammar writers are forced to mold a language grammar that might not fit into the GLR algorithm. When it doesn't work as expected, which is not uncommon in my experience, the error messages are inscrutable and debugging either the generator or the generated code is nigh impossible.
Most of the happy users have no idea how the sausage is made, they just see the prettier syntax highlighting that works with multiple tools. I get that my criticism is as welcome as a wet blanket, but I just think there is something much better possible which your comment hints at.
This question is coming from a place of total ignorance:
One appeal of the general idea of a structural diff tool, for me, is ignoring the ordering of things for which ordering makes no difference.
x = 4
y = 7
are independent statements and the code will be no different if I replace those two statements with
y = 7
x = 4
However, this information is not actually present in the abstract syntax tree. If I instead consider these two statements:
x += 3
x *= 7
it is apparent that reordering them will cause changes to the meaning of the code. But as far as the AST goes, this is the same thing as the example where reordering was fine.
What kinds of things are we doing with our new AST tooling?
> x = 4
> y = 7
>
>are independent statements and the code will be no different if I replace those two statements with
>
> y = 7
> x = 4
Not always, e.g. in a multi threaded situation where x and y are shared atomics. Then unless we authorize C++ to take more liberties in reordering, another thread will never see y as 7 while x is not yet 4 in the first example, but not the second. This kind of subtlety can't be determined from syntax alone.
In a sense, plain old diff is a structural diff. The grammar is a sequence of lines of characters.
All tree-sitter gives you is a _different_ grammar, so that a structural diff can operate on different trees given the same text as diff.
A parse tree still doesn't know anything about the meaning of a program, which is what you need to know in order to determine that those assignments to x and y are unordered.
What you want to determine this is not an AST, you want a Program Dependence Graph (PDG), which does encode this information. Creating them is not close to as simple as creating a AST, and for many languages requires either assumptions that will be broken, or result in something very similar to an AST (every node has a dependency on the previous node).
How close are we to being able to copy a function in to the clipboard, then highlight some lines of code and paste the function around it (like highlight > quote marks)?
I don't know what exactly you mean by pasting a function around the selection, but you can paste selections, registers or even files at specific lines with some vim-fu. If it's generic enough you could write a function or even a keyboard shortcut if it's very simple.
I have set ",',(,[,{ in visual mode to cut the selection insert the pairs then paste it back as a very hacky solution, but it gets the job done. If you want something more advanced to add or change anything around the selection tpope has solved that with vim-surround[1].
> Instead of a neat tool like this having to support dozens of languages, it can just support tree-sitter and automatically work with anything that tree-sitter supports.
https://github.com/ajeetdsouza/zoxide is a fantastic cd replacement, which stores where you cd to, and you can then do a partial match like "z hel" might take you to "~/projects/helloworld".
Also, cargo-binstall (cargo binary install) which allows you to not have to compile every single time you cargo install and instead allows you to just install the binaries for a specific program. It also integrates with cargo install-update.
I’m almost not sure why tools like git don’t ship with this as default. Been using difft for about a year now, and my main complaint is that it makes it hard to go back and use other diff tools when I don’t have difft available :).
I am curious if there’s been any work on _semantic_ diff tools as well (for when eg the syntax changes but the meaning is the same). It seems like an intractable problem in the general but maybe it’s doable and/or useful for smaller DSLs or subsets of some languages?
I think shipping good ol' diff as the default makes sense. It's going to be there already on any system you might want to run git on, it's fast, it's tiny, and everyone knows the basics of how to use it.
> I am curious if there's been any work on _semantic_ diff tools as well (for when eg the syntax changes but the meaning is the same)
We are working on https://semanticdiff.com/ which detects basic semantic changes like converting a literal from decimal to hex or reordering keys within JSON objects. It is not a command line utility but a VS Code extension and GitHub App. You can check out https://semanticdiff.com/blog/semanticdiff-vs-difftastic/ if you want to learn more about how it works and how it differs from difftastic.
Sorry, I should've been clearer. I'm interested if there's any tool that does this kind of thing statically, without running the code. I guess a simple approach is to compile both programs and see if the generated code is the same, but I'd guess reasoning at the generated-code level will probably produce a lot more false positives (i.e. tool will report a change when there isn't one) than if you reason about the original program.
Some linters and formatters are effectively compilers already, so that doesn’t seem completely implausible in itself. Finding canonical representations of common coding patterns so you can quickly and reliably determine that they are equivalent is a different question, though.
I'm trying difft for git now and I also really like it. One reason why I think it shouldn't be the default is that it hides whitespace differences. Maybe that's configurable though, haven't looked into that
> I am curious if there’s been any work on _semantic_ diff tools as well (for when eg the syntax changes but the meaning is the same)
So when using such a diff tool you can spend hours refactoring something, and then git will refuse to commit your changes because your refactoring was successful in not changing the behavior of the code? I understand what you mean, but if we arrive at that point maybe we should stop calling it "diff", to avoid confusion...
20 years staring at it and no, i don't. i usually have to work it out from context. i think if you have vi or ed sensory organs it might work better for ya. which.. i still chuckle that vi is the visual editor, bc ed lol
It seems like a major lapse in product innovation that Github has not come out with something like this. They don't even have something to help you when the indentation changes, they usually just show it as a giant add & remove. Their diff viewer can and should be smarter.
Github can't even recognize syntax, let alone provide semantic diffs! In fact, Github can't even tell that foo.cpp.in is different from foo.mk.in! Any foo.t is declared to be Perl, with no way to fix it…There are a decade-old tickets!
Tree-sitter optimizes for performance (to use in editors), not for correctness. In fact even TS' core developers advocate for not bothering too much with correctness of grammars[1]. I imagine this constraint would be a deal-breaker for GitHub or anyone else in their position.
> No. AST merging is a hard problem that difftastic does not address.
> AST diffing is a also lossy process from the perspective of a text diff. Difftastic will ignore whitespace that isn't syntactically significant, but merging requires tracking whitespace.
Those 2 massive innovations are leading to an explosion of tooling improvements like this. Now every editor, diff tool, or whatever can support dozens or hundreds of languages without having to duplicate all the work of every other similar tool. That's freaking amazing.
Often when writing a linter, you need to bring along the runtime of the language you're targeting. E.g., in python if you're writing a parser using the builtin `ast` module, you need to match the language version & features. So you can't parse Python 3 code with Pylint running on Python 2.7, for instance. This ends up being more obnoxious than you'd think at first, especially if you're targeting multiple languages.
Before tree-sitter, using a language's built-in AST tooling was often the best approach because it is guaranteed to keep up with the latest syntax. IMO the genius of tree-sitter is that it's made it way easier than with traditional grammars to keep the language parsers updated. Highly recommend Max Brunsfield's strange loop talk if you want to learn more about the design choices behind tree-sitter: https://www.youtube.com/watch?v=Jes3bD6P0To
And this has resulted in a bunch of new tools built off on tree-sitter, off the top of my head in addition to difftastic: neovim, Zed, Semgrep, and Github code search!
LSP truly solves the M editors to N languages needing M * N many integrations by using a standard interface for a query oriented compiler. Tree sitter doesn't solve this problem, it just makes it way easier to write N many integrations for your editor/tool.
1. Clone someone's tree-sitter grammar off GitHub.
2. Build it into a Mac .dylib.
3. Create a Nova extension that says "use this .dylib to highlight that language."
4. Use it.
I don't have to make any changes to Nova itself, and the amount of configuration I have to write is so tiny that Nova could have a DIY wizard if they wanted it to.
The source for Difftastic discussed here (at https://github.com/Wilfred/difftastic/blob/master/src/parse/...) is also very simple: for each of a list of supported languages, import the tree-sitter parser and wrap a teensy amount of configuration around it.
Yeah, decided to check this out to see if it could help review in our massive C-based project. Unfortunately, in a recent patch, of the 90 "hunks", 88 of them had fallen back to "normal diff" because "$N C parse errors, exceeded DFT_PARSE_ERROR_LIMIT").
This is because visual line diffs for an essay is bonkers. Usually the sentence changed starts in the middle of a visual line.
[0]: anyone have a good source for this? I’m not sure where I first encountered it
I just wrote a language parser a few months ago in tree sitter and it’s probably the most delightful software I’ve used apart from ffmpeg.
[1] https://www.masteringemacs.org/article/how-to-get-started-tr...
https://news.ycombinator.com/item?id=39762495
(1) The top comment is from the author of difftastic (the subject here), saying that treesitter Nim plugin can't be merged, because it's 60 MB of generated C source code. There's a scalability problem supporting multiple languages.
The author of Treesitter proposes using the WASM runtime, which is new.
(2) The original blog post concludes with some Treesitter issues, prefering Syntect (a Rust library that accepts Textmate grammars)
Because of these issues I’ll evaluate what highlighter to use on a case-by-case basis, with Syntect as the default choice.
https://www.jonashietala.se/blog/2024/03/19/lets_create_a_tr...
Other feedback:
(3) The idea of a uniform api for querying syntax trees is a good one and tree-sitter deserves credit for popularizing it. It's unfortunately not a great implementation of the idea
(4) [It] segfaults constantly ... More than any NPM module I've ever used before. Any syntax that doesn't precisely match the grammar is liable to take down your entire thread.
---
I think some of the feedback was rude and harsh, and maybe even using Treesitter outside its intended use cases. But as someone who's been interested in Treesitter, but hasn't really used it, it seems real.
One problem I see is that Treesitter is meant to be incremental, so it can be used in an editor/IDE. And that's a significantly harder problem than batch syntax highlighting, parsing, semantic understanding.
---
That is, difftastic is a batch tool, i.e. you run it with git diff.
So to me the obvious thing for difftastic is to throw out the GLR algorithm, and throw out the heinous external lexers written in C that are constrained by it, and just use normal batch parsers written in whatever language, with whatever algorithm. Recursive descent.
These parsers can output a CST in the TreeSitter format, which looks pretty simple.
They don't even need to be linked into the difftastic binary -- you could emit an CST / S-expression format and match it with the text.
Unix style! Parsers can live in different binaries and still be composed.
The blog post use case can also just use batch parsers that output a CST. You don't Treesitter's incremental features to render HTML for your blog.
At the same time, I believe that there needs to be a corrective about what tree-sitter should and should not be used for. There are companies building security products on top of tree-sitter which I think is an objectively bad idea given its problems and limitations. Difftastic is to me a grey area because it could lead hypothetically to a security issue if it generates an incorrect diff due to an incorrect tree-sitter grammar. Unlikely but not impossible.
Your point about batch vs incremental is spot on, though even for IDEs, I think incremental is usually overkill (I have written a recursive descent parser for a language in c that can do 3million lines per second on a decent laptop which is about 60k lines per 20 ms, which is the window I look to for reactivity). How many non-generated source files exceed say 100k lines? Incremental parsing feels like taking on a lot of complexity for rather limited benefit except in fairly niche use cases (granting that one person's niche is another's common case).
That being said, it is impressive that their incremental algorithm works as well as it does but the cost is that grammar writers are forced to mold a language grammar that might not fit into the GLR algorithm. When it doesn't work as expected, which is not uncommon in my experience, the error messages are inscrutable and debugging either the generator or the generated code is nigh impossible.
Most of the happy users have no idea how the sausage is made, they just see the prettier syntax highlighting that works with multiple tools. I get that my criticism is as welcome as a wet blanket, but I just think there is something much better possible which your comment hints at.
One appeal of the general idea of a structural diff tool, for me, is ignoring the ordering of things for which ordering makes no difference.
are independent statements and the code will be no different if I replace those two statements with However, this information is not actually present in the abstract syntax tree. If I instead consider these two statements: it is apparent that reordering them will cause changes to the meaning of the code. But as far as the AST goes, this is the same thing as the example where reordering was fine.What kinds of things are we doing with our new AST tooling?
Not always, e.g. in a multi threaded situation where x and y are shared atomics. Then unless we authorize C++ to take more liberties in reordering, another thread will never see y as 7 while x is not yet 4 in the first example, but not the second. This kind of subtlety can't be determined from syntax alone.
All tree-sitter gives you is a _different_ grammar, so that a structural diff can operate on different trees given the same text as diff.
A parse tree still doesn't know anything about the meaning of a program, which is what you need to know in order to determine that those assignments to x and y are unordered.
I have set ",',(,[,{ in visual mode to cut the selection insert the pairs then paste it back as a very hacky solution, but it gets the job done. If you want something more advanced to add or change anything around the selection tpope has solved that with vim-surround[1].
[1]: https://github.com/tpope/vim-surround
Built on the shoulders of giants.
It feels kind of as foundational as YACC.
https://mise.jdx.dev/ mise-en-place, a drop-in replacement for asdf https://asdf-vm.com/ that is really fast and flexible.
https://github.com/ajeetdsouza/zoxide is a fantastic cd replacement, which stores where you cd to, and you can then do a partial match like "z hel" might take you to "~/projects/helloworld".
https://github.com/bootandy/dust is a compliment to "du", shows which directories are using the most disk space.
- https://github.com/eza-community/eza (ls with some added visual sugar)
- https://github.com/ClementTsang/bottom (htop but with graphs)
- https://github.com/sharkdp/bat (cat with syntax highlight)
Looks great, thank you for the recommendation.
Now, it works great! I tried it with a mise update, and it pulled down the right binary with no proxy problems.
Thank you for the reminder and recommendation, much appreciated!!
I am curious if there’s been any work on _semantic_ diff tools as well (for when eg the syntax changes but the meaning is the same). It seems like an intractable problem in the general but maybe it’s doable and/or useful for smaller DSLs or subsets of some languages?
But I'm glad it's easy to change that default.
We are working on https://semanticdiff.com/ which detects basic semantic changes like converting a literal from decimal to hex or reordering keys within JSON objects. It is not a command line utility but a VS Code extension and GitHub App. You can check out https://semanticdiff.com/blog/semanticdiff-vs-difftastic/ if you want to learn more about how it works and how it differs from difftastic.
if you do this your difftool becomes a compiler
Some linters and formatters are effectively compilers already, so that doesn’t seem completely implausible in itself. Finding canonical representations of common coding patterns so you can quickly and reliably determine that they are equivalent is a different question, though.
So when using such a diff tool you can spend hours refactoring something, and then git will refuse to commit your changes because your refactoring was successful in not changing the behavior of the code? I understand what you mean, but if we arrive at that point maybe we should stop calling it "diff", to avoid confusion...
Preach!
Just dropped it in and did a git diff.. works like a charm!
Do you not?
[1] https://github.com/tree-sitter/tree-sitter/issues/130#issuec...
[0]: https://github.com/jeffkaufman/icdiff
This just does diff but not merge, but at least it's open source - and the diffs look a lot nicer, I've already made it my default.
Any plans to extend it to merging?
[1] https://docs.plasticscm.com/semanticmerge
The GitHub readme:
> Can difftastic do merges?
> No. AST merging is a hard problem that difftastic does not address.
> AST diffing is a also lossy process from the perspective of a text diff. Difftastic will ignore whitespace that isn't syntactically significant, but merging requires tracking whitespace.
https://news.ycombinator.com/item?id=27768861 (297 points | 3 years ago | 61 comments)
https://news.ycombinator.com/item?id=32746258 (698 points | 2 years ago | 90 comments)
https://news.ycombinator.com/item?id=30841244 (983 points | 2 years ago | 219 comments)