Difftastic, a structural diff tool that understands syntax

For those who don't already know, this is built on tree-sitter (https://tree-sitter.github.io/tree-sitter/) which does for parsing what LSP does for analysis. That is, it provides a standard interface for turning code into an AST and then making that AST available to clients like editors and diff tools. Instead of a neat tool like this having to support dozens of languages, it can just support tree-sitter and automatically work with anything that tree-sitter supports. And if you're developing a new language, you can create a tree-sitter parser for it, and now every tool that speaks tree-sitter knows how to support your language.

Those 2 massive innovations are leading to an explosion of tooling improvements like this. Now every editor, diff tool, or whatever can support dozens or hundreds of languages without having to duplicate all the work of every other similar tool. That's freaking amazing.

ievans · a year ago

Absolutely agreed, and copying from a comment I wrote last year: I think the fact that tree-sitter is dependency-free is worth highlighting. For context, some of my teammates maintain the OCaml tree-sitter bindings and often contribute to grammars as part of our work on Semgrep (Semgrep uses tree-sitter for searching code and parsing queries that are code snippets themselves into AST matchers).

Often when writing a linter, you need to bring along the runtime of the language you're targeting. E.g., in python if you're writing a parser using the builtin `ast` module, you need to match the language version & features. So you can't parse Python 3 code with Pylint running on Python 2.7, for instance. This ends up being more obnoxious than you'd think at first, especially if you're targeting multiple languages.

Before tree-sitter, using a language's built-in AST tooling was often the best approach because it is guaranteed to keep up with the latest syntax. IMO the genius of tree-sitter is that it's made it way easier than with traditional grammars to keep the language parsers updated. Highly recommend Max Brunsfield's strange loop talk if you want to learn more about the design choices behind tree-sitter: https://www.youtube.com/watch?v=Jes3bD6P0To

And this has resulted in a bunch of new tools built off on tree-sitter, off the top of my head in addition to difftastic: neovim, Zed, Semgrep, and Github code search!

drcongo · a year ago

Don't forget Zed! https://zed.dev

TeMPOraL · a year ago

Okay, but how does that work with language versions? Like, if I get a "C++ parser" for tree-sitter, how do I know if it's C++03, C++17, C++21 or what? Last time I checked (which was months ago, to be fair), this wasn't documented anywhere, nor were there apparent any mechanisms to support langauge versions and variants.

ossusermivami · a year ago

don't forget old man emacs is now using tree sitter

jrave · a year ago

helix (https://helix-editor.com/) is using treesitter and LSP as well

bfrog · a year ago

While I agree tree-sitter is an amazing tool, writing the grammar out can be incredibly difficult I found. I tried writing out a grammar and highlighting query set for vhdl with tree-sitter, and found that there were a lot of difficulties in expressing vhdl grammar in tree-sitter.

kstrauser · a year ago

No argument from me on that. The upside is that one person, somewhere, has to get it right one time and then we can all use it.

duped · a year ago

I don't believe this is correct - there's no such thing as "speaking tree-sitter." Every tree-sitter parser emits a different concrete syntax tree, not a standard abstract syntax tree.

LSP truly solves the M editors to N languages needing M * N many integrations by using a standard interface for a query oriented compiler. Tree sitter doesn't solve this problem, it just makes it way easier to write N many integrations for your editor/tool.

kstrauser · a year ago

That depends on how deep you want to go with the result. I use the Nova editor which uses tree-sitter for syntax highlighting, and I've packaged several languages for it. Each time it goes like this:

1. Clone someone's tree-sitter grammar off GitHub.

2. Build it into a Mac .dylib.

3. Create a Nova extension that says "use this .dylib to highlight that language."

4. Use it.

I don't have to make any changes to Nova itself, and the amount of configuration I have to write is so tiny that Nova could have a DIY wizard if they wanted it to.

The source for Difftastic discussed here (at https://github.com/Wilfred/difftastic/blob/master/src/parse/...) is also very simple: for each of a list of supported languages, import the tree-sitter parser and wrap a teensy amount of configuration around it.

fiddlerwoaroof · a year ago

The main issue I have with tree-sitter is that it’s approach can’t work for many languages I care about: Common Lisp cannot be parsed without a full lisp implementation; Haskell’s syntax is complicated enough that the grammar is incomplete; C/C++ can’t be parsed accurately if only because of the pre-processor; parsing perl is Turing-complete, etc. I think the suggestion elsewhere makes sense: don’t make us write parsers in a new ecosystem, but instead define a format for existing parsers to produce as a side-output.

gwd · a year ago

> C/C++ can’t be parsed accurately if only because of the pre-processor

Yeah, decided to check this out to see if it could help review in our massive C-based project. Unfortunately, in a recent patch, of the 90 "hunks", 88 of them had fallen back to "normal diff" because "$N C parse errors, exceeded DFT_PARSE_ERROR_LIMIT").

amelius · a year ago

C++ also can't be parsed like that because you need to process a declaration before you know what role a symbol plays in the grammar.

abdullahkhalids · a year ago

Can one write a tree-sitter grammar for English (or any other natural language), that basically labels each sentence as a statement, so I can use difftastic to show changes on sentences rather than visual lines?

This is because visual line diffs for an essay is bonkers. Usually the sentence changed starts in the middle of a visual line.

pxeger1 · a year ago

The common advice[0] is to just write one sentence per line. I usually split at commas etc as well. Then use editor soft wrapping instead of fixing a maximum line length - but if your lines get longer than the screen width that might be a sign your sentences are too complex.

[0]: anyone have a good source for this? I’m not sure where I first encountered it

fragmede · a year ago

word diff gets you halfway without that complexity

danielvaughn · a year ago

As soon as you said tree sitter I immediately understood. Yes, I can’t believe I never realized that you could totally build a syntax-aware VCS on top of it. That’s brilliant.

I just wrote a language parser a few months ago in tree sitter and it’s probably the most delightful software I’ve used apart from ffmpeg.

bonki · a year ago

The endless capabilities of ffmpeg are delightful indeed, but its use? Forgive me but what are you smoking?

emporas · a year ago

Was reading about emacs and tree-sitter today [1]. Tree-sitter is a force to be reckoned with.

[1] https://www.masteringemacs.org/article/how-to-get-started-tr...

pfdietz · a year ago

Tree-sitter is nice, but I would like parsers that make a better effort on invalid inputs. Something like an Early parser that maximizes some quality function. This would be useful for parsing (for example) C and C++ where the preprocessor prevents true parsing of unpreprocessed code. I understand that tree-sitter is intended for interactive use in editors where it can't spend too much time parsing.

chubot · a year ago

BTW there is interesting feedback from 4 people on a Treesitter post yesterday:

https://news.ycombinator.com/item?id=39762495

(1) The top comment is from the author of difftastic (the subject here), saying that treesitter Nim plugin can't be merged, because it's 60 MB of generated C source code. There's a scalability problem supporting multiple languages.

The author of Treesitter proposes using the WASM runtime, which is new.

(2) The original blog post concludes with some Treesitter issues, prefering Syntect (a Rust library that accepts Textmate grammars)

Because of these issues I’ll evaluate what highlighter to use on a case-by-case basis, with Syntect as the default choice.

https://www.jonashietala.se/blog/2024/03/19/lets_create_a_tr...

Other feedback:

(3) The idea of a uniform api for querying syntax trees is a good one and tree-sitter deserves credit for popularizing it. It's unfortunately not a great implementation of the idea

(4) [It] segfaults constantly ... More than any NPM module I've ever used before. Any syntax that doesn't precisely match the grammar is liable to take down your entire thread.

---

I think some of the feedback was rude and harsh, and maybe even using Treesitter outside its intended use cases. But as someone who's been interested in Treesitter, but hasn't really used it, it seems real.

One problem I see is that Treesitter is meant to be incremental, so it can be used in an editor/IDE. And that's a significantly harder problem than batch syntax highlighting, parsing, semantic understanding.

---

That is, difftastic is a batch tool, i.e. you run it with git diff.

So to me the obvious thing for difftastic is to throw out the GLR algorithm, and throw out the heinous external lexers written in C that are constrained by it, and just use normal batch parsers written in whatever language, with whatever algorithm. Recursive descent.

These parsers can output a CST in the TreeSitter format, which looks pretty simple.

They don't even need to be linked into the difftastic binary -- you could emit an CST / S-expression format and match it with the text.

Unix style! Parsers can live in different binaries and still be composed.

The blog post use case can also just use batch parsers that output a CST. You don't Treesitter's incremental features to render HTML for your blog.

diffxx · a year ago

As one of the harsh and rude commentators, I would say I basically agree with your interpretation. You also correctly inferred that I have experience with working with it in an area that is arguably outside of its true use case.

At the same time, I believe that there needs to be a corrective about what tree-sitter should and should not be used for. There are companies building security products on top of tree-sitter which I think is an objectively bad idea given its problems and limitations. Difftastic is to me a grey area because it could lead hypothetically to a security issue if it generates an incorrect diff due to an incorrect tree-sitter grammar. Unlikely but not impossible.

Your point about batch vs incremental is spot on, though even for IDEs, I think incremental is usually overkill (I have written a recursive descent parser for a language in c that can do 3million lines per second on a decent laptop which is about 60k lines per 20 ms, which is the window I look to for reactivity). How many non-generated source files exceed say 100k lines? Incremental parsing feels like taking on a lot of complexity for rather limited benefit except in fairly niche use cases (granting that one person's niche is another's common case).

That being said, it is impressive that their incremental algorithm works as well as it does but the cost is that grammar writers are forced to mold a language grammar that might not fit into the GLR algorithm. When it doesn't work as expected, which is not uncommon in my experience, the error messages are inscrutable and debugging either the generator or the generated code is nigh impossible.

Most of the happy users have no idea how the sausage is made, they just see the prettier syntax highlighting that works with multiple tools. I get that my criticism is as welcome as a wet blanket, but I just think there is something much better possible which your comment hints at.

thaumasiotes · a year ago

This question is coming from a place of total ignorance:

One appeal of the general idea of a structural diff tool, for me, is ignoring the ordering of things for which ordering makes no difference.

    x = 4
    y = 7

are independent statements and the code will be no different if I replace those two statements with

    y = 7
    x = 4

However, this information is not actually present in the abstract syntax tree. If I instead consider these two statements:

    x += 3
    x *= 7

it is apparent that reordering them will cause changes to the meaning of the code. But as far as the AST goes, this is the same thing as the example where reordering was fine.

What kinds of things are we doing with our new AST tooling?

etbebl · a year ago

> x = 4 > y = 7 > >are independent statements and the code will be no different if I replace those two statements with > > y = 7 > x = 4

Not always, e.g. in a multi threaded situation where x and y are shared atomics. Then unless we authorize C++ to take more liberties in reordering, another thread will never see y as 7 while x is not yet 4 in the first example, but not the second. This kind of subtlety can't be determined from syntax alone.

MathMonkeyMan · a year ago

In a sense, plain old diff is a structural diff. The grammar is a sequence of lines of characters.

All tree-sitter gives you is a _different_ grammar, so that a structural diff can operate on different trees given the same text as diff.

A parse tree still doesn't know anything about the meaning of a program, which is what you need to know in order to determine that those assignments to x and y are unordered.

libre-man · a year ago

What you want to determine this is not an AST, you want a Program Dependence Graph (PDG), which does encode this information. Creating them is not close to as simple as creating a AST, and for many languages requires either assumptions that will be broken, or result in something very similar to an AST (every node has a dependency on the previous node).

joshspankit · a year ago

How close are we to being able to copy a function in to the clipboard, then highlight some lines of code and paste the function around it (like highlight > quote marks)?

worksonmine · a year ago

I don't know what exactly you mean by pasting a function around the selection, but you can paste selections, registers or even files at specific lines with some vim-fu. If it's generic enough you could write a function or even a keyboard shortcut if it's very simple.

I have set ",',(,[,{ in visual mode to cut the selection insert the pairs then paste it back as a very hacky solution, but it gets the job done. If you want something more advanced to add or change anything around the selection tpope has solved that with vim-surround[1].

[1]: https://github.com/tpope/vim-surround

OJFord · a year ago

What does it mean to paste a function around some lines of code? As in what're the manual steps you do because that's not possible today?

fransje26 · a year ago

> Instead of a neat tool like this having to support dozens of languages, it can just support tree-sitter and automatically work with anything that tree-sitter supports.

Built on the shoulders of giants.

epistasis · a year ago

I'm imagining what I could have done in my compilers class with something like tree-sitter...

It feels kind of as foundational as YACC.

ivanjermakov · a year ago

It is literally an alternative to YACC and other parser generators.

I’m almost not sure why tools like git don’t ship with this as default. Been using difft for about a year now, and my main complaint is that it makes it hard to go back and use other diff tools when I don’t have difft available :).

I am curious if there’s been any work on _semantic_ diff tools as well (for when eg the syntax changes but the meaning is the same). It seems like an intractable problem in the general but maybe it’s doable and/or useful for smaller DSLs or subsets of some languages?

kstrauser · a year ago

I think shipping good ol' diff as the default makes sense. It's going to be there already on any system you might want to run git on, it's fast, it's tiny, and everyone knows the basics of how to use it.

But I'm glad it's easy to change that default.

DarkPlayer · a year ago

> I am curious if there's been any work on _semantic_ diff tools as well (for when eg the syntax changes but the meaning is the same)

We are working on https://semanticdiff.com/ which detects basic semantic changes like converting a literal from decimal to hex or reordering keys within JSON objects. It is not a command line utility but a VS Code extension and GitHub App. You can check out https://semanticdiff.com/blog/semanticdiff-vs-difftastic/ if you want to learn more about how it works and how it differs from difftastic.

Izikiel43 · a year ago

Thank you, you just simplified my life greatly, will use it for a demo tomorrow.

otherjason · a year ago

Difftastic is a useful tool, but in my experience, it's far too slow to be suitable as the default selection for a ubiquitous tool like git.

drcongo · a year ago

I'm finding it instantaneous here on a large dirty codebase. In what way is it slow for you?

ruined · a year ago

>I am curious if there’s been any work on _semantic_ diff tools as well (for when eg the syntax changes but the meaning is the same).

if you do this your difftool becomes a compiler

mlavrent · a year ago

Sorry, I should've been clearer. I'm interested if there's any tool that does this kind of thing statically, without running the code. I guess a simple approach is to compile both programs and see if the generated code is the same, but I'd guess reasoning at the generated-code level will probably produce a lot more false positives (i.e. tool will report a change when there isn't one) than if you reason about the original program.

Chris_Newton · a year ago

if you do this your difftool becomes a compiler

Some linters and formatters are effectively compilers already, so that doesn’t seem completely implausible in itself. Finding canonical representations of common coding patterns so you can quickly and reliably determine that they are equivalent is a different question, though.

hobs · a year ago

That's exactly what I have done with diffing SQL in lazy mode - just use a server and diff the AST/plan.

more-coffee · a year ago

I'm trying difft for git now and I also really like it. One reason why I think it shouldn't be the default is that it hides whitespace differences. Maybe that's configurable though, haven't looked into that

rob74 · a year ago

> I am curious if there’s been any work on _semantic_ diff tools as well (for when eg the syntax changes but the meaning is the same)

So when using such a diff tool you can spend hours refactoring something, and then git will refuse to commit your changes because your refactoring was successful in not changing the behavior of the code? I understand what you mean, but if we arrive at that point maybe we should stop calling it "diff", to avoid confusion...

kstrauser · a year ago

Git doesn't use the output of `diff` to determine whether anything has changed.