Readit News logoReadit News
kstrauser · a year ago
For those who don't already know, this is built on tree-sitter (https://tree-sitter.github.io/tree-sitter/) which does for parsing what LSP does for analysis. That is, it provides a standard interface for turning code into an AST and then making that AST available to clients like editors and diff tools. Instead of a neat tool like this having to support dozens of languages, it can just support tree-sitter and automatically work with anything that tree-sitter supports. And if you're developing a new language, you can create a tree-sitter parser for it, and now every tool that speaks tree-sitter knows how to support your language.

Those 2 massive innovations are leading to an explosion of tooling improvements like this. Now every editor, diff tool, or whatever can support dozens or hundreds of languages without having to duplicate all the work of every other similar tool. That's freaking amazing.

ievans · a year ago
Absolutely agreed, and copying from a comment I wrote last year: I think the fact that tree-sitter is dependency-free is worth highlighting. For context, some of my teammates maintain the OCaml tree-sitter bindings and often contribute to grammars as part of our work on Semgrep (Semgrep uses tree-sitter for searching code and parsing queries that are code snippets themselves into AST matchers).

Often when writing a linter, you need to bring along the runtime of the language you're targeting. E.g., in python if you're writing a parser using the builtin `ast` module, you need to match the language version & features. So you can't parse Python 3 code with Pylint running on Python 2.7, for instance. This ends up being more obnoxious than you'd think at first, especially if you're targeting multiple languages.

Before tree-sitter, using a language's built-in AST tooling was often the best approach because it is guaranteed to keep up with the latest syntax. IMO the genius of tree-sitter is that it's made it way easier than with traditional grammars to keep the language parsers updated. Highly recommend Max Brunsfield's strange loop talk if you want to learn more about the design choices behind tree-sitter: https://www.youtube.com/watch?v=Jes3bD6P0To

And this has resulted in a bunch of new tools built off on tree-sitter, off the top of my head in addition to difftastic: neovim, Zed, Semgrep, and Github code search!

drcongo · a year ago
Don't forget Zed! https://zed.dev
TeMPOraL · a year ago
Okay, but how does that work with language versions? Like, if I get a "C++ parser" for tree-sitter, how do I know if it's C++03, C++17, C++21 or what? Last time I checked (which was months ago, to be fair), this wasn't documented anywhere, nor were there apparent any mechanisms to support langauge versions and variants.
ossusermivami · a year ago
don't forget old man emacs is now using tree sitter
jrave · a year ago
helix (https://helix-editor.com/) is using treesitter and LSP as well
bfrog · a year ago
While I agree tree-sitter is an amazing tool, writing the grammar out can be incredibly difficult I found. I tried writing out a grammar and highlighting query set for vhdl with tree-sitter, and found that there were a lot of difficulties in expressing vhdl grammar in tree-sitter.
kstrauser · a year ago
No argument from me on that. The upside is that one person, somewhere, has to get it right one time and then we can all use it.
duped · a year ago
I don't believe this is correct - there's no such thing as "speaking tree-sitter." Every tree-sitter parser emits a different concrete syntax tree, not a standard abstract syntax tree.

LSP truly solves the M editors to N languages needing M * N many integrations by using a standard interface for a query oriented compiler. Tree sitter doesn't solve this problem, it just makes it way easier to write N many integrations for your editor/tool.

kstrauser · a year ago
That depends on how deep you want to go with the result. I use the Nova editor which uses tree-sitter for syntax highlighting, and I've packaged several languages for it. Each time it goes like this:

1. Clone someone's tree-sitter grammar off GitHub.

2. Build it into a Mac .dylib.

3. Create a Nova extension that says "use this .dylib to highlight that language."

4. Use it.

I don't have to make any changes to Nova itself, and the amount of configuration I have to write is so tiny that Nova could have a DIY wizard if they wanted it to.

The source for Difftastic discussed here (at https://github.com/Wilfred/difftastic/blob/master/src/parse/...) is also very simple: for each of a list of supported languages, import the tree-sitter parser and wrap a teensy amount of configuration around it.

fiddlerwoaroof · a year ago
The main issue I have with tree-sitter is that it’s approach can’t work for many languages I care about: Common Lisp cannot be parsed without a full lisp implementation; Haskell’s syntax is complicated enough that the grammar is incomplete; C/C++ can’t be parsed accurately if only because of the pre-processor; parsing perl is Turing-complete, etc. I think the suggestion elsewhere makes sense: don’t make us write parsers in a new ecosystem, but instead define a format for existing parsers to produce as a side-output.
gwd · a year ago
> C/C++ can’t be parsed accurately if only because of the pre-processor

Yeah, decided to check this out to see if it could help review in our massive C-based project. Unfortunately, in a recent patch, of the 90 "hunks", 88 of them had fallen back to "normal diff" because "$N C parse errors, exceeded DFT_PARSE_ERROR_LIMIT").

amelius · a year ago
C++ also can't be parsed like that because you need to process a declaration before you know what role a symbol plays in the grammar.
abdullahkhalids · a year ago
Can one write a tree-sitter grammar for English (or any other natural language), that basically labels each sentence as a statement, so I can use difftastic to show changes on sentences rather than visual lines?

This is because visual line diffs for an essay is bonkers. Usually the sentence changed starts in the middle of a visual line.

pxeger1 · a year ago
The common advice[0] is to just write one sentence per line. I usually split at commas etc as well. Then use editor soft wrapping instead of fixing a maximum line length - but if your lines get longer than the screen width that might be a sign your sentences are too complex.

[0]: anyone have a good source for this? I’m not sure where I first encountered it

fragmede · a year ago
word diff gets you halfway without that complexity
danielvaughn · a year ago
As soon as you said tree sitter I immediately understood. Yes, I can’t believe I never realized that you could totally build a syntax-aware VCS on top of it. That’s brilliant.

I just wrote a language parser a few months ago in tree sitter and it’s probably the most delightful software I’ve used apart from ffmpeg.

bonki · a year ago
The endless capabilities of ffmpeg are delightful indeed, but its use? Forgive me but what are you smoking?
emporas · a year ago
Was reading about emacs and tree-sitter today [1]. Tree-sitter is a force to be reckoned with.

[1] https://www.masteringemacs.org/article/how-to-get-started-tr...

pfdietz · a year ago
Tree-sitter is nice, but I would like parsers that make a better effort on invalid inputs. Something like an Early parser that maximizes some quality function. This would be useful for parsing (for example) C and C++ where the preprocessor prevents true parsing of unpreprocessed code. I understand that tree-sitter is intended for interactive use in editors where it can't spend too much time parsing.
chubot · a year ago
BTW there is interesting feedback from 4 people on a Treesitter post yesterday:

https://news.ycombinator.com/item?id=39762495

(1) The top comment is from the author of difftastic (the subject here), saying that treesitter Nim plugin can't be merged, because it's 60 MB of generated C source code. There's a scalability problem supporting multiple languages.

The author of Treesitter proposes using the WASM runtime, which is new.

(2) The original blog post concludes with some Treesitter issues, prefering Syntect (a Rust library that accepts Textmate grammars)

Because of these issues I’ll evaluate what highlighter to use on a case-by-case basis, with Syntect as the default choice.

https://www.jonashietala.se/blog/2024/03/19/lets_create_a_tr...

Other feedback:

(3) The idea of a uniform api for querying syntax trees is a good one and tree-sitter deserves credit for popularizing it. It's unfortunately not a great implementation of the idea

(4) [It] segfaults constantly ... More than any NPM module I've ever used before. Any syntax that doesn't precisely match the grammar is liable to take down your entire thread.

---

I think some of the feedback was rude and harsh, and maybe even using Treesitter outside its intended use cases. But as someone who's been interested in Treesitter, but hasn't really used it, it seems real.

One problem I see is that Treesitter is meant to be incremental, so it can be used in an editor/IDE. And that's a significantly harder problem than batch syntax highlighting, parsing, semantic understanding.

---

That is, difftastic is a batch tool, i.e. you run it with git diff.

So to me the obvious thing for difftastic is to throw out the GLR algorithm, and throw out the heinous external lexers written in C that are constrained by it, and just use normal batch parsers written in whatever language, with whatever algorithm. Recursive descent.

These parsers can output a CST in the TreeSitter format, which looks pretty simple.

They don't even need to be linked into the difftastic binary -- you could emit an CST / S-expression format and match it with the text.

Unix style! Parsers can live in different binaries and still be composed.

The blog post use case can also just use batch parsers that output a CST. You don't Treesitter's incremental features to render HTML for your blog.

diffxx · a year ago
As one of the harsh and rude commentators, I would say I basically agree with your interpretation. You also correctly inferred that I have experience with working with it in an area that is arguably outside of its true use case.

At the same time, I believe that there needs to be a corrective about what tree-sitter should and should not be used for. There are companies building security products on top of tree-sitter which I think is an objectively bad idea given its problems and limitations. Difftastic is to me a grey area because it could lead hypothetically to a security issue if it generates an incorrect diff due to an incorrect tree-sitter grammar. Unlikely but not impossible.

Your point about batch vs incremental is spot on, though even for IDEs, I think incremental is usually overkill (I have written a recursive descent parser for a language in c that can do 3million lines per second on a decent laptop which is about 60k lines per 20 ms, which is the window I look to for reactivity). How many non-generated source files exceed say 100k lines? Incremental parsing feels like taking on a lot of complexity for rather limited benefit except in fairly niche use cases (granting that one person's niche is another's common case).

That being said, it is impressive that their incremental algorithm works as well as it does but the cost is that grammar writers are forced to mold a language grammar that might not fit into the GLR algorithm. When it doesn't work as expected, which is not uncommon in my experience, the error messages are inscrutable and debugging either the generator or the generated code is nigh impossible.

Most of the happy users have no idea how the sausage is made, they just see the prettier syntax highlighting that works with multiple tools. I get that my criticism is as welcome as a wet blanket, but I just think there is something much better possible which your comment hints at.

thaumasiotes · a year ago
This question is coming from a place of total ignorance:

One appeal of the general idea of a structural diff tool, for me, is ignoring the ordering of things for which ordering makes no difference.

    x = 4
    y = 7
are independent statements and the code will be no different if I replace those two statements with

    y = 7
    x = 4
However, this information is not actually present in the abstract syntax tree. If I instead consider these two statements:

    x += 3
    x *= 7
it is apparent that reordering them will cause changes to the meaning of the code. But as far as the AST goes, this is the same thing as the example where reordering was fine.

What kinds of things are we doing with our new AST tooling?

etbebl · a year ago
> x = 4 > y = 7 > >are independent statements and the code will be no different if I replace those two statements with > > y = 7 > x = 4

Not always, e.g. in a multi threaded situation where x and y are shared atomics. Then unless we authorize C++ to take more liberties in reordering, another thread will never see y as 7 while x is not yet 4 in the first example, but not the second. This kind of subtlety can't be determined from syntax alone.

MathMonkeyMan · a year ago
In a sense, plain old diff is a structural diff. The grammar is a sequence of lines of characters.

All tree-sitter gives you is a _different_ grammar, so that a structural diff can operate on different trees given the same text as diff.

A parse tree still doesn't know anything about the meaning of a program, which is what you need to know in order to determine that those assignments to x and y are unordered.

libre-man · a year ago
What you want to determine this is not an AST, you want a Program Dependence Graph (PDG), which does encode this information. Creating them is not close to as simple as creating a AST, and for many languages requires either assumptions that will be broken, or result in something very similar to an AST (every node has a dependency on the previous node).
joshspankit · a year ago
How close are we to being able to copy a function in to the clipboard, then highlight some lines of code and paste the function around it (like highlight > quote marks)?
worksonmine · a year ago
I don't know what exactly you mean by pasting a function around the selection, but you can paste selections, registers or even files at specific lines with some vim-fu. If it's generic enough you could write a function or even a keyboard shortcut if it's very simple.

I have set ",',(,[,{ in visual mode to cut the selection insert the pairs then paste it back as a very hacky solution, but it gets the job done. If you want something more advanced to add or change anything around the selection tpope has solved that with vim-surround[1].

[1]: https://github.com/tpope/vim-surround

OJFord · a year ago
What does it mean to paste a function around some lines of code? As in what're the manual steps you do because that's not possible today?
fransje26 · a year ago
> Instead of a neat tool like this having to support dozens of languages, it can just support tree-sitter and automatically work with anything that tree-sitter supports.

Built on the shoulders of giants.

epistasis · a year ago
I'm imagining what I could have done in my compilers class with something like tree-sitter...

It feels kind of as foundational as YACC.

ivanjermakov · a year ago
It is literally an alternative to YACC and other parser generators.
bloopernova · a year ago
Related, updating difftastic and friends if you installed via cargo:

  cargo install cargo-update
  cargo install-update --list
  cargo install-update --all
Other fun Rust projects available via cargo:

https://mise.jdx.dev/ mise-en-place, a drop-in replacement for asdf https://asdf-vm.com/ that is really fast and flexible.

https://github.com/ajeetdsouza/zoxide is a fantastic cd replacement, which stores where you cd to, and you can then do a partial match like "z hel" might take you to "~/projects/helloworld".

https://github.com/bootandy/dust is a compliment to "du", shows which directories are using the most disk space.

arlort · a year ago
Another three very neat ones are

- https://github.com/eza-community/eza (ls with some added visual sugar)

- https://github.com/ClementTsang/bottom (htop but with graphs)

- https://github.com/sharkdp/bat (cat with syntax highlight)

kstrauser · a year ago
I love zoxide! Also for your list: lsd, a prettier ls.
bloopernova · a year ago
so... many... colours!

Looks great, thank you for the recommendation.

qmmmur · a year ago
Wow, I installed mise-en-place. It's exactly what I wanted asdf to be.
bloopernova · a year ago
It's so much faster than asdf, the dev did a really great job.
letmeinhere · a year ago
My favorite of the new `du`s is dua-cli, an ncdu clone (an interactive TUI). I went hunting because the latter didn't have a light-mode.
IshKebab · a year ago
ncdu is the best du replacement by far.
polygamous_bat · a year ago
I've always used dust as a replacement, and so I am curious to know if you have tried both tools: do you have thoughts on what makes ncdu better?
satvikpendem · a year ago
Also, cargo-binstall (cargo binary install) which allows you to not have to compile every single time you cargo install and instead allows you to just install the binaries for a specific program. It also integrates with cargo install-update.
bloopernova · a year ago
I originally tried binstall but it seemed to not work with our corporate proxy. However, I decided to give it another go after seeing your comment.

Now, it works great! I tried it with a mise update, and it pulled down the right binary with no proxy problems.

Thank you for the reminder and recommendation, much appreciated!!

tomatao · a year ago
How do these compare to https://github.com/moonrepo/proto ?
bloopernova · a year ago
I haven't used proto so unfortunately I can't answer your question, sorry about that.
mlavrent · a year ago
I’m almost not sure why tools like git don’t ship with this as default. Been using difft for about a year now, and my main complaint is that it makes it hard to go back and use other diff tools when I don’t have difft available :).

I am curious if there’s been any work on _semantic_ diff tools as well (for when eg the syntax changes but the meaning is the same). It seems like an intractable problem in the general but maybe it’s doable and/or useful for smaller DSLs or subsets of some languages?

kstrauser · a year ago
I think shipping good ol' diff as the default makes sense. It's going to be there already on any system you might want to run git on, it's fast, it's tiny, and everyone knows the basics of how to use it.

But I'm glad it's easy to change that default.

DarkPlayer · a year ago
> I am curious if there's been any work on _semantic_ diff tools as well (for when eg the syntax changes but the meaning is the same)

We are working on https://semanticdiff.com/ which detects basic semantic changes like converting a literal from decimal to hex or reordering keys within JSON objects. It is not a command line utility but a VS Code extension and GitHub App. You can check out https://semanticdiff.com/blog/semanticdiff-vs-difftastic/ if you want to learn more about how it works and how it differs from difftastic.

Izikiel43 · a year ago
Thank you, you just simplified my life greatly, will use it for a demo tomorrow.
otherjason · a year ago
Difftastic is a useful tool, but in my experience, it's far too slow to be suitable as the default selection for a ubiquitous tool like git.
drcongo · a year ago
I'm finding it instantaneous here on a large dirty codebase. In what way is it slow for you?
ruined · a year ago
>I am curious if there’s been any work on _semantic_ diff tools as well (for when eg the syntax changes but the meaning is the same).

if you do this your difftool becomes a compiler

mlavrent · a year ago
Sorry, I should've been clearer. I'm interested if there's any tool that does this kind of thing statically, without running the code. I guess a simple approach is to compile both programs and see if the generated code is the same, but I'd guess reasoning at the generated-code level will probably produce a lot more false positives (i.e. tool will report a change when there isn't one) than if you reason about the original program.
Chris_Newton · a year ago
if you do this your difftool becomes a compiler

Some linters and formatters are effectively compilers already, so that doesn’t seem completely implausible in itself. Finding canonical representations of common coding patterns so you can quickly and reliably determine that they are equivalent is a different question, though.

hobs · a year ago
That's exactly what I have done with diffing SQL in lazy mode - just use a server and diff the AST/plan.
more-coffee · a year ago
I'm trying difft for git now and I also really like it. One reason why I think it shouldn't be the default is that it hides whitespace differences. Maybe that's configurable though, haven't looked into that
rob74 · a year ago
> I am curious if there’s been any work on _semantic_ diff tools as well (for when eg the syntax changes but the meaning is the same)

So when using such a diff tool you can spend hours refactoring something, and then git will refuse to commit your changes because your refactoring was successful in not changing the behavior of the code? I understand what you mean, but if we arrive at that point maybe we should stop calling it "diff", to avoid confusion...

kstrauser · a year ago
Git doesn't use the output of `diff` to determine whether anything has changed.
pmayrgundter · a year ago
"Do you know how to read @@ -5,6 +5,7 @@ syntax? Difftastic shows the actual line numbers from your files, both before and after."

Preach!

Just dropped it in and did a git diff.. works like a charm!

neuromanser · a year ago
> Do you know how to read @@ -5,6 +5,7 @@ syntax?

Do you not?

pmayrgundter · a year ago
20 years staring at it and no, i don't. i usually have to work it out from context. i think if you have vi or ed sensory organs it might work better for ya. which.. i still chuckle that vi is the visual editor, bc ed lol
wffurr · a year ago
No, I sure don’t.
michaelcampbell · a year ago
No one knows anything until they learn it.
snthpy · a year ago
Same here. If there's a great ELI5 explanation somewhere, please post a link. Thanks
teaearlgraycold · a year ago
Never needed to.
hrdwdmrbl · a year ago
It seems like a major lapse in product innovation that Github has not come out with something like this. They don't even have something to help you when the indentation changes, they usually just show it as a giant add & remove. Their diff viewer can and should be smarter.
neuromanser · a year ago
Github can't even recognize syntax, let alone provide semantic diffs! In fact, Github can't even tell that foo.cpp.in is different from foo.mk.in! Any foo.t is declared to be Perl, with no way to fix it…There are a decade-old tickets!
bPspGiJT8Y · a year ago
Tree-sitter optimizes for performance (to use in editors), not for correctness. In fact even TS' core developers advocate for not bothering too much with correctness of grammars[1]. I imagine this constraint would be a deal-breaker for GitHub or anyone else in their position.

[1] https://github.com/tree-sitter/tree-sitter/issues/130#issuec...

sroussey · a year ago
GitHub has the option to ignore whitespace in a diff.
mbork_pl · a year ago
Which is useful, but too crude.
jmholla · a year ago
I tried switching to this, but I found it noisy and use weird formatting for things that didn't change. I went back to using icdiff[0].

[0]: https://github.com/jeffkaufman/icdiff

sanity · a year ago
Interesting, I found Semantic Merge [1] years ago but it was never open source.

This just does diff but not merge, but at least it's open source - and the diffs look a lot nicer, I've already made it my default.

Any plans to extend it to merging?

[1] https://docs.plasticscm.com/semanticmerge

OJFord · a year ago
> Any plans to extend it to merging?

The GitHub readme:

> Can difftastic do merges?

> No. AST merging is a hard problem that difftastic does not address.

> AST diffing is a also lossy process from the perspective of a text diff. Difftastic will ignore whitespace that isn't syntactically significant, but merging requires tracking whitespace.

rideontime · a year ago
Was going to suggest this myself, this was a godsend when I was working with a big team on a C# project going through a messy refactor.
asicsp · a year ago
Previous discussions:

https://news.ycombinator.com/item?id=27768861 (297 points | 3 years ago | 61 comments)

https://news.ycombinator.com/item?id=32746258 (698 points | 2 years ago | 90 comments)

https://news.ycombinator.com/item?id=30841244 (983 points | 2 years ago | 219 comments)