Difftastic: A diff that understands syntax

I really like the idea of focusing on producing patches for human consumption. I studied the problem of merging AST-level patches during my PhD (https://github.com/VictorCMiraldo/hdiff) and can confirm: not simple! :)

stavros · 4 years ago

Please tell me the final output of your PhD was a differtation.

vcmiraldo · 4 years ago

omg!! I really should have left that typo somewhere in there! What a missed opportunity! xD

bool3max · 4 years ago

Should've named that repo "phdiff".

Groxx · 4 years ago

I'll vote for "diphph"

pdimitar · 4 years ago

Best pun I've heard in a long time. Well done. <3

munk-a · 4 years ago

To be pronounced "Doctor-iff" in speech?

einpoklum · 4 years ago

So I looked at the paper and it seems interesting. Basic idea: Instead of the operations to consider being "insert", "delete" and "copy", one adds "reorder" "contract subtree" and "duplicate" (although I didn't quite get the subtlety of copy vs duplicate on a short skim); and even though extra ops increase the search space, they actually let you search more effectively. I can buy that argument.

The practical problem, though, is that the Haskell compiler is limited/buggy, so you couldn't implement this for C, and you settled on a small language like Lua. If you _do_ extend this to other languages (perhaps port your implementation from Haskell to something else?), please post it on HN and elsewhere!

arianvanp · 4 years ago

Some of the GHC performance bugs that we ran into during the research have been fixed as far as I know! Though I'd have to double-check

vcmiraldo · 4 years ago

Copy just copies once. The need for duplicate is clear if you're trying to diff something like `t = [a]` and `u = [a, a]`. You could copy `a`, but you'd have to decide whether to copy it on the first or second position; the second one would be classified an "insertion" by any ins/del/cpy-algorithm. If you instead opt to NOT make that choice, you can say: pick the source `a` and duplicate it instead

narush · 4 years ago

Can you give a little color on where the difficulties lie? Is it an efficiency question, or is determining "which changes" hard in the first place?

vanderZwan · 4 years ago

Early in the linked thesis there is a one-page argument about the shortcomings of traditional approaches, which technically isn't what you asked but might still answer the side of the question that deals with human usage at least:

https://victorcmiraldo.github.io/data/MiraldoPhD.pdf#page=24

scythmic_waves · 4 years ago

Not OP, but the docs call out some "Tricky Cases" [1].

[1] https://difftastic.wilfred.me.uk/tricky_cases.html

vcmiraldo · 4 years ago

Efficiency is not the issue at this point. My prototype diffing algorithm was linear and there have been improvements on it already (I think something called "truediff" is linear but an order of magnitude better! I could be misremembering the name, don't quote me :) ).

The real difficult part is in how you represent AST-level changes, which will limit what your merging algorithm can do. In particular, working around moving "the same subtree" into different places is difficult. Imagine the following conflict:

([1,3], [4,2,5]) <-- q -- ([1,2,3], [4,5]) -- p --> ([1,3], [2,4,5])

Both p and q move the same thing to different places so they need a human to make a decision about what's the correct merge. Depending on your choice of "what is a change", even detecting this type of conflict will be difficult. And that's because we didn't add insertions nor deletions. Because now, say p was:

([1,2,3], [4,5]) -- p --> ([1,3], [2,5])

One could argue that we can now merge, because '4' was also deleted hence the position in which we insert '2' in the second list is irrelevant.

If we extrapolate from lists of integers to arbitrary ASTs the difficulties become even worse :)

infogulch · 4 years ago

How does your work relate to tree-sitter, which also manages patches which it describes as "incremental parsing" as well as error states.

This looks really cool and I can't wait to try it, tho... a bit of a PITA to get running. ;) Took a while to figure out how to build, and had to install 400MB of dependencies first....

Edit: And after installing cargo, watching it fail to build, then determining I must need a newer version of cargo, so I built that from source... it fails. Apparently I need to install `rustc-mozilla` and not `rustc`. "obviously".

This is all a testament to how much I want to try this tool...

MOAR EDIT: even with rustc-mozilla cargo fails to build. running `cargo install difftastic` gives me an error about my version of cargo being too old ;.;

Dear author: Let us run your tool.

gkfasdfasdf · 4 years ago

Using ubuntu 20.04, I first installed cargo:

  curl https://sh.rustup.rs -sSf | sh

Restart shell to get $HOME/.cargo/bin in PATH, then did:

  cargo install difftastic

And ~4 minutes later, difft executable is ready.

Agree though that some pre-built binaries would be fantastic!

loxias · 4 years ago

Ah, well, if you're willing to accept having a frankensystem with a mix of packaged and unpackaged software, sure. ;) I used to do that, back in Slackware days.

It's considered really sloppy and unmaintainable to admin a system like that. Things quickly get out of hand.

That strategy _does_ work if you isolate it to a chroot or a container, but littering /usr/local with all sorts of locally compiled upstream is just asking for future pain. Security updates, library incompatibilities, &c.

Prebuilt binaries might be nice, but I don't expect them for random projects. (and I wouldn't have used them if offered) I do think it's a reasonable expectation to be able to build software w/o essentially setting up a new userland just for that tool though. :)

vlunkr · 4 years ago

A huge part of the appeal of Rust and Go tools is that you can just ship a binary, it's frustrating that it's not available here.

easrng · 4 years ago

Not sure about Go, but Rust still links against glibc, so I sometimes have to recompile things to make them work on my Debian systems if they're built against newer glibc.

ducktective · 4 years ago

Same here. Looked into repo -> no binary in release or Github actions

spinned up a Ubuntu 18.04 instance -> git clone, git checkout 0.24.0

installed rust using curl | sh method

build fails:

https://termbin.com/29xy

removed the instance and gonna check it again 6 months later

adwn · 4 years ago

In another comment you're asking about vim support. So let me get this straight: You're using vim, yet you're unable to resolve the error message

    = note: /usr/bin/ld: cannot find Scrt1.o: No such file or directory
            /usr/bin/ld: cannot find crti.o: No such file or directory

Have you tried googling for "ubuntu crti.o: No such file or directory" ?

Deleted Comment

unhammer · 4 years ago

If you have nix (package manager) installed, it takes like half a second. For tools I want to install through nixpkgs I make a starter like this:

    $ cat /usr/local/bin/difftastic
    #!/bin/sh
    source $HOME/.nix-profile/etc/profile.d/nix.sh
    nix run nixpkgs.difftastic -c difftastic "$@"

and then it'll install on first run:

    $ difftastic
    these paths will be fetched (1.17 MiB download, 9.38 MiB unpacked):
      /nix/store/wn74xn0w60xcwsly6nqaibn205hh2qms-difftastic-0.8
    copying path '/nix/store/wn74xn0w60xcwsly6nqaibn205hh2qms-difftastic-0.8' from 'https://cache.nixos.org'...
    Difftastic 0.8.0
    Wilfred Hughes
    A syntax aware diff.
    
    USAGE:
    [etc.]

YetAnotherNick · 4 years ago

Used `cargo install difftastic`? Finished in a minute for me.

lopatin · 4 years ago

Build errors for me. Apparently I'm on some nightly build of cargo, but I need 2021 version. The pain begins...

Edit: Reinstalling Cargo worked!

loxias · 4 years ago

Sure, but first I had to figure out wtf "cargo" is. :P

Also, `cargo install difftastic` AIUI pulls it from a central location, if I'm gonna poke at software for the first time, I enjoy building it myself first, so I can get my hands dirty in the source. :)

EDIT: Also, the build fails. :(

"error: unexpected token: `include_str` --> /home/loxias/.cargo/registry/src/github.com-1ecc6299db9ec823/radix-heap-0.4.2/src/lib.rs:2:10 | 2 | #![doc = include_str!("../README.md")] | ^^^^^^^^^^^

error: aborting due to previous error

error: could not compile `radix-heap`.

sad trombone

Wilfred · 4 years ago

The getting started section of the manual should help: https://difftastic.wilfred.me.uk/getting_started.html

I've documented the minimum rust version required today, although I'm looking at lowering the minimum version.

estebank · 4 years ago

Honest question: how did you arrive to the conclusion you needed rustc-mozilla? I would love to make sure whatever flow led you to that is made clearer for other newcomers, because that is definitely not something anyone that isn't working on Firefox should even try.

Wilfred · 4 years ago

I imagine it's a misunderstanding with the rustc-hash dependency used in difftastic for faster hashing.

emacsen · 4 years ago

This looks absolutely amazing.

One thing I do find interesting (and a wish were different) is that only programming languages are supported, rather than data formats as well.

For example, two JSON documents may be valid but formatted slightly differently, or a common task for me is comparing two YAML files.

Comparing config files that have a well defined syntax and or can be abstracted into a tree (JSON, YAML, TOML, etc.) would be absolutely lovely, even and including (if possible) Markdown and its ilk.

JSON and CSS are supported today, and I'm interested in adding more structured text formats.

If a format has a tree-sitter parser, it can be added to difftastic. The TOML tree-sitter parser looks good, but there isn't a mature markdown parser for tree-sitter. There are other markdown parsers available, so in principle difftastic could support markdown that way.

The display logic might need a little tuning for prose-heavy formats like markdown though. I'm not happy with how difftastic handles block comments yet either.

I'm not sure about formats that contain more prose, such as markdown or HTML.

zmix · 4 years ago

I think supporting XML would be something, a lot of people would appreciate. That XML is difficult to diff comes up again and again... However, one would need to decide, whether one wants to compare by syntax or by meaning. Latter one may be preferrable, but would require the XML to be canonicalized on both sides, first.

simonw · 4 years ago

I would naively expect that this problem is easiest to solve for languages like JSON that have an unambiguous way to be pretty printed.

chockchocschoir · 4 years ago

Indeed. One could just do `diff $(jq . $fileOne) $(jq . $fileTwo)` and you'll end up with a "nice enough" diff even if $fileOne and $fileTwo were very differently formatted.

https://github.com/andreyvit/json-diff works really well for JSON diffing in my experience.

It's more simplistic than difftastic though: it considers `1` and `[1]` to have nothing in common.

mark_and_sweep · 4 years ago

JSON is supported.

HTML and XML are missing, too.

You're right. I missed JSON.

Sadly YAML, TOML and the others I mentioned are not there (yet?)

d0gsg0w00f · 4 years ago

This is kind of like the problem of programmatically analyzing AWS IAM roles and policies to understand impact of changes. Very difficult to do in JSON format but worth tons of money to CISOs if it can be solved.

alxmrs · 4 years ago

Similarly, I would love it if Pandoc’s AST were supported. Or, if this could be extended to compare any documents taking formatting into account, or document-to-document conversions.

paxys · 4 years ago

This isn't going to add anything to existing diff tools for JSON or YAML though. Those formats barely have any syntax highlighting or complex structures.

linsomniac · 4 years ago

I would love a great XML diff tool, and after seeing the demo of this I was sad to see XML not in there. Would pay for.

tomatowurst · 4 years ago

same, I don't know how many times I do a diff and wish there was a smarter solution that could take account formatting and whitespaces. This is it. Wish git diff would incorporate this, would be a real treat.

dools · 4 years ago

Funny side note: I had a flat mate once who was on a working holiday from Japan.

He was in love with and endlessly curious about English slang, it’s basically all we talked about.

I remember explaining to him why my uni friends and I referred to things as being “craptastic”, starting with American marketing’s love affair with the portmanteau.

He got it pretty quickly and enjoyed using it in conversation.

The saying that was harder for him to understand was “fuck all”. He always wanted fuck to be the verb, rather than using “fuck all” as the adjective, so he would say things like “I fuck all my money last night at the pub”.

benreesman · 4 years ago

I know native English-speakers who would say that s/pub/bar/.

Profanity is just delightful in general, and non-native English speakers come up with some of the best profane idioms in English.

I wonder if it’s the same in other languages?

roeles · 4 years ago

Perhaps this book would have helped.

https://www.amazon.com/gp/aw/d/486256139X

db48x · 4 years ago

This is written by the same guy who wrote Helpful, an enhancement package for the Emacs Help buffer. I highly recommend checking out Helpful if you haven’t seen it. https://github.com/Wilfred/helpful

CodeIsTheEnd · 4 years ago

EDIT: Wilfred IS the original author [3]; my apologies.

Not to discredit Wilfred (it looks like he's taken over the project as the maintainer), but, based on the historical contributions [1], it looks like it was originally developed by Max Brunsfeld, who also created Tree-sitter. [2]

[1]: https://github.com/Wilfred/difftastic/graphs/contributors

[2]: https://github.com/tree-sitter/tree-sitter

[3]: https://github.com/Wilfred/difftastic/commit/958033924a2dea7...

arxanas · 4 years ago

I think the contributor graph is misleading, and that he's using git-subtree to vendor tree-sitter, which makes it look like others have contributed more to the project.

maw · 4 years ago

He wrote https://github.com/Wilfred/deadgrep too. It's awesome and I don't know how I lived without it for so long.

disgruntledphd2 · 4 years ago

Helfpul is (pun fully intended) so very, very helpful.

Honestly, I cannot imagine going back to the standard emacs help.

Agreed. It’s so good it feels like it should have been that way all along. For example, when you view the help for a function Emacs has always given you a link to the source code where that function is defined. Helpful shows you the source code right in the Help buffer, and shows you a list of callers, and gives you buttons that enable tracing or debugging for the function.

Once I discovered Helpful, all of those things seemed so obviously useful that I can’t understand why nobody else thought to put them there, including myself.

buu700 · 4 years ago

For everyone wondering, it looks like this will work with git diff: https://difftastic.wilfred.me.uk/git.html.

Starcrunch · 4 years ago

Exactly what I was looking for. Thanks!

pvg · 4 years ago

A previous discussion from 8 months ago, with some comments by the author and authors of other diff tools:

https://news.ycombinator.com/item?id=27768861

yboris · 4 years ago

My favorite dev tool is diff2html - a CLI that opens up your browser with a rich diff. Pro tip: alias `diff` to the command so you can launch it quickly ;)

https://diff2html.xyz/