I really like the idea of focusing on producing patches for human consumption. I studied the problem of merging AST-level patches during my PhD (https://github.com/VictorCMiraldo/hdiff) and can confirm: not simple! :)
So I looked at the paper and it seems interesting. Basic idea: Instead of the operations to consider being "insert", "delete" and "copy", one adds "reorder" "contract subtree" and "duplicate" (although I didn't quite get the subtlety of copy vs duplicate on a short skim); and even though extra ops increase the search space, they actually let you search more effectively. I can buy that argument.
The practical problem, though, is that the Haskell compiler is limited/buggy, so you couldn't implement this for C, and you settled on a small language like Lua. If you _do_ extend this to other languages (perhaps port your implementation from Haskell to something else?), please post it on HN and elsewhere!
Copy just copies once. The need for duplicate is clear if you're trying to diff something like `t = [a]` and `u = [a, a]`. You could copy `a`, but you'd have to decide whether to copy it on the first or second position; the second one would be classified an "insertion" by any ins/del/cpy-algorithm. If you instead opt to NOT make that choice, you can say: pick the source `a` and duplicate it instead
Early in the linked thesis there is a one-page argument about the shortcomings of traditional approaches, which technically isn't what you asked but might still answer the side of the question that deals with human usage at least:
Efficiency is not the issue at this point. My prototype diffing algorithm was linear and there have been improvements on it already (I think something called "truediff" is linear but an order of magnitude better! I could be misremembering the name, don't quote me :) ).
The real difficult part is in how you represent AST-level changes, which will limit what your merging algorithm can do. In particular, working around moving "the same subtree" into different places is difficult. Imagine the following conflict:
Both p and q move the same thing to different places so they need a human to make a decision about what's the correct merge. Depending on your choice of "what is a change", even detecting this type of conflict will be difficult. And that's because we didn't add insertions nor deletions. Because now, say p was:
([1,2,3], [4,5]) -- p --> ([1,3], [2,5])
One could argue that we can now merge, because '4' was also deleted hence the position in which we insert '2' in the second list is irrelevant.
If we extrapolate from lists of integers to arbitrary ASTs the difficulties become even worse :)
One thing I do find interesting (and a wish were different) is that only programming languages are supported, rather than data formats as well.
For example, two JSON documents may be valid but formatted slightly differently, or a common task for me is comparing two YAML files.
Comparing config files that have a well defined syntax and or can be abstracted into a tree (JSON, YAML, TOML, etc.) would be absolutely lovely, even and including (if possible) Markdown and its ilk.
JSON and CSS are supported today, and I'm interested in adding more structured text formats.
If a format has a tree-sitter parser, it can be added to difftastic. The TOML tree-sitter parser looks good, but there isn't a mature markdown parser for tree-sitter. There are other markdown parsers available, so in principle difftastic could support markdown that way.
The display logic might need a little tuning for prose-heavy formats like markdown though. I'm not happy with how difftastic handles block comments yet either.
I'm not sure about formats that contain more prose, such as markdown or HTML.
I think supporting XML would be something, a lot of people would appreciate. That XML is difficult to diff comes up again and again... However, one would need to decide, whether one wants to compare by syntax or by meaning. Latter one may be preferrable, but would require the XML to be canonicalized on both sides, first.
Indeed. One could just do `diff $(jq . $fileOne) $(jq . $fileTwo)` and you'll end up with a "nice enough" diff even if $fileOne and $fileTwo were very differently formatted.
This is kind of like the problem of programmatically analyzing AWS IAM roles and policies to understand impact of changes. Very difficult to do in JSON format but worth tons of money to CISOs if it can be solved.
Similarly, I would love it if Pandoc’s AST were supported. Or, if this could be extended to compare any documents taking formatting into account, or document-to-document conversions.
This isn't going to add anything to existing diff tools for JSON or YAML though. Those formats barely have any syntax highlighting or complex structures.
same, I don't know how many times I do a diff and wish there was a smarter solution that could take account formatting and whitespaces. This is it. Wish git diff would incorporate this, would be a real treat.
Funny side note: I had a flat mate once who was on a working holiday from Japan.
He was in love with and endlessly curious about English slang, it’s basically all we talked about.
I remember explaining to him why my uni friends and I referred to things as being “craptastic”, starting with American marketing’s love affair with the portmanteau.
He got it pretty quickly and enjoyed using it in conversation.
The saying that was harder for him to understand was “fuck all”. He always wanted fuck to be the verb, rather than using “fuck all” as the adjective, so he would say things like “I fuck all my money last night at the pub”.
This is written by the same guy who wrote Helpful, an enhancement package for the Emacs Help buffer. I highly recommend checking out Helpful if you haven’t seen it. https://github.com/Wilfred/helpful
EDIT: Wilfred IS the original author [3]; my apologies.
Not to discredit Wilfred (it looks like he's taken over the project as the maintainer), but, based on the historical contributions [1], it looks like it was originally developed by Max Brunsfeld, who also created Tree-sitter. [2]
I think the contributor graph is misleading, and that he's using git-subtree to vendor tree-sitter, which makes it look like others have contributed more to the project.
Agreed. It’s so good it feels like it should have been that way all along. For example, when you view the help for a function Emacs has always given you a link to the source code where that function is defined. Helpful shows you the source code right in the Help buffer, and shows you a list of callers, and gives you buttons that enable tracing or debugging for the function.
Once I discovered Helpful, all of those things seemed so obviously useful that I can’t understand why nobody else thought to put them there, including myself.
This looks really cool and I can't wait to try it, tho... a bit of a PITA to get running. ;) Took a while to figure out how to build, and had to install 400MB of dependencies first....
Edit: And after installing cargo, watching it fail to build, then determining I must need a newer version of cargo, so I built that from source... it fails. Apparently I need to install `rustc-mozilla` and not `rustc`. "obviously".
This is all a testament to how much I want to try this tool...
MOAR EDIT: even with rustc-mozilla cargo fails to build. running `cargo install difftastic` gives me an error about my version of cargo being too old ;.;
Ah, well, if you're willing to accept having a frankensystem with a mix of packaged and unpackaged software, sure. ;) I used to do that, back in Slackware days.
It's considered really sloppy and unmaintainable to admin a system like that. Things quickly get out of hand.
That strategy _does_ work if you isolate it to a chroot or a container, but littering /usr/local with all sorts of locally compiled upstream is just asking for future pain. Security updates, library incompatibilities, &c.
Prebuilt binaries might be nice, but I don't expect them for random projects. (and I wouldn't have used them if offered) I do think it's a reasonable expectation to be able to build software w/o essentially setting up a new userland just for that tool though. :)
Not sure about Go, but Rust still links against glibc, so I sometimes have to recompile things to make them work on my Debian systems if they're built against newer glibc.
Sure, but first I had to figure out wtf "cargo" is. :P
Also, `cargo install difftastic` AIUI pulls it from a central location, if I'm gonna poke at software for the first time, I enjoy building it myself first, so I can get my hands dirty in the source. :)
Honest question: how did you arrive to the conclusion you needed rustc-mozilla? I would love to make sure whatever flow led you to that is made clearer for other newcomers, because that is definitely not something anyone that isn't working on Firefox should even try.
My favorite dev tool is diff2html - a CLI that opens up your browser with a rich diff. Pro tip: alias `diff` to the command so you can launch it quickly ;)
The practical problem, though, is that the Haskell compiler is limited/buggy, so you couldn't implement this for C, and you settled on a small language like Lua. If you _do_ extend this to other languages (perhaps port your implementation from Haskell to something else?), please post it on HN and elsewhere!
https://victorcmiraldo.github.io/data/MiraldoPhD.pdf#page=24
[1] https://difftastic.wilfred.me.uk/tricky_cases.html
The real difficult part is in how you represent AST-level changes, which will limit what your merging algorithm can do. In particular, working around moving "the same subtree" into different places is difficult. Imagine the following conflict:
([1,3], [4,2,5]) <-- q -- ([1,2,3], [4,5]) -- p --> ([1,3], [2,4,5])
Both p and q move the same thing to different places so they need a human to make a decision about what's the correct merge. Depending on your choice of "what is a change", even detecting this type of conflict will be difficult. And that's because we didn't add insertions nor deletions. Because now, say p was:
([1,2,3], [4,5]) -- p --> ([1,3], [2,5])
One could argue that we can now merge, because '4' was also deleted hence the position in which we insert '2' in the second list is irrelevant.
If we extrapolate from lists of integers to arbitrary ASTs the difficulties become even worse :)
One thing I do find interesting (and a wish were different) is that only programming languages are supported, rather than data formats as well.
For example, two JSON documents may be valid but formatted slightly differently, or a common task for me is comparing two YAML files.
Comparing config files that have a well defined syntax and or can be abstracted into a tree (JSON, YAML, TOML, etc.) would be absolutely lovely, even and including (if possible) Markdown and its ilk.
If a format has a tree-sitter parser, it can be added to difftastic. The TOML tree-sitter parser looks good, but there isn't a mature markdown parser for tree-sitter. There are other markdown parsers available, so in principle difftastic could support markdown that way.
The display logic might need a little tuning for prose-heavy formats like markdown though. I'm not happy with how difftastic handles block comments yet either.
I'm not sure about formats that contain more prose, such as markdown or HTML.
It's more simplistic than difftastic though: it considers `1` and `[1]` to have nothing in common.
Deleted Comment
HTML and XML are missing, too.
Sadly YAML, TOML and the others I mentioned are not there (yet?)
He was in love with and endlessly curious about English slang, it’s basically all we talked about.
I remember explaining to him why my uni friends and I referred to things as being “craptastic”, starting with American marketing’s love affair with the portmanteau.
He got it pretty quickly and enjoyed using it in conversation.
The saying that was harder for him to understand was “fuck all”. He always wanted fuck to be the verb, rather than using “fuck all” as the adjective, so he would say things like “I fuck all my money last night at the pub”.
Profanity is just delightful in general, and non-native English speakers come up with some of the best profane idioms in English.
I wonder if it’s the same in other languages?
https://www.amazon.com/gp/aw/d/486256139X
Not to discredit Wilfred (it looks like he's taken over the project as the maintainer), but, based on the historical contributions [1], it looks like it was originally developed by Max Brunsfeld, who also created Tree-sitter. [2]
[1]: https://github.com/Wilfred/difftastic/graphs/contributors
[2]: https://github.com/tree-sitter/tree-sitter
[3]: https://github.com/Wilfred/difftastic/commit/958033924a2dea7...
Honestly, I cannot imagine going back to the standard emacs help.
Once I discovered Helpful, all of those things seemed so obviously useful that I can’t understand why nobody else thought to put them there, including myself.
https://news.ycombinator.com/item?id=27768861
Edit: And after installing cargo, watching it fail to build, then determining I must need a newer version of cargo, so I built that from source... it fails. Apparently I need to install `rustc-mozilla` and not `rustc`. "obviously".
This is all a testament to how much I want to try this tool...
MOAR EDIT: even with rustc-mozilla cargo fails to build. running `cargo install difftastic` gives me an error about my version of cargo being too old ;.;
Dear author: Let us run your tool.
Agree though that some pre-built binaries would be fantastic!
It's considered really sloppy and unmaintainable to admin a system like that. Things quickly get out of hand.
That strategy _does_ work if you isolate it to a chroot or a container, but littering /usr/local with all sorts of locally compiled upstream is just asking for future pain. Security updates, library incompatibilities, &c.
Prebuilt binaries might be nice, but I don't expect them for random projects. (and I wouldn't have used them if offered) I do think it's a reasonable expectation to be able to build software w/o essentially setting up a new userland just for that tool though. :)
spinned up a Ubuntu 18.04 instance -> git clone, git checkout 0.24.0
installed rust using curl | sh method
build fails:
https://termbin.com/29xy
removed the instance and gonna check it again 6 months later
Deleted Comment
Edit: Reinstalling Cargo worked!
Also, `cargo install difftastic` AIUI pulls it from a central location, if I'm gonna poke at software for the first time, I enjoy building it myself first, so I can get my hands dirty in the source. :)
EDIT: Also, the build fails. :(
"error: unexpected token: `include_str` --> /home/loxias/.cargo/registry/src/github.com-1ecc6299db9ec823/radix-heap-0.4.2/src/lib.rs:2:10 | 2 | #![doc = include_str!("../README.md")] | ^^^^^^^^^^^
error: aborting due to previous error
error: could not compile `radix-heap`.
sad trombone
I've documented the minimum rust version required today, although I'm looking at lowering the minimum version.
https://diff2html.xyz/