Readit News logoReadit News
DarkPlayer · 10 months ago
Looking at the architecture, they will probably run into some issues. We are doing something similar with SemanticDiff [1] and also started out using tree-sitter grammars for parsing and GumTree for matching. Both choices turned out to be problematic.

Tree sitter grammars are primarily written to support syntax highlighting and often use a best effort approach to parsing. This is perfectly fine for syntax highlighting, since the worst that can happen is that a few characters are highlighted incorrectly. However, when diffing or modifying code you really want the code to be parsed according to the upstream grammar, not something that mostly resembles it. We are currently in the process of moving away from tree-sitter and instead using the parsers provided by the languages themselves where possible.

GumTree is good at returning a result quickly, but there are quite a few cases where it always returned bad matches for us, no matter how many follow-up papers with improvements we tried to implement. In the end we switched over to a dijkstra based approach that tries to minimize the cost of the mapping, which is more computationally expensive but gives much better results. Difftastic uses a similar approach as well.

[1]: https://semanticdiff.com/

wetneb · 10 months ago
Thanks for the insightful comments! You surely have a lot more experience than me there, but my impression was that producing visual diffs and merging files are tasks that put different requirements on the tree matching algorithms, and Dijkstra-style approaches felt more fitting for diffs than for merging, so that's why I went for GumTree as it seemed to be the state of the art for merging. Does SemanticDiff offer a merge driver? I could only find documentation about diffing on the website.

As to mismatches: yes, they are bound to happen in some cases. Even for line-based diffing, Git uses rather convoluted heuristics to avoid them (with the "histogram" diff algorithm), but they can't be completely ruled out there either. I hope that with enough safeguards (helper to review merges, downstream consistency checks with local fall-back to line-based diffing) they can be lived with. I'm happy to try other matching algorithms if they are more promising though (there isn't much coupling with the rest of the pipeline).

Concerning tree-sitter, I have noticed some small issues, but nothing that was a show-stopper so far. I actually like it that it's designed for syntax highlighting, because it's really helpful that the representations it gives stay faithful to the original source, to avoid introducing reformatting noise in the merging process. Parsers written for a specific language can sometimes be too zealous (stripping comments out, doing some normalizations behind your back). That's a problem in Spork (which uses Spoon, a pretty advanced Java parser). And the uniform API tree-sitter offers over all those parsers is just too good to give up, in my opinion.

DarkPlayer · 10 months ago
I don't think that different algorithms are better for merging or diffing. In both cases, the first step is to match identical nodes, and the quality of the final result depends heavily on this step. The main problem with GumTree is that it is a greedy algorithm. One incorrectly matched node can completely screw up the rest of the matches. A typical example we encountered was adding a decorator to a function in Python. When other functions with the same decorator followed, the algorithm would often map the newly added decorator to an existing decorator, causing all other decorator mappings to be "off-by-one". GumTree has a tendency to come up with more changes than there actually are.

We try to really get the diff quality nailed down before going after merges. We don't have merge functionallity in SemanticDiff yet.

The main issue we have with tree-sitter is that the grammars are often written from scratch and not based on the upstream grammar definition. Sometimes they only cover the most likely cases which can lead to parsing errors or incorrectly parsed code. When you encounter parsing errors it can be difficult to fix them, because the upstream grammar is structured completely different. To give you an example, try to compare the tree-sitter Go grammar for types [1] with the upstream grammar [2]. It is similar but the way the rules are structured is somewhat inverted.

We use separate executables for the parsers (this also helps to secure them using seccomp on Linux), and they all use the same JSON schema for their output. This allows us to write the parser executable in the most appropriate language for the target language. Building all them statically and cross-platform for our VS Code extension isn't easy though ;)

[1]: https://github.com/tree-sitter/tree-sitter-go/blob/master/gr... [2]: https://go.dev/ref/spec#Types

abathur · 10 months ago
> We are currently in the process of moving away from tree-sitter and instead using the parsers provided by the languages themselves where possible.

I imagine this means you're trying to abstract over those parsers somehow? How well is that going, and have you written about your approach?

(I wrote `resholve` to identify and rewrite references to external dependencies in bash/posixy Shell scripts to absolute paths. This is helpful in the Nix ecosystem to confirm the dependencies are known, specified, present, don't shift when run from a service with a different PATH, etc.

It builds on the mostly-bash-compatible OSH parser from the oilshell/oils-for-unix project for the same reasons you're citing.

It would be ~nice to eventually generalize out something that can handle scripts for other shell languages like fish, zsh, nushell, elvish, the ysh part of the oils-for-unix project, etc., but I suspect that'll be a diminishing-return sort of slog and haven't had any lightbulb-moments to make it feel tractable yet.

We also have some ~related needs here around identifying hardcoded or user-controlled exec...)

DarkPlayer · 10 months ago
Our parsers simply return the concrete syntax trees in a JSON format. We do not unify all the different syntax constructs into a common AST if that is what you are looking for. The languages and file formats we support are too diverse for that.

The language specific logic does not end with the parsers though. The core of SemanticDiff also contains language specific rules that are picked up by the matching and visualization steps. For example, the HTML module might add a rule that the order of attributes within a tag is irrelevant. So it all comes down to writing a generic rule system that makes it easy to add new languages.

Sesse__ · 10 months ago
An important point here is that for certain languages, using the original grammar is pretty much impossible. In particular, for C, you want to do diffing and merging on the un-preprocessed source, but the language's grammar very much assumes the source has gone through the preprocessor.

Of course, the existence of the preprocessor means there are situations where it's completely impossible to know what the correct parse is; it will necessarily be heuristic in some cases.

herrington_d · 10 months ago
Hi! ast-grep[1] author here. It is a tree-sitter based syntax tool to search tool.

I wonder how you transition from tree-sitter to other builtin parsers? Tree-sitter gave a unified interface to all languages. Using language native parsers will require significant work for various FFI if I am not wrong.

[1]: https://ast-grep.github.io/

alexpovel · 10 months ago
Not the OP, but you raise good points. Performance might also be a concern, thinking of languages like Python and its ast package (not sure that’s accessible without going through the interpreter).

For a tool I’m writing, the tree-sitter query language is a core piece of the puzzle as well. Once you only have JSON of the concrete syntax trees, you’re back to implementing a query language yourself. Not that OP needs it, but ast-grep might?

OJFord · 10 months ago
> best effort approach to parsing. This is perfectly fine for syntax highlighting, since the worst that can happen is that a few characters are highlighted incorrectly. However, when diffing or modifying code you really want the code to be parsed according to the upstream grammar, not something that mostly resembles it.

But surely you need to support code that doesn't parse correctly by the actual language's grammar anyway? 'Merge branch fix-syntax-error'

wetneb · 10 months ago
In Mergiraf, as soon as there is a parsing error in any of the revisions, it falls back on line-based merging, even though tree-sitter is generally good at isolating the error. It felt like the safest thing to do (maybe we detected the language wrong), but I'm definitely open to reconsidering…
Gibbon1 · 10 months ago
Small brained primate comment.

I've wondered if you could add annotation keywords to languages to convert them into something that could be parsed reliably with a tree sitter grammar.

I say this as someone that feels like you really want diffs that say, changed 'struct x name from y to z' instead of here's a huge list of files with ten changes each.

drawnwren · 10 months ago
This may or may not be on your radar, but crypto is desperate for a product like this. Smart contracts are often forks or rewrites (obfuscated or otherwise) of others and an easy interface for end users to be able to see changes between two forks would probably provide a lot of value.
gritzko · 10 months ago
Interesting. My students used Language Server Protocol data to make syntax-aware diffs. Very promising project. Unfortunately, everyone moved on. I am currently looking for ways to revitalize it.

https://github.com/shishyando/tokenized-myers-diff

Sparkyte · 10 months ago
Good example of when adding abstraction is more problematic than the processes themselves which is like an extra minute or two in flow.
Game_Ender · 10 months ago
The tool has an excellent architecture section [0] that goes into how it works under the hood. It stands out to me that a complex tool has an overview to this depth that allows you to grasp conceptually how it works.

0 - https://mergiraf.org/architecture.html

__MatrixMan__ · 10 months ago
That is nicely done, often hard to find, and usually it's what I'm looking for when deciding whether to use a piece of software: Show me the complexity you've encapsulated so that I can be the judge of whether the juice is worth the squeeze.

Armed with that, I can tolerate some rough edges. Without it, I'll get stuck in weird ways that your docs can't anticipate.

chrismorgan · 10 months ago
Going through the sorts of conflicts it solves, and limitations in that, I find it claiming that in some insertions, order doesn’t matter <https://mergiraf.org/conflicts.html#neighbouring-insertions-...>.

I really don’t like that. At the language level, order may not matter, but quite frequently in such cases the order does matter, insofar as almost every human would put the two things in a particular order; or where there is a particular convention active. If you automatically merge the two sides in a different order from that, doing it automatically has become harmful.

My clearest example: take Base `struct Foo; struct Bar;`, then between these two items, Left inserts `impl Foo { }`, Right inserts `struct Baz;`. To the computer, the difference doesn’t matter, but merging it as `struct Foo; struct Baz; impl Foo { } struct Bar;` is obviously bad to a human. This is the problem: it’s handling language syntax semantics, but can’t be aware of logical semantics. (Hope you can grasp what I’m trying to convey, not sure of the best words.) Left was not inserting something between Foo and Bar, it was attaching something to the end of Foo. Whereas Right was probably inserting something between Foo and Bar—but maybe even it was inserting something before Bar. You perceive that these are all different things, logically.

Another example where this will quickly go wrong: in CSS rulesets, some will sort the declarations by property name lexicographically, some by property name length (seriously, it’s frequently so pretty), some will group by different types of property… you can’t know.

soraminazuki · 10 months ago
Not only that, the order of fields in a Java class does matter despite what that link claims. It's common to use Lombok to automatically generate constructors, and "the order of the parameters match the order in which the fields appear in your class."

https://projectlombok.org/features/constructor

The first two kinds of conflicts that Mergiraf handles looks somewhat dangerous to me when handled by a computer.

https://mergiraf.org/conflicts.html

wetneb · 10 months ago
Lombok is an interesting example, but yes, just with reflection you can already get order-dependent behaviors as the docs note. I've been thinking about giving users more control over this commutativity, but it's not clear to me what it should look like. A strict mode where commutativity is disabled entirely? The ability to disable certain commutative parents?
andrewaylett · 10 months ago
Code that uses Lombok features which change classes (rather than subclassing them) might have a high degree of similarity to Java, but it's not Java.
wetneb · 10 months ago
Yes, that's definitely something that could be refined, for instance by specifying that only children of specific types can be reordered together: https://codeberg.org/mergiraf/mergiraf/issues/6
soulofmischief · 10 months ago
I sort my CSS by property type (display, size, position, etc). Takes a while to get used to it but it definitely speeds up my ability to do CSS surgery.
wonger_ · 10 months ago
Similarly, I've been wanting to make an autoformatter that reorders CSS properties into these categories: https://9elements.com/css-rule-order/ (box model, positioning, typography, etc)

I think it's the most useful CSS organization method I've found yet.

bjackman · 10 months ago
I am very excited about this tool despite agreeing with what you wrote here.

The reason for that is that most of the time when I'm resolving huge numbers merge conflicts... I don't give a shit about details like field order. I just want to get some code that's functionally correct at p<0.05 so I can figure out what performance characteristics my old feature branch would have if I resurrected it. Or I want to kick off the slow integration tests ASAP. 9/10 times it's that kind of thing.

The 1/10 times where I'm like "OK now I actually wanna merge this code, I have to look over the resolutions, I will upload them to Gerrit/GitHub and do a self-review" I am more than happy to spend 20 minutes correcting order etc. Or I'll happily just switch this tool off and do a totally manual merge.

So yeah I think it just comes down to usecase.

chrismorgan · 10 months ago
I want to add another problem here. <https://mergiraf.org/adding-a-language.html#add-commutative-...>:

> To let Mergiraf reorder using statements to fix conflicts, we actually need to specify that they can be reordered with any of their siblings (any other child of their parent in the syntax tree).

That’s too coarse-grained. No idea about C♯, but languages could impose a rule that imports must come before anything else—long ago Rust had such a rule, for example. So it might be that within your compilation_unit node, only its using_directive children are commutative, and only among themselves.

Otherwise (lapsing into Rust syntax for convenience), with Base `use A; struct X…`, Left `use A; use B; struct X…` and Right `use A; struct Y; struct X…`, you could end up with the invalid `use A; struct Y; use B; struct X…`.

tokinonagare · 10 months ago
> No idea about C♯

C#'s using aren't imports, and order indeed doesn't matter.

IshKebab · 10 months ago
Yes maybe, but these issues are true for Git's native merge algorithms too. It isn't perfect either.

As soon as you do any merge you're accepting that there might be edits you don't agree with.

andybak · 10 months ago
Could some of this be mitigated by running a prettifier post-merge?
wetneb · 10 months ago
It's definitely something I would recommend in general, but I'm not sure if it would solve this particular problem (reordering blocks is perhaps a bit bold for a prettifier).
CGamesPlay · 10 months ago
But surely Mergiraf has some opinion about the order when it doesn’t matter, right? Like structs before impls, in your example.
wetneb · 10 months ago
For now, you let it reorder every child within a given node type, which felt expressive enough to me in most cases, but I agree it would be good to refine that: https://codeberg.org/mergiraf/mergiraf/issues/6
nathell · 10 months ago
‘Why the giraffe? Two reasons. First, it can see farther due to its height; second, it has one of the biggest hearts of all land mammals. Besides, its ossicones make you believe it listens to you when you look at it.’ – My NVC teacher

Kudos for the nonviolence. :)

aeonik · 10 months ago
Fyi, giraffes are wild animals, and can be very violent and territorial.
mnsc · 10 months ago
I'm very literal minded and many metaphors irk me because they require a unspoken shared agreement of only considering the positive aspects that match and ignore the inconvenient ones. Like all sports metaphors used in work related activities. But still, don't skip nvc just because you can't accept a violent animal as a symbol for non violent communication. It has to much value for us literal minded people.
postepowanieadm · 10 months ago
Boa constrictor could be a less friendly mascot.
arunix · 10 months ago
There was such a project long time ago.

https://boa-constructor.sourceforge.net/

lucasoshiro · 10 months ago
Happy to see something being developed for merge drivers, they are a underrated Git feature that could save a lot since the standard three-way merge of file contents is not aware of the language and can create some problems. For example, if you have this valid Python code:

x = input()

if x == 'x': print('foo')

    print('bar')
If you delete the first print in a branch, delete the other print in another branch, then merge the two branches, you'll have this:

x = input()

if x == 'x':

Both branches delete a portion of the code inside the if block, leaving it only with a whitespace. In Python it is not a valid code, as they empty scopes need to be declared with pass.

I installed Mergiraf to see if it can solve this situation, but sadly, it doesn't support Python...

masklinn · 10 months ago
FWIW your examples are very unclear as they use text formatting, you need to indent lines by 4 spaces (with an empty line before and after) for a code block e.g.

if a == b: print("x")

versus

    if a == b:
        print("x")

chrismorgan · 10 months ago
The minimum required is actually only two spaces.
wetneb · 10 months ago
I tried your example but git does create a conflict in my case - but maybe I misunderstood the scenario. Python support can likely be done (I would be thrilled if someone made a PR for it), but I don't know if there is a lot of potential for solving conflicts there: imports can have side effects, function arguments are complicated with the mixture of positional and keyword arguments, decorators are effectful… it seems to me that there is a lot of sensitivity to order in many places.
erik_seaberg · 10 months ago
Syntax-aware tools always have issues when a team extends the base language to fit their problem. Rust has macros. People started using "go generate" for stuff like early generics. Does Mergiraf take EBNF or plugins or does a team fork it to explain their syntax?
wetneb · 10 months ago
Yeah at the moment it just supports whatever the tree-sitter parser accepts, period. A bring-your-own-grammar version could be interesting, I don't see why it couldn't work. Do you have any Rust crates to recommend, to do parsing according to a grammar supplied by the user at run time? It's likely to be slower, but maybe not prohibitively so…

Another approach would be for the tool to accept doing structured merging even if there are error nodes in the parsed tree. If those error span the parts of the file where the extended language is used, then the tool could still help with merging the other parts, treating the errors as atomic blocks. I'd be a bit reluctant to do that, because there could be errors for all sorts of other reasons.

papashell · 10 months ago
Since tree sitter parsers output a c library, you could dynamically load it.

The rust bindings themselves are a thin ffi wrapper.

If you wanted to make it a little smoother than needing to compile the tree sitter syntax you could compile/bundle grammars up with wasm so its sandboxed and cross platform

Edit: found this vscode extension that dynamically loads syntaxes compiled to wasm. You should be able to do the same thing in rust: https://github.com/selfint/vscode-tree-sitter

leonheld · 10 months ago
I'll certainly give it a try. Another tool I've been using (with varied degree of success) to enhance my git life is https://github.com/tummychow/git-absorb. If both of these worked flawlessly or maybe even officially incorporated in git, I'd be very happy.
ksynwa · 10 months ago
This sounds like what jujutsu's workflow is like by default

Deleted Comment

leonheld · 10 months ago
I've heard of jujutsu, but I'm kinda over learning new tooling, specially experimental one. If it's not a drop-in replacement that makes my life better (like these easy "git plugins"), I'm basically not using.
dochtman · 10 months ago
git absorb is great, use it all the time.
IshKebab · 10 months ago
This sounds great. To be honest though none of the merge tools really give me enough information to resolve all conflicts easily.

The best I've got to is zdiff3 in VSCode (not using their fancy merge view which I don't understand at all). But it's missing:

1. Blame for the merge base.

2. Detection of the commit that introduced the first conflict.

3. Most annoyingly, no way to show diffs between the "current" and "incoming". IIRC it has buttons to compare both of those to the merge base, but not to each other. That often leaves me visually scanning the text to manually find differences like a neanderthal. Sometimes it's annoying enough that I copy & paste current/incoming into files and then diff those but that's a right pain.

secondcoming · 10 months ago
Have you tried p4merge? It's usually one of the first tools I install.
IshKebab · 10 months ago
No. Does it do any of the things I want?