> Want to keep LLM .md files in a separate overlay, only make them visible on request? Also easy. CRDT gives the freedom in splitting and joining along all the axes.
I now have a bunch of layers of text / markdown: system prompts, AGENTS.md, SKILL.md, plus user tweaks or full out replacements to these on every repo or subproject.
Then we want to do things like update the "root" system prompt and have that applied everywhere.
There are analogies in git, CMS templating systems, software package interfaces and versioning. Doing it all with plain text doesn't feel right to me.
Any other approaches to this problem? Or is Beagle and ASTs and CDRTs really onto something here?
The linked page looks like a subsystem of some specific library, I am not sure if it is intended for general use.
If it were intended to be general replacement for general purpose version control systems, I am not sure how storing AST is better than storing the original plain text files since the transformation from text to AST might be lossy. I might want to store files with no AST (e.g. plain text files), files with multiple AST (e.g. polyglots), multiple files with the same AST (e.g. files to test different code layout), broken AST (e.g. data files to be used as test cases). These use cases would be trivially supported by storing the original file as is, whereas storing any processed form of the file would require extra work.
(Author) There is a fall-back general-text codec: tokens, no AST (e.g. for Markdown). If that fails (non UTF8), there is the general-blob final-fallback codec (the git mode).
The way it makes an AST tree is non-lossy. Additionally, it stamps ids on the nodes, so merges do not get confused by renames, formatting changes and similar things. There is value in preserving structure this way that repeat parsing can not provide. In big-O terms, working with such an AST tree and a stack of its patches is not much different from stacks of binary diffs git is using.
If I have k independent changesets, I have k^2 unplanned interactions and 2^k unplanned change combinations. Having a bunch of change sets, which I had not fully evaluated yet, esp in relation to one another, I would like k-way merges and repeat-merges to be seamless, non-intrusive and deterministic. git's merges are not.
AST of what? Will it read my clojure code's forms as such? What if my source file has a paran balancing error? I feel I'm thinking of this at the wrong level/angle.
This sounds good in theory, but it means Beagle needs to understand how to parse every language, and keep up with how they evolve. This sounds like a ton of work and a regression could be a disaster. It'll be interesting to see how this progresses though.
IMO this really isn’t a huge problem for this project specifically, since that part is outsourced to tree-sitter which has a lot of effort behind it to begin with.
I think this project is incredibly cool as a line of research / thought but my general experience in trying to provide human interfaces using abstractions over source code suggests that most people in general and programmers especially are better at reasoning in the source code space. Of course, beagle can generate into the source code space at each user interaction point, but at that point, why not do the opposite thing, which is what we already do with language servers and AST driven (semantic) merge and diff tools?
It's also just one more facet. The problem already exists for anything else that we already have, like formatters, linters, syntax highlighters, language servers... And it's also not an exclusive choice. If you want to use a dumb editor, there's nothing preventing that. All of the machinery to go back and forth to text exists. Not really a huge departure.
mmm. interesting and fun concept, but it seems to me like the text is actually the right layer for storing and expressing changes since that is what gets read, changed and reasoned about. why does it make more sense to use asts here?
are these asts fully normalized or do (x) and ((x)) produce different trees, yet still express the same thing?
why change what is being stored and tracked when the language aware metadata for each change can be generated after the fact (or alongside the changes)? (adding transform layers between what appears and what gets stored/tracked seems like it could get confusing?)
For one, it eliminates a class of merge conflict that arises strictly from text formatting.
I always liked the idea of storing code in abstraction, especially editors supported edit-time formatting. I enjoy working on other people's code, but I don't think anybody likes the tedium of complying with style guides, especially ones that are enforced at the SCM level, which adds friction to creating even local, temporary revisions. This kind of thing would obviate that. That's why I also appreciate strict and deterministic systems like rustfmt. Unison goes a little further, which is neat but I think they're struggling getting adoption because of that, even though I'm pretty sure they've got some better tooling for working outside the whole ecosystem. These decoupled tools are probably a good way to go.
I was messing around with a file-less paradigm that would present a source tree in arbitrary ways, like just showing a individual functions, so you have the things you're working on co-located rather than switching between files. Kind of like the old VB IDE.
I remember someone mentioning a system that operated with ASTs like this in the 70s or 80s. One of the affordances is that the source base did not require a linter. Everyone reading the code can have it formatted the way they like, and it would all still work with other people’s code.
Related, I’d love an editor that’d let me view/edit identifier names in snake_case and save them as camelCase on disk. If anyone knows of such a thing - please let me know!
100% agree. I think AST-driven tooling is very valuable (most big companies have internal tools akin to each operation Beagle provides, and Linux have Coccinelle / Spatch for example), but it's still just easier implemented as a layer on top of source code than the fundamental source of truth.
There are some clever things that can be done with merge/split using CRDTs as the stored transformation, but they're hard to reason about compared to just semantic merge tools, and don't outweigh the cognitive overhead IMO.
Having worked for many years with programming systems which were natively expressed as trees - often just operation trees and object graphs, discarding the notion of syntax completely, this layer is incredibly difficult for humans to reason about, especially when it comes to diffs, and usually at the end you end up having to build a system which can produce and act upon text-based diffs anyway.
I think there's some notion of these kinds of revision management tools being useful for an LLM, but again, at that point you might as well run them aside (just perform the source -> AST transformation at each commit) rather than use them as the core storage.
One nice thing about serializing/transmitting AST changes is that it makes it much easier to to compose and transform change sets.
The text based diff method works fine if everyone is working off a head, but when you're trying to compose a release from a lot of branches it's usually a huge mess. Text based diffs also make maintaining forks harder.
Git is going to become a big bottleneck as agents get better.
what do you actually gain over enforced formatting?
first you should not be composing releases at the end from conflicting branches, you should be integrating branches and testing each one in sequence and then cutting releases. if there are changes to the base for a given branch, that means that branch has to be updated and re-tested. simple as that. storing changes as normalized trees rather than normalized text doesn't really buy you anything except for maybe slightly smarter automatic merge conflict resolution but even then it needs to be analyzed and tested.
Having a VCS that stores changes as refactorings combined with an editor that reports the refactorings directly to the VCS, without plain text files as intermediate format, would avoid losing information on the way.
The downside is tight coupling between VCS and editor. It will be difficult to convince developers to use anything else than their favourite editor when they want to use your VCS.
I wonder if you can solve it the language-server way, so that each editor that supports refactoring through language-server would support the VCS.
It makes a lot of sense for math-focused LLMs to work with higher order symbols - or context-dependent chunking - than tokens. The same is probably true for software.
> It makes sense for LLMs to work with testable code for symbolic mathematics; CAS Computer Algebra System code instead of LaTeX which only roughly corresponds.
> Are LLMs training on the AST parses of the symbolic expressions, or token coocurrence? What about training on the relations between code and tests?
There are already token occurrence relations between test functions and the functions under test that they call. What additional information would it be useful to parse and extract and graph rewrite onto source code before training, looking up embeddings, and agent reasoning?
> Want to keep LLM .md files in a separate overlay, only make them visible on request? Also easy. CRDT gives the freedom in splitting and joining along all the axes.
I now have a bunch of layers of text / markdown: system prompts, AGENTS.md, SKILL.md, plus user tweaks or full out replacements to these on every repo or subproject.
Then we want to do things like update the "root" system prompt and have that applied everywhere.
There are analogies in git, CMS templating systems, software package interfaces and versioning. Doing it all with plain text doesn't feel right to me.
Any other approaches to this problem? Or is Beagle and ASTs and CDRTs really onto something here?
Deleted Comment
https://www.unison-lang.org/docs/the-big-idea
If it were intended to be general replacement for general purpose version control systems, I am not sure how storing AST is better than storing the original plain text files since the transformation from text to AST might be lossy. I might want to store files with no AST (e.g. plain text files), files with multiple AST (e.g. polyglots), multiple files with the same AST (e.g. files to test different code layout), broken AST (e.g. data files to be used as test cases). These use cases would be trivially supported by storing the original file as is, whereas storing any processed form of the file would require extra work.
The way it makes an AST tree is non-lossy. Additionally, it stamps ids on the nodes, so merges do not get confused by renames, formatting changes and similar things. There is value in preserving structure this way that repeat parsing can not provide. In big-O terms, working with such an AST tree and a stack of its patches is not much different from stacks of binary diffs git is using.
If I have k independent changesets, I have k^2 unplanned interactions and 2^k unplanned change combinations. Having a bunch of change sets, which I had not fully evaluated yet, esp in relation to one another, I would like k-way merges and repeat-merges to be seamless, non-intrusive and deterministic. git's merges are not.
The project is experimental at this point.
I think this project is incredibly cool as a line of research / thought but my general experience in trying to provide human interfaces using abstractions over source code suggests that most people in general and programmers especially are better at reasoning in the source code space. Of course, beagle can generate into the source code space at each user interaction point, but at that point, why not do the opposite thing, which is what we already do with language servers and AST driven (semantic) merge and diff tools?
Would you say these are commonly in use, and if so what are some "mainstream" examples? IME most people just use git's built-in diff/merge...
are these asts fully normalized or do (x) and ((x)) produce different trees, yet still express the same thing?
why change what is being stored and tracked when the language aware metadata for each change can be generated after the fact (or alongside the changes)? (adding transform layers between what appears and what gets stored/tracked seems like it could get confusing?)
For one, it eliminates a class of merge conflict that arises strictly from text formatting.
I always liked the idea of storing code in abstraction, especially editors supported edit-time formatting. I enjoy working on other people's code, but I don't think anybody likes the tedium of complying with style guides, especially ones that are enforced at the SCM level, which adds friction to creating even local, temporary revisions. This kind of thing would obviate that. That's why I also appreciate strict and deterministic systems like rustfmt. Unison goes a little further, which is neat but I think they're struggling getting adoption because of that, even though I'm pretty sure they've got some better tooling for working outside the whole ecosystem. These decoupled tools are probably a good way to go.
I was messing around with a file-less paradigm that would present a source tree in arbitrary ways, like just showing a individual functions, so you have the things you're working on co-located rather than switching between files. Kind of like the old VB IDE.
There are some clever things that can be done with merge/split using CRDTs as the stored transformation, but they're hard to reason about compared to just semantic merge tools, and don't outweigh the cognitive overhead IMO.
Having worked for many years with programming systems which were natively expressed as trees - often just operation trees and object graphs, discarding the notion of syntax completely, this layer is incredibly difficult for humans to reason about, especially when it comes to diffs, and usually at the end you end up having to build a system which can produce and act upon text-based diffs anyway.
I think there's some notion of these kinds of revision management tools being useful for an LLM, but again, at that point you might as well run them aside (just perform the source -> AST transformation at each commit) rather than use them as the core storage.
Easier but much less valuable.
The text based diff method works fine if everyone is working off a head, but when you're trying to compose a release from a lot of branches it's usually a huge mess. Text based diffs also make maintaining forks harder.
Git is going to become a big bottleneck as agents get better.
first you should not be composing releases at the end from conflicting branches, you should be integrating branches and testing each one in sequence and then cutting releases. if there are changes to the base for a given branch, that means that branch has to be updated and re-tested. simple as that. storing changes as normalized trees rather than normalized text doesn't really buy you anything except for maybe slightly smarter automatic merge conflict resolution but even then it needs to be analyzed and tested.
The downside is tight coupling between VCS and editor. It will be difficult to convince developers to use anything else than their favourite editor when they want to use your VCS.
I wonder if you can solve it the language-server way, so that each editor that supports refactoring through language-server would support the VCS.
Deleted Comment
From "Large Language Models for Mathematicians (2023)" (2025) https://news.ycombinator.com/item?id=42899805 :
> It makes sense for LLMs to work with testable code for symbolic mathematics; CAS Computer Algebra System code instead of LaTeX which only roughly corresponds.
> Are LLMs training on the AST parses of the symbolic expressions, or token coocurrence? What about training on the relations between code and tests?
There are already token occurrence relations between test functions and the functions under test that they call. What additional information would it be useful to parse and extract and graph rewrite onto source code before training, looking up embeddings, and agent reasoning?