SCM as a database for the code

I think agree (but I think I think about this maybe a one level higher). I wrote about this a while ago in https://yoyo-code.com/programming-breakthroughs-we-need/#edi... .

One interesting thing I got in replies is Unison language (content adressed functions, function is defined by AST). Also, I recommend checking Dion language demo (experimental project which stores program as AST).

In general I think there's a missing piece between text and storage. Structural editing is likely a dead end, writing text seems superior, but storage format as text is just fundamentally problematic.

I think we need a good bridge that allows editing via text, but storage like structured database (I'd go as far as say relational database, maybe). This would unlock a lot of IDE-like features for simple programmatic usage, or manipulating langauge semantics in some interesting ways, but challenge is of course how to keep the mapping between textual input in shape.

fuhsnn · a month ago

Structural diff tools like difftastic[1] is a good middle ground and still underexplored IMO.

[1] https://github.com/Wilfred/difftastic

panstromek · a month ago

IntelliJ diffs are also really good, they are somewhat semi-structural I'd say. Not going as far as difftastic it seems (but I haven't use that one).

thesz · a month ago

  > Dion language demo (experimental project which stores program as AST).

Michael Franz [1] invented slim binaries [2] for the Oberon System. Slim binaries were program (or module) ASTs compressed with the some kind of LZ-family algorithm. At the time they were much more smaller than Java's JAR files, despite JAR being a ZIP archive.

[1] https://en.wikipedia.org/wiki/Michael_Franz#Research

[2] https://en.wikipedia.org/wiki/Oberon_(operating_system)#Plug...

I believe that this storage format is still in use in Oberon circles.

Yes, I am that old, I even correctly remembered Franz's last name. I thought then he was and still think he is a genius. ;)

panstromek · a month ago

Interesting. It looks to me this was more about the portability of the resulting binary, IIUC.

Dion project was more about user interface to the programming language and unifying tools to use AST (or Typed AST?) as a source of truth instead of text and what that unlocks.

Dion demo is here: https://vimeo.com/485177664

flowerbreeze · a month ago

I'm quite sure I've read your article before and I've thought about this one a lot. Not so much from GIT perspective, but about textual representation still being the "golden source" for what the program is when interpreted or compiled.

Of course text is so universal and allows for so many ways of editing that it's hard to give up. On the other hand, while text is great for input, it comes with overhead and core issues for (most are already in the article, but I'm writing them down anyway):

  1. Substitutions such as renaming a symbol where ensuring the correctness of the operation pretty much requires having parsed the text to a graph representation first, or letting go of the guarantee of correctness in the first place and performing plain text search/replace.
  2. Alternative representations requiring full and correct re-parsing such as:
  - overview of flow across functions
  - viewing graph based data structures, of which there tend to be many in a larger application
  - imports graph and so on...
  3. Querying structurally equivalent patterns when they have multiple equivalent textual representations and search in general being somewhat limited.
  4. Merging changes and diffs have fewer guarantees than compared to when merging graphs or trees.
  5. Correctness checks, such as cyclic imports, ensuring the validity of the program itself are all build-time unless the IDE has effectively a duplicate program graph being continuously parsed from the changes that is not equivalent to the eventual execution model.
  6. Execution and build speed is also a permanent overhead as applications grow when using text as the source. Yes, parsing methods are quite fast these days and the hardware is far better, but having a correct program graph is always faster than parsing, creating & verifying a new one.

I think input as text is a must-have to start with no matter what, but what if the parsing step was performed immediately on stop symbols rather than later and merged with the program graph immediately rather than during a separate build step?

Or what if it was like "staging" step? Eg, write a separate function that gets parsed into program model immediately, then try executing it and then merge to main program graph later that can perform all necessary checks to ensure the main program graph remains valid? I think it'd be more difficult to learn, but I think having these operations and a program graph as a database, would give so much when it comes to editing, verifying and maintaining more complex programs.

panstromek · a month ago

> what if the parsing step was performed immediately on stop symbols rather than later and merged with the program graph immediately rather than during a separate build step?

I think this is the way to go, kinda like on Github, where you write markdown in the comments, but that is only used for input, after that it's merged into the system, all code-like constructs (links, references, images) are resolveed and from then you interact with the higher level concept (rendered comment with links and images).

For programinng langauge, Unison does this - you write one function at a time in something like a REPL and functions are saved in content addressed database.

> Or what if it was like "staging" step?

Yes, and I guess it'd have to go even deeper. The system should be able to represent broken program (in edited state), so conceptually it has to be something like a structured database for code which separates the user input from stored semantic representation and the final program.

IDE's like IntelliJ already build a program model like this and incrementally update it as you edit, they just have to work very hard to do it and that model is imperfect.

There's million issues to solve with this, though. It's a hard problem.

zelphirkalt · a month ago

Why would structural editing be a dead end? It has nothing to do with storage format. At least the meaning of the term I am familiar with, is about how you navigate and manipulate semantic units of code, instead of manipulating characters of the code, for example pressing some shortcut keys to invert nesting of AST nodes, or wrap an expression inside another, or change the order of expressions, all at the pressing of a button or key combo. I think you might be referring to something else or a different definition of the term.

panstromek · a month ago

I'm referring to UI interfaces that allow you to do structural editing only and usually only store the structural shape of the program (e.g. no whitespace or indentation). I think at this point nobody uses them for programming, it's pretty frustrating to use because it doesn't allow you to do edits that break the semantic text structure too much.

I guess the most used one is styles editor in chrome dev tools and that one is only really useful for small tweaks, even just adding new properties is already pretty frustrating experience.

[edit] otherwise I agree that structural editing a-la IDE shortcuts is useful, I use that a lot.

conartist6 · a month ago

Come the BABLR side. We have cookies!

In all seriousness this is being done. By me.

I would say structural editing is not a dead end, because as you mention projects like Unison and Smalltalk show us that storing structures is compatible with having syntax.

The real problem is that we need a common way of storing parse tree structures so that we can build a semantic editor that works on the syntax of many programming languages

panstromek · a month ago

I think neither Unison nor Smalltalk use structural editing, though.

[edit] on the level of a code in a function at least.

zokier · a month ago

> but storage format as text is just fundamentally problematic.

Why? The ast needs to be stored as bytes on disk anyways, what is problematic in having having those bytes be human-readable text?

I've had this idea too, and think about it everytime I'm on a PR with lots of whitespace/non-functional noise how nice it would be if source code wern't just text and I could be looking at a cleaner higher level diff instead.. I think you have to go higher than AST though, it should at least be language-aware

gritzko · a month ago

(Author) In my current codebase, I preserve the whitespace nodes. Whitespace changes would not affect the other nodes though. My first attempt to recover whitespace algorithmically not exactly failed, but more like I was unable to verify it is OK enough. We clang-format or go fmt the entire thing anyway, and whitespace changes are mostly noise, but I did not find 100% sure approach yet.

gfody · a month ago

I think about eg the "using" section at the top of a .cs file where order doesn't matter and it's common for folks to use the "Remove and Sort Usings" feature in VS.. if that were modeled as a set then diffs would consist only of added/removed items and a re-ordering wouldn't even be representable. And then every other manner of refactor that noises up a PR: renaming stuff, moving code around, etc. in my fantasies some perfect high-level model would separate everything that matters from everything that doesn't and when viewing PRs or change history we could tick "ignore superficial changes" to cut thru all the noise when looking for something specific

..to my mind such a thing could only be language-specific and the model for C# is probably something similar to Roslyn's interior (it keeps "trivia" nodes separate but still models the using section as a list for some reason) and having it all in a queryable database would be glorious for change analysis

zelphirkalt · a month ago

Some languages are unfortunately whitespace sensitive, so a generic VCS cannot discard whitespace at all. But maybe the diffing tools themselves could be made language aware and hide not meaningful changes.

em-bee · a month ago

hiding not meaningful changes is not enough. when a block in python changes the indentation, i want to see that the block is otherwise unchanged. so indentation changes simply need to be marked differently. if a tool can to that then it will also work with code where indentation is optional, allowing me to cleanly indent code without messing up the diff.

i saw a diff tool that marked only the characters that changed. that would work here.

procaryote · a month ago

You can build a mergetool (https://git-scm.com/docs/git-mergetool)

WorldMaker · a month ago

In my investigations ages ago [1] I felt the trick was to go lower that an AST. ASTs by nature generally have to be language-aware and vary so much from language to language that trying to generally diff them is rough. I didn't solve the language-aware part, but I did have some really good luck with using tokenizers intended for syntax highlighting. Because they are intended for syntax highlighting they are fast, efficient, and generally work well with "malformed"/in-progress works (which is what you want for source control where saving in progress steps can be important/useful/necessary).

It still needs to be language-aware to know which token grammar to use, but syntax highlighting as a field has a relatively well defined shared vocabulary of output token types, which lends to some flexibility in changing the language on the fly with somewhat minimal shifts (particularly things like JS to TS where the base grammars share a lot of tokens).

I didn't do much more with it than generate simple character-based diffs that seemed like improvements of comparative line-based diffs, but I got interesting results in my experiments and beat some simple benchmarks in comparing to other character-based diff tools of the time.

(That experiment was done in the context of darcs exploring character-based diffs as a way to improve its CRDT-like source control. I still don't think darcs has the proposed character-based patch type. In theory, I could update the experiment and attempt to use it as a git mergetool, but I don't know if it provides as many benefits as a git mergetool than it might in a patch theory VCS like darcs or pijul.)

[1] https://github.com/WorldMaker/tokdiff

hallh · a month ago

We've tackled this problem slightly differently where I work. We have AI agents contribute in a large legacy codebase, and without proper guidance, the agents quickly get lost or reimplement existing functionality.

To help the agents understand the codebase, we indexed our code into a graph database using an AST, allowing the agent to easily find linked pages, features, databases, tests, etc from any one point in the code, which helped it produce much more accurate plans with less human intervention and guidance. This is combined with semantic search, where we've indexed the code based on our application's terminology, so when an agent is asked to investigate a task or bug for a specific feature, it'll find the place in the code that implements that feature, and can navigate the graph of dependencies from there to get the big picture.

We provide these tools to the coding agents via MCP and it has worked really well for us. Devs and QAs can find the blast radius of bugs and critical changes very quickly, and the first draft quality of AI generated plans requires much less feedback and corrections for larger changes.

In our case, I doubt that a general purpose AST would work as well. It might be better than a simple grep, especially for indirect dependencies or relationships. But IMO, it'd be far more interesting to start looking at application frameworks or even programming languages that provide this direct traversability out of the box. I remember when reading about Wasp[0] that I thought it would be interesting to see it go this way, and provide tooling specifically for AI agents.

[0] https://wasp.sh/

namibj · a month ago

Who'd have thought advanced semantic navigation and search as e.g. in the IDEA (Jetbrains) family of IDEs with framework awareness helps not just humans?

Also note it's "structural search (and replace)" that let's you: - essentially regex against the semantically annotated AST, which gives you things like match on function calls that are given an (otherwise arbitrary) object that implements some particular interface (be that interface nominally typed as in Java, or structurally typed as in TypeScript). - or fancier, database queries with a join condition equality matching a string type, that are invoked from inside a loop, with another database query outside and in front of that very loop.

Personally I hated when due to a poorly debated user wish, they turned off auto indenting in full file column space for multi-line inline SQL in PHP, not only forcing a massive commit that messes up `git blame`, but also just annoyingly throwing the outer PHP context's indentation level away causing the SQL to be waaay too far left.

andout_ · a month ago

This is close to what we're doing with [Encore](https://encore.cloud). The framework parses your application code through static analysis at compile time to build a full graph of services, APIs, databases, queues, cron jobs, and their dependencies. It uses that graph to provision infrastructure, generate architecture diagrams, API docs, and wire up observability automatically.

The interesting side effect is that AI tools get this traversability for free. When business logic and infrastructure declarations live in the same code, an AI agent doesn't need a separate graph database or MCP tool to understand what a service depends on or what infrastructure it needs. It's all in the type signatures. The agent generates standard TypeScript or Go, and the framework handles everything from there to running in production.

Our users see this work really well with AI agents as the agent can scaffold a complete service with databases and pub/sub, and it's deployable immediately because the framework already understands what the code needs.

SOLAR_FIELDS · a month ago

I’m currently experimenting with something similar on a smaller scale using continue.dev’s code indexing implementation to expose a context mcp server for both semantic and code search. Tricky part is of course context management.

PunchyHamster · a month ago

No we don't.

And you can build near any VCS of your dream while still using Git as storage backend, as it is database of a linked snapshots + metadata. Bonus benefit: it will work with existing tooling

The whole article is "I don't know how git works, let's make something from scratch"

A syntax tree node does not fit into a git object. Too many children. This doesn't mean we shouldn't keep everything that's great about git in a next-gen solution, but it does mean that we'll have to build some new tools to experiment with features like semantic patching and merging.

Also I checked the author out and can confirm that they know how git works in detail.

ongy · a month ago

Why do you think it has too many children? If we are talking direct descendents, I have seen way larger directories in file systems (git managed) than I've ever seen in an AST.

I don't think there's a limit in git. The structure might be a bit deep for git and thus some things might be unoptimized, but the shape is the same.

Tree.

but you can make a parallel branch with files parsed into ASTs (to have the most expensive path cached).

Then you can use alternative diff (which is pluggable, of course) to compare those ASTs, and quickly too.

Hell, you could generate those ASTs on server, making normal git client compatible, just unable to use this feature

ASalazarMX · a month ago

For anyone confused by this reply, the original title was

"Git is a file system. We need a database for the code"

Which begs the sequitur: "A database is just files in the file system. We need a database for the database"

nylonstrung · a month ago

Trustfall seems really promising for querying files as if they were a db

https://github.com/obi1kenobi/trustfall

charcircuit · a month ago

>I definitely reject the "git compatible" approach

If your version control system is not compatible with GitHub it will be dead on arrival. The value of allowing people to gradually adopt a new solution can not be understated. There is also value in being compatible with existing git integrations or scripts in projects build systems.

quadrifoliate · a month ago

Based on reading this, I don't see anything that would prevent keeping track of a repo tracked by this database with Git (and therefore GitHub) in addition to the database. I think the "compatible" bit means more that you have to think in terms of Git concepts everywhere.

Curious what the author thinks though, looks like it's posted by them.

Technically, exporting changes either way is not a challenge. It only becomes difficult if we have multiple gateways for some reason.

One way to do it is to use the new system for the messy part and git/GitHub for "publication".

A system as described could be forwards compatible with git without being backwards compatible with git. In other words let people migrate easily, but don't force the new system to have all the same flaws of the old

What issues do you see in git's data model to abandon it as wire format for syncing?

wavemode · a month ago

I don't think Git/GitHub is really all that big of a lock-in in practice for most projects.

IMO Git is not an unassailable juggernaut - I think if a new SCM came along and it had a frontend like GitHub and a VSCode plugin, that alone would be enough for a many users to adopt it (barring users who are heavy customers of GitHub Actions). It's just that nobody has decided to do this, since there's no money in it and most people are fine with Git.

The wall of getting buy in for switching your company over to a new SCM is much higher if you need to move everything over to a new system and switch over all of your developers onto new software at the same time. And if things don't work out you have to do the same expensive process in reverse. On the other hand if you have git compatibility you can start with a small group of developers being able to try out the tool first and see if it is actually beneficial and work your way spreading it through the company. If it turns out the new thing isn't good it is trivial to go back to your old tools since you did not have to do expensive migrations.

ivanjermakov · a month ago

Doesn't seem like they want VCS in traditional sense anyway. More like a collaborative undo history with emphasis on CRDT?

yencabulator · 23 days ago

This comes across as a shallow understanding of git, with the author very eager bash it instead of understanding it.

> On top of that, git is not syntax-aware, so false conflicts are pretty common.

If you have a syntax-aware merge tool, you can tell git to use it. Git does not bundle such for all the languages in the world, or force the user into a specific language (as the author seems to intend to do).

> Fundamentally, git merges are an act of will, they are not deterministic.

They are deterministic, though? It seems author is confused about the fact that humans can add edits on top to resolve conflicts.

> [Part II] With remote branches different from local branches

They're both just refs.

> staging and stash different from commits, plus the worktree

Jj unifies all these while being "just git" underneath (for most users).

> GET //branch2 switching the branch;

GET with side effects?

*eager to bash git

One fundamental deal-breaking problem with structure aware version control is that your vcs now needs to know all the versions of all the languages you're writing in. It gets non-trivial fast.

It does! So extensible parser definition is another key piece of the technological puzzle along with common formats for parse trees.