This approach works by storing the actual SQLite binary files in Git and then using a custom "diff" configuration to dump each file as SQL and compare the result.
It's a neat trick, but storing binary files like that in Git isn't as space efficient as using a plain text format.
I built my own tooling to solve this problem: https://datasette.io/tools/sqlite-diffable - which outputs a “diffable” copy of the data in a SQLite database, precisely so you can store it in Git and look at the differences later.
I’ve been running that for a couple of years in this repo: https://github.com/simonw/simonwillisonblog-backup - which provides a backup of my blog’s PostgreSQL Django database (first converted to SQLite and then dumped out using sqlite-diffable).
I don't think there will be any huge difference between a text dump and binary formats here, git doesn't really care about the underlying data when it stores and packs it.
Maybe a text format that is sorted before saved would compress a lot better though, both with zlib compression and gits delta compression. You can't really sort a binary file and put it back together!
I'm more wondering about whether or not this Sqlite DB is actually a "source file", that is something that is not generated by something else, which is normally the base material for both make and Git.
Obviously an SQLite DB cannot be created ex nihilo, so it is not a source file. Isn't it a bit dangerous to store the database file in the repo, and not the mean to generate it? If both are included in the repo, it then seems redundant (reproducible builds concerns, maybe?).
Seeing that many people are interested in this topic, I wonder in which cases they would however want to track that file in Git.
> I’ve been running that for a couple of years in this repo: https://github.com/simonw/simonwillisonblog-backup - which provides a backup of my blog’s PostgreSQL Django database (first converted to SQLite and then dumped out using sqlite-
I'm curious, what is the reason you chose not to use pgdump, but instead opted to convert to to sqlite and then dump the DB using sqlite-diffable?
On a project I'm working on, I'd like to dump our Postgres schema into individual files for each object (i.e., one file for each table, function, stored proc, etc.), but haven't spent enough time to see if pgdump could actually do that. We're just outputting files by object type for now (one tables, function, and stored procs files). It looks like sqlite-diffable does something similar to what we're looking for (separate files for each object).
Mainly it's because all of my other projects are SQLite and this felt like a good opportunity to try out some of my own tooling against the main PostgreSQL database I still use for an active project.
pgdump can do that if you go with the directory format output. According to the docs, it will "create a directory with one file for each table and large object being dumped, plus a so-called Table of Contents file describing the dumped objects in a machine-readable format that pg_restore can read."
> It's a neat trick, but storing binary files like that in Git isn't as space efficient as using a plain text format.
Why is that? I would have expected git's delta compression to work well on sqlite files, since small table changes usually cause only a small number of page changes.
I would think so too. As a quick test I added chinook.db to a fresh repository, committed it, added a row to a table, committed again, and then did a git gc.
The resulting pack file, with two revisions of the database, is 329k. The raw database is 864k and gzips to 328k.
So it does look like git's delta encoding works with sqlite's blocks.
4 years ago I wrote an internal tool to serialize records of our hierarchical database to a bunch of hierarchical json files, which are then managed in a normal git repo. Each record is serialized to json, "root" entities in the hierarchy are serialized as to contain all their child entities. This is done as the user works - the database is updated and then the correct json file is read, updated and written back. The current SHA is always updated in the database, so there's a way to recognize users worked outside our app with git on the repo, we can detect that and automatically the database based on the diff. There's also an opposite sync.
This was/is a very specific solution to our very specific set of problems. So not applicable to the general problem of "versioning a database".
In fact users now complain that this duality of database <-> repo is confusing and we should ditch the database. Our users are unfortunately not very technical and sometimes require a lot of hand holding.
In theory they're correct but in practice this would mean ditching our ORM and rewriting one from scratch that will work on top of the json files. While keeping the same API outwards. Which is potentially a lot of work. Maybe that will happen, we haven't decided yet.
From time to time I'm searching for a database engine for small databases that stores the data as diffable text, just provides up to date indexing and sql on top of that to make the operations convenient and fast.
Didn't find any. I keep encountering BerkleyDB but I don't know how good it is for that purpose.
For the SQL part, but maybe a lot heavier, you can use one of the projects listed on this page: https://github.com/multiprocessio/dsq (No longer maintained, but has links to lots of other projects)
HSQLDB does that, but 1. ew, Java, and 2. there are reasons people don't do that.
The major problem with text formats is that it means either you have to rewrite the whole file on every change ... or else you have to append arbitrarily and be able to survive that (if you're committing to git you probably are capable of dealing with the storage, at least).
If the former is feasible for you (because incremental updates are rare), just use your favorite database's load/dump around every session, and do some fsync testing (or just keep full backups).
If the latter compromise is acceptable there might be something you can do with streaming logical replication on some DBs, but again there's a reason people generally don't.
Have you considered Fossil (link below)? it is a SCM tool built on top of sqlite.
I have not used it myself, but I would assume it stores file revisions as diffs, and since its all just sqlite you should be able to run arbitrary queries on it.
But the key disadvantage is that now getting started with the repository requires more effort: you must have sqlite3 installed locally, and you must apply certain config values manually (and “remember to run this helper script” or “the first time you try running it it’ll⸺” are still manual).
(This is for security reasons—you don’t want commands like clone and diff to be able to execute arbitrary code without your knowledge. Hence the attributes/config split.)
Otherwise, you’ll find that your .sqlite file is SQL instead of a SQLite database, and if you try running stuff while it’s in that condition you’ll get “file is not a database” errors from SQLite.
You are right, there is a bootstrap cost and potential security implications. In general, also, .gitattributes is pretty evil, but:
1. don't store binary files in Git
2. don't run programs that keep a git file open for too long, otherwise clowntown. For example, you run git pull while GnuCash keeps the file open and then you loose all your updates when you save (also a problem with the code below, but fixable).
The correct solution to this is running the "smudge" and "clean" operations as a backup and restore operation, outside the repo, when starting and closing GnuCash, in this case:
You have to be careful with this, as using .dump as your filter will not necessarily get you an equivalent database on restore. For example, .dump will not include database metadata such as application_id¹ or user_version²(perhaps other things too but these are the ones that have bitten me before).
The custom diff driver in the OP will not reflect changes to those values either obviously, but at least it won't lose them in the actual database file stored in the git repository.
Is there no capability in SQLite to dump these things as well? Seems like a pretty big thing missing if that's the case, a thing I would expect SQLite to support pretty well.
I think the bias towards text files in Git is only a reflection of Git configuring its defaults to handle text files, such as the diff tool and also file type. If you add a custom file type and configure it to not handle it as text, which includes specifying git attributes to prevent it to update things like newline characters, then Git works just as well.
Is there a web interface like GitHub that supports this? I wanted to use it with GitHub Enterprise a few years ago and found that it meant I couldn't see diffs properly in GitHub.
Maintaining a big, binary blob will make the git history grow in proportion to the size of the binary blob, not the size of the logical/abstract diff, regardless of how comprehensible the diff looks.
To minimize the disk space to only the changes, you want to store the changes and not the complete, binary before/after images. By filtering, you reduce the stored object to the difference.
You won't be storing the sqlite file in git. So you won't be storing indexes or un-vacuumed data in git. You also won't be storing a database with some of it's rows locked or anything like that.
Git will be able to compress the text better, and compress between versions better.
Merging and resolving conflicts would have a chance of working.
When you hijack someone else’s thread with something unrelated, I find it impossible to trust you enough to get excited about what you did. Smiling faces aside, this is kind of disrespectful to the author.
Whereas with a show HN, I would focus on you and do my best to find something positive to say.
I felt it was related (database + git is quite a narrow scope), and not concurrent to the post (I did not comment on the DoltgreSQL post for instance). But, I admit I was looking for some personal gratification. Sorry it bothered you.
I used to do this trick a couple of years ago to diff the credentials file in Rails codebases, which is encrypted, to keep track of changes to passwords and tokens.
Worth to keep in mind that the text representation is virtual and thus the command is run on every operation like a git-diff or a git-show. Therefore, if it takes a while to convert to text, it will also slowdown Git and other tools depending on it.
We do the exact same thing to keep track of some credentials we use sops[1] and AWS KMS to separate credentials by sensitivity, then use the git differ to view the diffs between the encrypted secrets
Definitely not best practice security-wise, but it works well
This is an excellent and very clever trick; however what it does is store the sql database in binary format, and use the dump only for the diff.
Would it not be simpler to only store the text dump each time? (The dump can be used trivially to rebuild the db should the need arise, and it's human readable).
So maybe a pre-commit or pre-staging hook would be even better for this use case?
It kinda depends on performance, a large database may take a while to rebuild depending on indices and the like. But then, the same can be said for the output of this dump command (and diffing the result). It's probably manageable for a gnucash database (1000 - 100000 record order of magnitude if I were to guess), but for anything larger the dump command will take ages.
That takes two files as input, don’t you have to checkout, copy etc manually?
Whereas (IIUC) in the article git runs a text dump on both dbs automatically during git diff, which is an insane UX improvement. (Given that you’re like me and didn’t know about this).
It's rather annoying that part of the config is stored in your local checked out copy of the repository (.git/config) and part of it is in a configuration file that is checked into version control (.gitattributes).
That's because part of it is for you personally (the bit that dictates exactly which program gets run to render the diff), and part of it is for everybody (the bit that specifies the diff-related file type).
Sure, but unless I'm missing something, adding `sqlite3` into my `.gitattributes` is useless without manually doing extra work to specify the config. I get that it would be nice to be able to override this locally, but having no way to specify a default way to make this work feels broken.
That’s not a usual workaround. That’s a crippled technique that doesn’t scale beyond single-user, single-branch usage, because .git/config includes things like the remotes and branches.
It's a neat trick, but storing binary files like that in Git isn't as space efficient as using a plain text format.
I built my own tooling to solve this problem: https://datasette.io/tools/sqlite-diffable - which outputs a “diffable” copy of the data in a SQLite database, precisely so you can store it in Git and look at the differences later.
I’ve been running that for a couple of years in this repo: https://github.com/simonw/simonwillisonblog-backup - which provides a backup of my blog’s PostgreSQL Django database (first converted to SQLite and then dumped out using sqlite-diffable).
Here’s an example diff: https://github.com/simonw/simonwillisonblog-backup/commit/72...
Maybe a text format that is sorted before saved would compress a lot better though, both with zlib compression and gits delta compression. You can't really sort a binary file and put it back together!
Obviously an SQLite DB cannot be created ex nihilo, so it is not a source file. Isn't it a bit dangerous to store the database file in the repo, and not the mean to generate it? If both are included in the repo, it then seems redundant (reproducible builds concerns, maybe?).
Seeing that many people are interested in this topic, I wonder in which cases they would however want to track that file in Git.
I'm curious, what is the reason you chose not to use pgdump, but instead opted to convert to to sqlite and then dump the DB using sqlite-diffable?
On a project I'm working on, I'd like to dump our Postgres schema into individual files for each object (i.e., one file for each table, function, stored proc, etc.), but haven't spent enough time to see if pgdump could actually do that. We're just outputting files by object type for now (one tables, function, and stored procs files). It looks like sqlite-diffable does something similar to what we're looking for (separate files for each object).
Why is that? I would have expected git's delta compression to work well on sqlite files, since small table changes usually cause only a small number of page changes.
The resulting pack file, with two revisions of the database, is 329k. The raw database is 864k and gzips to 328k.
So it does look like git's delta encoding works with sqlite's blocks.
Why? Because git stores deltas?
At a first glance I'd expect a binary file to be radically more compact than the plain text equivalent.
4 years ago I wrote an internal tool to serialize records of our hierarchical database to a bunch of hierarchical json files, which are then managed in a normal git repo. Each record is serialized to json, "root" entities in the hierarchy are serialized as to contain all their child entities. This is done as the user works - the database is updated and then the correct json file is read, updated and written back. The current SHA is always updated in the database, so there's a way to recognize users worked outside our app with git on the repo, we can detect that and automatically the database based on the diff. There's also an opposite sync.
This was/is a very specific solution to our very specific set of problems. So not applicable to the general problem of "versioning a database".
In fact users now complain that this duality of database <-> repo is confusing and we should ditch the database. Our users are unfortunately not very technical and sometimes require a lot of hand holding.
In theory they're correct but in practice this would mean ditching our ORM and rewriting one from scratch that will work on top of the json files. While keeping the same API outwards. Which is potentially a lot of work. Maybe that will happen, we haven't decided yet.
I wrote much more about it here:
https://news.ycombinator.com/item?id=25005993
Didn't find any. I keep encountering BerkleyDB but I don't know how good it is for that purpose.
For the SQL part, but maybe a lot heavier, you can use one of the projects listed on this page: https://github.com/multiprocessio/dsq (No longer maintained, but has links to lots of other projects)
The major problem with text formats is that it means either you have to rewrite the whole file on every change ... or else you have to append arbitrarily and be able to survive that (if you're committing to git you probably are capable of dealing with the storage, at least).
If the former is feasible for you (because incremental updates are rare), just use your favorite database's load/dump around every session, and do some fsync testing (or just keep full backups).
If the latter compromise is acceptable there might be something you can do with streaming logical replication on some DBs, but again there's a reason people generally don't.
Have you considered Fossil (link below)? it is a SCM tool built on top of sqlite.
I have not used it myself, but I would assume it stores file revisions as diffs, and since its all just sqlite you should be able to run arbitrary queries on it.
https://www2.fossil-scm.org/home/doc/trunk/www/index.wiki
[0] - https://git-scm.com/docs/gitattributes#_filter
(This is for security reasons—you don’t want commands like clone and diff to be able to execute arbitrary code without your knowledge. Hence the attributes/config split.)
Otherwise, you’ll find that your .sqlite file is SQL instead of a SQLite database, and if you try running stuff while it’s in that condition you’ll get “file is not a database” errors from SQLite.
1. don't store binary files in Git 2. don't run programs that keep a git file open for too long, otherwise clowntown. For example, you run git pull while GnuCash keeps the file open and then you loose all your updates when you save (also a problem with the code below, but fixable).
The correct solution to this is running the "smudge" and "clean" operations as a backup and restore operation, outside the repo, when starting and closing GnuCash, in this case:
```
#!/bin/bash
cat $1.sql > $1.sqlite3
gnucash $1.sqlite3
echo ".dump" | sqlite3 $1.sqlite3 > $1.sql
# optionally:
# git add $1.sql
# git commit -m "Update $(date) $1.sql"
```
The custom diff driver in the OP will not reflect changes to those values either obviously, but at least it won't lose them in the actual database file stored in the git repository.
¹ https://sqlite.org/pragma.html#pragma_application_id
² https://sqlite.org/pragma.html#pragma_user_version
It's better to store text in git IMHO.
I think the bias towards text files in Git is only a reflection of Git configuring its defaults to handle text files, such as the diff tool and also file type. If you add a custom file type and configure it to not handle it as text, which includes specifying git attributes to prevent it to update things like newline characters, then Git works just as well.
https://git-scm.com/docs/gitattributes
To minimize the disk space to only the changes, you want to store the changes and not the complete, binary before/after images. By filtering, you reduce the stored object to the difference.
Git will be able to compress the text better, and compress between versions better.
Merging and resolving conflicts would have a chance of working.
https://app.data-drift.io/41231518/samox/local-datadrift-rep...
Not sure how it could be plugged on this stuff, but sharing anyway :D
Whereas with a show HN, I would focus on you and do my best to find something positive to say.
Worth to keep in mind that the text representation is virtual and thus the command is run on every operation like a git-diff or a git-show. Therefore, if it takes a while to convert to text, it will also slowdown Git and other tools depending on it.
Definitely not best practice security-wise, but it works well
[1] https://github.com/getsops/sops
Would it not be simpler to only store the text dump each time? (The dump can be used trivially to rebuild the db should the need arise, and it's human readable).
So maybe a pre-commit or pre-staging hook would be even better for this use case?
I don't have experience with truly huge dbs in the gigabytes or terabytes, but up to, say, 250 Mb the dump feels very fast.
Whereas (IIUC) in the article git runs a text dump on both dbs automatically during git diff, which is an insane UX improvement. (Given that you’re like me and didn’t know about this).