This is the future of databases, but nobody seems to realize it yet.
One of the biggest problems with databases (particularly SQL ones) is they're a giant pile of mutable state. The whole idea of "migrations" exists because it is impossible to "just" revert any arbitrary change to a database, diff changes automatically, merge changes automatically. You need some kind of intelligent tool or framework to generate DDL, DML, DCL, they have to be applied in turn, something has to check if they've already been applied, etc. And of course you can't roll back a change once it's been applied, unless you create even more program logic to figure out how to do that. It's all a big hack.
By treating a database as version-controlled, you can treat any operation as immutable. Make any change you want and don't worry about conflicts. You can always just go back to the last working version, revert a specific change, merge in one or more changes from different working databases. Make a thousand changes a day, and when one breaks, revert it. No snapshotting and slowly restoring the whole database due to a non-reversible change. Somebody dropped the main table in prod? Just revert the drop. Need to make a change to the prod database but the staging database is different? Branch the prod database, make a change, test it, merge back into prod.
The effect is going to be as radical as the popularization of containers. Whether you like them or not, they are revolutionizing an industry and are a productivity force multiplier.
This is how relational databases have commonly worked since at least the 1990s and is called multi-version concurrency control (MVCC). Welcome to the future, it is called PostgreSQL. There are at least two reasons no sensible database designer would allow users to operate a database in this way even though they are technically capable of it.
First, keeping every version of every piece of data forever is an excellent way to consume non-intuitively vast amounts of storage even if your data model is tiny. Every time this feature has been offered by databases, it immediately causes a rash of "out of storage" errors that force the user to manually and permanently delete large numbers of old versions. This is extremely user-unfriendly, so the feature is almost immediately removed in subsequent versions because the pain it causes far outweighs the benefits even when used carefully. In typical MVCC systems, old versions are aggressively garbage collected automatically to limit out-of-storage errors.
Second, finding or reconstructing an arbitrary number of old versions of data is unavoidably expensive. Much of the architectural difference between various MVCC implementations are trying to manage the rather severe performance tradeoffs of maintaining multiple versions of data and navigating to the version you need, with the understanding that all of these versions live on storage and rarely in a single place. There is no optimal way, and keeping version chains short is critical for good performance.
There is very deep literature around MVCC-style databases. The challenges of generalizing and maximally exploiting MVCC as a user feature while having performance that is not poor to the point of unusability are thoroughly documented.
MVCC is not version control, and time travel / historical querying is not version control.
Dolt's unique functionality isn't time travel, although it has that. It's version control, i.e. branch and merge, push and pull, fork and clone. A bunch of database products give you some of this for schema migrations, but Dolt is the only one that does it for table data as well.
> This is the future of databases, but nobody seems to realize it yet
It's a pipedream, not the future.
Your database is either too big / has too much throughput or migrations just don't matter.
And it's not like you wouldn't need migrations with a versioned schema, as otherwise a rollback would mean data loss.
Consider a CMS, one of the most common forms of database backed applications. What if you could give your customer a "dev" branch of all their data to make their changes on and test out new content, that you could then merge with back to prod after somebody reviews it in a standard PR workflow?
This is the workflow one of our earliest customers built. They run network configuration software, and they use Dolt to implement a PR workflow for all changes their customers make.
You're thinking in terms of the present, but I'm saying it's the future. At present it doesn't make sense, because nobody has yet made a database which does version control on very big datasets with a lot of throughput. But when somebody does make it, it will be completely obvious that this was something we always needed.
It's all just software. There is essentially no limit to what we can make software do as long as the hardware supports it. And there's no hardware limit I know of that says version-controlled databases can't work. We just need to figure out how they will work, and then make 'em (or try to make 'em and in the process figure it out).
> And it's not like you wouldn't need migrations with a versioned schema, as otherwise a rollback would mean data loss.
When you roll back a code change, you don't lose code, as it's still in history. If you need to revert but keep some code, you branch the code, copy the needed code into some other part of the app, revert the old change, merge everything. If on merge there is a conflict (let's presume losing data is a conflict), it can prompt you to issue a set of commands to resolve the conflict before merge. You could do all of that in a branch, test it, merge into prod, and on merge it could perform the same operations. The database does all the heavy lifting and the user just uses a console the way they use Git today.
It's probably going to be required to lock the version of software and the version of the database together, such that both are changed/reverted at the same time. But because this is version control, we could actually serve multiple versions of the same database at the same time. You could have the database present two different versions of itself with the same data COW-overlayed for each version, and two different versions of an application. You could then blue/green deploy both the application and database, each modifying only its version. If you need to revert, you can diff and merge changes from one version to another.
I think the problem is relational datasets like that don't fit well into that model. In reality, it's very possible. Look at Datomic. While I agree, for high-throughput systems, storage is a concern, but the general trends seem to be towards streaming data and retention policies anyways.
With big data such a model is even more promising. One of the big problems is that people keep copying data sets, which does not scale. Just syncing newer versions is much more efficient, look at delta.io.
In a similar vein, I am trying to work on a tool for a volunteer group, and one of the problems I'm trying to figure out how to model is not just tracking when the data changed, but why it changed.
Sometimes when you're first entering data you just get it wrong. You said the wrench was in storeroom A1 or the customer lives on 3rd St or the tree was five feet from the sidewalk. If the state of the asset changes due to human action, that's a new thing. But if you opened the wrong list taking inventory, or you missed a keystroke for 33rd St or you just eyeballed the coordinates, then that row was simply wrong, and the correction should be back-dated for most things.
But if I emptied out A1 because we aren't renting that space anymore, the customer moves or the tree got replanted, then it was here and now it's over there. Which might be important for figuring out things like overstock, taxes or success rates.
Similarly if the migration introduces derived data, then the migrated data is assumed/inferred, whereas if we took that data from a user, that information is confirmed, which might introduce subtle differences in how best to relate to the user. Things a mediocre business could easily ignore but a quality establishment might be ill-pleased with such a request.
Lots of databases offer time travel / historical querying, including datomic, MySQL, Postgres, etc (plugins required in some cases).
Dolt's unique functionality isn't time travel, although it has that. It's version control, i.e. branch and merge, push and pull, fork and clone. A bunch of database products give you some of this for schema migrations, but Dolt is the only one that does it for table data as well.
We've dedicated this year to performance with a storage engine rewrite. We'll have some performance wins coming in the back half of the year. We think we can get under 2X MySQL.
It also requires more disk. Each change is at least on average 4K on disk. So, you might need more/bigger hard drives.
In modern world of SAASes and connectivity, you use multiple "databases" and such (queues, identity providers) anyways, and the world got connected enough to not need to have a local db most of the time.
You do realize that how MVCC works right? Turn off GC process that collects old versions on production DB and see what happens. Reverting changes out of order is not possible in many cases.
One issue with immutable data stores for businesses is compliance with things like GDPR. You need some mechanism for scrubbing data and its version history effectively making it non-immutable.
I wish we could use this at work. We're trying to predict time-series stuff. However, there's a lot of infrastructure complexity which is there to ensure that when we're training on data from years ago, that we're not using data that would be in the future from this point (future data leaking into the past).
Using Dolt, as far as I understand it, we could simply set the DB to a point in the past where the 'future' data wasn't available. Very cool
Basically my research project[1] I'm working on in my spare time is all about versioning and efficiently storing small sized revisions of the data as well as allowing sophisticated time travel queries for audits and analysis.
Of course all secondary user-defined, typed indexes are also versioned.
Basically the technical idea is to map a huge tree of index tries (with revisions as indexed leave pages at the top-level and a document index as well as secondary indexes on the second level) to an append-only file. To reduce write amplification and to reduce the size of each snapshot data pages are first compressed and second versioned through a sliding snapshot algorithm. Thus, Sirix does not simply do a copy on write per page. Instead it writes nodes, which have been changed in the current revision plus nodes which fall out of the sliding window (therefore it needs a fast random-read drive).
Have you looked at dvc www.dvc.org? Takes a little bit to figure out how you want to handle the backing store (usually s3) but then it's very straightforward. You could do a similar pattern: have a data repository and simply move the git HEAD to the desired spot and dvc automatically adds/removes the data files based on what's in the commit. You can even version binaries, without blowing up your .git tree.
DVC is great for tracking locally-stored data and artifacts generated in the course of a research project, and for sharing those artifacts across a team of collaborators (and/or future users).
However DVC is fundamentally limited because you can only have dependencies and outputs that are files on the filesystem. Theoretically they could start supporting pluggable non-file-but-file-like artifacts, but for now it's just a feature request and I don't know if it's on their roadmap at all.
This is fine, of course, but it kind of sucks for when your data is "big"-ish and you can't or don't want to keep it on your local machine, e.g. generating intermediate datasets that live in some kind of "scratch" workspace within your data lake/warehouse. You can use DBT for that in some cases, but that's not really what DBT is for and then you have two incompatibile workflow graphs within your project and a whole other set of CLI touch points and program semantics to learn.
The universal solution is something like Airflow, but it's way too verbose for use during a research project, and running it is way too complicated. It's an industrial-strength data engineering tool, not a research workflow-and-artifact-tracking tool.
I think my ideal tool would be "DVC, but pluggable/extensible with an Airflow-like API."
I'm looking into DVC right now, and I feel like the code history (in git) and the data history are too intertwined. If you move the git HEAD back, then you get the old data back, but you also get the old code back. I wish there was a way to move the two "heads" independently. Or is there?
Edit: I can always revert the contents of the .dvc folder to a previous commit, but I wonder if there's a more natural way of doing it.
I've (partially / POC) implemented time travel in a SQLite database; the TL;DR is that whenever you create a table, you add a second, identical or nearly-identical table with a `_history` suffix; the history table has a valid from and valid to. Then you add a trigger on the primary table that, on update or on delete, makes a copy of the old values into the history table, setting the 'valid_to' column to the current timestamp.
The reason I used a separate table is so that you don't have to compromise or complicate the primary table's constraints, indices and foreign keys; the history table doesn't really need those because it's not responsible for data integrity.
Anyway, once you have that, you can run queries with a `where $timestamp is between start_date and end_date` condition, which will also allow you to join many tables at a certain point in time. To also be able to get the latest version, you can use a union query (iirc).
I'm sure there's a lot of caveats there though. What I should do is take some time in the weekend and work on that POC, publish it for posterity / blog post fuel.
I have been interested in this space, but have failed to understand how these versioning solutions for data work in the context of environments. There are aspects of time travel that line up better e.g. with data modelling approaches (such as bitemporal DBs, xtdb etc.) others more with git-like use cases (e.g. schema evolution, backfilling) some combinations. The challenge is, with data I don’t see how you’d like to have all environments in same place/repo and there may be additional considerations coupled with directionality of moves, such as anonymisation for moving from prod to non-prod , back filling for moving from non-prod to prod etc. Keen to read more on other people experiences in this space and how they might be combining different solutions.
These tools aren't really meant for developers. They are meant for researchers, analysts, and other "offline" users and managers of data sources. Data science research workflows generally don't need the same "dev/test/prod" kind of environment setup.
I won't speak for other data versioning products, but Dolt is definitely for developers. Our customers are software engineers writing database applications that need version control features.
I've only used Dolt once, but it was very helpful when I was working in an ML class on a group project. Previously we would commit gigantic CSVs to git, which sucked. Putting it in Dolt made a lot of the data exploration and sharing easier, and separated our data from the code.
What kind of merge actually happens, e.g. how does it deal with conflicts? Does it merge at the SQL command level or at the changed tuple level?
If I have
name job age
remram dev 32
and I concurrently do
UPDATE table SET age = age + 2 WHERE job = 'dev';
UPDATE table SET job = 'mgr' WHERE age = 32;
Do I get a conflict? Do I get something consistent with the two transactions serialized, e.g. dev 34 or op 32)? Can I get something that no ordering of transaction could have given me, like mgr 34?
Merge is cell-wise. If the two heads being merged edited disjoint cells, there's no conflict, they merge together. If they touched the same cell, or if one branch deleted a row another added, then you get a conflict.
Merge conflicts are stored in a special system table so you can decide what to do with them, kind of analogous to conflict markers in a source file. More details here:
The situation you're talking about with two transactions isn't a merge operation, it's just normal database transaction isolation level stuff. Dolt supports REPEATABLE_READ right now, with others coming in future releases. So in the example above, whichever transaction committed last would fail and get rolled back (assuming they touched the same rows).
> The situation you're talking about with two transactions isn't a merge operation, it's just normal database transaction isolation level stuff
I mean, arguably. It's not like there is a standard definition for "merge operation" on data. Even Git tries to do more than line-level changes, taking context into account, and turning one line change into two line changes if merging across a file copy for example.
Dolt markets itself as "a version controlled SQL database", so I think it is perfectly reasonable to consider the standard that already exists for concurrent changes to a SQL database, and that's transaction isolation.
I guess anything more complex than this would be pretty unwieldy though, with probably little benefits. I am struggling to come up with a good example for the kind of anomaly I imagine.
We are a fork of Noms. Lots of incremental changes and we're in the process of a major storage engine overhaul (what we use Noms for) for performance as we speak.
Given how many databases already implement MVCC, some form of versioning is already baked into DB systems. It seems obvious in hindsight that this could be reified into its own feature instead of a book-keeping detail. Cool to see a project actually tackle that challenge.
You can define a "virtual" table (schema, how to retrieve rows/columns) and then a MySQL client can connect and execute arbitrary queries on your table (which could just be an API or other source)
One of the biggest problems with databases (particularly SQL ones) is they're a giant pile of mutable state. The whole idea of "migrations" exists because it is impossible to "just" revert any arbitrary change to a database, diff changes automatically, merge changes automatically. You need some kind of intelligent tool or framework to generate DDL, DML, DCL, they have to be applied in turn, something has to check if they've already been applied, etc. And of course you can't roll back a change once it's been applied, unless you create even more program logic to figure out how to do that. It's all a big hack.
By treating a database as version-controlled, you can treat any operation as immutable. Make any change you want and don't worry about conflicts. You can always just go back to the last working version, revert a specific change, merge in one or more changes from different working databases. Make a thousand changes a day, and when one breaks, revert it. No snapshotting and slowly restoring the whole database due to a non-reversible change. Somebody dropped the main table in prod? Just revert the drop. Need to make a change to the prod database but the staging database is different? Branch the prod database, make a change, test it, merge back into prod.
The effect is going to be as radical as the popularization of containers. Whether you like them or not, they are revolutionizing an industry and are a productivity force multiplier.
First, keeping every version of every piece of data forever is an excellent way to consume non-intuitively vast amounts of storage even if your data model is tiny. Every time this feature has been offered by databases, it immediately causes a rash of "out of storage" errors that force the user to manually and permanently delete large numbers of old versions. This is extremely user-unfriendly, so the feature is almost immediately removed in subsequent versions because the pain it causes far outweighs the benefits even when used carefully. In typical MVCC systems, old versions are aggressively garbage collected automatically to limit out-of-storage errors.
Second, finding or reconstructing an arbitrary number of old versions of data is unavoidably expensive. Much of the architectural difference between various MVCC implementations are trying to manage the rather severe performance tradeoffs of maintaining multiple versions of data and navigating to the version you need, with the understanding that all of these versions live on storage and rarely in a single place. There is no optimal way, and keeping version chains short is critical for good performance.
There is very deep literature around MVCC-style databases. The challenges of generalizing and maximally exploiting MVCC as a user feature while having performance that is not poor to the point of unusability are thoroughly documented.
Dolt's unique functionality isn't time travel, although it has that. It's version control, i.e. branch and merge, push and pull, fork and clone. A bunch of database products give you some of this for schema migrations, but Dolt is the only one that does it for table data as well.
It's a pipedream, not the future.
Your database is either too big / has too much throughput or migrations just don't matter. And it's not like you wouldn't need migrations with a versioned schema, as otherwise a rollback would mean data loss.
Consider a CMS, one of the most common forms of database backed applications. What if you could give your customer a "dev" branch of all their data to make their changes on and test out new content, that you could then merge with back to prod after somebody reviews it in a standard PR workflow?
This is the workflow one of our earliest customers built. They run network configuration software, and they use Dolt to implement a PR workflow for all changes their customers make.
More details here:
https://www.dolthub.com/blog/2021-11-19-dolt-nautobot/
It's all just software. There is essentially no limit to what we can make software do as long as the hardware supports it. And there's no hardware limit I know of that says version-controlled databases can't work. We just need to figure out how they will work, and then make 'em (or try to make 'em and in the process figure it out).
> And it's not like you wouldn't need migrations with a versioned schema, as otherwise a rollback would mean data loss.
When you roll back a code change, you don't lose code, as it's still in history. If you need to revert but keep some code, you branch the code, copy the needed code into some other part of the app, revert the old change, merge everything. If on merge there is a conflict (let's presume losing data is a conflict), it can prompt you to issue a set of commands to resolve the conflict before merge. You could do all of that in a branch, test it, merge into prod, and on merge it could perform the same operations. The database does all the heavy lifting and the user just uses a console the way they use Git today.
It's probably going to be required to lock the version of software and the version of the database together, such that both are changed/reverted at the same time. But because this is version control, we could actually serve multiple versions of the same database at the same time. You could have the database present two different versions of itself with the same data COW-overlayed for each version, and two different versions of an application. You could then blue/green deploy both the application and database, each modifying only its version. If you need to revert, you can diff and merge changes from one version to another.
Sometimes when you're first entering data you just get it wrong. You said the wrench was in storeroom A1 or the customer lives on 3rd St or the tree was five feet from the sidewalk. If the state of the asset changes due to human action, that's a new thing. But if you opened the wrong list taking inventory, or you missed a keystroke for 33rd St or you just eyeballed the coordinates, then that row was simply wrong, and the correction should be back-dated for most things.
But if I emptied out A1 because we aren't renting that space anymore, the customer moves or the tree got replanted, then it was here and now it's over there. Which might be important for figuring out things like overstock, taxes or success rates.
Similarly if the migration introduces derived data, then the migrated data is assumed/inferred, whereas if we took that data from a user, that information is confirmed, which might introduce subtle differences in how best to relate to the user. Things a mediocre business could easily ignore but a quality establishment might be ill-pleased with such a request.
Dolt's unique functionality isn't time travel, although it has that. It's version control, i.e. branch and merge, push and pull, fork and clone. A bunch of database products give you some of this for schema migrations, but Dolt is the only one that does it for table data as well.
What are the tradeoffs here? When wouldn't I want to use this?
It's slower. This is `sysbench` Dolt vs MySQL.
https://docs.dolthub.com/sql-reference/benchmarks/latency
We've dedicated this year to performance with a storage engine rewrite. We'll have some performance wins coming in the back half of the year. We think we can get under 2X MySQL.
It also requires more disk. Each change is at least on average 4K on disk. So, you might need more/bigger hard drives.
In modern world of SAASes and connectivity, you use multiple "databases" and such (queues, identity providers) anyways, and the world got connected enough to not need to have a local db most of the time.
CALL DOLT_REVERT('bad-commit-hash')
Works just like git revert: creates an inverse of the changes in the commit given and applies it as a patch.
More discussion of the immutable DB product space:
https://www.dolthub.com/blog/2022-03-21-immutable-database/
I wish we could use this at work. We're trying to predict time-series stuff. However, there's a lot of infrastructure complexity which is there to ensure that when we're training on data from years ago, that we're not using data that would be in the future from this point (future data leaking into the past).
Using Dolt, as far as I understand it, we could simply set the DB to a point in the past where the 'future' data wasn't available. Very cool
For a lot of types of data, auditing requires history.
Have a look at MSSQLs history table, for example: https://docs.microsoft.com/en-us/sql/relational-databases/ta...
Of course all secondary user-defined, typed indexes are also versioned.
Basically the technical idea is to map a huge tree of index tries (with revisions as indexed leave pages at the top-level and a document index as well as secondary indexes on the second level) to an append-only file. To reduce write amplification and to reduce the size of each snapshot data pages are first compressed and second versioned through a sliding snapshot algorithm. Thus, Sirix does not simply do a copy on write per page. Instead it writes nodes, which have been changed in the current revision plus nodes which fall out of the sliding window (therefore it needs a fast random-read drive).
[1] https://github.com/sirixdb/sirix
However DVC is fundamentally limited because you can only have dependencies and outputs that are files on the filesystem. Theoretically they could start supporting pluggable non-file-but-file-like artifacts, but for now it's just a feature request and I don't know if it's on their roadmap at all.
This is fine, of course, but it kind of sucks for when your data is "big"-ish and you can't or don't want to keep it on your local machine, e.g. generating intermediate datasets that live in some kind of "scratch" workspace within your data lake/warehouse. You can use DBT for that in some cases, but that's not really what DBT is for and then you have two incompatibile workflow graphs within your project and a whole other set of CLI touch points and program semantics to learn.
The universal solution is something like Airflow, but it's way too verbose for use during a research project, and running it is way too complicated. It's an industrial-strength data engineering tool, not a research workflow-and-artifact-tracking tool.
I think my ideal tool would be "DVC, but pluggable/extensible with an Airflow-like API."
Edit: I can always revert the contents of the .dvc folder to a previous commit, but I wonder if there's a more natural way of doing it.
The reason I used a separate table is so that you don't have to compromise or complicate the primary table's constraints, indices and foreign keys; the history table doesn't really need those because it's not responsible for data integrity.
Anyway, once you have that, you can run queries with a `where $timestamp is between start_date and end_date` condition, which will also allow you to join many tables at a certain point in time. To also be able to get the latest version, you can use a union query (iirc).
I'm sure there's a lot of caveats there though. What I should do is take some time in the weekend and work on that POC, publish it for posterity / blog post fuel.
- Dolt is Git for Data: a SQL database that you can fork, clone, branch, merge (2021) - 177 comments https://news.ycombinator.com/item?id=26370572
- Dolt is Git for data (2020) - 191 comments https://news.ycombinator.com/item?id=22731928
If I have
and I concurrently do Do I get a conflict? Do I get something consistent with the two transactions serialized, e.g. dev 34 or op 32)? Can I get something that no ordering of transaction could have given me, like mgr 34?Merge conflicts are stored in a special system table so you can decide what to do with them, kind of analogous to conflict markers in a source file. More details here:
https://docs.dolthub.com/sql-reference/version-control/merge...
The situation you're talking about with two transactions isn't a merge operation, it's just normal database transaction isolation level stuff. Dolt supports REPEATABLE_READ right now, with others coming in future releases. So in the example above, whichever transaction committed last would fail and get rolled back (assuming they touched the same rows).
I mean, arguably. It's not like there is a standard definition for "merge operation" on data. Even Git tries to do more than line-level changes, taking context into account, and turning one line change into two line changes if merging across a file copy for example.
Dolt markets itself as "a version controlled SQL database", so I think it is perfectly reasonable to consider the standard that already exists for concurrent changes to a SQL database, and that's transaction isolation.
I guess anything more complex than this would be pretty unwieldy though, with probably little benefits. I am struggling to come up with a good example for the kind of anomaly I imagine.
https://github.com/attic-labs/noms
We are a fork of Noms. Lots of incremental changes and we're in the process of a major storage engine overhaul (what we use Noms for) for performance as we speak.
You can define a "virtual" table (schema, how to retrieve rows/columns) and then a MySQL client can connect and execute arbitrary queries on your table (which could just be an API or other source)