If anyone from the dolt team is reading this, I'd like to make an enquiry:
At bugout.dev, we have an ongoing crawl of public GitHub. We just created a dataset of code snippets crawled from popular GitHub repositories, listed by language, license, github repo, and commit hash and are looking to release it publicly and keep it up-to-date with our GitHub crawl.
The dataset for a single crawl comes in at about 60GB. We uploaded the data to Kaggle because we thought it would be a good place for people to work with the data. Unfortunately, the Kaggle notebook experience is not tailored to such large datasets. Our dataset is in a SQLite database. It takes a long time for the dataset to load into Kaggle notebooks, and I don't think they are provisioned with SSDs as queries take a long time. Our best workaround to this is to partition into 3 datasets on Kaggle - train, eval, and development, but it will be a pain to manage this for every update, especially as we enrich the dataset with results of static analysis, etc.
I'd like to explore hosting the public dataset on Dolthub. If this sounds interesting to you please, reach out to me - email is in my HN profile.
Yeah, that sized database is likely to be a challenge unless the computer system it's running on has scads of memory.
One of my projects (DBHub.io) is putting effort towards working through the problem with larger sized SQLite databases (~10GB), and that's mainly through using bare metal hosts with lots of memory. eg 64GB, 128GB, etc.
Putting the same data into PostgreSQL, or even MySQL, would likely be much more efficient memory wise. :)
You have other options too. If I have time i can try to reduce the size with a columnar format that is designed for this use case (repeated values, static dataset).
That would be really great. Let me know if there's any way we can help. Maybe if we released a small version of the dataset for testing/benchmarking and then I could take care of running the final processing on the full dataset?
What a lot of webmasters want is, test the site locally, then merge it back. A lot of people turned to Jekyll or Hugo for the very reason that it can be checked into git, and git is reliable. A static website can’t get hacked, whereas anyone who has been burnt with Wordpress security fail knows they’d prefer a static site.
And even more: People would like to pass the new website from the designer to the customer to managers — Wordpress might have not needed to develop their approval workflows (permission schemes, draft/preview/publish) if they had had a forkable database.
At the root of this lies the problem that content, configuration and code is stored in one blurp (a single db). Never clearly bounded nor contained. Config is spread over tables. Usergenerated content (from comments to orders) mixed up with redactional content: often even in tables, sometimes even in the same column. Drupal suffers the same.
What is needed, is a clear line between configuration (and logic), redactional content, and user generated content. Those three have very distinct lifecycles. As long as they are treated as one, the distinct lifecycles will conflict. No matter how good your database merging is: the root is a missing piece of architecture.
Very well summarised. I can see clearly you too have tried to apply normal software lifecycle principles and come out the other end with a rather dismal impression of WordPress.
Sure, but Wordpress is still running PHP files for every page load and back-end query. Dolt would help offload some of the code’s complexity, but that would still leave a significant attack surface area.
In other words, by itself Dolt couldn’t solve the problem of Wordpress not being run mostly from static assets on a server (plus an API).
The parent post here is speaking to the content stored in the database, not the templates on the filesystem. Dolt enables one to merge, push and pull data, just as one would with files and git.
thanks for the mention of TerminusDB (https://github.com/terminusdb/terminusdb) -> we are the graph cousins of Dolt. Great to see so much energy in the version control database world!
We are currently focused on data mesh use cases. Rather than trying to be GitHub for Data, we're trying to be YOUR GitHub for Data. Get all that good git lineage, pull, push, clone etc. and have data producers in your org own their data.
We see lots of organizations with big 'shadow data' problems and data being centrally managed rather than curated by domain experts.
Dolt might be good but never underestimate the power of Type 2 Slowly Changing Dimension tables [1]. For example, if you had an SSIS package that took CSV and imported them into a database, and one day you noticed it accidently rounded the value incorrectly, you could fix the data and retain traceability of the data which was there originally.
I definitely agree, just tossing in the superset concept that Dolt and Type 2 SCD involve - temporal databases [1].
I think the idea of a "diff" applied to datasets is quite awesome, but even then, we kind of do that with databases today with data comparison tools - it's just most of them are not time aware, rather they are used to compare data between two instances of the data in different databases, not at two points in time in the same database.
The commit log in Dolt is edit history. (When did someone import or edit the data? Who made the change?) It's not about when things happened.
To keep track of when things happened, you would still need date columns to handle it. But at least you don't need to handle two-dimensional history for auditing purposes. So, in your example, I think the "effective" date columns wouldn't be needed.
They have ways to query how the dataset appeared at some time in the past. However, with both data changes and schema changes being mixed together in a single commit log, I could see this being troublesome.
I suppose writing code to convert old edit history to the new schema would still be possible, similar to how git allows you to create a new branch by rewriting an existing one.
However, I'm struggling to figure out a real-world use case for this. I'd love if anyone here can enlighten me.
I don't see how it can be for production databases involving lots of users, because while it seems appealing as a way to upgrade and then roll back, you'd lose all the new data inserted in the meantime. When you roll back, you generally want to roll back changes to the schema (e.g. delete the added column) but not remove all the rows that were inserted/deleted/updated in the meantime.
So does it handle use cases that are more like SQLite? E.g. where application preferences, or even a saved file, winds up containing its entire history, so you can rewind? Although that's really more of a temporal database -- you don't need git operations like branching. And you really just need to track row-level changes, not table schema modifications etc. The git model seems like way overkill.
Git is built for the use case of lots of different people working on different parts of a codebase and then integrating their changes, and saving the history of it. But I'm not sure I've ever come across a use case for lots of different people working on the data and schema in different parts of a database and then integrating their data and schema changes. In any kind of shared-dataset scenario I've seen, the schema is tightly locked down, and there's strict business logic around who can update what and how -- otherwise it would be chaos.
So I feel like I'm missing something. What is this actually intended for?
I wish the site explained why they built it -- if it was just "because we can" or if projects or teams actually had the need for git for data?
The application backing use case is best suited for when you have parts of your database that get updated periodically and need human review. So you have a production database that you serve to your customers. Then you have a branch / fork of that (dev) that your development team adds batches of products to. Once a week you do a data release: submit a PR from dev -> prod, have somebody review all the new copy, and merge it once you're happy. If there's a big mistake, just back it out again. We have several paying customers building products around this workflow.
As for lots of people collaborating on data together, we have started a data bounties program where we pay volunteers to assemble large datasets. Two have completed so far, and a third is in progress. For the first one, we paid $25k to assemble precinct-level voting data for the 2016 and 2020 presidential elections. For the second, we paid $10k to get procedure prices for US hospitals. You can read about them here:
What's cool is that novices can make a really good income from data entry as a side gig, and it's two orders of magnitude cheaper than hiring a firm to build data sets for you.
You're right that the site is kind of vague about what dolt is "for." It's a really general, very multi-purpose tool that we think will get used a lot of places. Here's a blog we wrote a while back about some of the use cases we envision.
This won't work for usual database usecases. This is meant for interactive work with data same way you work with code.
Who needs that?
Data scientists working with large datasets. You want to be able to update data without redownloading everything.
Also make your local changes (some data cleaning) and propose your updates upstream same way you would with git.
Having many people working interactively with data is common here.
One of the companies I work with provided set of data distributed to their partners on a daily basis.
Once it grew larger, downloading everything daily became an issue. So that would be desirable,
I have large data model that I need to deploy to production and update once in a while.
For code, network usage is kept to minimum because we have git. For data, options are limited.
As with git, it is something that once you have, you will find a lot of usecases that make life easier and open many new doors.
Here's one thing I'd have used it for: Video game assets.
Say you have a tabletop game engine for designing starships. Different settings have different lists of parts. Some settings are run by a game's DM, some are collaborative efforts. I ended up saving the lists of parts in huge JSON files and dumping those into git. However, for much the same reason that data science is often done in a REPL or notebook type interface, it turned out that by far the most efficient way for people to iterate on assets was to boot up the game, fiddle with the parts in-engine until things looked right, then replicate their changes back into the JSON. With this, we could just save the asset database directly.
The same reasoning should hold for effectively any dataset which a) can be factored into encapsulated parts b) isn't natively linear c) needs multiple developers. Game assets are one example, as I described above. Other datasets that hold: ML training/testing sets, dictionaries, spreadsheets, catalogs, datasets for bio papers.
I might look at it for work. Compliance requires us to keep track of who made what change when, and who it was approved by in case regulators need it.
Right now, this often means making an MR on Git with your snippet of SQL, getting it approved, and then manually executing it. This would let us bring the apply stage in, as well as avoid "that SQL query didn't do exactly what I expected" issues.
It's possible to do it within the SQL engine, but then I have to maintain that, which I would prefer not to do. As well as dealing with performance implications from that.
I see a use-case for public resource data sets. For example, right of way, land use variance, permits, and land records. These are fundamentally public data sets that are often maintained by entities with limit budgets and the potential (even incentive) for fraud is substantial. Also, there is significant value in being able to access the data set for analytical purposes such as real estate analysis.
I am not associated to Dolt, but I really like the idea of Dolt personally. I do see use cases, but not without challenges.
One of the main use cases you can see them targeting, and that I think makes a ton of sense, is providing tools for collecting, maintaining and publishing reference data sets using crowd sourcing.
For example, they are doing this with hospital charge codes (a.k.a. chargemaster data). Hospitals in the US are required to publish this data for transparency.. however, I have never seen a single aggregated national (or international) data set of all these charges. In fact, such a data set could be worth a lot of money to a lot of organizations for so many reasons. I used to work in health insurance, gathering data from all kinds of sources (government rules/regs, etc.) and it was a lot of hard work, scraping, structuring, maintaining, etc.
This reference data can be used for analytics, to power table-driven business logic, machine learning - to help identify cost inequalities, efficiencies, maybe even illicit price gouging, etc. There are so many reference data sets that have similar characteristics... and "data marketplaces" in a way are targeted at making "private" reference data sets available for sale - so then where is the "open" data marketplace? Well, here you go.. Dolt.
I have often realized that the more ways we can make things collaborative, the better off we will be.
Data is one of those things where, coming up with common, public reference datasets is difficult and there are lots of different perspectives ("branches"), sometimes your data set is missing something and it would be cool if someone could propose it ("pull request"), sometimes you want to compare the old and new version of a dataset ("diff") to see what is different.
One difficult thing about Dolt is, it will only be successful if people are actually willing to work together to cook up and maintain common data sets collaboratively, or if those doing so have an incentive to manage an "open data" project on Dolt as benevolent maintainers, for example. But, I could say then it has the same challenges as "open source" in general, so therefore it is not really that different.
Dolt could even be used as a foundation for a master data management registry - in the sense of you could pop it in as a "communal data fountain" if you will where anybody in your org, or on the Web, etc. could contribute - and you can have benevolent maintainers look over the general quality of the data. Dolt would be missing the data quality/fuzzy matching aspect that master data tools offer, but this is a start for sure.
For example, I work in a giant corporation right now. Each department prefers to maintain its own domain data and in some cases duplicates common data in that domain. Imagine using Dolt to make it possible for all these different domains to collaborate on a single copy of common data in a "Dolt" data set - now people can share data on a single copy and use pull requests, etc. to have an orderly debate on what that common data schema and data set should look like.
I think it's an idea that is very timely.
P.S. Dolt maintainers, if you read this and want to talk, I'm game! Awesome work :)
Your final example is very like the data mesh ideas coming out of thought works and elsewhere (https://martinfowler.com/articles/data-mesh-principles.html). Data products being owned by data producers and common data models ONLY when required. It is as much organizational model as technology, but I don't really think this maps to SQL tables. You probably want to look at a versioned knowledge graph of some kind. Downside is no SQL, upside is flexibility and speed. (disclaimer - I work over at TerminusDB, the graph version of Dolt)
agree 100% this looks awesome for reference datasets.
not so sure how well it would work for live data sources that update with time as it could encourage people to apply more ad-hoc edits as opposed to getting their version controlled jobs to work 100%, but who knows, maybe that would be a net win in some cases?
Or more simply put, how about table-driven logic in general? It doesn't have to be as complex as machine learning. There are more use cases than just machine learning, IMHO.
It might not be exactly what you are looking for, but git-bug[1] is encoding data into regular git objects, with merges and conflict resolution. I'm mentioning this because the hard part is providing an ordering of events. Once you have that you can store and recreate whatever state you want.
This branch[2] I'm almost done with remove the purely linear branch constraint and allow to use full DAGs (that is, concurrent edition) and still provide a good ordering.
It’s a bit slower, but smart use of partial/shallow clones can address performance degradation on large repositories over time. You just need to take care of the transformation between “physical” trees/blobs and “logical” objects in your dataset (which may not have 1:1 mapping, as having physical layer more granular reduces likelihood of merge conflicts).
I’m also following Pijul, which seems very promising in regards to versioning data—I believe they might introduce primitives allowing to operate on changes in actual data structures rather than between lines in files, like with Git.
Add to that sound theory of patches, and that’s a definite win over Git (or Doit for that matter, which seems to be same old Git but for SQL).
Lack of locking for binary files, overhead > 1gb and all the shenanigans you need to do for proxy servers. There's better solutions out there but they aren't free.
I find a balance between this using git on JSON files. And I build the JSON files into a database (1 file per record, 1 directory per table, subdirectories for indexes). The whole thing is pretty beautiful, and it's functioning well for a user-account, access management database I'm running in production. I like that I can go back and do:
`git diff -p` to see the users who have signed up recently, for example.
The advantages of this approach are using existing unix tooling for text files, solid versioning, easy inspect-ability, and leveraging the filesystem B-Tree indexing as a fast index structure (rather than having to write my b-trees). Another advantage is hardware-linked scaling. For example, if I use regular hard disks, it's slower. But if I use SSDs it's faster. And i should also be possible to mount the DB as a RAM disk and make it super fast.
The disadvantages are that the database side still only supports a couple of operations (like exact, multikey searches, lookup by ID, and so on) rather than a rich query language. I'm OK with that for now, and I'm also thinking of using skiplists in future to get nice ordering property for the keys in an index so I can easily iterate and page over those.
Git is too complicated. It's barely usable for daily tasks. Look at how many people have to Google for basic things like uncommitting a commit, or cleaning your local repo to mirror a remote one. Complexity is a liability. Mercurial has a nicer interface. And now I see the real simplicity of non-distributed source control systems. I have never actually needed to work in a distributed manner, just client-server. I have never sent a patch to another dev to patch into their local repo or whatnot. All this complexity seems like a solution chasing after a problem - at least for most developers. What works for Linux isn't necessary for most teams.
> Git is too complicated. It's barely usable for daily tasks. Look at how many people have to Google for basic things like uncommitting a commit, or cleaning your local repo to mirror a remote one.
Cars are too complicated. They are barely usable for daily tasks. Look at how many people have to Google for basic things like changing a fan belt, or fixing cylinder head gasket.
You can fill in almost anything here. Most tools are complicated. Yet yo don’t need to know their ins and outs for them to be useful to you.
Doit boasts its likeness to Git as a feature. Does this mean it’ll also be barely usable for daily tasks? Is it possible for a project to faithfully reproduce the entirety of Git command interface and be less complicated than Git / not suffer from the same shortcomings?
I personally think Git isn’t that bad, once it’s understood. It could be counter-intuitive sometimes though (as an example, for the longest time I used Git without realizing it stores a snapshot of each file and diffs/deltas are only computed when required). Just trying to be pragmatic and not expecting a tool like Git to be entirely free of leaky abstractions.
We used git to store and manage data sets for a machine learning project involving chewing detection with audio data used in training. It was cumbersome and the huge datasets caused some problems with git (e.g., searches of our code base got really slow because the data was being searched also until we moved the data to a different repo). Something easier to use that could manage large datasets would be useful.
I wonder if DoIt could be used to create a clone of Apple's Time Machine. Seems like the basics are there.
What people usually miss about these things is normal version control benefits hugely from content addressing and normal forms.
The salient aspect of relational data is that it's cyclic, this makes content addressing unable to provide normal forms on it's own (unless someone figures out how to Merkle cylic graphs!), but the normal form can still made other ways.
The first part is easier enough, store rows in some order.
The second part is more interesting: making the choice of surrogate keys not matter (quotienting it away). Sorting table rows containing surrogate keys depending on the sorting of table rows makes for some interesting bags of constraints, for which there may be more than one fixed point.
Example:
CREATE TABLE Foo (
a uuid PRIMARY KEY,
b text,
best_friend uuid REFERENCES Foo(b)
);
DB 0:
0 Alice 0
1 reclusive Alice, best friends with herself. Just fine.
0 Alice 1
1 Alice 1
2 reclusive Alices, both best friends with the second one. The alices are the same up to primary keys, but while primary keys are to be quotiented out, primary key equality isn't, so this is valid. And we have an asymmetry by which to sort.
0 Alice 1
1 Alice 0
2 reclusive Alices, each best friends with the other. The Alices are completely isomorphic, and one notion of normal forms would say this is exactly the same as DB 0: as if this is reclusive Alice in a fun house of mirrors.
All this is resolvable, but it's subtle. And there's no avoiding complexity. E.g. if one wants to cross reference two human data entries who each assigned their own surrogate IDs, this type of analysis must be done. Likewise when merging forks of a database.
I'd love be wrong, but I don't think any of the folks doing "git for data" are planning their technology with this level of mathematical rigor.
A lot of your comment went over my head, but I have modeled relational data in a way conducive to being stored in a Merkle tree. The trick being that every entity in the system ended up having two IDs. A hash ID, identifying this specific version of the object, and an entity ID (probably a UUID or OID), which remained constant as new versions were added. In a situation where people can have friends that are also people, they are friends with the person long-term, not just a specific version of the friend, so in that case you'd use the entity IDs. Though you might also include a reference to the specific version of the person at the point in time at which they became friends, in which case they would necessarily reference an older version of that person. If you friend yourself, you're actually friending that person a moment ago.
A list of all current entities, by hash, is stored higher up the tree. Whether it's better that objects themselves store their entity ID or if that's a separate data structure mapping entity to hash IDs depends on the situation.
On second reading I guess your comment was actually about how to come up with content-based IDs for objects. I guess my point was that in the real world you don't usually need to do that, because if object identity besides its content is important you can just give it an arbitrary ID. How often does the problem of differentiating between a graph containing identical Alices vs one with a single self-friending Alice actually come up? Is there any way around it other than numbering the Alices?
> ...The trick being that every entity in the system ended up having two IDs...
I think we agree that this is a partial solution. Adding a temporal dimension and referencing immutable single-versions only can break cycles by making them unconstructable in the first place. But once an object refers to foreign entity IDs, hash IDs become "polluted" with surrogate values.
> How often does the problem of differentiating between a graph containing identical Alices vs one with a single self-friending Alice actually come up? Is there any way around it other than numbering the Alices?
I think it would come up with forks that have some identical edits, especially if those edits are in different orders. In that case, surrogate keygen state would get out of sync (whether it's counters or UUID state). Either we pessimize merges, or we need some way to recover.
I think allowing for "identical Alices" is probably necessary in practice, but an interface should have a warning of some sort about this. (Maybe ask the Alices more questions until you can differentiate them? Get prepared in case one of the alices comes back and wants a new I.D. card and you don't want to easily enable fraud.) Likewise when merging those warnings should be brought to the fore, along with a menu of resolutions at extremes for the user to decide between.
> The salient aspect of relational data is that it's cyclic
This is an odd claim. Most relational data is not cyclic, and it's easy enough to come up with a scheme to handle cyclic data in a consistent fashion.
Conflicting changes (two changes to the same 'cell' of a database table) are a much more likely issue to hit and will need handling in much the same way merge conflicts are currently handled, so there are already situations in which manual effort will be needed.
We have content addressable objects at Splitgraph. [0] And here’s an example of a “point in time” query across two versions of an image on Splitgraph. [1]
I’m on my phone and don’t have a ton of time to respond right now, but I’d recommend reading our docs. We’re working on a lot of what you mention.
(Also, we’re hiring for backend and frontend roles. See my comment history for more.)
Consider the whole database - the whole set of facts across all relations - as the state in the tree. Each transaction a serialized delta that produces a new node in the tree, a new HEAD. That's closer to what's being gotten at, as I see it.
Transaction logs are already not that different to patch sets, and merge conflicts have an isomporphism with replication inconsistencies.
> Consider the whole database - the whole set of facts across all relations - as the state in the tree.
I tried to demonstrate that this is easier said than done. Deciding the equality/redundancy of facts is very subtle. At some it might even be guess whether your two clerks each met the same Alice when entering in their DB forks or not.
Transactions are just patches, I completely agree. And patches are just partial functions. But deciding what the action is of a partial function, or whether the input is in the domain requires a notion of equality. (Partial functions are encoded very nicely with incomplete pattern matching; in this case the question is whether Alice matches the pattern.)
Basically I know where you are coming from, and I want it to work too, but you cannot just wave away the math and issues it points out.)
At bugout.dev, we have an ongoing crawl of public GitHub. We just created a dataset of code snippets crawled from popular GitHub repositories, listed by language, license, github repo, and commit hash and are looking to release it publicly and keep it up-to-date with our GitHub crawl.
The dataset for a single crawl comes in at about 60GB. We uploaded the data to Kaggle because we thought it would be a good place for people to work with the data. Unfortunately, the Kaggle notebook experience is not tailored to such large datasets. Our dataset is in a SQLite database. It takes a long time for the dataset to load into Kaggle notebooks, and I don't think they are provisioned with SSDs as queries take a long time. Our best workaround to this is to partition into 3 datasets on Kaggle - train, eval, and development, but it will be a pain to manage this for every update, especially as we enrich the dataset with results of static analysis, etc.
I'd like to explore hosting the public dataset on Dolthub. If this sounds interesting to you please, reach out to me - email is in my HN profile.
One of my projects (DBHub.io) is putting effort towards working through the problem with larger sized SQLite databases (~10GB), and that's mainly through using bare metal hosts with lots of memory. eg 64GB, 128GB, etc.
Putting the same data into PostgreSQL, or even MySQL, would likely be much more efficient memory wise. :)
What a lot of webmasters want is, test the site locally, then merge it back. A lot of people turned to Jekyll or Hugo for the very reason that it can be checked into git, and git is reliable. A static website can’t get hacked, whereas anyone who has been burnt with Wordpress security fail knows they’d prefer a static site.
And even more: People would like to pass the new website from the designer to the customer to managers — Wordpress might have not needed to develop their approval workflows (permission schemes, draft/preview/publish) if they had had a forkable database.
What is needed, is a clear line between configuration (and logic), redactional content, and user generated content. Those three have very distinct lifecycles. As long as they are treated as one, the distinct lifecycles will conflict. No matter how good your database merging is: the root is a missing piece of architecture.
In other words, by itself Dolt couldn’t solve the problem of Wordpress not being run mostly from static assets on a server (plus an API).
Deleted Comment
Dolt is Git for data - https://news.ycombinator.com/item?id=22731928 - March 2020 (191 comments)
Git for Data – A TerminusDB Technical Paper [pdf] - https://news.ycombinator.com/item?id=22045801 - Jan 2020 (5 comments)
Ask HN: Would you use a “git for data”? - https://news.ycombinator.com/item?id=11537934 - April 2016 (10 comments)
We are currently focused on data mesh use cases. Rather than trying to be GitHub for Data, we're trying to be YOUR GitHub for Data. Get all that good git lineage, pull, push, clone etc. and have data producers in your org own their data.
We see lots of organizations with big 'shadow data' problems and data being centrally managed rather than curated by domain experts.
E.g., SSIS package writes row of data: https://imgur.com/DClXAi5
Then a few months later (on 2020-08-15) we identify that trans_value was imported incorrectly so we update it: https://imgur.com/wdQJWm4
Then whenever we SELECT from the table we always ensure we are extracting "today's" version of the data:
[1] https://en.wikipedia.org/wiki/Slowly_changing_dimensionI think the idea of a "diff" applied to datasets is quite awesome, but even then, we kind of do that with databases today with data comparison tools - it's just most of them are not time aware, rather they are used to compare data between two instances of the data in different databases, not at two points in time in the same database.
[1] https://en.wikipedia.org/wiki/Temporal_database
To keep track of when things happened, you would still need date columns to handle it. But at least you don't need to handle two-dimensional history for auditing purposes. So, in your example, I think the "effective" date columns wouldn't be needed.
They have ways to query how the dataset appeared at some time in the past. However, with both data changes and schema changes being mixed together in a single commit log, I could see this being troublesome.
I suppose writing code to convert old edit history to the new schema would still be possible, similar to how git allows you to create a new branch by rewriting an existing one.
Dolt actually makes branch and merge possible, totally different beast.
Postgres equivalent solution in this [0] rather complex but great tutorial
[0]: https://clarkdave.net/2015/02/historical-records-with-postgr...
However, I'm struggling to figure out a real-world use case for this. I'd love if anyone here can enlighten me.
I don't see how it can be for production databases involving lots of users, because while it seems appealing as a way to upgrade and then roll back, you'd lose all the new data inserted in the meantime. When you roll back, you generally want to roll back changes to the schema (e.g. delete the added column) but not remove all the rows that were inserted/deleted/updated in the meantime.
So does it handle use cases that are more like SQLite? E.g. where application preferences, or even a saved file, winds up containing its entire history, so you can rewind? Although that's really more of a temporal database -- you don't need git operations like branching. And you really just need to track row-level changes, not table schema modifications etc. The git model seems like way overkill.
Git is built for the use case of lots of different people working on different parts of a codebase and then integrating their changes, and saving the history of it. But I'm not sure I've ever come across a use case for lots of different people working on the data and schema in different parts of a database and then integrating their data and schema changes. In any kind of shared-dataset scenario I've seen, the schema is tightly locked down, and there's strict business logic around who can update what and how -- otherwise it would be chaos.
So I feel like I'm missing something. What is this actually intended for?
I wish the site explained why they built it -- if it was just "because we can" or if projects or teams actually had the need for git for data?
As for lots of people collaborating on data together, we have started a data bounties program where we pay volunteers to assemble large datasets. Two have completed so far, and a third is in progress. For the first one, we paid $25k to assemble precinct-level voting data for the 2016 and 2020 presidential elections. For the second, we paid $10k to get procedure prices for US hospitals. You can read about them here:
https://www.dolthub.com/blog/2021-02-15-election-bounty-revi...
https://www.dolthub.com/blog/2021-03-03-hpt-bounty-review/
What's cool is that novices can make a really good income from data entry as a side gig, and it's two orders of magnitude cheaper than hiring a firm to build data sets for you.
You're right that the site is kind of vague about what dolt is "for." It's a really general, very multi-purpose tool that we think will get used a lot of places. Here's a blog we wrote a while back about some of the use cases we envision.
https://www.dolthub.com/blog/2020-03-30-dolt-use-cases/
Deleted Comment
Data scientists working with large datasets. You want to be able to update data without redownloading everything. Also make your local changes (some data cleaning) and propose your updates upstream same way you would with git. Having many people working interactively with data is common here.
One of the companies I work with provided set of data distributed to their partners on a daily basis. Once it grew larger, downloading everything daily became an issue. So that would be desirable,
I have large data model that I need to deploy to production and update once in a while. For code, network usage is kept to minimum because we have git. For data, options are limited.
As with git, it is something that once you have, you will find a lot of usecases that make life easier and open many new doors.
Say you have a tabletop game engine for designing starships. Different settings have different lists of parts. Some settings are run by a game's DM, some are collaborative efforts. I ended up saving the lists of parts in huge JSON files and dumping those into git. However, for much the same reason that data science is often done in a REPL or notebook type interface, it turned out that by far the most efficient way for people to iterate on assets was to boot up the game, fiddle with the parts in-engine until things looked right, then replicate their changes back into the JSON. With this, we could just save the asset database directly.
The same reasoning should hold for effectively any dataset which a) can be factored into encapsulated parts b) isn't natively linear c) needs multiple developers. Game assets are one example, as I described above. Other datasets that hold: ML training/testing sets, dictionaries, spreadsheets, catalogs, datasets for bio papers.
Right now, this often means making an MR on Git with your snippet of SQL, getting it approved, and then manually executing it. This would let us bring the apply stage in, as well as avoid "that SQL query didn't do exactly what I expected" issues.
It's possible to do it within the SQL engine, but then I have to maintain that, which I would prefer not to do. As well as dealing with performance implications from that.
https://discord.com/invite/RFwfYpu
One of the main use cases you can see them targeting, and that I think makes a ton of sense, is providing tools for collecting, maintaining and publishing reference data sets using crowd sourcing.
For example, they are doing this with hospital charge codes (a.k.a. chargemaster data). Hospitals in the US are required to publish this data for transparency.. however, I have never seen a single aggregated national (or international) data set of all these charges. In fact, such a data set could be worth a lot of money to a lot of organizations for so many reasons. I used to work in health insurance, gathering data from all kinds of sources (government rules/regs, etc.) and it was a lot of hard work, scraping, structuring, maintaining, etc.
This reference data can be used for analytics, to power table-driven business logic, machine learning - to help identify cost inequalities, efficiencies, maybe even illicit price gouging, etc. There are so many reference data sets that have similar characteristics... and "data marketplaces" in a way are targeted at making "private" reference data sets available for sale - so then where is the "open" data marketplace? Well, here you go.. Dolt.
I have often realized that the more ways we can make things collaborative, the better off we will be.
Data is one of those things where, coming up with common, public reference datasets is difficult and there are lots of different perspectives ("branches"), sometimes your data set is missing something and it would be cool if someone could propose it ("pull request"), sometimes you want to compare the old and new version of a dataset ("diff") to see what is different.
One difficult thing about Dolt is, it will only be successful if people are actually willing to work together to cook up and maintain common data sets collaboratively, or if those doing so have an incentive to manage an "open data" project on Dolt as benevolent maintainers, for example. But, I could say then it has the same challenges as "open source" in general, so therefore it is not really that different.
Dolt could even be used as a foundation for a master data management registry - in the sense of you could pop it in as a "communal data fountain" if you will where anybody in your org, or on the Web, etc. could contribute - and you can have benevolent maintainers look over the general quality of the data. Dolt would be missing the data quality/fuzzy matching aspect that master data tools offer, but this is a start for sure.
For example, I work in a giant corporation right now. Each department prefers to maintain its own domain data and in some cases duplicates common data in that domain. Imagine using Dolt to make it possible for all these different domains to collaborate on a single copy of common data in a "Dolt" data set - now people can share data on a single copy and use pull requests, etc. to have an orderly debate on what that common data schema and data set should look like.
I think it's an idea that is very timely.
P.S. Dolt maintainers, if you read this and want to talk, I'm game! Awesome work :)
https://discord.com/invite/RFwfYpu
not so sure how well it would work for live data sources that update with time as it could encourage people to apply more ad-hoc edits as opposed to getting their version controlled jobs to work 100%, but who knows, maybe that would be a net win in some cases?
This branch[2] I'm almost done with remove the purely linear branch constraint and allow to use full DAGs (that is, concurrent edition) and still provide a good ordering.
[1]: https://github.com/MichaelMure/git-bug [2]: https://github.com/MichaelMure/git-bug/pull/532
It’s a bit slower, but smart use of partial/shallow clones can address performance degradation on large repositories over time. You just need to take care of the transformation between “physical” trees/blobs and “logical” objects in your dataset (which may not have 1:1 mapping, as having physical layer more granular reduces likelihood of merge conflicts).
I’m also following Pijul, which seems very promising in regards to versioning data—I believe they might introduce primitives allowing to operate on changes in actual data structures rather than between lines in files, like with Git.
Add to that sound theory of patches, and that’s a definite win over Git (or Doit for that matter, which seems to be same old Git but for SQL).
Lack of locking for binary files, overhead > 1gb and all the shenanigans you need to do for proxy servers. There's better solutions out there but they aren't free.
`git diff -p` to see the users who have signed up recently, for example.
You can get the code, over at: https://github.com/i5ik/sirdb
The advantages of this approach are using existing unix tooling for text files, solid versioning, easy inspect-ability, and leveraging the filesystem B-Tree indexing as a fast index structure (rather than having to write my b-trees). Another advantage is hardware-linked scaling. For example, if I use regular hard disks, it's slower. But if I use SSDs it's faster. And i should also be possible to mount the DB as a RAM disk and make it super fast.
The disadvantages are that the database side still only supports a couple of operations (like exact, multikey searches, lookup by ID, and so on) rather than a rich query language. I'm OK with that for now, and I'm also thinking of using skiplists in future to get nice ordering property for the keys in an index so I can easily iterate and page over those.
Cars are too complicated. They are barely usable for daily tasks. Look at how many people have to Google for basic things like changing a fan belt, or fixing cylinder head gasket.
You can fill in almost anything here. Most tools are complicated. Yet yo don’t need to know their ins and outs for them to be useful to you.
I personally think Git isn’t that bad, once it’s understood. It could be counter-intuitive sometimes though (as an example, for the longest time I used Git without realizing it stores a snapshot of each file and diffs/deltas are only computed when required). Just trying to be pragmatic and not expecting a tool like Git to be entirely free of leaky abstractions.
I wonder if DoIt could be used to create a clone of Apple's Time Machine. Seems like the basics are there.
The salient aspect of relational data is that it's cyclic, this makes content addressing unable to provide normal forms on it's own (unless someone figures out how to Merkle cylic graphs!), but the normal form can still made other ways.
The first part is easier enough, store rows in some order.
The second part is more interesting: making the choice of surrogate keys not matter (quotienting it away). Sorting table rows containing surrogate keys depending on the sorting of table rows makes for some interesting bags of constraints, for which there may be more than one fixed point.
Example:
DB 0: 1 reclusive Alice, best friends with herself. Just fine. 2 reclusive Alices, both best friends with the second one. The alices are the same up to primary keys, but while primary keys are to be quotiented out, primary key equality isn't, so this is valid. And we have an asymmetry by which to sort. 2 reclusive Alices, each best friends with the other. The Alices are completely isomorphic, and one notion of normal forms would say this is exactly the same as DB 0: as if this is reclusive Alice in a fun house of mirrors.All this is resolvable, but it's subtle. And there's no avoiding complexity. E.g. if one wants to cross reference two human data entries who each assigned their own surrogate IDs, this type of analysis must be done. Likewise when merging forks of a database.
I'd love be wrong, but I don't think any of the folks doing "git for data" are planning their technology with this level of mathematical rigor.
A list of all current entities, by hash, is stored higher up the tree. Whether it's better that objects themselves store their entity ID or if that's a separate data structure mapping entity to hash IDs depends on the situation.
On second reading I guess your comment was actually about how to come up with content-based IDs for objects. I guess my point was that in the real world you don't usually need to do that, because if object identity besides its content is important you can just give it an arbitrary ID. How often does the problem of differentiating between a graph containing identical Alices vs one with a single self-friending Alice actually come up? Is there any way around it other than numbering the Alices?
I think we agree that this is a partial solution. Adding a temporal dimension and referencing immutable single-versions only can break cycles by making them unconstructable in the first place. But once an object refers to foreign entity IDs, hash IDs become "polluted" with surrogate values.
> How often does the problem of differentiating between a graph containing identical Alices vs one with a single self-friending Alice actually come up? Is there any way around it other than numbering the Alices?
I think it would come up with forks that have some identical edits, especially if those edits are in different orders. In that case, surrogate keygen state would get out of sync (whether it's counters or UUID state). Either we pessimize merges, or we need some way to recover.
I think allowing for "identical Alices" is probably necessary in practice, but an interface should have a warning of some sort about this. (Maybe ask the Alices more questions until you can differentiate them? Get prepared in case one of the alices comes back and wants a new I.D. card and you don't want to easily enable fraud.) Likewise when merging those warnings should be brought to the fore, along with a menu of resolutions at extremes for the user to decide between.
This is an odd claim. Most relational data is not cyclic, and it's easy enough to come up with a scheme to handle cyclic data in a consistent fashion.
Conflicting changes (two changes to the same 'cell' of a database table) are a much more likely issue to hit and will need handling in much the same way merge conflicts are currently handled, so there are already situations in which manual effort will be needed.
I’m on my phone and don’t have a ton of time to respond right now, but I’d recommend reading our docs. We’re working on a lot of what you mention.
(Also, we’re hiring for backend and frontend roles. See my comment history for more.)
[0] https://www.splitgraph.com/docs/concepts/objects
[1] https://www.splitgraph.com/workspace/ddn?layout=hsplit&query...
Transaction logs are already not that different to patch sets, and merge conflicts have an isomporphism with replication inconsistencies.
I tried to demonstrate that this is easier said than done. Deciding the equality/redundancy of facts is very subtle. At some it might even be guess whether your two clerks each met the same Alice when entering in their DB forks or not.
Transactions are just patches, I completely agree. And patches are just partial functions. But deciding what the action is of a partial function, or whether the input is in the domain requires a notion of equality. (Partial functions are encoded very nicely with incomplete pattern matching; in this case the question is whether Alice matches the pattern.)
Basically I know where you are coming from, and I want it to work too, but you cannot just wave away the math and issues it points out.)