So I assume this means cockroachDB currently (probably) meets its promised consistency levels. How does it compare to the usual default settings of PostgreSQL for example? - I think you get SERIALIZABLE, so the behaviour should be very reasonable.
I understand the txn/s numbers are in a pathological scenario, but will these transactions block the whole db from moving faster or have unrelated transactions still normal throughput?
I'm just glad we get linearizable single keys, I had to switch to HBase because it was the only choice at the time. Cassandra did not even give me read-your-own-writes. HBase also had issues while testing on the same host, delete+write lead to deletes being reordered after the writes because they had the same timestamp.
I'm a bit disappointed from HN in this post, everyone only talks about the name. But 90% probably did not even read the article.
Aaanyway, heres my 2cents: https://cr.yp.to/cdb.html is CDB for me :P - so better choose another acronym!
> How does it compare to the usual default settings of PostgreSQL for example? - I think you get SERIALIZABLE, so the behaviour should be very reasonable.
Not clear if you're referring to PostgreSQL or CockroachDB, but for the record, the PostgreSQL default transaction isolation level is READ COMMITTED, not SERIALIZABLE.
(Cockroach Labs CTO) CockroachDB provides two transaction isolation levels: SERIALIZABLE, which is our default and the highest of the four standard SQL isolation levels, and SNAPSHOT, which is a slightly weaker mode similar to (but not quite the same as) REPEATABLE READ. Unlike most databases which default to lower isolation modes like REPEATABLE READ or READ COMMITTED, we default to our highest setting because we don't think you should have to think about which anomalies are allowed by different modes, and we don't want to trade consistency for performance unless the application opts in.
All transaction interactions are localized to particular keys, so other transactions can proceed normally while there is contention in an unrelated part of the database.
(as for the acronym, we prefer CRDB instead of CDB)
I don't believe CockroachDB provides SERIALIZABLE consistency for the entire keyspace, rather for subsets of the keyspace that reside in the same Raft cluster.
In practice this means no cross key serializability without including a serialization token in transactions.
I may be mistaken however but that is my understanding from reading CockroachDB docs and Aphyrs post.
Cockroach Labs employee here. We provide serializable consistency for the whole keyspace. It's just that it's easier when you stay in one Raft ensemble as less can go wrong.
"For instance, on a cluster of five m3.large nodes, an even mixture of processes performing single-row inserts and selects over a few hundred rows pushed ~40 inserts and ~20 reads per second (steady-state)."
Jepsen is essentially a worst-case stress test for a consistent database: all the transactions conflict with each other. CockroachDB's optimistic concurrency control performs worse with this kind of workload than the pessimistic (lock-based) concurrency control seen in most non-distributed databases. But this high-contention scenario is not the norm for most databases. Most transactions don't conflict with each other and can proceed without waiting. Without contention, CockroachDB can serve thousands of reads or writes per second per node (and depending on access patterns, can scale linearly with the number of nodes).
And of course, we're continually working on performance (it's our main focus as we work towards 1.0), and things will get better on both the high- and low-contention scenarios. Just today we're about to land a major improvement for many high-contention tests, although it didn't speed up the jepsen tests as much as we had hoped.
I've been testing with node.js's pg client (not even using the 'native' option) and on 1 node I get > 1,000 single inserts /sec. I'm seeing this scale linearly too. NB I'm testing on a pretty overloaded macbook with only 8gb RAM. On a decent-sized cluster with production-grade hardware and network, performance is really unlikely to be a bottleneck. Also, it'll significantly improve during this year, one imagines.
a.) this is a beta product where the team has been focusing on completeness and correctness before performance
b.) I was testing Cockroach from August through October--Cockroach Labs has put a good deal of work into performance in the last three months, but that work hasn't been merged into mainline yet. I expect things should be a good deal faster by 1.0.
c.) Jepsen does some intentionally pathological things to databases. For instance, we choose extremely small Cockroach shard sizes to force constant rebalancing of the keyspace under inserts. Don't treat this as representative of a production workload.
The clock skew limit controls the maximum latency on reads--so yeah, if you're doing reads that contend with writes, improving clock resolution improves latency. Not entiiirely sure about the throughput impact though.
Even if they manage to multiply this by 100 on the final release, it's still way weaker than a regular sql db. I hope they have another selling point than performance.
The test is testing the worst-case scenario of everything needing locking. CockroachDB uses an optimistic "locking" approach which makes this bad. But if you're use-case is strictly linearizable high-throughput reads good luck finding anything that is not a single-machine database.
Like Spanner, Cockroach’s correctness depends on the strength of its clocks... Unlike Spanner, CockroachDB users are likely deploying on commodity hardware or the cloud, without GPS and atomic clocks for reference.
I'm curious, do any of the cloud providers offer instances with high-precision clocks? Maybe this is something one of the smaller cloud providers might want to offer to differentiate themselves - instances with GSP clocks.
The spanner announcement mentioned offering direct access to the TrueTime api sometime in the not too distant future.
Edit: Quote from Brewer's post: "Next steps
For a significantly deeper dive into the details, see the white paper also released today. It covers Spanner, consistency and availability in depth (including new data). It also looks at the role played by Google’s TrueTime system, which provides a globally synchronized clock. We intend to release TrueTime for direct use by Cloud customers in the future."
Bingo. I've been hoping Google (or, heck, any colo provider) would offer this for a few years. Semi-synchronous clocks enable significant performance optimizations!
> I'm curious, do any of the cloud providers offer instances with high-precision clocks?
Essentially that's what Google's doing by selling Spanner, but you don't get access to the clocks directly. I'd guess that as they get more experience selling Spanner, and they bring the price of the hardware down, they'll rent those directly too.
>I'm curious, do any of the cloud providers offer instances with high-precision clocks?
That's a great question.
Could be a little tricky with AWS for example. It's typical to run a database on a VPC (virtual private cloud), since that layer of the stack doesn't need to be exposed to the Internet. Unfortunately, that means the servers in the VPC can't get to the ⁎.amazon.pool.ntp.org servers.
Skimming around, I don't see any of the cloud providers, VPS providers, or dedicated hosting companies offering anything special, like a local stratum 1 time source for each region/location.
In addition to being extremely precise, TrueTime is explicit about error in its measurements.
In other words, a call to TrueTime returns a lower and upper bound for the current time, which is important for systems like Spanner or CockroachDB when operations must be linearized.
In effect, Spanner sleeps on each operation for a duration of [latest_upper_bound - current_lower_bound] to ensure that operations are strictly ordered.
Your suggestion might improve typical clock synchronization, but still leaves defining the critical offset threshold to the operator, and importantly leaves little choice for software like CockroachDB other than to crash when it detects that the invariant is violated.
Wow. Probably fundraising and getting out front of googles new consumer spanner service. I just started experimenting w/ cockroachdb and it sounds/seems great. They def are campaigning hard, landing a Jepsen test and a lot of posts/articles last couple.
I am pulling for them over G on principle. I infer they know people at Stripe where the Rethink team & Aphyr are so hopefully they can learn from them and build an awesome product. G has lost a lot of battles on UX, Docs, attitude and just general product. They have won a lot of battles on sheer size & resources. Be nice to have this stick around.
Hey man, didn't realize you left Stripe but did sort of notice Jepsen getting more serious. Jepsen has phenomenal reach:
Interesting and hilarious coverage of technical complexity which assumes just enough knowledge to engage a keen novice while seemingly engaging seasoned experts in the field.
Thanks for that. Your entertainment firm is a bit out if my wheelhouse, but it's such a crowded space and you end up getting really tied down. Maybe get acquihired for an Iditarod team, but I reckon this is more scaleable.
CDB is a really cool db for the time I've been playing with it (just a few months I think).
That said, I'd really appreciate if they'd launch a hosted DBaaS. By that I mean, more in line with dynamodb or documentdb instead of say MySQL on AWS for example.
I know it's on their timeline, but dbs generally get selected during the start of the projects and unless absolutely necessary, companies don't want to change dbs.
CockroachDB employee here. DBaaS is definitely on our roadmap. There are a lot of moving pieces to building out DBaaS, so our focus first is on honing our core product. In the meantime, we are working on making it as easy as possible to deploy CockroachDB across different cloud vendors and with orchestration tools like Kubernetes.
> Should the clock offset between two nodes grow too large, transactions will no longer be consistent and all bets are off.
A concern I have is that there could be datacorruption in this context, no? It sounds like on a per cluster basis, i.e. Same Raft cluster, there is a strict serializability guarantee. But without linearalizability system wide, you could end up with data corruption on multi-keyspace writes.
You're correct--large clock anomalies (or VM pauses, message delays, etc) can cause serializability anomalies. I don't exactly know what will happen, because our tests explicitly focused on staying within clock skew bounds.
To speculate: if clocks do go out of bounds, I don't... think... you can violate single-key serializability, because that goes through a single Raft cluster, but you can definitely see multi-key consistency anomalies. I think you could also get stale single-key reads, because the read path IIRC relies on a time-based lease. Might be other interesting issues if Cockroach is shuffling data from one Raft ensemble to another--I think the range info itself might have a time-based lease that could go stale as well. Perhaps a Cockroach Labs engineer could comment here?
Note that the time window for these anomalies is limited to a few seconds-nodes will kill themselves when they detect clock skew.
(Cockroach Labs CTO) The major issue with clock offsets is stale reads due to the time-based read lease. Pretty much everything else is independent of the clocks. Stale reads are still enough to get you in trouble (some of the jepsen tests will show serializability violations if you disable the clock-offset suicide and mess with the clocks enough), but it's tricky to actually get bad data written into the database, as opposed to returning a stale-but-consistent view of the data.
I understand the txn/s numbers are in a pathological scenario, but will these transactions block the whole db from moving faster or have unrelated transactions still normal throughput?
I'm just glad we get linearizable single keys, I had to switch to HBase because it was the only choice at the time. Cassandra did not even give me read-your-own-writes. HBase also had issues while testing on the same host, delete+write lead to deletes being reordered after the writes because they had the same timestamp.
I'm a bit disappointed from HN in this post, everyone only talks about the name. But 90% probably did not even read the article. Aaanyway, heres my 2cents: https://cr.yp.to/cdb.html is CDB for me :P - so better choose another acronym!
Not clear if you're referring to PostgreSQL or CockroachDB, but for the record, the PostgreSQL default transaction isolation level is READ COMMITTED, not SERIALIZABLE.
https://www.postgresql.org/docs/current/static/sql-set-trans...
All transaction interactions are localized to particular keys, so other transactions can proceed normally while there is contention in an unrelated part of the database.
(as for the acronym, we prefer CRDB instead of CDB)
In practice this means no cross key serializability without including a serialization token in transactions.
I may be mistaken however but that is my understanding from reading CockroachDB docs and Aphyrs post.
Wow, that's...not fast.
Jepsen is essentially a worst-case stress test for a consistent database: all the transactions conflict with each other. CockroachDB's optimistic concurrency control performs worse with this kind of workload than the pessimistic (lock-based) concurrency control seen in most non-distributed databases. But this high-contention scenario is not the norm for most databases. Most transactions don't conflict with each other and can proceed without waiting. Without contention, CockroachDB can serve thousands of reads or writes per second per node (and depending on access patterns, can scale linearly with the number of nodes).
And of course, we're continually working on performance (it's our main focus as we work towards 1.0), and things will get better on both the high- and low-contention scenarios. Just today we're about to land a major improvement for many high-contention tests, although it didn't speed up the jepsen tests as much as we had hoped.
I've been testing with node.js's pg client (not even using the 'native' option) and on 1 node I get > 1,000 single inserts /sec. I'm seeing this scale linearly too. NB I'm testing on a pretty overloaded macbook with only 8gb RAM. On a decent-sized cluster with production-grade hardware and network, performance is really unlikely to be a bottleneck. Also, it'll significantly improve during this year, one imagines.
a.) this is a beta product where the team has been focusing on completeness and correctness before performance
b.) I was testing Cockroach from August through October--Cockroach Labs has put a good deal of work into performance in the last three months, but that work hasn't been merged into mainline yet. I expect things should be a good deal faster by 1.0.
c.) Jepsen does some intentionally pathological things to databases. For instance, we choose extremely small Cockroach shard sizes to force constant rebalancing of the keyspace under inserts. Don't treat this as representative of a production workload.
The test is testing the worst-case scenario of everything needing locking. CockroachDB uses an optimistic "locking" approach which makes this bad. But if you're use-case is strictly linearizable high-throughput reads good luck finding anything that is not a single-machine database.
I'm curious, do any of the cloud providers offer instances with high-precision clocks? Maybe this is something one of the smaller cloud providers might want to offer to differentiate themselves - instances with GSP clocks.
Edit: Quote from Brewer's post: "Next steps For a significantly deeper dive into the details, see the white paper also released today. It covers Spanner, consistency and availability in depth (including new data). It also looks at the role played by Google’s TrueTime system, which provides a globally synchronized clock. We intend to release TrueTime for direct use by Cloud customers in the future."
Essentially that's what Google's doing by selling Spanner, but you don't get access to the clocks directly. I'd guess that as they get more experience selling Spanner, and they bring the price of the hardware down, they'll rent those directly too.
That's a great question.
Could be a little tricky with AWS for example. It's typical to run a database on a VPC (virtual private cloud), since that layer of the stack doesn't need to be exposed to the Internet. Unfortunately, that means the servers in the VPC can't get to the ⁎.amazon.pool.ntp.org servers.
Skimming around, I don't see any of the cloud providers, VPS providers, or dedicated hosting companies offering anything special, like a local stratum 1 time source for each region/location.
In addition to being extremely precise, TrueTime is explicit about error in its measurements.
In other words, a call to TrueTime returns a lower and upper bound for the current time, which is important for systems like Spanner or CockroachDB when operations must be linearized.
In effect, Spanner sleeps on each operation for a duration of [latest_upper_bound - current_lower_bound] to ensure that operations are strictly ordered.
Your suggestion might improve typical clock synchronization, but still leaves defining the critical offset threshold to the operator, and importantly leaves little choice for software like CockroachDB other than to crash when it detects that the invariant is violated.
I am pulling for them over G on principle. I infer they know people at Stripe where the Rethink team & Aphyr are so hopefully they can learn from them and build an awesome product. G has lost a lot of battles on UX, Docs, attitude and just general product. They have won a lot of battles on sheer size & resources. Be nice to have this stick around.
Interesting and hilarious coverage of technical complexity which assumes just enough knowledge to engage a keen novice while seemingly engaging seasoned experts in the field.
Thanks for that. Your entertainment firm is a bit out if my wheelhouse, but it's such a crowded space and you end up getting really tied down. Maybe get acquihired for an Iditarod team, but I reckon this is more scaleable.
Posts are great. Tganks
1 - https://github.com/pingcap/tidb
That said, I'd really appreciate if they'd launch a hosted DBaaS. By that I mean, more in line with dynamodb or documentdb instead of say MySQL on AWS for example.
I know it's on their timeline, but dbs generally get selected during the start of the projects and unless absolutely necessary, companies don't want to change dbs.
A concern I have is that there could be datacorruption in this context, no? It sounds like on a per cluster basis, i.e. Same Raft cluster, there is a strict serializability guarantee. But without linearalizability system wide, you could end up with data corruption on multi-keyspace writes.
To speculate: if clocks do go out of bounds, I don't... think... you can violate single-key serializability, because that goes through a single Raft cluster, but you can definitely see multi-key consistency anomalies. I think you could also get stale single-key reads, because the read path IIRC relies on a time-based lease. Might be other interesting issues if Cockroach is shuffling data from one Raft ensemble to another--I think the range info itself might have a time-based lease that could go stale as well. Perhaps a Cockroach Labs engineer could comment here?
Note that the time window for these anomalies is limited to a few seconds-nodes will kill themselves when they detect clock skew.
Deleted Comment