I've been working hard to up skill on the consistency and distributed systems sides of things. General recommendations:
- Designing Data Intensive Applicatons. Great overview of... basically everything, and every chapter has dozens of references. Can't recommend it enough.
- Read papers. I've had lots of a-ha moments going to wikipedia and looking up the oldest paper on a topic (wtf was in the water in Massachusetts in the 70s..). Yes they're challenging, no they're not impossible if you have a compsci undergrad equivalent level of knowledge.
- Try and build toy systems. Built out some small and trivial implementations of CRDTs here https://lewiscampbell.tech/sync.html, mainly be reading the papers. They're subtle but they're not rocket science - mere mortals can do this if they apply themselves!
- Follow cool people in the field. Tigerbeetle stands out to me despite sitting at the opposite end of the consistency/availability corner where I've made my nest. They really are poring over applied dist sys papers and implementing it. I joke that Joran is a dangerous man to listen to because his talks can send you down rabbit-holes and you begin to think maybe he isn't insane for writing his own storage layer..
- Did I mention read papers? Seriously, the research of the smartest people on planet earth are on the internet, available for your consumption, for free. Take a moment to reflect in how incredible that is. Anyone anywhere on planet earth can git gud if they apply themselves.
>- Did I mention read papers? Seriously, the research of the smartest people on planet earth are on the internet, available for your consumption, for free. Take a moment to reflect in how incredible that is. Anyone anywhere on planet earth can git gud if they apply themselves.
There is a flood of papers out there with unrepeatable processes. Where can you find quality papers to read?
This is a great point. It's true that there's a wealth of good information out there. But there's so much bad information that we now struggle with a signal vs noise problem. If you don't have enough context and knowledge yet to make the distinction, it's very easy to go down a wild goose chase. Having access to an expert in the field who can mentor and direct you is invaluable.
Can you explain what you mean by an un-repeatable process? This isn't a physical science, you don't need your own reactor or chemical lab or anything to repeat what they've done.
Designing data intensive applications is frequently recommended, but is 6 years old now. I know that doesn’t sound like a lot, but the state of distributed compute is a lot different than it was 6 years ago. Do you feel like it holds up well?
A year of two ago I read Ralph Kimball’s seminal The Data Warehouse Toolkit. While I could see why it’s still often recommended, it was showing its age in many ways (though a fair bit older than DDIA). It felt like a mix of best practices and dated advice, but it was hard to tell for certain what was what.
A lot of what DDIA covers is pretty fundamental stuff. I expect it will age fairly well.
It's not really a book about 'best practices', despite the name. It's more like an encyclopaedia, covering every approach out there, putting them in context, linking to copious reference papers, and talking about their properties on a very conceptual and practical level. It's not really like 'hey use this database vendor!!!'.
DDIA partly goes into implementation details (for example, transaction implementations in popular databases), but the reasoning behind it is always rather fundamental. I think it's going to age well unless some breakthrough happens – but even then only that particular section would be affected.
Other than DDIA do you have a heuristic for finding the older papers? I often look for the seminal works but it’s hard when you don’t know who the key people are in a new field. Do you just look for highest number of citations on google scholar or something else?
Yeah just go to wikipedia and find out who coined the term :) I love the original version vector paper, it's from 1983 and kind of funny in parts:
"Network partitioning can completely destroy mutual consistency in the worst case, and this fact has led to a certain amount of restrictiveness, vagueness, and even nervousness in past discussions, of how it may be handled"
But as a general starting point, all roads seem to lead to Lamport 78 (Time, Clocks). If you have a specific area of interest I or others might be able to point you in the right direction.
We had hardware for time sync at Google 15 years ago actually, thats not a new thing. Actually.. we had time sync hardware (via GPS) in the early Twitter datacenters as well until it became clear that was impossible to support without getting roof access. =)
Accurate clocks in data centers predates Google. Telecom needed accurate clocks for their time sliced fiber infrastructure. Same for cellular infrastructure.
...but for distributed databases specifically, you can use a different algorithm like Calvin[0] or Fauna[1] that do not require external atomic clocks… but the CS point and the wealth of info in research papers (in distributed systems stuff) are solid
...but there is a lot of noise in those software papers, too - you are often disappointed by fine print, or have good curators/thought-leaders [2] - we all should share names ;)
enjoying the discussion though - very timely if you ask me.
I hear this from tech people, but hft people are happily humming along with highly-synchronized clocks (mifid ii requires clocks to be synchronized to 100us). I wouldn't say it's "easy" but apparently if you need it then you do it and it's not that bad.
ChatGPT tells me that the time difference between the top of mount everest and sea level is in the nano second range - so completely dwarfed by network latency and maybe doesn't matter?
I’ve built pretty much my entire career around this problem and it still feels evergreen. If you want a meaningful and rich career as a software engineer, distributed storage is a great area to focus on.
I'm a principal engineer in a faang-adjacent ecomm domain. I've been doing this for 11 years and I'm absolutely sick of it. there are no novel problems, just quarterly goals that compete with and dominate engineering timelines.
how did you get started, and what would you recommend for pivoting into this space?
I was hoping that the blog post would actually spell out examples of problems. Is it just me or have there been a lot of shorter blog posts on HN lately that are really no more than an introduction section rather than an actual full article?
If you are interested in performance aspect of databases, I would recommend watching this great talk [0] from Alexey, ClickHouse Developer, where he talks about various aspects like designing systems by first realising the hardware capabilities and understanding the problem landscape.
I worked at a specialty database software vendor for almost 4yrs, albiet I worked on ML connectors. I recall some of the hardest challenges as figuring out each cloud vendor's poorly documented and rapidly changing/breaking marketplace launch mechanisms (usually built atop k8s using their own flavor (eks, aks, gke, etc)).
There is so many interesting problems to solve. I just want there was available libraries or solutions that solved a lot of them for the least cost, so that I may build on some good foundations.
RocksDB is an example of that.
I am playing around with SIMD, multithreaded queues and barriers. (Not on the same problem)
I haven't read the DDIA book.
I used Michaeln Nielsen's consistent hashing code for distributing SQL database rows between shards.
I have an eventually consistent protocol that is not linearizable.
I am currently investigating how to schedule system events such as TCP ready for reading EPOLLIN or ready for writing EPOLLOUT efficiently rather than data events.
I want super flexible scheduling styles of control flow. Im looking at barriers right now.
I am thinking how to respond to events with low latency and across threads.
I'm playing with some coroutines in assembly by Marce Coll and looking at algebraic effects
> Another example is figuring out the right tradeoffs between using local SSD disks and block-storage services (AWS EBS and others).
Local disks on AWS are not appropriate for long term storage, because when an instance reboot the data will be lost. AWS also doesn't offer huge amounts of local storage.
yeah, that is part of the trade off. Using an ephemeral SSD (for a database) means the database needs to have another means of making the data durable (replication, storing data in S3, etc.).
There are AWS instance types (I3en) with large and very fast SSDs (many times higher IOPS then EBS).
- Designing Data Intensive Applicatons. Great overview of... basically everything, and every chapter has dozens of references. Can't recommend it enough.
- Read papers. I've had lots of a-ha moments going to wikipedia and looking up the oldest paper on a topic (wtf was in the water in Massachusetts in the 70s..). Yes they're challenging, no they're not impossible if you have a compsci undergrad equivalent level of knowledge.
- Try and build toy systems. Built out some small and trivial implementations of CRDTs here https://lewiscampbell.tech/sync.html, mainly be reading the papers. They're subtle but they're not rocket science - mere mortals can do this if they apply themselves!
- Follow cool people in the field. Tigerbeetle stands out to me despite sitting at the opposite end of the consistency/availability corner where I've made my nest. They really are poring over applied dist sys papers and implementing it. I joke that Joran is a dangerous man to listen to because his talks can send you down rabbit-holes and you begin to think maybe he isn't insane for writing his own storage layer..
- Did I mention read papers? Seriously, the research of the smartest people on planet earth are on the internet, available for your consumption, for free. Take a moment to reflect in how incredible that is. Anyone anywhere on planet earth can git gud if they apply themselves.
There is a flood of papers out there with unrepeatable processes. Where can you find quality papers to read?
A year of two ago I read Ralph Kimball’s seminal The Data Warehouse Toolkit. While I could see why it’s still often recommended, it was showing its age in many ways (though a fair bit older than DDIA). It felt like a mix of best practices and dated advice, but it was hard to tell for certain what was what.
It's not really a book about 'best practices', despite the name. It's more like an encyclopaedia, covering every approach out there, putting them in context, linking to copious reference papers, and talking about their properties on a very conceptual and practical level. It's not really like 'hey use this database vendor!!!'.
Hmm. The techniques used today were invented in the 1970s-80s.
For example in the world of engines, Heywood is basically god: (https://scholar.google.com/scholar?hl=en&as_sdt=0%2C5&q=John...) has 29k citations!
"Network partitioning can completely destroy mutual consistency in the worst case, and this fact has led to a certain amount of restrictiveness, vagueness, and even nervousness in past discussions, of how it may be handled"
https://pages.cs.wisc.edu/~remzi/Classes/739/Fall2015/Papers...
But as a general starting point, all roads seem to lead to Lamport 78 (Time, Clocks). If you have a specific area of interest I or others might be able to point you in the right direction.
Any advice how to approach this?
https://twitter.com/DominikTornow
https://twitter.com/jorandirkgreef
https://twitter.com/JungleSilicon
You can also follow me. Not saying I'm cool but I do re-tweet cool people:
https://twitter.com/LewisCTech
https://lamport.azurewebsites.net/pubs/time-clocks.pdf
...but there is a lot of noise in those software papers, too - you are often disappointed by fine print, or have good curators/thought-leaders [2] - we all should share names ;)
enjoying the discussion though - very timely if you ask me.
-L, author of [1] below.
[0] - The original Calvin paper -
https://cs.yale.edu/homes/thomson/publications/calvin-sigmod...
[1] - How Fauna implements a variation of Calvin -
https://fauna.com/blog/inside-faunas-distributed-transaction...
[2] - A great article about Calvin by Mohammad Roohitavaf - https://www.mydistributed.systems/2020/08/calvin.html?m=1#:~....
Pure speculation on my behalf though.
how did you get started, and what would you recommend for pivoting into this space?
[0]: https://www.youtube.com/watch?v=ZOZQCQEtrz8
RocksDB is an example of that.
I am playing around with SIMD, multithreaded queues and barriers. (Not on the same problem)
I haven't read the DDIA book.
I used Michaeln Nielsen's consistent hashing code for distributing SQL database rows between shards.
I have an eventually consistent protocol that is not linearizable.
I am currently investigating how to schedule system events such as TCP ready for reading EPOLLIN or ready for writing EPOLLOUT efficiently rather than data events.
I want super flexible scheduling styles of control flow. Im looking at barriers right now.
I am thinking how to respond to events with low latency and across threads.
I'm playing with some coroutines in assembly by Marce Coll and looking at algebraic effects
> Another example is figuring out the right tradeoffs between using local SSD disks and block-storage services (AWS EBS and others).
Local disks on AWS are not appropriate for long term storage, because when an instance reboot the data will be lost. AWS also doesn't offer huge amounts of local storage.
There are AWS instance types (I3en) with large and very fast SSDs (many times higher IOPS then EBS).