2. It could be configure to store - snapshots; - RAFT logs other than the latest log; in S3. It cannot use a stateless Kubernetes pod - the latest log has to be located on the filesystem.
Although I see you can make a multi-region setup with multiple independent Kubernetes clusters and store logs in tmpfs (which is not 100% wrong from a theoretical standpoint), it is too risky to be practical.
3. Only the snapshots and the previous logs could be on S3, so the PUT requests are done only on log rotation.
If anyone has any questions, I'll do my best to get them answered.
(Disclaimer: I work at ClickHouse)
1. Yeah, we mention at the end of the post the P99 produce latency is ~400ms. 2. MSK still charges you for networking to produce into the cluster and consumer out of it if follower fetch is not properly configured. Also, you still have to more or less manage a Kafka cluster (hot spotting, partition rebalancing, etc). In practice we think WarpStream will be much cheaper to use than MSK for almost all use-cases, and significantly easier to manage.
If you have any questions about WarpStream, my co-founder (richieartoul) and I will be here to answer them.
2. if the '5-10x cheaper' is mostly due to cross AZ savings, isnt that offered by AWS MSK offering too?
This article hits most of the nails exactly on the head, at least from the perspective of large corporate environments; I doubt any of these apply to small startups (and if they do to yours, something has probably gone very wrong). Nuclide handles these issues in fairly interesting ways:
* FB built their own incremental, caching compilers for several languages that Nuclide is aware of, so that typechecking and rebuilds are effectively instantaneous even across massive codebases.
* Nuclide integrates with FB's massive cross-repo search service, BigGrep, so that you can find references to anything.
* Nuclide's version control UI is phenomenal, and is aware of multiple repositories existing, which ones you have checked out, and integrates with their in-house version of Phabricator (a code repository/review tool similar to Github). I literally never learned to use Mercurial while I was there. I just used Nuclide.
However, there's one major difference between Nuclide and the vision that this article lays out: remote vs local code checkouts. Nuclide ran locally, but it used remote code checkouts, and your code ran on the remote, not on your local machine. I think the reasons and benefits were compelling:
* Massive repositories take up massive amounts of disk space.
* Massive repositories take massive amounts of time to download to your laptop, and if you're working in a large corporate environment, they also take massive amounts of time to download recent changes since there are many engineers shipping commits.
* If your machine gets hosed, you can spin up a new one quickly; in FB's case, it took seconds to get one of the "OnDemand" servers. If your local machine gets hosed... A trip to IT is not going to be as easy.
* If you run into issues, tools engineers can SSH directly into your instance and see what the problem is, fix it, and also ship preemptive fixes to everyone else. That would feel quite privacy-invasive for someone's local machine.
* Many employees enjoy being able to take their laptop with them so that they can work remotely e.g. to extend a trip to another country without taking more PTO. Laptops aren't great at running large distributed systems.
* Even some individual services (ignoring the rest of the constellation of services) are impractical to run on constrained devices like laptops, and operate on either "big data" or require large amounts of RAM.
* Remote servers can more conveniently integrate securely with remote services than laptops or desktops.
* Having the entire codebase stored locally on a laptop or desktop is a relative security risk: if it's in a centrally managed server and a laptop gets stolen, revoke the key. If it's on disk, well, hope the thief didn't get the password? Or in FB's case, hope the thief isn't an APT/the NSA and has backdoored the laptop's full-disk encryption? And it's not just laptop theft: a disgruntled employee could intentionally leak the code -- or in Google and Anthony Levandowski's case, sell it to a competitor through a shell company. If everything is stored centrally and local checkouts are extremely uncommon, you have a lot more room to automate defense against that scenario.
Overall I'm a big fan of running your code on remote devservers rather than on local machines once you get to scale.