How Discord supercharges network disks for extreme low latency

I feel there is some slight mis-information in this article that might confuse people and give them the wrong impression of 'Local SSDs' vs the network based ones and about the nvme protocol and made it seem like "local ssd == nvme".

NVMe - 'non-volatile memory express'. This is a _protocol_ that storage devices like SSDs can use to communicate with the external world. It is orthogonal to whether the disk is attached to your motherboard via pcie, or it lives somewhere else in the datacenter and uses a different transport layer. For example, local NVME SSDs will use PCIe transport. But you can also have NVME-over-TCPIP, or NVME-over-fiber-channel. Or NVME-over-Fabrics. Many of these are able to provide significantly lower latencies than a millisecond.

As a concrete example, AWS `io2-express` has latencies of ~0.25-0.5ms, though i'm not sure which technology it's using.

nvme over fabrics: https://www.techtarget.com/searchstorage/definition/NVMe-ove... many interesting presentations on the official nvme website: https://nvmexpress.org/education/documents-and-videos/presen... aws: https://aws.amazon.com/blogs/storage/achieve-higher-database...

jon-wood · 4 years ago

Thanks for the clarification, I consider myself fairly well informed about server hardware and had no idea that there were transport methods for NVMe other than PCI Express.

PeterCorless · 4 years ago

There were a slew of new NVMe specs that came out last year, such as the separation of storage and transport. So now NVMe has the following transport protocols:

• PCIe Transport specification

• Fibre Channel Transport specification (NVMe-oF)

• RDMA Transport specification

• TCP Transport specification

Separation of storage from transport is a huge game changer. I am really hoping NVMe-over-fibre really takes off. But I'd suggest people would first see that in on-prem deployments before you see it in the cloud-hosting hyperscalers.

More on the 80,000 foot view of what's going on in NVMe world is covered in link below. But there are tech specs you can read over if you are so interested in how exactly it works.

https://nvmexpress.org/nvm-express-announces-the-rearchitect...

[EDIT: Also, the GCP persistent disks are not NVMe-oF as far as we know. They seem to be iSCSI based off Colossus/D. see: https://news.ycombinator.com/item?id=21732387]

tyingq · 4 years ago

It seems like expected confusion. The protocol is well suited for a wide/fast path, so people associate it with something local that uses a lot of PCI lanes, which is much faster/wider/lower-latency than your NIC.

shrubble · 4 years ago

But why then would the article mention that they had problems with reliability? That seems ... odd; what have you experienced in your working with these (assuming you have)?

themoonisachees · 4 years ago

These disks aren't typically meant for database server usage. They're fine for consumers and even stuff like webservers, but the amount of data discord moves around on the daily means that they are bound to make some disks fail pretty quickly.

The normal course of action for this is usually to have a raid array over all your nvme disks, but since google just migrates your VM to a machine that has good disks, doing that is useless.

Really this whole article is "we are going to keep using google cloud despite their storage options being unfit for our purpose and here's how".

To me, this sounds like a GCP failure. The provider is supposed to provide you with options that you need. If you're going to build it yourself why bother with GCP?

It would be a fun exercise to reimplement Discord in AWS...or with FoundationDB.

In any case before revving the hardware I'd want to know how ScyllaDB actually is supposed to perform. I mean, their marketing drivel says this right on the main page: "Provides near-millisecond average latency and predictably low-single-digit P99 response times."

So why are they fucking with disks if ScyllaDB is so good? I mean, back in the day optimizing drive performance was like step 1. Inside tracks are faster, and make sure you don't saturate your SAS drives/Fiber Channel links by mistake. It's fun to do, but you could always get better performance by getting the software to not do dumb stuff. Seriously.

nemothekid · 4 years ago

>So why are they fucking with disks if ScyllaDB is so good?

The database layer isn't magic. The database can't give you low-single-digit P99 response times, if a single I/O request can stall for almost 2ms.

That said, I don't think AWS would fare any better here as the infrastructure issue is the same. Networked EBS drives on AWS are not going to be magically faster than networked PD drives on GCP. The bottleneck is the same, the length of the cable between the two hosts.

paulfurtado · 4 years ago

At a huge price, EBS can finally get you near-local-nvme performance. If you use an io2 drive attached to a sufficiently sized r5b instance (and I think a few other instance types), you can achieve 260,000 IOPS and 7,500 MB/s throughput.

But up until the last year or two, you couldn't get anywhere near that with EBS and I'm sure as hardware advances, EBS will once again lag and you'll need to come up with similar solutions to remedy this.

Also, I guess AWS would fight them a little less here: the lack of live migrations at least means that a local failed disk is a failed disk and you can keep using the others.

zxcvbn4038 · 4 years ago

I have this same conversation about AWS autoscaling all too frequently. It is a cost control mechanism, not a physics cheat. if you suddenly throw a tidal wave of traffic at a server then that traffic is going to queue and/or drop until there are more servers. If you saturate the network before the CPU (which is easy to do with nginx) or your event loop is too slow to accept the connections so they are dropped without being processed (easy to do I nodejs) then you might not scale at all.

kmod · 4 years ago

I don't think it's the cable length: 0.5ms is 150km in fiber or about 100km in copper. Cable length is important in HFT where you are measuring fractions of micro seconds.

It's really quite amazing to me that HFT reduces RPC latency by about three orders of magnitude, I feel like there are lessons from there that are not being transferred to the tech world.

paulfurtado · 4 years ago

Both GCP and AWS provide super fast and cheap, but fallible local storage. If running an HA database, the solution is to mitigate disk failures by clustering at the database level. I've never operated scylladb before, but it claims to support high-availability and various consistency levels so the normal way to deal with this problem is to use 3 servers with fast local storage and replace an entire server when the disk fails.

kodah · 4 years ago

A public cloud exists to serve majority usecases, not to provide you with all the options. High speed, low latency network block io is probably not a common need at this point in time.

pojzon · 4 years ago

Its very common issue. Decreasing transaction execution times for any type of request is the main goal for any data processing platforms.

radicality · 4 years ago

esjeon · 4 years ago

This reminds me of mirroring an SSD array to a HDD! I believe this is what some college kids get their hands on, since many motherboards come with 2 NVMEs. NVME RAID0 is too dangerous and RAID 1 is too expensive, but, by pouring few hundred more bucks, one can gain a marginal safety while enjoying the blazing fast RAID0.

The magic here is `--write-behind`/`--write-mostly`[1] in `mdadm`. I mean, that's the only method that I can think of here. This is an old dark magic that prevents (though not entirely) reading from a specific drive.

[1]: https://raid.wiki.kernel.org/index.php/Write-mostly

TBH, in general, I don't think it's a good option for databases. The slow drive does cause issues, for it is literally slow. The whole setup slows down when the write queue is full, reading from the write-mostly device can get be extremely slow thanks to all the pending writes, and the slow drive will wear out quicker thanks to the sustained high load (though this one should not apply in this specific case).

So you mostly don't want to use a single HDD as your mirror. For proper mirroring, you need another array of HDDs, which will end up being a NAS with its own redundancy. That's a large critical pain in the butt, but this is necessary for industrial grade reliability, and also allows making your "slow drive" faster in the future.

In this specific case, it's pretty well played. They get 4 billion messages per day, which is roughly 46k per second. Assuming each write requires updating at least one block - 4kb - the setup needs to sustain at least 185 MB/s, which is clearly beyond a single HDD. Google Persistent Disk is a kind of NAS anyway, so that perfectly aligns with the paragraph above.

jcynix · 4 years ago

Spinning rust isn't all that bad. I get 250 MB/s while sychronously (i.e. fsync/fdatasync) writing 384 GB to an Ultrastar DC HC550 (inside a WD Elements 18TB enclosure) connected via USB-3.

bombcar · 4 years ago

Spinning rust can get well above 185MB/s on a single disk if you're sequential.

londons_explore · 4 years ago

And all random write workloads can be turned into sequential write workloads, either with clever design, or with something like logfs.

manv1 · 4 years ago

gooeyblob · 4 years ago

I'm curious as to why with a system like Scylla (that I assume shares the same replication properties as Cassandra which my experience is based off of here) you can't just use the local SSDs and absorb the disk failures. If you space things out across AZs you wouldn't expect to lose quorum and can rebuild dead servers without issue. Is this to run things with a replication factor of 1 or something?

I've done this in past roles on AWS with their i3, etc. family with local attached storage and didn't use EBS.

thekozmo · 4 years ago

This is indeed what we (ScyllaDB) do, pretty much everywhere. It works great for 95% of our users. Discord wanted to add a level of guarantee since they observed a too high level of local disk failures.

Yikes! Wonder what's up with GCP in that regard.

legulere · 4 years ago

Basically they need to solve local SSDs not having all needed features and persistent disks having too high latency by:

> essentially a write-through cache, with GCP's Local SSDs as the cache and Persistent Disks as the storage layer.

ahepp · 4 years ago

I found it worth noting that the cache is primarily interested through linux's built in software raid system, md. SSDs in raid0 (strip), persistent disk in raid1 (mirror).

simcop2387 · 4 years ago

Not just a mirror but a specific setup of the mirror where it does its best to only write to one of the mirror devices, unless the other starts reporting read errors. This way they don't lose data, errors get handled gracefully and they can still snapshot the high latency disks for replication/migration tasks. That write-mostly feature was not something I was aware existed but it sounds absolutely perfect for a lot of use cases.

Deleted Comment

Key sentence and a half:

'Discord runs most of its hardware in Google Cloud and they provide ready access to “Local SSDs” — NVMe based instance storage, which do have incredibly fast latency profiles. Unfortunately, in our testing, we ran into enough reliability issues'

why_only_15 · 4 years ago

Don't all physical SSDs have reliability issues? There's a good reason we replicate data across devices.

Well NVME drives shouldn't fail at a high rate, but if they don't have good local storage capabilities, then yes they have to build something else that is different.

winrid · 4 years ago

An easy way I usually solve this with most MongoDB deployments is to have a couple data-serving nodes per shard that have local NVME drives and then have a hidden secondary that just uses an EBS volume with point-in-time backups.

Your write load is already split up per-shard in this scenario, so you can horizontally scale out or increase IOPS of the EBS volumes to scale. And you can recover from the hidden secondaries if needed.

And no fancy code written! :)

(by the way, feel free to shoot me an email if you need help with Mongo performance stuff!)

t0mas88 · 4 years ago

Clever trick. Having dealt with very similar things using Cassandra, I'm curious how this setup will react to a failure of a local Nvme disk.

They say that GCP will kill the whole node, which is probably a good thing if you can be sure it does that quickly and consistently.

If it doesn't (or not fast enough) you'll have a slow node amongst faster ones, creating a big hotspot in your database. Cassandra doesn't work very well if that happens and in early versions I remember some cascading effects when a few nodes had slowdowns.

mikesun · 4 years ago

That's good observation. We've spent a lot of time on our control plane which handles the various RAID1 failure modes, e.g. when a RAID1 degrades due to failed local SSD, we force stop the node so that it doesn't continue to operate as a slow node. Wait for part 2! :)

TheGuyWhoCodes · 4 years ago

that's always a risk when using local drives and needing to rebuild when a node dies but I guess they can over provision in case of one node failure in cluster until the cache is warmed up

Edit: Just wanted to add that because they are using Persistent Disks as the source of truth and depending on the network bandwidth it might not be that big of a problem to restore a node to a working state if it's using a quorum for reads and RP >= 3.

Resorting a Node from zero in case of disk failure will always be bad.

They could also have another caching layer on top of the cluster to further mitigated the latency issue until the nodes gets back to health and finishes all the hinted handoffs.

jhgg · 4 years ago

We have two ways of re-building a node under this setup.

We can either re-build the node by simply wiping its disks, and letting it stream in data from other replicas, or we can re-build by simply re-syncing the pd-ssd to the nvme.

Node failure is a regular occurrence, it isn't a "bad" thing, and something we intend to fully automate. Node should be able to fail and recover without anyone noticing.