Should I buy ECC memory? (2015)

While I was at Google, someone asked one of the very early Googlers (I think it was Craig Silverstein, but it may've been Jeff Dean) what was the biggest mistake in their Google career, and they said "Not using ECC memory on early servers." If you look through the source code & postmortems from that era of Google, there are all sorts of nasty hacks and system design constraints that arose from the fact that you couldn't trust the bits that your RAM gave back to you.

It saved a few bucks in a time period where Google's hardware costs were rising rapidly, but the ripple-on effects on system design cost much more than that in lost engineer time. Data integrity is one engineering constraint that should be pushed as low down in the stack as is reasonably possible, because as you get higher up the stack, the potential causes of corrupted data multiple exponentially.

sytelus · 9 years ago

Google had done extensive studies[1]. There is roughly 3% chance of error in RAM per DIMM per year. That doesn't justify buying ECC if you have just one personal computer to worry about. However if you are in data center with 100K machines each with 8 DIMM, you are looking at about 6K machines experiencing RAM errors each day. Now if data is being replicated then these errors can propogate corrupted data in unpredictable unexplainable way even when there are no bugs in your code! For example, you might encounter your logs containing bad line items which gets aggregated in to report showing bizarre numbers because 0x1 turned in to 0x10000001. You can imagine that debugging this happening every day would be huge nightmare and developers would end up eventually inserting lot of asserts for data consistency all over the places. So ECC becomes important if you have distributed large scale system.

1: http://www.cs.toronto.edu/~bianca/papers/sigmetrics09.pdf

indolering · 9 years ago

That data set covers 2006-2009 and the ram consisted of 1-4GB DDR2 running at 400-800 MB/S. Back when 4GB was considered a beefy desktop, consumers could get away with a few bit-flips during the lifetime of the machine. Now my phone has that much RAM and a beefy desktop consists of 16-32 GB of RAM running at 3GB/s.

It's time we start trading off the generous speed and capacity gains for a some error correction.

pseudalopex · 9 years ago

That's a 3% per DIMM per year chance of at least one error. Most memory faults are persistent and cause errors until the DIMM is replaced. Also, the error rate was only that low for the smallest DDR2 DIMMs.

sitkack · 9 years ago

I have hit soft errors in every desktop machine that used ECC. Either I have bad luck, ECC causes the errors or third thing. I think ECC should be mandated for anything except toys and video players.

colanderman · 9 years ago

> There is roughly 3% chance of error in RAM per DIMM per year. […] with 100K machines each with 8 DIMM, you are looking at about 6K machines experiencing RAM errors each day.

Can you work out the math? I don't follow it. 3%×100K×8÷365=66 per day by my reasoning…

loeg · 9 years ago

> There is roughly 3% chance of error in RAM per DIMM per year. That doesn't justify buying ECC if you have just one personal computer to worry about.

How do you make that leap?

amelius · 9 years ago

This makes me wonder how banks deal with this issue.

lomnakkus · 9 years ago

> If you look through the source code & postmortems from that era of Google, there are all sorts of nasty hacks and system design constraints that arose from the fact that you couldn't trust the bits that your RAM gave back to you.

Details of this would be very interesting, but obviously I understand if you cannot provide such details due to NDAs, etc.

I mean, I can imagine a few mitigations (pervasive checksumming, etc), but ultimately there's very little you can actually do reliably if your memory is lying to you[1]. I can imagine that probabilistic programming would be an option, but it's hardly "mainstream" nor particularly performant :)

I'm also somewhat dismayed at the price premium that Intel are charging for basic ECC support. This is a case where AMD really is a no-brainer for commodity servers unless you're looking for single-CPU performance.

[1] Incidentally also true of humans.

paulsutter · 9 years ago

You need ECC /and/ pervasive checksumming. There are too many stages of processing where errors can occur. For example, disk controllers or networks. The TCP checksum is a bit of a joke at 16 bits (it will fail to detect 1 in 65000 errors), and even the Ethernet CRC can fail - you need end to end checksums.

http://www.evanjones.ca/tcp-and-ethernet-checksums-fail.html

derefr · 9 years ago

> ultimately there's very little you can actually do reliably if your memory is lying to you

1. Implement everything in terms of retry-able jobs; ensure that jobs fail when they hit checksum errors.

2. if you've got a bytecode-executing VM, extend it to compare its modules to stored checksums, just before it returns from them; and to throw an exception instead of returning if it finds a problem. (This is a lot like Microsoft's stack-integrity protection, but for notionally "read-only" sections rather than read-write sections.)

3. Treat all such checksum failures as a reason to immediately halt the hardware and schedule it for RAM replacement. Ensure that your job-system handles crashed nodes by rescheduling their jobs to other nodes. If possible, also undo the completion of any recently-completed jobs that ran on that node.

4. Run regular "memtest monkey" jobs on all nodes that attempt to trigger checksum failures. To get this to work well, either:

4a. ensure that jobs die often enough, and are scheduled onto nodes in random-enough orders, that no job ever "pins" a section of physical memory indefinitely;

4b. or, alternately, write your own kernel memory-page allocation strategy, to map physical memory pages at random instead of linearly. (Your TLBs will be very full!)

Mind you, steps 3 and 4 only matter to catch persistent bit-errors (i.e. failing RAM); one-time cosmic-ray errors can only really be caught by steps 1 and 2, and even then, only if they happen to affect memory that ends up checksummed.

mentat · 9 years ago

Pervasive checksumming is going to cost a lot of CPU and touch a lot of memory. The data could be right, the checksum wrong as well. ECC double bit errors are recognized and you can handle them how you'd like, including killing the affected process.

btian · 9 years ago

it was indeed Craig

woliveirajr · 9 years ago

Given that cosmic radiation is one source of memory errors, shouldn't just better computer cases reduce memory errors?

Basically a tin-foil (or plumb-foil) hat over my computer?

Deleted Comment

Can people here please stop posting that ZFS needs ECC memory. Every filesystem, with any name like FAT, NTFS, EXT4 runs more safe with ECC memory. ZFS is actually one of the few that can still be safer if you don't run with ECC memory. Source: Matthew Ahrens himself: https://arstechnica.com/civis/viewtopic.php?f=2&t=1235679&p=...

kurlberg · 9 years ago

In an old discussion regarding ECC/ZFS (in particular, whether hitting bad RAM while scrubbing could corrupt more and more data), user XorNot kindly took a look the ZFS source and wrote

"In fact I'm looking at the RAID-Z code right now. This scenario would be literally impossible because the code keeps everything read from the disk in memory in separate buffers - i.e. reconstructed data and bad data do not occupy or reuse the same memory space, and are concurrently allocated. The parity data is itself checksummed, as ZFS assumes it might be reading bad parity by default."

His full comment can be found here:

https://news.ycombinator.com/item?id=8294434

lomnakkus · 9 years ago

Indeed. It's true that the data may be corrupted before hitting any disk[1], but once it has hit the disks (>1), it's extremely unlikely that you'll ever hit a similar bit error where it'll mistakenly choose the wrong disk block to recover from.

The main point of e.g. ZFS or Btrfs checksumming is that a) at least it isn't getting worse, and b) I can tell if it's getting worse.

[1] ... but if the bits are not generated by the machine that actually saving to disk, how do you know they weren't corrupted along the way? The number of people who religiously check PGP signatures/SHA256sums or whatever is miniscule.

derefr · 9 years ago

> The number of people who religiously check PGP signatures/SHA256sums or whatever is miniscule.

• If you transfer things around using BitTorrent, it'll ensure you always end up with a file that hashes correctly to the sum it originally had when the .torrent file was constructed.

• Many archive formats (zip, rar, and 7z, at least) contain checksums, and archival utilities validate those checksums during extraction, refusing to extract broken files. "Self-extracting archive" executables that use these formats inherit this property.

• Some common disk-image formats (dmg, wim) embed a checksum that checks the whole disk-image during mount, and will refuse to mount a bad one. (I believe you can then try to "repair" the disk image with your OS's disk-repair utility, if you have no other copies.)

• Web pages increasingly use Sub-Resource Integrity attributes on things like .css and .js files, protecting them (though not the page itself) from errors.

• ISO files don't embed checks, but all the common package formats (Windows .cab and .msi; Linux .deb and .rpm; macOS .pkg) on installer ISOs embed their own checksums and often signatures.

• git repos are 'protected' insofar as you won't be able to sync mis-hashed objects from a remote, so they won't spread.

Really, looking over all that, it's only 1. plain binary executables, and 2. "media files" (images, audio, video)—and only when retrieved over a "dumb" protocol, rather than a pre-baked-manifest protocol like BitTorrent or zsync—that are "risky" and in need of explicit checksum comparison.

ruleabidinguser · 9 years ago

What are you doing where you're actually checking checksums periodically and detcting when things get worse? That seems like a lot of work to set up.

tjoff · 9 years ago

No. ZFS is in much greater need for ECC than most other filesystems.

1. ZFS doesn't come with any disk repair tools and the ones that exist are not nearly as capable as for other filesystems (the ZFS motto is that it is too costly to repair filesystems, just recover from tape instead (here we can sense the intended audience of ZFS)). If the wrong bit get's flipped your entire pool might be gone (you can of course spend months of your spare time to debug it yourself if you want to). This is not the case (to the same extent) for FAT, NTFS or EXT.

2. The more you use memory the more likely you are about to get hit. I'd argue that ZFS is a quite resource heavy filesystem and is thus more likely to actually attract bit flips. This is similar as to when using an encrypted filesystem on an overclocked CPU. There is nothing inherently more risky with encrypting your filesystem with an overclocked CPU - but overclocking your CPU increases the risk for miscalculations. And enabling encryption increases the CPU usage when accessing the filesystem by several orders of magnitude. So, in practice you quickly notice how filesystem data on encrypted drives get corrupted but not on regular drives on an ever so slightly too overclocked machine.

So, if you care about your filesystem, then yes - saying that ZFS needs ECC is quite sensible. (if you care about your data you should have backups regardless)

cyberpunk · 9 years ago

Well; I don't think 'lack of repair tools' for ZFS is the reason that ECC is the suggested good practice; but I can sort of agree that recovering from a badly farked pool isn't fun having been down that rabbithole...

Regardless of ZFS, this is why we architect our storage to cope with such potential borkage (which has, incidentially, only happened to me _ONCE_ in ~8 years of zfs in prod, had nothing to do with silent md corruption in ram and had everything to do with a nasty hw sas hba bug) -- If losing a single storage node/pool causes you problems then you're "doing it wrong" (sorry) and it makes no difference if you're using XFS, ZFS, BTRFS or whatever else...

So... I'm not sure about the ECC stuff, to me ECC really matters not at all for the simple reason that any significant deployment is using ECC anyway: even deploying a cheapo 10k pair of jbods and a tiny 1U head to run NFS or something, you'll be unlikely to even have the option of non ECC from whoever you're buying the kit from (dell? hp? bla) right?

Yes, it might work without it. It might work better somehow with it.. What does it matter when even the cheap gear comes with it anyway?

I've built a reasonable slew of ZFS backed storage (well, a 10 PB prod or so anyway, nothing compared to what some of the folks who comment here have done) and besides some hardware compat issues if you're building storage this is currently your best option to back your objstore/dfs.

ZFS as the storage backend for your DC/Cloud? Good pick. ZFS as the 'local' FS's in your VMs? I wouldn't bother, unless you need some features (it works pretty well with docker, as it goes, but I prefer to run apps on plain ext backed by zvols instead of 'zfs-on-zfs')....

koffiezet · 9 years ago

ZFS is not the cheap option, it was never intended to be - so why bother skimping on ECC? If you're worried about ECC prices - ZFS is probably not for you.

It's not that you need ECC for ZFS, but when you're at a point where you're willing to throw money at a storage system where ZFS makes sense, the extra cost of ECC is minuscule. The most expensive hardware requirement of ZFS is that you need disks of the same size anyway, which means you're not just throwing a random amount of disks together, and if you want to expand, you need to add another full zpool, or replace disks one by one.

On my home NAS, the difference was about 120 EUR for 32GB (80eur/dimm vs 50eur/dimm), on a grand total of over 2500 EUR. One of the reasons for choosing ZFS was storage reliability, and then skimping out on ECC is a imho a bit silly.

GuB-42 · 9 years ago

You can have disks of differing sizes with ZFS, though you are making things difficult for yourself so your point still stands.

However the cost of ECC is not negligible because you need your CPU and motherboard to support it.

The total budget for my NAS was 1000 EUR in 2013, 500 for the 5 3TB disks, 350 for the motherboard+CPU+8GB ECC RAM, and 150 for the case, PSU, system SSD, and accessories. In reality I salvaged 2 3TB disks, lowering the cost to 800. By using non-ECC I could have used a cheaper motherboard and CPU, in addition to cheaper RAM. In fact I would probably have used hardware from an older desktop PC. It would have been a 15-20% saving, or 45% if I take reuse into account. Not negligible.

My previous NAS, running linux soft-RAID, entirely made of salvaged parts except for some of the disks had a few corruption problems. One of them caused by a defective disk. ZFS would have caught it, so even on cheap systems, ZFS has its use.

I also had defective DRAM, rebuilds not going smoothly, etc... That system caused me too many scares, so I decided that the next system would be cheap but not too cheap as to endanger my sanity. I also got a proper backup solution.

gbrown_ · 9 years ago

I'm so glad to see this comment high up.

X86BSD · 9 years ago

It's not the that it NEEDS it, it's that if you DONT use it you are introducing potential data errors into an otherwise checksummed data path. Which would completely negate the rest of the path.

nostrademons · 9 years ago

olavgg · 9 years ago

spullara · 9 years ago

I reproduced this by bit-squatting cloudfront.net after reading about it. So many memory errors!

http://dinaburg.org/bitsquatting.html

Loved the variety as well. Sometimes though requests came to me the Host header was correct!

codinghorror · 9 years ago

Wait so when someone typoes cnn.com as con.com, that is ipso facto a memory error? I guess I could see that if the characters are radically far apart on the keyboard? But doesn't a simpler explanation like "one person out of billions with Internet access typed the wrong thing" seem a lot more likely?

duskwuff · 9 years ago

Domains that are only used as CDNs, like cloudfront.com, are almost never typed into an address bar. Errors in the domain name are more frequently the result of a bit-flip error.

NamTaf · 9 years ago

This Defcon 21 presentation from Robert Stucke did something similar with google's domains, plus other stuff. A great watch if you've got a spare 40 minutes!

https://www.youtube.com/watch?v=yQqWzHKDnTI

SolarNet · 9 years ago

Some macs do use ECC memory (specifically some of the most popular varieties, like the Mac Pro) which is probably why you saw lower numbers on bit squatted domains.

esun · 9 years ago

Fascinating article. Did you ever find a reason for the different ios results?

Based on source IPs I would say that cheaper RAM = more error prone RAM.

veidr · 9 years ago

Yes. Everybody reading this should use ECC RAM, and non-ECC RAM should be called "error-propagating RAM".

Random bit flips aren't cool, and they happen regularly. Most computers that have ECC RAM can report whether errors happen. I see them at least once a year or so. For instance, here are 2 ECC-correctable memory errors that occurred just last month.

Cosmic rays? Fukushima phantom? Who knows. You'll never know why they happen (unless it's like a bad RAM module and they happen a lot), but if you don't rock ECC you will never know they happened at all. You'll be left guessing when, years later, some encrypted file can no longer decrypt, and all the backups show the same corruption...

[1]: https://www.dropbox.com/s/zndvy3nkv1jipri/2017-03-20%20FUCK%...

[2]: https://www.dropbox.com/s/6yeoedc7ajzq4u9/2017-03-20%20FUCK%...

jandrese · 9 years ago

I remember the one time I bought ECC memory, for a PII-400. It was only 512MB or so I think, but in the 12 years that server ran I saw a grand total of 1 corrected error in the logs. Given how much of a premium that ECC memory was it felt like a waste.

blackflame7000 · 9 years ago

Nice file names, gonna have to start naming my bug reports similarly

ReligiousFlames · 9 years ago

An old article from DJB worth perusal: http://cr.yp.to/hardware/ecc.html

It's also worth noting that not all ECC (SECDED) is created equal: ChipKill™ and similar might not survive physical damage because of likely shorts of the data bus but a single malfunctioning chip producing/experiencing higher hard error rate is possible from which to recover.

Also, it'd be really cool if some shop a-la BackBlaze blogged about large-scale monitoring for soft and hard RAM errors across chip/module modules (+ motherboards & CPUs). Without collecting and revealing years data from real use, conversation devolves into opinion and conjecture.

Finally, not all use-cases can benefit from ECC (ie Angry Birds) however there are some obvious/nonobvious ones that can (ie router non-ECC DNS bitsquatting or processing bank transactions).

PS: Random-crazy thought.. it's curious with reduction of costs via Moore's law improvements that there aren't yet formally-verified, zero-knowlege systems which can end-to-end prove they performed computation/real-world side-effects and/or continue to safely store data. Why blindly trust anyone or any company with data that can be seized, lost or misused when distributed computation, communication and storage can be A2E with only limited participants knowing operations / plaintext? Perhaps: homomorphic encryption, blockchain-similar ledger or proof-of-work and periodic, authenticated hash challenge queries. Mix in relaying and other idle phony traffic to make triangulation more difficult. I think in order to assure sufficient distributed system resources are made available, μpayments a-la AWS but just covering costs would make it possible to have a persistent, anonymous computation and storage collective that would survive outages, FBI raids, single nodes going offline, etc.

Storage, [yes](https://storj.io/). Computation ... sure if you don't mind the server viewing the contents of your computation and can verify the results. Sadly, fully homomorphic systems incur waaaay too much overhead so you are constrained in what you can do (i.e. specialized DBs, zkSNARKs, etc).

Then, of course, there is the problem of network latency and bandwidth costs vs just keeping it all on one datacenter.

nickpsecurity · 9 years ago

It wasn't necessary because we've already had systems whose hardware and/or software reliability reached decades between events of unplanned downtime.

https://lobste.rs/s/jea4ms/paranoid_programming_techniques_f...

http://www.hpl.hp.com/techreports/tandem/TR-86.2.pdf

http://h71000.www7.hp.com/openvms/whitepapers/high_avail.htm...

http://www.enterprisefeatures.com/why-are-iseries-system-i-a...

Now I'm not including all the anonymous, zero-knowledge stuff since the market won't buy that. All kinds of costs come with it that they don't want. Besides, most consumers and enterprises love products with lots of surveillance built in. ;)

mjevans · 9 years ago

A better question is why /shouldn't/ you use ECC memory?

Generally the answer to this is any context where you legitimately do NOT care about your data at all, but you still care about costs. This predominately devolves in to consumption only gaming systems.

In all other cases everyone would be better served (in the long run) by buying ECC RAM.

Viper007Bond · 9 years ago

My main issue is that it isn't just a choice between ECC memory and not, but I'd also need a different motherboard and processor, right?

a common network topology is to have a load balancer distribute load to a number of cheap Http servers which internally connect to a centralized and powerful database server. In this case only the database server really needs ECC ram. The system is designed to be fault tolerant for any individual HTTP server node so the increased cost vs the problem it solves doesn't make sense.

I guess you could argue that a random bit flip could somehow make the HTTP server vulnerable and able to compromise the network however that risk is impossibly small. If we take IBMs estimation that a bit flip occurs at an approximate rate of (3.7 × 10-9) bytes/month and then divide it by the number of bytes in the system you can see that the odds of randomly corrupting a byte in memory that triggers a vulnerability is too small.

fulafel · 9 years ago

What about memory-error corrupted application data (or application logic) where the corruption occurred on load balancers or web application servers? There's more to data integrity than security holes.

ori_b · 9 years ago

> In this case only the database server really needs ECC ram.

That's only true if the database is read only. Otherwise, you will still insert corrupt data into it.

lucb1e · 9 years ago

This article is gold in so many ways. It contains interesting bits of information on ECC, company history that I didn't know (Sun's and Google's namely), filesystem reliability (I never knew!), the physics of RAM (50 electrons per capacitor)...

It's a must read, even if only to get you thinking about some of these things.

VA3FXP · 9 years ago

Depends on what you are doing. ZFS storage servers: Hell yes High-value data in my DB? Hell yes email server: Nope super cool gaming rig: Nope * Cluster: Hell yes

General office workstation: maybe.

I don't have the budget for 20 redundant copies. I do have the budget for slightly more expensive RAM. Especially on my ZFS storage arrays.

ECC memory is like Insurance. You hope you never need it. One real downside that I have found, is finding out _when_ that memory correction has saved your ass. RAID arrays can alert you when a disk is dead. SMART mostly tells you when disks are failing. I haven't found a reliable tool to notify me when I am getting ECC errors/corrections.

mikeash · 9 years ago

I agree on gaming, but e-mail often contains important information that I wouldn't want to suffer from random corruption.

muro · 9 years ago

I don't understand why anyone would run their own email server. Cloud offerings work so well and are cheap.

toast0 · 9 years ago

You should be getting MCA (machine check architecture?) notifications in syslog/dmesg if there are ECC correctable errors, and an MCE (machine check exception) on the console for uncorrectable error, based on my experience with SuperMicro xeon servers running FreeBSD. A lot of our servers see a few correctable errors once in a while, and it doesn't affect the usability of the system; but sometimes the number of correctable errors is very high and the system is very sluggish.

Thanks!

rocqua · 9 years ago

There is a hidden cost of ECC with regards to the chipset. None of the cheap chipsets support it, so on any home build, it's going to be expensive.

eightysixfour · 9 years ago

Fortunately the new AMD Ryzen processors support ECC in the memory controller, unfortunately none of the boards seem to be testing/certifying it yet and the UEFI on a lot of the boards is a mess right now.

Hopefully more consumer boards support/certify it since it is already there on the memory controller.

binarycrusader · 9 years ago

Not true with Ryzen, as long as you find unregistered ECC acceptable.

Somewhat not true with Intel, as some of the lower end Xeons now support it.

dom0 · 9 years ago

No chipset has supported ECC for quite a while: Flipping some configuration bits depending on the chipset used is purely an Intel money extraction-engine (Intel ME technology®©™).

Server / workstation class boards normally all do support ECC, though, so no real issue in practice.

dogma1138 · 9 years ago

There are cheap chipsets that support the entry level workstation ones from Intel you can get a motherboard for 65$ with ECC support you don't need to go to x99.

I have a few home storage servers running on the low end Pentiums with ECC support on these.

mafro · 9 years ago

I just last week bought an ASRock C236 WSI [1] for £170. 8GB of ECC RAM was £80. Granted that five years ago I needed to pay more than twice that amount - so I skipped the ECC :p

[1] http://asrockrack.com/general/productdetail.asp?Model=C236%2...

xorblurb · 9 years ago

If you use your computing in a way that makes you think about the potential interest of ECC, the price you are likely to target for a rig that fit your needs is extremely probably high enough to get some nice ECC...

greenshackle2 · 9 years ago

Some older AMD desktop chips support ECC.

I built a home NAS from an old board and Phenom II 545 CPU I had lying around, fortuitously they happen to support ECC. DDR2 unregistered ECC ram was a bit of a pain to find though.

awqrre · 9 years ago

Not only cost, it can also be harder to find a motherboard with ECC support and with all the components/inputs/outputs that you would want in a desktop computer.