Filesystem error handling

One trope I frequently encounter as a FreeBSD professional is "Linux is used by so many people, it doesn't have vast and sweeping bugs anymore".. the many eyes fallacy. In reality, you get bystander paradox.. everyone wants RedHat or IBM to do all the hard work and they reap all the rewards.

I find my field of systems pretty interesting.. there's so much to do and not a lot of people rushing in to do it as the heavy hitters retire or move to entrepreneurship etc. It's probably much cheaper to harden a *BSD or Illumos system in these domains, and much easier to get the results integrated into mainline. The financial and fame are much greater in doing it on Linux though.

Bromskloss · 8 years ago

> the many eyes fallacy

What do you mean by this? Do more eyes not find and point out more problems? Is the effect offset by some counteracting mechanism?

wrs · 8 years ago

The many eyes idea postulates many eyes looking at the code, not many eyes finding bugs. (Proprietary software has as many or more users finding bugs as open source!) In reality, the number of developers actually reading code to find bugs may not be significantly larger for complex open source like Linux, particularly in esoteric areas like filesystems or device drivers.

kev009 · 8 years ago

https://en.wikipedia.org/wiki/Linus%27s_Law

The results of "many eyes" seems to be basically moot (i.e. certainly not a 10x magnitude and more likely well below 2x) in the face of the cultural mores of a project. For example, Linux kernel treats userland ABI compatibility quite seriously and there haven't been many faults there. OTOH security and KPI stability are not taken seriously, and are in constant flux. I can dig up papers on the former if needed, but it is the most glaring failure of "many eyes"

What's particularly noteworthy from my perspective is the bandwagon effect toward one kernel is causing the thing that is supposed to be the ultimate best that keeps getting better (Linux kernel) to lose focus, investment (new talent and money), maybe even quality due to decreased competition. Systems software is starting to look a lot more like Electrical Engineering as a field than the greater software ecosystem. Mostly done by large, central organizations. That makes me sad.

acdha · 8 years ago

One big problem is how many users actually translate into extra eyes contributing to the codebase. Linux has a huge number of users but how many actually do more than find a workaround for a bug, much less contribute a patch back.

It definitely happens but in my experience a lot of people think that’s too much work and assume someone else will do it even if they don’t. It’s been far more common for someone to try hoping it doesn’t happen again, disabling features blindly, etc.

aw1621107 · 8 years ago

I think it's more along the lines of "Eh, I probably don't need to look, since someone else is surely doing that".

ericflo · 8 years ago

There's a psychological principle called Diffusion of Responsibility that I think applies here. But this is my rusty memory from undergrad around a decade ago.

mannykannot · 8 years ago

The fallacy is in the assumption that there actually are many eyes taking a critical look - e.g. at the level demonstrated here.

When it comes to security, there may be more eyes, in which case the fallacy is to assume that the preponderance of them are well-intentioned.

Here we have empirical evidence of problems being overlooked, yet you would prefer to put faith in a comforting aphorism?

> It’s a common conception that SSDs are less likely to return bad data than rotational disks, but when Google studied this across their drives, they found:

>> The annual replacement rates of hard disk drives have previously been reported to be 2-9% [19,20], which is high compared to the 4-10% of flash drives we see being replaced in a 4 year period. However, flash drives are less attractive when it comes to their error rates. More than 20% of flash drives develop uncorrectable errors in a four year period, 30-80% develop bad blocks and 2-7% of them develop bad chips. In comparison, previous work [1] on HDDs reports that only 3.5% of disks in a large population developed bad sectors in a 32 months period – a low number when taking into account that the number of sectors on a hard disk is orders of magnitudes larger than the number of either blocks or chips on a solid state drive, and that sectors are smaller than blocks, so a failure is less severe.

Unfortunately "return bad data" is slightly ambiguous. IIRC these failures that are reported are ones where the drive itself claims that it discovered invalid blocks which were detected by error detection algorithms. The rate at which drives "return bad data" is likely only to be discovered by a well-controlled test that writes specific data to specific locations, or data patterns, then looks to see whether that data was preserved. This could tell us about data that was corrupted and escaped detection. Google's aggregate disk usage stats are a dump of the devices and their service activity with application-specific load, not the well-controlled test required for detection of the "returned bad data" symptom.

It's possible that you could scour application logs and try to attribute application failures to corrupt data from a disk, but it would be difficult to isolate.

kabdib · 8 years ago

We see a lot of SSD errors that turn out to be problems with RAID controllers. The lost drive rate dropped quite a bit when we started using OS-level RAID functionality rather than whatever the heck is in the firmware of Promise / LSI / Adaptec / etc. cards, and we were able to return a bunch of drives to service after we started using those cards in a simple pass through mode.

Raise your hand if you've lost data that was RAIDed eight ways to Sunday because of a RAID controller failure that turned out to be unrecoverable because of crappy decisions in the firmware, or rebuild times equal to the number of days it took to order, rack and configure a competitor's product.

Sinjo · 8 years ago

Or any of the layers they run above the block device could be designed to detect it, or even correct it.

wyldfire · 8 years ago

Some of them do precisely that. btrfs is designed to correct small errors introduced by the layers below it (this concept is tested in the article) .

trishume · 8 years ago

This is correct, I checked claim against the paper and the paper does seem to be talking about errors that were caught by the SSD itself: http://0b4af6cdc2f0c5998459-c0245c5c937c5dedcca3f1764ecc9b2f...

It could have been that the paper was about errors detected by checksums at the level of Google's distributed storage engine, but that's not the case.

This means that Apple could in fact be using entirely standard SSDs while being correct in their claim that they don't need checksums. Their SSDs may sometimes bubble "uncorrectable errors" up to them, but like the article says, redundancy doesn't help you avoid those on SSDs.

_kp6z · 8 years ago

magnat · 8 years ago

> This could happen if there’s disk corruption, a transient read failure, or a transient write failure caused bad data to be written.

Wouldn't transient I/O error get corrected by block driver itself by repeating the request a few times? Does a filesystem driver assume any error reported by storage device is permanent?

> The 2017 tests above used an 8k file where the first block that contained file data either returned an error at the block device level or was corrupted, depending on the test.

Are there any tests on how filesystems deal with metadata corruption or I/O error? Apart from NTFS "self-healing" introduced in Windows 7/2008, do modern filesystem drivers attempt to correct broken metadata completely on the kernel side, or just give up to avoid making things worse?

binarycrusader · 8 years ago

I believe apfs (apple's new filesystem), btrfs, and zfs all checksum metadata and will attempt to correct it.

tscs37 · 8 years ago

I recommend the OP article since it specifically mentions apfs.

asveikau · 8 years ago

> Wouldn't transient I/O error get corrected by block driver itself by repeating the request a few times?

I am surprised at the seeming assumption here that errors are always transient and you shouldn't worry because the retry will fix it. Sure that probably won't hurt, but it won't address the case where retries fail.

the8472 · 8 years ago

depends on the OS, according to this comment on the ZoL github [0] solaris passes through errors from the hardware while linux retries.

[0] https://github.com/zfsonlinux/zfs/issues/1256#issuecomment-1...

snvzz · 8 years ago

While neat, it's limited to Linux. It'd be nice to see how the BSDs behave, on their filesystems (such as HAMMER2). ZFS was also notably absent.

X86BSD · 8 years ago

As usual, another example of the Linux echo chamber. It's just one constant inward gaze with that crowd. They learn nothing from anyone, only trial and error.

I don't know that I'd agree. As much as llvm+clang inspired gcc to improve, ZFS seems like a big part of the inspiration for btrfs. I'm not sure that it hits the whole ZFS feature set, but it's probably a step in that direction.

pjmlp · 8 years ago

Including the fragmentation experience of the UNIX wars glory days.

blattimwind · 8 years ago

Something to keep in mind is that by default writes go through the page cache. By the time I/O is initiated for a write(2), the application process can have exited.

simcop2387 · 8 years ago

This is why calling fsync(fd) before closing the file and exiting is a good idea if you need that kind of error to be handled. You should get it as a return of fsync if it happens after the write.

O_[D]SYNC is better than a separate call to fsync, since it is not supposed to suffer from the race condition inherent to fsync. Arguably pedantic.

amelius · 8 years ago

Does the article do this?

mjw1007 · 8 years ago

I can't see anywhere in the article where it says which version of linux this was tested on (beyond "2017"), which is a shame.

Is this the state of affairs after the improvements described at https://lwn.net/Articles/724307/ have gone in?

mhei · 8 years ago

second to last paragraph:

> All tests were run on both ubuntu 17.04, 4.10.0-37, as well as on arch, 4.12.8-2. We got the same results on both machines.

That wasn't there half an hour ago :-).

That LWN article was from June 2017 and 4.10 was released in February, so I think that means these tests predate any of the changes it discusses.

Upvoter33 · 8 years ago

Pretty cool to see an update on the old analysis - and nice to hear, file systems have improved!

Deleted Comment