Working with Files Is Hard (2019)

continuational · a year ago

> Pillai et al., OSDI’14 looked at a bunch of software that writes to files, including things we'd hope write to files safely, like databases and version control systems: Leveldb, LMDB, GDBM, HSQLDB, Sqlite, PostgreSQL, Git, Mercurial, HDFS, Zookeeper. They then wrote a static analysis tool that can find incorrect usage of the file API, things like incorrectly assuming that operations that aren't atomic are actually atomic, incorrectly assuming that operations that can be re-ordered will execute in program order, etc.

> When they did this, they found that every single piece of software they tested except for SQLite in one particular mode had at least one bug. This isn't a knock on the developers of this software or the software -- the programmers who work on things like Leveldb, LBDM, etc., know more about filesystems than the vast majority programmers and the software has more rigorous tests than most software. But they still can't use files safely every time! A natural follow-up to this is the question: why the file API so hard to use that even experts make mistakes?

Retr0id · a year ago

> why the file API so hard to use that even experts make mistakes?

I think the short answer is that the APIs are bad. The POSIX fs APIs and associated semantics are so deeply entrenched in the software ecosystem (both at the OS level, and at the application level) that it's hard to move away from them.

huntaub · a year ago

I take a different view on this. IMO the tricks that existing file systems play to get more performance (specifically around ordering and atomicity) make it extra hard for developers to reason about. Obviously, you can't do anything about fsync dropping error codes, but some of these failure modes just aren't possible over file systems like NFS due to protocol semantics.

IgorPartola · a year ago

Not only that, but the POSIX file API also assumes that NFS is a thing but NFS breaks half the important guarantees of a file system. I don’t know if it’s a baby and bath water situation, but NFS just seems like a whole bunch of problems. It’s like having eval in a programming language.

__loam · a year ago

POSIX is also so old and essential that it's hard to imagine an alternative.

timewizard · a year ago

> POSIX fs APIs and associated semantics

Well I think that's the actual problem. POSIX gives you an abstract interface but it essentially does not enforce any particular semantics on those interfaces.

dkarl · a year ago

> why the file API so hard to use that even experts make mistakes?

Sounds like Worse Is Better™: operating systems that tried to present safer abstractions were at a disadvantage compared to operating systems that shipped whatever was easiest to implement.

(I'm not an expert in the history, just observing the surface similarity and hoping someone with more knowledge can substantiate it.)

ncruces · a year ago

POSIX file locking is clearly modeled around whatever was simplest to implement, although it makes no sense at all.

trinix912 · a year ago

> Sounds like Worse Is Better™: operating systems that tried to present safer abstractions were at a disadvantage compared to operating systems that shipped whatever was easiest to implement.

What about the Windows API? Windows is a pretty successful OS with a less leaky FS abstraction. I know it's a totally different deal than POSIX (files can't be devices etc), the FS function calls require a seemingly absurd number of arguments, but it does seem safer and clearer what's going to happen.

thfuran · a year ago

Why does that seem more likely than file system API simply not having been a major factor in the success of failure of OSes?

kccqzy · a year ago

By the way, LMDB's main developer Howard Chu responded to the paper. He said,

> They report on a single "vulnerability" in LMDB, in which LMDB depends on the atomicity of a single sector 106-byte write for its transaction commit semantics. Their claim is that not all storage devices may guarantee the atomicity of such a write. While I myself filed an ITS on this very topic a year ago, http://www.openldap.org/its/index.cgi/Incoming?id=7668 the reality is that all storage devices made in the past 20+ years actually do guarantee atomicity of single-sector writes. You would have to rewind back to 30 years at least, to find a HDD where this is not true.

So this is a case where the programmers of LMDB thought about the "incorrect" use and decided that it was a calculated risk to take because the incorrectness does not manifest on any recent hardware.

This is analogous to the case where someone complains some C code has undefined behavior, and the developer responds by saying they have manually checked the generated assembler to make sure the assembler is correct at the ISA level even though the C code is wrong at the abstract C machine level, and they commit to checking this in the future.

Furthermore both the LMDB issue and the Postgres issue are noted in the paper to be previously known. The paper author states that Postgres documents this issue. The paper mentions pg_control so I'm guessing it's referring to this known issue here: https://wiki.postgresql.org/wiki/Full_page_writes

> We rely on 512 byte blocks (historical sector size of spinning disks) to be power-loss atomic, when we overwrite the "control file" at checkpoints.

yuboyt · a year ago

This assumption was wrong for Intel Optane memory. Power loss could cut the data stream anywhere in the middle. (Note: the DIMM nonvolatile memory version)

senderista · a year ago

This is called “Atomic Write Unit Power Failure” (AWUPF).

Joker_vD · a year ago

> the developer responds by saying they have manually checked the generated assembler to make sure the assembler is correct at the ISA level even though the C code is wrong at the abstract C machine level, and they commit to checking this in the future.

Yeah, sounds about right about quite a lot of the C programmers except for the "they commit to checking this in the future" part. I've responses like "well, don't upgrade your compiler; I'm gonna put 'Clang >= 9.0 is unsupported' in the README as a fix".

eviks · a year ago

> why the file API so hard to use that even experts make mistakes?

Because it was poorly designed, and there is a high resistance to change, so those design mistakes from decades ago continue to bite

liontwist · a year ago

Something this misses is that all programs make assumptions for example - “my process is the only one writing this file because it created it”

Evaluating correctness without that consideration is too high of a bar.

Safety and correctness cannot be “impossible to misuse”

nickelpro · a year ago

And yet all of these systems basically work for day-to-day operations, and fail only under obscure error conditions.

It is totally acceptable for applications to say "I do not support X conditions". Swap out the file half way through a read? Sorry don't support that. Remove power to the storage devise in the middle of a sync operation? Sorry don't support that.

For vital applications, for example databases, this is a known problem and risks of the API are accounted for. Other applications don't have nearly that level of risk associated with them. My music tagging app doesn't need to be resistant to the SSD being struck by lightning.

It is perfectly acceptable to design APIs for 95% of use cases and leave extremely difficult leaks to be solved by the small number of practitioners that really need to solve those leaks.

belter · a year ago

"PostgreSQL vs. fsync - How is it possible that PostgreSQL used fsync incorrectly for 20 years" - https://youtu.be/1VWIGBQLtxo

praptak · a year ago

Ext4 actually special-handles the rename trick so that it works even if it should not:

"If auto_da_alloc is enabled, ext4 will detect the replace-via-rename and replace-via-truncate patterns and [basically save your ass]"[0]

[0]https://docs.kernel.org/admin-guide/ext4.html

Retr0id · a year ago

> they found that every single piece of software they tested except for SQLite in one particular mode had at least one bug.

This is why whenever I need to persist any kind of state to disk, SQLite is the first tool I reach for. Filesystem APIs are scary, but SQLite is well-behaved.

Of course, it doesn't always make sense to do that, like the dropbox use case.

nodamage · a year ago

Before becoming too overconfident in SQLite note that Rebello et al. (https://ramalagappan.github.io/pdfs/papers/cuttlefs.pdf) tested SQLite (along with Redis, LMDB, LevelDB, and PostgreSQL) using a proxy file system to simulate fsync errors and found that none of them handled all failure conditions safely.

In practice I believe I've seen SQLite databases corrupted due to what I suspect are two main causes:

1. The device powering off during the middle of a write, and

2. The device running out of space during the middle of a write.

justin66 · a year ago

I remembered Howard Chu commenting on that paper...

https://lists.openldap.org/hyperkitty/list/openldap-devel@op...

I'm pretty sure that's not where I originally saw his comments. I remember his criticisms being a little more pointed. Although I guess "This is a bunch of academic speculation, with a total absence of real world modeling to validate the failure scenarios they presented" is pretty pointed.

ablob · a year ago

I believe it is impossible to prevent dataloss if the device powers off during a write. The point about corruption still stands and appears to be used correctly from what I skimmed in the paper. Nice reference.

ziddoap · a year ago

>SQLite is the first tool I reach for.

Hopefully in whichever particular mode is referenced!

Retr0id · a year ago

WAL mode, yes!

eatonphil · a year ago

Do you turn on SQLite checksumming or how do you feel comfortable that data on disk stays keeps integrity?

edgarvaldes · a year ago

As per HN headlines, files are hard, git is hard, regex is hard, time zones are hard, money as data type is hard, hiring is hard, people is hard.

I wonder what is easy.

paulddraper · a year ago

Complaining :)

D-Coder · a year ago

Selection error. The stuff that always works doesn't get posted here.

ssivark · a year ago

To reuse another HN headline, all this is probably because no one really cares X-)

gavinhoward · a year ago

I wonder if, in the Pillai paper, I wonder if they tested the SQLite Rollback option with the default synchronous [1] (`NORMAL`, I believe) or with `EXTRA`. I'm thinking that it was probably the default.

I kinda think, and I could be wrong, that SQLite rollback would not have any vulnerabilities with `synchronous=EXTRA` (and `fullfsync=F_FULLFSYNC` on macOS [2]).

[1]: https://www.sqlite.org/pragma.html#pragma_synchronous

[2]: https://www.sqlite.org/pragma.html#pragma_fullfsync

wruza · a year ago

No mention on ntfs and windows keywords in the article, for those interested.

pjdesno · a year ago

Although the conference this was presented at is platform-agnostic, the author is an expert on Linux, and the motivation for the talk is Linux-specific. (Dropbox dropping support for non-ext4 file systems)

The post supports its points with extensive references to prior research - research which hasn't been done in the Microsoft environment. For various reasons (NDAs, etc.) it's likely that no such research will ever be published, either. Basically it's impossible to write a post this detailed about safety issues in Microsoft file systems unless you work there. If you did, it would still take you a year or two of full-time work to do the background stuff, and when you finished, marketing and/or legal wouldn't let you actually tell anyone about it.

wmf · a year ago

Universities can get Windows source code under NDA and do research on it but nobody really cares about such work.

yahayahya · a year ago

Is that because the windows APIs are better? Or because businesses build their embedded systems/servers with Windows?

p_ing · a year ago

Certainly depends on which APIs you ultimately use as a developer, right? If it is .NET, they're super simple, and you can get IOCP for "free" and non-blocking async I/O is quite easy to implement.

I can't say the Win32 File API is "pretty", but it's also an abstraction, like the .NET File Class is. And if you touch the NT API, you're naughty.

On Linux and macOS you use the same API, just the backends are different if you want async (epoll [blocking async] on Linux, kqueue on macOS).

pjc50 · a year ago

The windows APIs are certainly slower. Apart from IOCP I don't think they're that much different? Oh, and mandatory locking on executable images which are loaded, which has .. advantages and disadvantages (it's why Windows keeps demanding restarts)

wruza · a year ago

I doubt that, was just curious how it might compare in the article.

ryao · a year ago

> On Linux ZFS, it appears that there's a code path designed to do the right thing, but CPU usage spikes and the system may hang or become unusable.

ZFS fsync will not fail, although it could end up waiting forever when a pool faults due to hardware failures:

https://papers.freebsd.org/2024/asiabsdcon/norris_openzfs-fs...

ein0p · a year ago

ZFS on Linux unfortunately has a long standing bug which makes it unusable under load: https://github.com/openzfs/zfs/issues/9130. 5.5 years old, nobody knows the root cause. Symptoms: under load (such as what one or two large concurrent rsyncs may generate over a fast network - that's how I encountered it) the pool begins to crap out and shows integrity errors and in some cases loses data (for some users - it never lost data for me). So if you do any high rate copies you _must_ hash-compare source and destination. This needs to be done after all the writes are completed to the zpool, because concurrent high rate reads seem to exacerbate the issue. Once the data is at rest, things seem to be fine. Low levels of load are also fine.

Deleted Comment

ryao · a year ago

There are actually several distinct issues being reported there. I replied responding to everyone who posted backtraces and a few who did not:

https://github.com/openzfs/zfs/issues/9130#issuecomment-2614...

That said, there are many others who stress ZFS on a regular basis and ZFS handles the stress fine. I do not doubt that there are bugs in the code, but I feel like there are other things at play in that report. Messages saying that the txg_sync thread has hung for 120 seconds typically indicate that disk IO is running slowly due to reasons external to ZFS (and sometimes, reasons internal to ZFS, such as data deduplication).

I will try to help everyone in that issue. Thanks for bringing that to my attention. I have been less active over the past few years, so I was not aware of that mega issue.

einpoklum · a year ago

The article wrap up with this salient point:

> In conclusion, computers don't work (but I guess you already know this...

paulddraper · a year ago

They work.

Just not all the time.