> More broadly, I don't think a single definition of 'durable' (as in ACID D) for transactions is particularly useful.
> Much more useful is to ask "what kinds of failures could cause committed transactions to be lost?"
All these articles talking about durability as a singular term should be warned. Writing to disk is not durable under certain circumstances. fsync is not durable under certain circumstances. Two-phase commit is not durable under certain circumstances. Multi-data center commit is not durable against the Death Star too.
I used SQLite in an embedded life safety system for years. It was storing settings and history events on an industrial MicroSD card. Failure happens for the most common reasons it happens across all other products.
1) MicroSD card started producing bad sectors.
2) Power failure during write.
3) Fraudulent MicroSD being used.
Settings section of the database where written so few times that they could be recovered. It was the journaling of events for analysis that where the ones to be lost. Most of the time it was a page or two that was corrupted and the database worked just fine until detailed reporters where created.
Settings where also mirrored to a backup location. Unless the client wanted to pay more for history backup it was only retained locally.
The documentation seems pretty clear to me - it describes specifically what each option controls and the implications of using it. Besides debating whether the default behavior is should be described as durable or not, this post's author seems to understand exactly what each option actually does.
Perhaps what's unclear is when to select which option?
I wrote the first article, and I thought documentation is clear, but then I saw comment by Hipp which got confused me:
> If you switch to WAL mode, the default behavior is that transactions are durable across application crashes (or SIGKILL or similar) but are not necessarily durable across OS crashes or power failures. Transactions are atomic across OS crashes and power failures. But if you commit a transaction in WAL mode and take a power loss shortly thereafter, the transaction might be rolled back after power is restored.
Also, even if I've understood the docs correctly, a number of people in this thread and elsewhere have come to a different interpretation. I think that's much less likely to happen with the PostgreSQL docs.
And I'm sure you can understand why I began to doubt my own interpretation of the docs when SQLite's creator posted a comment saying the exact opposite of what I thought the docs said!
Fair! Fwiw, I enjoyed the post - hopefully my comment wasn't harsh.
I think it does come down to the definition of durable. The default in SQLite is that the data for a transaction will be written all the way to disk (with an fsync) and the rollback journal will be deleted (but without an fsync).
In this model, there is chance of losing only the last transaction and only if the whole system crashes (like a power failure) after deleting the journal but before your file system makes that durable. If your application crashes, you're still fine.
That's significantly more durable than an asynchronous commit in Postgres, though. In an asynchronous commit, the transaction will complete before any data is written to disk at all - the data is only in memory, and many transactions can queue up in memory during the wal_writer_delay. All of those will definitely be lost if Postgres stops uncleanly for any reason (for example a crash, oomkill, or power failure).
> By default, SQLite is not durable, because the default value of journal_mode is DELETE, and the default value of synchronous is FULL, which doesn't provide durability in DELETE mode.
From the documentation, it seems like synchronous being FULL does provide durability of the database in DELETE mode, as FULL means it calls fsync after the transaction is completed. I think you may be confusing durability of the journal file with durability of the database. I don't think WAL can ever really have a durable transaction; it is essentially based on leaving a transaction open until it gets "check-pointed" or actually committed to the database file.
> I don't think WAL can ever really have a durable transaction; it is essentially based on leaving a transaction open until it gets "check-pointed" or actually committed to the database file.
In general: WAL means you write the transaction to the WAL, fsync (in sqlite, this depends upon the sync mode], and then return it's done to the application. The transaction is then durable: even if the database crashes, the contents of the WAL will be applied to the database file.
Checkpointing later just lets you throw away that part of the WAL and not have to replay as much to the database file.
My understanding of DELETE mode is that the transaction is not committed until the rollback journal file is deleted - if the rollback journal is present when sqlite opens a database, it applies the rollback journal to undo the changes that the transaction made. See https://www.sqlite.org/atomiccommit.html#1_deleting_the_roll...
If the directory containing the rollback journal is not fsynced after the journal file is deleted, then the journal file might rematerialize after a power failure, causing sqlite to roll back a committed transaction. And fsyncing the directory doesn't seem to happen unless you set synchronous to EXTRA, per the docs cited in the blog post.
> If the directory containing the rollback journal is not fsynced after the journal file is deleted, then the journal file might rematerialize after a power failure, causing sqlite to roll back a committed transaction. And fsyncing the directory doesn't seem to happen unless you set synchronous to EXTRA, per the docs cited in the blog post.
I think this is the part that is confusing.
The fsyncing of the directory is supposed to be done by the filesystem/OS itself, not the application.
From man fsync,
As well as flushing the file data, fsync() also flushes the metadata information associated with the file (see inode(7)).
So from sqlite's perspective on DELETE it is either: before the fsync call, and not committed, or after the fsync call, and committed (or partially written somehow and needing rollback.)
Unfortunately it seems like this has traditionally been broken on many systems, requiring workarounds, like SYNCHRONOUS = EXTRA.
I don't follow. How would fsyncing the rollback journal affect the durability of the actual database? Do you actually think that the database would reapply an already committed journal whose ID in the header already indicates that the transaction was committed, when the database is already consistent? I really think you should re-review the definition of durability of a database, especially before saying the creator of SQLite is incorrect about its implementation.
> I don't think WAL can ever really have a durable transaction; it is essentially based on leaving a transaction open until it gets "check-pointed" or actually committed to the database file
why not? if you use synchronous=FULL, then WAL does provide durable transactions, no?
Durability also requires the file system implementation and the disk to do the right thing on fsync, which, if I recall past discussions correctly, isn’t a given.
There are some older fsync() bugs (as famously explored by Postgres developers: https://wiki.postgresql.org/wiki/Fsync_Errors ) but I'm not aware of any modern mainstream kernel where this is broken. If I'm wrong, please tell me!
An application that really wants confidence in a write—to the extent that the underlying device and drivers allow—should use O_DIRECT. Or maybe there is a modern equivalent with io-uring. But that is not easy engineering :)
O_DIRECT in now way absolves you from needing to call fsync because fsync ALSO sends a signal to the storage device to flush the buffer if it has anything which is important for durability.
What OP is referring to is that some drives ignore that signal for performance reasons and there’s nothing SW can do to solve that part.
SQLite is an incredible piece of software, and its commitment to backward compatibility is deeply admirable. But that same promise has also become a limitation.
v3.0 was first released in 2004—over 20 years ago—and the industry has changed dramatically since then.
I can’t help but wish for a “v4.0” release: one that deliberately breaks backward compatibility and outdated defaults, in order to offer a cleaner, more modern foundation.
Note: I'm not asking for new functionality per se. But just a version of SQLite that defaulted to how it should be used, deployed in 2025.
The focus of that experiment was to find out of LSM-based storage[^1] would prove to be faster. It turned out that LSM, for SQLite's workloads, did not provide enough benefit to justify the upheaval.
fsync is rarely truly durable in the sense the article describes. it does help with loss ordering in a lot of cases, and flushes often do some work, but true durability, the idea that after fsync there will not be any rollback or tail loss unless there's catastrophic failure, sorry, nope. everyone in the chain ends up in the hot path being the hotspot, then they break it because y'all are addicted to spamming syncs then they move it to a new api, then slowly the syncs come back on the new api, then everyone moves again layer by layer. somewhat common examples in recent years are nvme vendors who implement nvme flush in terms of just pushing the write cache down, but won't always also finalize and flush in flight or scheduled ftl operations due to the insane worst case latency costs associated. weren't apple also caught doing the same in recent years, in part because their ftl shared memory, bus and privilege with the higher exception levels? there's also the rumors people say about enterprise drives being better here, but not doing so is even a saleable product in those environments: https://www.atpinc.com/technology/ssd-flush-cache-technology and some other vendors just have arbitrary firmware patches to compete (that is: you could buy "enterprise grade" hardware second hand and be entirely unaware of the actual command behavior).
> More broadly, I don't think a single definition of 'durable' (as in ACID D) for transactions is particularly useful.
> Much more useful is to ask "what kinds of failures could cause committed transactions to be lost?"
All these articles talking about durability as a singular term should be warned. Writing to disk is not durable under certain circumstances. fsync is not durable under certain circumstances. Two-phase commit is not durable under certain circumstances. Multi-data center commit is not durable against the Death Star too.
1) MicroSD card started producing bad sectors. 2) Power failure during write. 3) Fraudulent MicroSD being used.
Settings section of the database where written so few times that they could be recovered. It was the journaling of events for analysis that where the ones to be lost. Most of the time it was a page or two that was corrupted and the database worked just fine until detailed reporters where created.
Settings where also mirrored to a backup location. Unless the client wanted to pay more for history backup it was only retained locally.
Deleted Comment
SQLite (with WAL) doesn't do `fsync` on each commit under default settings - https://news.ycombinator.com/item?id=45005071 - Aug 2025 (90 comments)
with a relevant comment by the creator of SQLite here: https://news.ycombinator.com/item?id=45014296
(via https://news.ycombinator.com/item?id=45068594 - thanks int_19h!)
Perhaps what's unclear is when to select which option?
> If you switch to WAL mode, the default behavior is that transactions are durable across application crashes (or SIGKILL or similar) but are not necessarily durable across OS crashes or power failures. Transactions are atomic across OS crashes and power failures. But if you commit a transaction in WAL mode and take a power loss shortly thereafter, the transaction might be rolled back after power is restored.
https://news.ycombinator.com/item?id=45014296
the documentation is in contradiction with this.
Also, even if I've understood the docs correctly, a number of people in this thread and elsewhere have come to a different interpretation. I think that's much less likely to happen with the PostgreSQL docs.
And I'm sure you can understand why I began to doubt my own interpretation of the docs when SQLite's creator posted a comment saying the exact opposite of what I thought the docs said!
I think it does come down to the definition of durable. The default in SQLite is that the data for a transaction will be written all the way to disk (with an fsync) and the rollback journal will be deleted (but without an fsync).
In this model, there is chance of losing only the last transaction and only if the whole system crashes (like a power failure) after deleting the journal but before your file system makes that durable. If your application crashes, you're still fine.
That's significantly more durable than an asynchronous commit in Postgres, though. In an asynchronous commit, the transaction will complete before any data is written to disk at all - the data is only in memory, and many transactions can queue up in memory during the wal_writer_delay. All of those will definitely be lost if Postgres stops uncleanly for any reason (for example a crash, oomkill, or power failure).
From the documentation, it seems like synchronous being FULL does provide durability of the database in DELETE mode, as FULL means it calls fsync after the transaction is completed. I think you may be confusing durability of the journal file with durability of the database. I don't think WAL can ever really have a durable transaction; it is essentially based on leaving a transaction open until it gets "check-pointed" or actually committed to the database file.
In general: WAL means you write the transaction to the WAL, fsync (in sqlite, this depends upon the sync mode], and then return it's done to the application. The transaction is then durable: even if the database crashes, the contents of the WAL will be applied to the database file.
Checkpointing later just lets you throw away that part of the WAL and not have to replay as much to the database file.
If the directory containing the rollback journal is not fsynced after the journal file is deleted, then the journal file might rematerialize after a power failure, causing sqlite to roll back a committed transaction. And fsyncing the directory doesn't seem to happen unless you set synchronous to EXTRA, per the docs cited in the blog post.
I think this is the part that is confusing.
The fsyncing of the directory is supposed to be done by the filesystem/OS itself, not the application.
From man fsync,
So from sqlite's perspective on DELETE it is either: before the fsync call, and not committed, or after the fsync call, and committed (or partially written somehow and needing rollback.)Unfortunately it seems like this has traditionally been broken on many systems, requiring workarounds, like SYNCHRONOUS = EXTRA.
I don't follow. How would fsyncing the rollback journal affect the durability of the actual database? Do you actually think that the database would reapply an already committed journal whose ID in the header already indicates that the transaction was committed, when the database is already consistent? I really think you should re-review the definition of durability of a database, especially before saying the creator of SQLite is incorrect about its implementation.
why not? if you use synchronous=FULL, then WAL does provide durable transactions, no?
An application that really wants confidence in a write—to the extent that the underlying device and drivers allow—should use O_DIRECT. Or maybe there is a modern equivalent with io-uring. But that is not easy engineering :)
What OP is referring to is that some drives ignore that signal for performance reasons and there’s nothing SW can do to solve that part.
io_uring in no way changes the rules here.
But I'm not sure I'd necessarily think of O_DIRECT as a way of improving "confidence in a write". It's a way to get a specific behavior.
https://transactional.blog/how-to-learn/disk-io
v3.0 was first released in 2004—over 20 years ago—and the industry has changed dramatically since then.
I can’t help but wish for a “v4.0” release: one that deliberately breaks backward compatibility and outdated defaults, in order to offer a cleaner, more modern foundation.
Note: I'm not asking for new functionality per se. But just a version of SQLite that defaulted to how it should be used, deployed in 2025.
[^1]: https://en.wikipedia.org/wiki/Log-structured_merge-tree