100k TPS over a billion rows: the unreasonable effectiveness of SQLite

The only caveat being this assumes all your data can fit on a single machine, and all your processing can fit on one machine. You can get a a u-24tb1.112xlarge with 448 vcores, 24TB RAM for 255/hour and attach 64TB of EBS -- that's a lot of runway.

tuhgdetzhh · 3 months ago

Or rent a bare-metal machine from hetzner with 2-3x performance per core and 90% less costs[1].

[1] Various HN posts regarding Hetzner vs AWS in terms of costs and perf.

tetha · 3 months ago

In my experience, a decently managed database scales very hard.

3x EX44 running Patroni + PostgreSQL would give you 64GB of working memory, at least 512 GB NVMe of dataset (configurable with more for a one-time fee) at HA + 1 maintenance node. Practically speaking, that would have carried the first 5 - 10 years of production at the company I work at with ease, for 120 Euros hardware cost/month + a decent sysadmin.

I also know quite a few companies who toss 3-4x 20k - 30k at DELL every few years to get a database cluster on-prem so that database performance ceases to be a problem (unless the application has bad queries).

cortesoft · 3 months ago

There are no Hetzner servers that have 24TBs of RAM

ethanwillis · 3 months ago

This might be true in terms of direct monetary costs.

I want to like Hetzner but the bureaucratic paper process of interacting with them and continuing to interact with them is just... awful.

Not that the other clouds don't also have their own insane bureaucracies so I guess it's a wash.

I'm just saying, I want a provider that leaves me alone and lets me just throw money at them to do so.

Otherwise, I think I'd rather simply deploy my own oversized server in a colo even with the insanely overpriced hardware prices currently.

edit: And shortly after writing this comment I see: "Microsoft won't let me pay a $24 bill, blocking thousands in Azure spending" https://news.ycombinator.com/item?id=46124930

closeparen · 3 months ago

It’s too bad that there does not seem to be a comparable provider with datacenters in North America.

Deleted Comment

adityaathalye · 3 months ago

Heh, the documentation calls out the limits. Maximum (theoretical) DB size is 281TB: https://sqlite.org/limits.html

> This particular upper bound is untested since the developers do not have access to hardware capable of reaching this limit.

> However, tests do verify that SQLite behaves correctly and sanely when a database reaches the maximum file size of the underlying filesystem (which is usually much less than the maximum theoretical database size) and when a database is unable to grow due to disk space exhaustion.

DenisM · 3 months ago

Scale-up solves a lot of problems for stable workloads. But elasticity is poor, so you either live with overprovisinoed capacity (multiples, not percentages) or fail under spiky load which often time is the most valuable moment (viral traffic, Black Friday, etc).

No one has solved this problem. Scale out is typically more elastic, at least for reads.

kragen · 3 months ago

That's a good point, but when one laptop can do 102545 transactions per second, overprovisioned capacity is kind of a more reasonable thing to use than back when you needed an Amdahl mainframe to hit 100 transactions per second.

CuriouslyC · 3 months ago

I love hetzner for internal resources because they're not spikey. For external stuff I like to do co-processing, you can load balance to cloudflare/aws/gcp services like containers/Run/App Runner/etc.

masterj · 3 months ago

I suspect that for a large number of orgs accepting over-provisioning would be significantly cheaper than the headcount required for a more sophisticated approach while allowing faster movement due to lower overall complexity

Deleted Comment

kiitos · 3 months ago

and that your application doesn't need to be resilient to host or network faults

embedding-shape · 3 months ago

> The only caveat being this assumes all your data can fit on a single machine

Does my data fit in RAM? https://yourdatafitsinram.net/

Not sure using EC2/AWS/Amazon is a good example here, if you're squeezing for large single-node performance you most certainly go for dedicated servers, or at least avoid vCPUs like a plague.

jandrese · 3 months ago

That site is a bit questionable. I entered "64TB" as the answer and it was very happy to show me a bunch of servers that maxed out at 6 or 8TB. Even the one server that listed 64TB of RAM might be questionable since it's not leaving room for the OS or your applications. That said 64 TB is a gargantuan amount of data, so I'm not too worked up over it not fitting in RAM. Lord help you if you have a power outage and have to reload the data from disk.

ErroneousBosh · 3 months ago

Getting on for ten year's worth of forum posts on https://rangerovers.pub/ comes to about 32MB of SQL dump.

So yeah, easily.

paulddraper · 3 months ago

How does 25 TiB fit in RAM when the max machine has 24 TB?

cryptonector · 3 months ago

That's true of PG as well.

> Hopefully, this post helps illustrate the unreasonable effectiveness of SQLite as well as the challenges you can run in with Amdahl's law and network databases like postgres.

No, it does not. This article first says that normally you would run an application and the database on separate servers and then starts measuring the performance of a locally embedded database. If you have to keep the initial requirement for your software, then SQLite is completely out of equation. If you can change the requirement, then you can achieve similar performance by tuning the local PGSQL instance -- and then it also becomes a valuation of features and not just raw throughput. I'm not saying SQLite is not an option either, but this article seems confusing in that it compares two different problems/solutions.

sethev · 3 months ago

Right - but SQLite handily beats the case where postgres is on the same box as well. And it's completely reasonable to test technology in the configuration in which it would actually run.

As an industry, we seem to have settled on patterns that actually are quite inefficient. There's no problem that requires the solution of doing things inefficiently just because someone said databases should run on a different host.

stickfigure · 3 months ago

If you're going to run on more than one piece of hardware, something is going to be remote to your single writer database.

As an industry, we've generally decided against "one big box", for reasons that aren't necessarily performance related.

Deleted Comment

Dead Comment

BugsJustFindMe · 3 months ago

> If you have to keep the initial requirement for your software, then SQLite is completely out of equation.

No it isn't? You can run a thin sqlite wrapping process on another server just fine. Ultimately all any DB service is, PostgreSQL included, is a request handler and a storage handler. SQLite is just a storage handler, but you can easily put it behind a request handler too.

Putting access to sqlite behind a serial request queue used to be the standard way of implementing multi-threaded writes. That's only spitting distance away from also putting it behind TCP.

gunnarmorling · 3 months ago

You could do that, but you'd run into exactly the same bottleneck the author describes with a remote Postgres instance. The workload exposes high contention on hot rows. If transactions are kept open for several milliseconds due to this being a remote network call between client and DB server, throughput will be equally limited also when using SQLite.

Barathkanna · 3 months ago

Exactly. People forget that “SQLite can’t do X” often really means “SQLite doesn’t ship with X built in.” If you wrap it with a lightweight request handler or a queue, you essentially recreate the same pattern every other DB uses. The fact that PostgreSQL bundles its own coordinator doesn’t make SQLite fundamentally incapable. It just means you choose whether you want that layer integrated or external.

chasil · 3 months ago

As long as WAL mode is not enabled, connections over NFS/SMB or other file sharing protocols will work.

I'm not saying that this is a good idea, and it could fail in a spectacular manner, but it can be done. DML over this is just asking for trouble.

formerly_proven · 3 months ago

Well that's just dqlite/rqlite.

andersmurphy · 3 months ago

Paradoxically, raw throughput matters a lot more if you are going to scale on a single box. SQLite is 10x PG on a single box in this example. Considering databases tend to be the bottle neck that can take you an order of magnitude further. PG on the same server will also be slower the more complex the transaction as unix sockets are still going to be considerably slower than a function call.

The other thing to point out is in this article is that the PG network example CANNOT scale horizontally due to the power law. You can throw a super cluster at the problem and still fundamentally do around 1000 TPS.

emseetech · 3 months ago

Also important is just how fast cheap hardware has gotten which means vertical scaling is extremely effective. People could get a lot farther with sqlite in wal mode on a single box with an nvme drive than they imagine. Feels like our intuition has not caught up with the material reality of current hardware.

And now that there are solid streaming backup systems, the only real issue is redundancy not scaling.

barfoure · 3 months ago

> Paradoxically, raw throughput matters a lot more if you are going to scale on a single box.

There’s absolutely nothing paradoxical about any of this.

embedding-shape · 3 months ago

> If you have to keep the initial requirement for your software, then SQLite is completely out of equation.

It'd be a very short article if so, don't you think? Full article would be something like: "Normally you'd have a remote connection to the database, and since we're supposed to test SQLite's performance, and SQLite is embedded, it doesn't compare. Fin"

stonemetal12 · 3 months ago

The table of data at the end of the article has 7 lines, only one has data for both DBs. What was the point of setting up the comparison if there is no comparison made?

fulafel · 3 months ago

What is says first is: "SQLite is for phones and mobile apps (and the occasional airliner)! For web servers use a proper database like Postgres!"

Though I'd say it's for a broader set of applications than that (embedded apps, desktop apps, low-concurrency server apps etc).

Phones and mobile apps installations of course outnumber web app deployments, and it doesn't say what you paraphrased about servers.

rich_sasha · 3 months ago

I concur that sqlite is quite amazing. That said, I was a heavy user and have grown some skepticism as well:

- it is not that hard to lock the db. Usually killing the process that caused the deadlock solves the issue - but you need to identify it / monitor for it. And yes, it happens with WAL too

- but when it does happen, it is quite scary. Simply, anything that touches your DB suddenly stops working - can't read, can't write.

- in some cases, WAL does not checkpoint. This leads to drastic growth in the size of the WAL file, and down the line in catastrophic slowdown of queries - things that take 10ms suddenly take 10 seconds. In my particular case, no tweaking of SQLite params fixed it. I had to monitor for it, and periodically force WAL file to be rolled into the main DB.

- all of this gets harder on Windows, where eg.you cannot just 'lsof' a file.

- the performance stats change somewhat for the worse in the cloud on drives that look local but actually aren't. Of course that is not sqlite's fault, but the blazing fast performance doesn't apply to all commonly encountered environments that look like real local drives.

I'm not dissing SQLite, I use it despite these shortcomings. Equally, I'm happy to reach for something like Postgres, which, well, hasn't burned me yet.

EDIT I should add that despite all this I never managed to corrupt the DB, or break any of the SQL promises - never messed up indices, never saw broken ACID compliance etc. And that's a massive endorsement, on reflection.

These are some really good points.

- WAL checkpointing is very important (litestream handles this well). As you said not checkpointing can cause massive query slow down.

- SQLITE_LOCK and SQLITE_BUSY can be avoided by ensuring your application only has a single write connection ideally behind an MPSC queue. After WAL this is probably one of the biggest SQLite quality of life improvements.

- 100% avoid cloud drives in this context you ideally want attached NVME.

- Postgres is great and there's nothing wrong with using it!

zbentley · 3 months ago

> a single write connection ideally behind an MPSC queue

That’s a pretty tall order. What if I want read-after-write consistency for code that issues a write? Did you mean some kind of a fair mutex around writes instead (in which case, how is this different from what SQLite already does?)? What if writes are coming from multiple uncoordinated processes? Do I then need to bring in a daemon or IPC?

asa400 · 3 months ago

From your experience, would you call these behaviors bugs, or are they more known issues that result from SQLites specific implementation quirks? What kinds of workloads were you throwing at it when these types of issues happened? Asking as someone who really enjoys and respects SQLite but hasn't encountered these specific behaviors before.

I was pushing SQLite quite hard. My DB was at peak 25GB or so. Occasional queries of O(1e6) rows while simultaneously inserting etc. Many readers and a few writers too. Id expect some degradation, sure, but Id say it wasn't very graceful.

I think, however, I was well within the parameters that SQLite maximalists would describe as within th envelope of heavy but fine usage. YMMV.

I found a very small number of people online with the exact same issues. Enough to know I'm not hallucinating, but not enough to find good support for this :/ but, TLDR, forcing WAL truncation regularly fixed it all. But I had to do it from an external process on a heartbeat, etc etc

saberience · 3 months ago

I was in love with sqlite too, until it just started getting randomly corrupted/locked and I kept having to restore it, and I never worked out why it was happening.

I appreciate its "simplicity" but ultimately I hated not knowing why it occasionally just shit the bed and ended up in an unrecoverable state. I also didn't like having to roll my own recovery system for it. Now I just use Postgres for all my hobby projects and it "just works" and I've never had it lock-up or corrupt itself...

Your mileage may vary, but sqlite definitely isn't as stable as it makes it seem.

Scaevolus · 3 months ago

WAL2 might help with the checkpointing problem: https://sqlite.org/src/doc/wal2/doc/wal2.md

Iirc it was somewhat experimental when I last used SQLite like this. Is is not considered mainstream? Thanks for the top tip.

weitendorf · 3 months ago

I've been working on a hybrid protobuf ORM/generic CRUD server based on sqlite

(code at https://github.com/accretional/collector - forgive the documentation. I'm working on a container-based agent project and also trialling using agents heavily to write the individual features. It's working pretty well but the agents have been very zealous at documenting things lol).

This is my first real project using sqlite and we've hit some similarly cool benchmarks:

* 5-15ms downtime to backup a live sqlite db with a realistic amount of data for a crud db

* Capable of properly queueing hundreds of read/write operations when temporarily unavailable due to a backup

* e2e latency of basically 1ms for CRUD operations, including proto SerDe

* WAL lets us do continuous, streaming, chunked backups!

Previously I'd only worked with Postgres and Spanner. I absolutely love sqlite so far - would still use Spanner for some tasks with an infinite budget but once we get Collector to implement partitions I don't think I would ever use Postgres again.

bombela · 3 months ago

> * 5-15ms downtime to backup a live sqlite db with a realistic amount of data for a crud db

Did you consider using a filesystem with atomic snapshots? For example sqlite with WAL on BTRFS. As far as I can tell, this should have a decent mechanical sympathy.

edit: I didn't really explain myself. This is for zero downtime backups. Snapshot, backup at your own pace, delete the snapshot.

hedora · 3 months ago

If it’s at 5-15ms of downtime already, you’re in the space where the “zero” downtime FS might actually cause more downtime. In addition to pauses while the snapshot is taken, you’d need to carefully measure things like performance degradation while the snapshot exists (incurring COW costs) and while it’s being GCed in the background.

Also, the last time I checked the Linux scheduling quanta was about 10ms, so it’s not clear backups are going to even be the maximum duration downtime while the system is healthy.

matthewaveryusa · 3 months ago

diath · 3 months ago

jaketoronto · 3 months ago

Are you limiting your # of connections to postgres to 8? Is this unnecessarily throttling your throughput? This seems like quite the bottleneck... connection pools are good when your app is overwhelming your db.. but in this case, you really should be trying to put more load on Postgres... I'm concerned that this whole experiment is tainted by this choke point. I would love to see this tested again with a much larger connection pool. (Also, might be nice for you to share what the CPU and thread usage on the Postgres side was during this). (Side note: I'm a big fan of sqlite.. but I've run into lots of performance bottlenecks caused by arbitrarily low connection pools and the like)

To further explain:

You mention setting the conn pool to 8 to match your # of cores. That would be fine if you didn't have any sleeps inside of your txns... But the moment you added the sleeps inside the txns, your limit of 8 kills through throughput... because no other thread can access the DB once 8 of them grab connections and start the 20ms of total sleep. Imagine instead if you had 64 connections... you would 8x your throughput... What if you were to go even higher? At some point you might start overloading the DB... at that point, you could consider tuning the db to accept more connections... or... maybe you've truly reached the DB's peak performance limit.

I just don't think that 8 connections represents that limit... you need to do everything you can to open up your client config until you reach PG's limitations.

A larger pool actually makes the number worse because it adds more contention. I tested it with 64 and all the results were worse. The last example which was 348 TPS drops to 164 TPS!

Tangentially I also highly recommend this article on pool sizing.

https://github.com/brettwooldridge/HikariCP/wiki/About-Pool-...

Added an epilogue that covers larger connection pools and the results.

ch2026 · 3 months ago

Yeah I don’t buy this article. Something is off. I run vastly more TPS on a very barebones MySQL install over the network.

Edit: Keep downvoting. I hate this fucking site.

tomconnors · 3 months ago

Cool stuff as usual, Anders. One of the nice things about running a networked DB is that it makes redeploying the application a bit simpler. You can spin up a new EC2 instance or whatever and once it's online kill the old one. That gets 0 or close to 0 downtime. If the DB is on the same instance, replacing it requires loading up the DB onto the new instance, which seems more error prone than just restarting the app on the original instance, but in my experience that typically incurs downtime or some complicated handoff logic. Have you had to deal with anything like that running sqlite in prd?

Oh there are a bunch of considerations.

You're going to want persistent storage on your server, not ephemeral. You'll also want NVME. A lot of the time you're going to end up on bare metal running a single server anyway.

You're going to have down time for migrations unless you're very clever with your schema and/or replicas.

Litestream for me at least is what makes SQLite viable for a web app as prior to that there wasn't a good replication story.

With litestream it's much easier to have a backup on standby. That being said where I have used it in production some amount of downtime has been acceptable so mileage may vary.

benjiro · 3 months ago

> Litestream for me at least is what makes SQLite viable for a web app as prior to that there wasn't a good replication story.

Does Sqlite now not have a build in rsync for replicas?

Searches, yep ... https://sqlite.org/rsync.html

> You're going to have down time for migrations unless you're very clever with your schema and/or replicas.

probably worth stating these kinds of design considerations/assumptions up-front

i'm sure lots of applications are fine with "downtime for [database] migrations" but lots more are definitely not, especially those interested in synthetic metrics like TPS

cwillu · 3 months ago

Sqlite supports multi-process access, so as long as you've configured to permit that, you can do the same “start new process before retiring the old one” dance.

tomasol · 3 months ago

Author is setting PRAGMA synchronous="normal", meaning fsync is not issued as part of every write tx, but eventually. In order to make the comparison fair it should be set to "full".

PRAGMA synchronous="normal" is fine if you are in WAL mode. The database cannot be corrupted by power loss unlike in journal mode.

> The synchronous=NORMAL setting provides the best balance between performance and safety for most applications running in WAL mode. You lose durability across power lose with synchronous NORMAL in WAL mode, but that is not important for most applications. Transactions are still atomic, consistent, and isolated, which are the most important characteristics in most use cases.

fsync is the most expensive operation during a write. NORMAL mode means you don't care whether last ~100 ms of transactions before a process crash / VM restart are going to be persisted or not. My suggestion is either to use synchronous="full" or disable `synchronous_commit` on Postgres to avoid comparing apples to oranges.

Edit: Also, the example indicates financial transactions. Can you explain why you need serializability but not durability?

100%. TFA was NOT comparing apples to apples. Now that it's been updated the numbers do not look quite so fantastic. The version of PG used matters, too, since the latest adds async I/O support that greatly improves performance.

SQLit went from 102545 -> 100405

westurner · 3 months ago

That's a helpful TPS Report.

TIL `SAVEPOINT` can occur in a BEGIN ... END SQLite transaction, and that works with optimizing batch size on a particular node with a given load.

Is there a solution for SQLite WAL corruption?

From https://news.ycombinator.com/item?id=45133444 :

> "PSA: SQLite WAL checksums fail silently and may lose data" https://news.ycombinator.com/item?id=44672902

> sqlite-parquet-vtable, [...]

matharmin · 3 months ago

As mentioned in those threads, there is no SQLite WAL corruption if you have a working disk & file system. If you don't, then all bets are off - SQLite doesn't protect you against that, and most other databases won't either. And nested transactions (SAVEPOINT) won't have have any impact on this - all it does in this form is reduce the number of transactions you have.

> working disk & file system

And a working ECC or non-ECC RAM bus, and [...].

How bad is recovery from WAL checksum / journal corruption [in SQLite] [with batching at 100k TPS]?

And should WAL checksums be used for distributed replication "bolted onto" SQLite?

>> (How) Should merkle hashes be added to sqlite for consistency? How would merkle hashes in sqlite differ from WAL checksums?

SQLite would probably still be faster over the network with proper Merkleization