I built a rules engine around SQLite queries for the last product I was working on. This article was the reason.
You can run an unbelievable # of select statements per unit time against a SQLite database. I think of it like a direct method invocation to members of some List<T> within the application itself.
Developers who still protest SQLite are really sleeping on the latency advantages. 2-3 orders of magnitude are really hard to fight against. This opens up entirely new use cases that hosted solutions cannot consider.
Databases that are built into the application by default are the future. Our computers are definitely big enough. There are many options for log replication (sync & async), as well as the obvious hypervisor snapshot approach (WAL can make this much less dangerous).
> Databases that are built into the application by default are the future. Our computers are definitely big enough.
Interesting point. It's easy to forget that in the past local storage was small and expensive so that it was necessary to have a separate database machine.
It wasn't just storage. In the 90s, most servers were single core. Databases already had network interfaces so they could be shared by non-web applications running on desktops, and so were the natural place to split a website over multiple machines. This turned into the received wisdom of always putting your database on a separate machine, despite processing power and storage latency improving so much since the 486 and pentium days that network overhead dominated response times.
I'm still not understanding this push toward using SQLite as a production backend database. It's great for what it is, a tiny embeddable client-side application database. Like an address book on your phone. But even the developers themselves have steadfastly refused to allow it to expand beyond that scope. For instance, they won't add native types for any useful things like dates/times, or uuids. Because that would bloat the code and the size of the embedded object. So you're stuck with "everything is a string". Ref integrity can be enabled, but even those constraint options are very limited.
Not sure why people are still trying to shoe-horn it into a role that it's not meant to be in, and not even really supported to be.
Maybe it's not for you, but the "everything is a string" thing is just the default. SQLite has STRICT table option since 2021 that people really should be using if possible: https://www.sqlite.org/stricttables.html
This brings strict types that people expect from the other server-based databases.
Just to clear up the error in the parent post: SQLite has native blobs, floats, and integers, not just strings. It doesn't have a bunch of other types for things like dates and JSON - you just represent those things using the native times of integer, float, string or blob. But it is not limited to only strings. This has been true for 20 years.
It's absolutely supported for that use case if you can get away with single-writer multi-reader architecture, which IMHO most medium sized applications can. [1]
> It's great for what it is, a tiny embeddable client-side application database. Like an address book on your phone
Size is no real concern, if the user of a client side application has many gigabytes of data a sqlite database is still well suited for the role. There's no shoehorning, it just works.
> Databases that are built into the application by default are the future.
No they're not, because for web servers, you have many (tens? hundreds? thousands?) of web servers that all need to talk to a single database. And you sure don't want to replicate and sync a gigantic database across each web server -- that would be a disaster.
While for local apps on your phone or computer, usage of SQLite is already widespread -- it's not the future, it's here. And cloud-connected apps that work offline already do some sort of sync for that.
Most people don't have tens or thousands of webservers and could be running with a more efficient inprocess database that syncs to another networked readonly replica for a very long time. I'm very surprised that MySQL or Postgres still don't have an inprocess mode because it's such an obvious win and on a technical coding level it should (naively) be very easy to switch from network calls to direct calls.
> for web servers, you have many (tens? hundreds? thousands?) of web servers that all need to talk to a single database
Maybe if you're in the top 1000 or so largest websites.
Back when the alexa 10k was a thing, $work was on it - and we're serving that level of traffic with a rails app running on 30 CPU cores. It would fit _easily_ onto a single machine.
We have a giant Postgres DB, but instead of having every machine connect directly to it, we have a job that creates a smaller SQLite cache of relevant data and that’s pushed out to the machines who then reload that on the fly.
All this depends on your data being somewhat shardable of course.
> Databases that are built into the application by default are the future. Our computers are definitely big enough.
That assumes local applications themselves are the future, and that assumption has grown ever weaker and weaker with everyone and their dog going cloud-only (or starting as a SaaS in the first place) to grab all the sweet sweet recurring subscription revenue.
| Databases that are built into the application by default are the future.
I'm not sold on client side, but a lot of great work is being done on putting the db and application on the same server, between SQLite replication and other approaches like SpacetimeDB. I'm interested to see where it goes.
SQLite underpins nearly every on-disk storage mechanism at Apple and Google as the OS level. You may not be sold but systems programmers elsewhere are.
I often find myself thinking about what a database like SQLite that might look like if it had a native API instead of a query language. I guess it wouldn’t be so different from a dataframe API, but with persistence, locking, and relational features, and I guess an mmap under the hood.
> Databases that are built into the application by default are the future.
That’s a bit overzealous. SQLite is great, but it’s not a replacement for a hosted database. Not all data can live on the client. Use SQLite when you have the right use case for it (offline desktop app, etc.)
If you have a database server and an application server, then move your application on the same machine as your database. That's basically the use case of sqlite server-side. You can easily fit 250TB on rack (or 64tb on an ec2.) That's a lot for non-blob storage: 750KB per us citizen
Wouldn't a hosted database really only "make sense" if you don't want sharding? It's a lot easier to shard at the front end before it makes it to any business logic than to do that and also shard the database. It's when you don't want to deal with database sharding at all that you'd want all application instances to hit the same hosted database.
An interesting result of this is that SQLite is a really good database for implementing GraphQL.
The biggest problem with GraphQL is how easy it becomes to accidentally trigger N+1 queries. As this article explains, if you're using SQLite you don't need to worry about pages that accidentally run 100s of queries, provided those queries are fast - and it's not too hard to build a GraphQL API where all of the edge resolutions are indexed lookups with a limit to the number of rows they'll consider.
It's true that RPC latency doesn't exist, and that opens up other possibilities that are worth considering.
But if you do a ton of small queries instead of one big one, you could be depriving the database of the opportunity to choose an efficient execution plan.
For example, if you do a join by querying one table for a bunch of IDs and then looking up each key with an individual select, you're forcing the database into doing a nested loop join. Maybe a merge join or hash join would have been faster.
Or maybe not. Sometimes the way you write the queries corresponds to what the database would have done anyway. Just not necessarily, so it's something to keep in mind.
I swear "n+1" only popped up in the past 5 or so years with this meaning. I learned it over a decade ago as "1+n".
Also "n+1" looks more like a reference to concurrency/synchronization throughput - like, n concurrent threads or requests where the results are collected and used together. I was really confused the first few times I saw "n+1" because of that.
I have thought of that as ‘n’ being the manageable threshold and the (uncontrolled) ‘+1’ the overflow creates the problems. Typically in terms of additional layers or iterations.. but i like your point and perhaps’n+1’ and ‘1+n’ mean different problem shapes
I could see that making sense in some contexts, like the "straw that breaks the camel's back". However that's not what's usually being referred to in this database query problem.
Usually it's doing one query that returns n results, then doing one more query for each result. Therefore, you end up having done 1+n queries. If you'd used a join you could potentially have done only 1 query.
It's interesting to consider the question of what this would look like if you put Postgres, rather than SQLite, in process? With PGlite we can actually look at it.
I'm hesitant to post a link to benchmarks after the last 24 hours on Twitter... but I have some basic micro-benchmarks comparing WASM SQLite to PGlite (WASM Postgres that runs in-process): https://pglite.dev/benchmarks#round-trip-time-benchmarks
It's very much in the same ballpark, Postgres is a heavier databases and is understandably slower, but not by much. There is a lot of nuance to these benchmarks though as the underlying VFSs are a little different to each other, and PGlite has a WAL rather then SQLite which is in its rollback journal mode (I believe this is why PGlite is faster for some inserts/updates).
But essentially I think Postgres when in-process would be able to perform similarly to SQLite with many small queries and embracing n+1. But having said that I think the other comments here about query planning are important to consider, if you can minimise your queries, you minimise the scans of indexes and tables, which is surely better.
What this does show, particularly when you look at the "native" comparison at the end, is that removing the network (or at least a local socket) from the stack being the two closer together.
The article doesn't explain why this is a bad idea for database servers hosted on a remote machine. The first reason is obvious... Network connections take memory and processing power on both client and server. Each additional query causes more resource usage. It is unnecessary overhead which is why things like multiple active result sets were created for SQL Server.
The network round trip time can also add up if you run into resource constraints doing this.
On a remote database, you also have to contend with multiple users and so complicated locking techniques can come into play depending upon the complexity of all database activity.
Many databases have options to return multiple result sets from one connection which helps control the overhead caused by this usage pattern.
EDIT: This also brings back horrible memories where developers would do this in a db client server architecture. Then they would often not close the DB connections when done. So you could have thousands of active connections basically doing nothing. Luckily, this problem was solved with better database connection handling.
These days there are other tricks you can use to turn several SQL queries into a single round-trip too, with things like JSON aggregates.
Here's an example PostgreSQL query that returns 10 rows from one table and 20 rows from another table in a single network round-trip, using JSON serialization to return the different shaped rows in one go: https://simonwillison.net/dashboard/union-json-demo/
the problem is that it only works with Postgres, not mysql or sqlite or pretty much anything else (at least not as conveniently) and the bigger problem is that the queries become more complex
Yes, you get some performance improvements, but I think that comes at the price of security isolation. Think about it, your application probably requires tons of libraries and evolves really quickly compared to the database itself. Additionally, having Admin/root permissions on the server hosting the DB is generally a much bigger deal than granting such permissions on an application server that talks to that DB.
If none of this makes sense, don't worry, that just means you don't work in enterprise...
Use a single SqliteConnection instance for all access - most builds of the SQLite provider serialize threads inside. Opening multiple connections will incur unnecessary F/S operations.
> How do they manage durable data store? Since sqlite doesn't have a replication mechanism, how do people handle losing transactions?
I've never run one of my SQLite-embedded services on a machine that wasn't virtualized and also known as "production". The simplest recovery strategy is to snapshot the VM. There are other libraries that augment SQLite and provide [a]sync replication/clustering/etc.
#1. They are pretty much my default settings now. But how do you serialize concurrent write with multiple processes? Let's say you run your API service in separate stack as your webapp service. You can't really do that. I know some folks just make sure the transactions are so fast, collision will be rare. But why take the chance?
#2. Snapshot is backup strategy. AFAIK, from reading some reddit comments by Ben? (author of litestream) that it's not a replication strategy.
I've used sqlite in ETL pipeline very successfully, as well as as a read-only caching. I just can't figure out how people use it as server without doing a whole bunch of hacks to deal with its limitations.
One thing I don’t see mentioned yet is that efficiency isn’t the only reason multiple queries can be problematic. There’s also consistency. This might not matter for many SQLite use cases, or in general for blocking single-reader calls. But if your database might be handling concurrent writes that could occur between any two reads, reducing the overhead to zero still might not be worth the tradeoff.
You can run an unbelievable # of select statements per unit time against a SQLite database. I think of it like a direct method invocation to members of some List<T> within the application itself.
Developers who still protest SQLite are really sleeping on the latency advantages. 2-3 orders of magnitude are really hard to fight against. This opens up entirely new use cases that hosted solutions cannot consider.
Databases that are built into the application by default are the future. Our computers are definitely big enough. There are many options for log replication (sync & async), as well as the obvious hypervisor snapshot approach (WAL can make this much less dangerous).
Interesting point. It's easy to forget that in the past local storage was small and expensive so that it was necessary to have a separate database machine.
Not sure why people are still trying to shoe-horn it into a role that it's not meant to be in, and not even really supported to be.
This brings strict types that people expect from the other server-based databases.
[1] https://www.sqlite.org/whentouse.html (see Server Side Database)
Size is no real concern, if the user of a client side application has many gigabytes of data a sqlite database is still well suited for the role. There's no shoehorning, it just works.
No they're not, because for web servers, you have many (tens? hundreds? thousands?) of web servers that all need to talk to a single database. And you sure don't want to replicate and sync a gigantic database across each web server -- that would be a disaster.
While for local apps on your phone or computer, usage of SQLite is already widespread -- it's not the future, it's here. And cloud-connected apps that work offline already do some sort of sync for that.
Maybe if you're in the top 1000 or so largest websites.
Back when the alexa 10k was a thing, $work was on it - and we're serving that level of traffic with a rails app running on 30 CPU cores. It would fit _easily_ onto a single machine.
All this depends on your data being somewhat shardable of course.
That assumes local applications themselves are the future, and that assumption has grown ever weaker and weaker with everyone and their dog going cloud-only (or starting as a SaaS in the first place) to grab all the sweet sweet recurring subscription revenue.
I'm not sold on client side, but a lot of great work is being done on putting the db and application on the same server, between SQLite replication and other approaches like SpacetimeDB. I'm interested to see where it goes.
That’s a bit overzealous. SQLite is great, but it’s not a replacement for a hosted database. Not all data can live on the client. Use SQLite when you have the right use case for it (offline desktop app, etc.)
When you say "client" are you referring to the end user's machine, or the server hosting the application they are talking to?
I personally haven't used it.
What I am curious about is if "Many Small Queries Are Efficient in SQLite" when using the various hosted flavors of SQLite.
Knowing this, I wonder if there will be another wave of object databases at some point.
It still depends a lot. If your system is high-load enough, you may want sharding, for example.
The biggest problem with GraphQL is how easy it becomes to accidentally trigger N+1 queries. As this article explains, if you're using SQLite you don't need to worry about pages that accidentally run 100s of queries, provided those queries are fast - and it's not too hard to build a GraphQL API where all of the edge resolutions are indexed lookups with a limit to the number of rows they'll consider.
I had a lot of fun building a Datasette GraphQL plugin a while ago: https://datasette.io/plugins/datasette-graphql
But if you do a ton of small queries instead of one big one, you could be depriving the database of the opportunity to choose an efficient execution plan.
For example, if you do a join by querying one table for a bunch of IDs and then looking up each key with an individual select, you're forcing the database into doing a nested loop join. Maybe a merge join or hash join would have been faster.
Or maybe not. Sometimes the way you write the queries corresponds to what the database would have done anyway. Just not necessarily, so it's something to keep in mind.
As a pedant, I've been referring to it as a "1+n" problem, but haven't managed to make it catch on yet!
Also "n+1" looks more like a reference to concurrency/synchronization throughput - like, n concurrent threads or requests where the results are collected and used together. I was really confused the first few times I saw "n+1" because of that.
It looks like this page in the SQLite docs has had the "n+1" terminology for as long as it's been on the internet archive (2016): https://web.archive.org/web/20161112021608/https://www.sqlit...
Usually it's doing one query that returns n results, then doing one more query for each result. Therefore, you end up having done 1+n queries. If you'd used a join you could potentially have done only 1 query.
It's interesting to consider the question of what this would look like if you put Postgres, rather than SQLite, in process? With PGlite we can actually look at it.
I'm hesitant to post a link to benchmarks after the last 24 hours on Twitter... but I have some basic micro-benchmarks comparing WASM SQLite to PGlite (WASM Postgres that runs in-process): https://pglite.dev/benchmarks#round-trip-time-benchmarks
It's very much in the same ballpark, Postgres is a heavier databases and is understandably slower, but not by much. There is a lot of nuance to these benchmarks though as the underlying VFSs are a little different to each other, and PGlite has a WAL rather then SQLite which is in its rollback journal mode (I believe this is why PGlite is faster for some inserts/updates).
But essentially I think Postgres when in-process would be able to perform similarly to SQLite with many small queries and embracing n+1. But having said that I think the other comments here about query planning are important to consider, if you can minimise your queries, you minimise the scans of indexes and tables, which is surely better.
What this does show, particularly when you look at the "native" comparison at the end, is that removing the network (or at least a local socket) from the stack being the two closer together.
The network round trip time can also add up if you run into resource constraints doing this.
On a remote database, you also have to contend with multiple users and so complicated locking techniques can come into play depending upon the complexity of all database activity.
Many databases have options to return multiple result sets from one connection which helps control the overhead caused by this usage pattern.
EDIT: This also brings back horrible memories where developers would do this in a db client server architecture. Then they would often not close the DB connections when done. So you could have thousands of active connections basically doing nothing. Luckily, this problem was solved with better database connection handling.
Here's an example PostgreSQL query that returns 10 rows from one table and 20 rows from another table in a single network round-trip, using JSON serialization to return the different shaped rows in one go: https://simonwillison.net/dashboard/union-json-demo/
Related trick: https://til.simonwillison.net/sqlite/related-rows-single-que...
the problem is that it only works with Postgres, not mysql or sqlite or pretty much anything else (at least not as conveniently) and the bigger problem is that the queries become more complex
If none of this makes sense, don't worry, that just means you don't work in enterprise...
1. How are they managing the concurrent writes? I know about lock or queue, but if my stack needs multiple server, then what?
2. How do they manage durable data store? Since sqlite doesn't have a replication mechanism, how do people handle losing transactions?
> How do they manage durable data store? Since sqlite doesn't have a replication mechanism, how do people handle losing transactions?
I've never run one of my SQLite-embedded services on a machine that wasn't virtualized and also known as "production". The simplest recovery strategy is to snapshot the VM. There are other libraries that augment SQLite and provide [a]sync replication/clustering/etc.
See:
https://www.sqlite.org/pragma.html#pragma_synchronous
https://www.sqlite.org/pragma.html#pragma_journal_mode
https://www.sqlite.org/threadsafe.html
https://rqlite.io/docs/faq/
https://litestream.io
#2. Snapshot is backup strategy. AFAIK, from reading some reddit comments by Ben? (author of litestream) that it's not a replication strategy.
I've used sqlite in ETL pipeline very successfully, as well as as a read-only caching. I just can't figure out how people use it as server without doing a whole bunch of hacks to deal with its limitations.