Batch size one billion: SQLite insert speedups, from the useful to the absurd

ako · 2 years ago

One alternative way I’ve been testing recently: load the entire csv file into a single row column in Postgres, and the use a view to parse it into lines and values. Then cache the result with a materialized view. 16mb, 500k rows (formula 1 laptimes) takes only a few seconds to load and map to materialized view records.

daza · 2 years ago

What do you mean by "row column in Postgres"? It's confusing. Could you please clarify that?

Linosaurus · 2 years ago

Sounds like the initial table has just one row and one column, containing the entire 16mb csv string.

klaussilveira · 2 years ago

That's a fantastic analysis. I wonder how PostgreSQL performs?

eatonphil · 2 years ago

I've done some comparisons for inserts of various data sizes on mariadb, postgres, and sqlite without indexes.

I didn't have the patience of OP though to push it to 1B.

https://github.com/eatonphil/databases-intuition

riku_iki · 2 years ago

looks like PGSQL 2x slower for uploading.

voidstarcpp · 2 years ago

PostgreSQL supports heap tables, which should blow away SQLite's mandatory clustered index tables in an unindexed insert test.

Deleted Comment

LVB · 2 years ago

The improvements from using a transaction are familiar to me and make intuitive sense. But can someone explain why using a prepared statement results in a roughly 5x improvement? Parsing the very simple SQL doesn't seem like it would account for much time, so is the extra time spent redoing query planning or something else?

voidstarcpp · 2 years ago

>Parsing the very simple SQL doesn't seem like it would account for much time, so is the extra time spent redoing query planning or something else?

If you're inserting one million rows, even 5 microseconds of parse and planning time per query is five extra seconds on a job that could be done in half a second.

hot_gril · 2 years ago

I'm curious which takes longer, parsing or planning.

cogman10 · 2 years ago

One big caveat to this sort of test. With DBs you are very often trading throughput for latency.

Meaning, it's certainly possible to get "the absolute fastest inserts possible" but at the same time, you are impacting both other readers of that table AND writers across the db.

This also gets more messy when you are talking about multiple instances.

voidstarcpp · 2 years ago

With SQLite by default nobody else can even read the database while it's being written, so your comment would be better directed to a conventional database server.

argiopetech · 2 years ago

Note that the default can be changed with a PRAGMA at compile time to Write-Ahead Logging (WAL) which is both faster and allows reading during writes.

[0] https://www.sqlite.org/wal.html

RedShift1 · 2 years ago

If you enable the WAL, reading while writing is possible. IMO it should be the default.

mike_r_parsons · 2 years ago

How does this compare to the following?

create table testInt(num Integer); WITH n(n) AS(SELECT 1 UNION ALL SELECT n+1 FROM n WHERE n < 1000000) insert into testInt select n from n;

On my machine - Mac Mini 3 GHz Dual-Core Intel Core i7 16 GB 1600 MHz DDR3

this takes Run Time: real 0.362 user 0.357236 sys 0.004530

riku_iki · 2 years ago

you test just 1M record not 1B?..