Readit News logoReadit News
simonw · 3 years ago
I shared some notes on this on my blog, because I'm guessing a lot of people aren't quite invested enough to read through the whole paper: https://simonwillison.net/2022/Sep/1/sqlite-duckdb-paper/
airstrike · 3 years ago
Thank you for this. Big fan of your blog and all your contributions to Django
thunderbong · 3 years ago
That's a very comprehensive review. Thank you.
badgerdb · 3 years ago
Indeed, an excellent summary.
sph · 3 years ago
I waited for a tl;dr but this is even better. Much appreciated.
polyrand · 3 years ago
Regarding hash joins, the SQLite documentation mentions the absence of real hash tables [0]

  SQLite constructs a transient index instead of a hash table in this instance 
  because it already has a robust and high performance B-Tree implementation at 
  hand, whereas a hash-table would need to be added. Adding a separate hash table 
  implementation to handle this one case would increase the size of the library 
  (which is designed for use on low-memory embedded devices) for minimal 
  performance gain.
It's already linked in the paper, but here's the link to the code used in the paper [1]

The paper mentions implementing Bloom filters for analytical queries an explains how they're used. I wonder if this is related to the query planner enhancements that landed on SQLite 3.38.0 [2]

  Use a Bloom filter to speed up large analytic queries.

[0]: https://www.sqlite.org/optoverview.html#hash_joins

[1]: https://github.com/UWHustle/sqlite-past-present-future

[2]: https://www.sqlite.org/releaselog/3_38_0.html

kpgaffney · 3 years ago
That's correct, the optimizations from this paper became available in SQLite version 3.38.0.

As we were writing the paper, we did consider implementing hash joins in SQLite. However, we ultimately went with the Bloom filter methods because they resulted in large performance gains for minimal added complexity (2 virtual instructions, a simple data structure, and a small change to the query planner). Hash joins may indeed provide some additional performance gains, but the question (as noted above) is whether they are worth the added complexity.

gorjusborg · 3 years ago
I came for SQLite, got sold DuckDB.
manimino · 3 years ago
TFA appears to be about adapting SQLite for OLAP workloads. I do not understand the rationale. Why try to adapt a row-based storage system for OLAP? Why not just use a column store?
Comevius · 3 years ago
SQLite is significantly better at OLTP and being a blob strorage than DuckDB, and it doesn't want to sacrifice those advantages and compatibility if OLAP performance can be improved independently. In my experience for diverse workloads it is more practical to start with a row-based structure and incrementally transform it into a column-based one. Indeed in the paper there is a suggested approach that trades space for improved OLAP performance.
didgetmaster · 3 years ago
It is certainly possible to have a single system that can effectively process high volumes of OLTP traffic while at the same time performing OLAP operations. While there are systems that are designed to do one or the other type of operation well, very few are able to do both. https://www.youtube.com/watch?v=F6-O9v4mrCc
satyrnein · 3 years ago
It seems like one idea in there is to store it both ways automatically (the HE variant)! That might be better then manually continually copying between your row store and your column store.
badgerdb · 3 years ago
Great discussion here. As one of the co-authors of the paper, here is some additional information.

If you need both transactions and OLAP in the same system, the prevalent way to deliver high performance on this (HTAP) workload is to make two copies of the data. This is what we did in the SQLite3/HE work (paper: https://www.cidrdb.org/cidr2022/papers/p56-prammer.pdf; talk: https://www.youtube.com/watch?v=c9bQyzm6JRU). That was quite clunky. This two copy approach not only wasted storage but makes the code complicated, and it would be very hard to maintain over time (we did not want to fork the SQLite code -- that is not nice).

So, we approached it in a different way and started to look for how we could get higher performance on OLAP queries working as closely with SQLite's native query processing and storage framework.

We went through a large number of options (many of them taken from the mechanisms we developed in an earlier Quickstep project (https://pages.cs.wisc.edu/~jignesh/publ/Quickstep.pdf) and concluded that the Bloom filter method (inspired by a more general technique called Look-ahead Information Passing https://www.vldb.org/pvldb/vol10/p889-zhu.pdf) gave us the biggest bang for the buck.

There is a lot of room for improvement here, and getting high OLAP and transaction performance in a single-copy database system is IMO a holy grail that many in the community are working on.

BTW - the SQLite team, namely Dr. Hipp (that is a cool name), Lawrence and Dan are amazing to work with. As an academic, I very much enjoyed how deeply academic they are in their thinking. No surprise that they have built an amazing data platform (I call it a data platform as it is much more than a database system, as it has many hooks for extensibility).

Deleted Comment

spaniard89277 · 3 years ago
I've been learning SQL recently with PostgreSQL and MySQL in an online bootcamp here in Spain. So far very comprehensive. We've touched indexing and partitioning with EXPLAIN ANALYZE for optimizing performance, and I've implemented this strategies successfully onto an active forum I own.

The SQL course has almost no love by the students but so far it has been the most useful and interesting to me.

I was able to create some complex views (couldn't understand how to make materialized views in MySQL), but they were still very slow.

I decided to copy most of this forum DB to DuckDB (with Knime now, until I know better), and optimization with DuckDB seems pointless. It's very, very fast. Less energy usage for my brain, and less time waiting. That's a win for me.

My current dataset is about 40GB, so It's not HUGE, and sure people here in HN would laugh at my "complex" views, but so far I've reduced all my concerns from optimizing to how to download the data I need without causing problems to the server.

dkjaudyeqooe · 3 years ago
In the real world a relational database is the single most useful tool short of a compiler/interpreter. SQL is anachronistic but still works well even if its a pain.

My advice: avoid MySQL like the plague. PgSQL and SQLite is all you ever need and all you ever want.

aljgz · 3 years ago
> The SQL course has almost no love by the students

This is a big early career mistake. I've seen experienced developers use NoSql in a project where Sql is clearly a great fit, then waste lots of manpower to emulate things you get with Sql for free.

Of course one's career can fall into a success path that never depends on SQL, but not learning SQL deeply is not a safe bet.

dkjaudyeqooe · 3 years ago
Most "programmers" just can't understand relational databases and SQL. It's too hard.

I've seen things you wouldn't believe. Random deadlocks in multi-billion transaction reporting systems. Atomic transactions split into multiple commits in banking applications. Copying rows between tables instead of setting a flag on a row. All because highly paid programmers are scared of RDBs.

twh270 · 3 years ago
I've got some developer PTSD from a previous project where the solution architect decided to use CosmosDB for the entire domain model that was very relational and very transactional, all because "NoSQL is easy to learn and allows rapid development".

Yeah it is, until you're trying to manually create and maintain relations between documents in different schemas owned by different microservices.

spaniard89277 · 3 years ago
I've read this again and again in this forum and other dev communities, so I didn't hesitate. I can't say I love SQL, but it's not that bad, Databases look interesting to me.
wodenokoto · 3 years ago
Don’t sell yourself short. I’m sure the minority here knows what a complex view is
spaniard89277 · 3 years ago
I have no real world experience, I've seen things in Stack Overflow that I hardly manage to understand.
js8 · 3 years ago
Why cannot SQLite have two different table storage engines for different tables, one row and the other column oriented?
manigandham · 3 years ago
The same reasoning in the article applies: it's a lot of added complexity that isn't related to its core use as a general purpose in-process SQL database.

Usually OLAP at these scales is fast enough with SQLite or you can use DuckDB if you need a portable format before graduating to a full on distributed OLAP system.

ryanworl · 3 years ago
Storage layout is not the primary issue here because IO throughput on commodity hardware has increased significantly in the last 10 years.

DuckDB is significantly faster than SQLite because it has a vectorized execution engine which is more CPU efficient and uses algorithms which are better suited for analytical queries. If you implemented a scan operator for DuckDB to read SQLite files, it would still have better performance.

1egg0myegg0 · 3 years ago
We have one of those! :-)

And yes it is fast!

https://github.com/duckdblabs/sqlite_scanner

mwish · 3 years ago
I'm confused that why in Figure3, seems in Raspberry Pi, latency is slower than same queries' latency in cloud server. Did I missed something?