Hosting SQLite databases on any static file hoster (2021)

apignotti · 3 years ago

A point of great pain for me in the use of HTTP range headers is that they are absolutely incompatible with HTTP compression.

This boils down to an ambiguity found in the early days of HTTP: when you ask for a byte range with compression enabled, are you referring to a range in the uncompressed or compressed stream?

The only reasonable answer, in my opinion: a range of bytes in the uncompressed version, which is then compressed on-the-fly to save bandwidth.

I suspect the controversial point was that originally HTTP gzip encoding was used via pre-compression of the static files. On-the-fly encoding was probably too computationally expensive at the time.

Anyway, the result of this choice done decades ago is currently preventing effective use of the ranged HTTP requests. As an example, it is fairly likely that chunks from the SQLite database can be significantly compressed, but both HTTP servers and CDN refuse to compress "206 Partial Content" replies.

Interestingly, another use case of HTTP byte ranges is video streaming, which is not affected by this problem since video is _already_ compressed and there is no redundancy to remove anymore.

blueflow · 3 years ago

> This boils down to an ambiguity found in the early days of HTTP

The RFC is clear on this, different Transfer-Encoding is only for the transfer and does not affect the identity of the resource, while different Content-Encoding does affect the identity of the resource.

Its not sure who did it first, but either Browsers or Web Servers started using the Content-Encoding header (which means, "this file is always compressed" like a tar.gz, user agent not meant to uncompress) with the meaning of Transfer-Encoding (which means, the file is not compressed, compression is just applied for the sake of the transfer and the user agent needs to uncompress first). This is a violation of the HTTP spec.

This fuckup resulted in big confusion and now requires an annoying amount of workarounds:

- E-Tag re-generation [1]

- Ranges and Compression are not usable together

- Browsers's "Resume Download" feature removed

- RFC-compliant behavior being a bug [2]

- Some corporate proxies are still unpacking tarballs on the fly, breaking checksum verification

Contrast this to eMail, which has the Encodings correctly sorted out, while using the same RFC822 encoding technique as HTTP.

[1] https://bz.apache.org/bugzilla/show_bug.cgi?id=39727

[2] https://serverfault.com/questions/915171/apache-server-cause...

tl;dr: HTTP Standard is not being implemented correctly

matharmin · 3 years ago

I think another issue with compressing partial content is that those compressed responses cannot be cached efficiently - it would have to be compressed on-the-fly for every range requested. And while compression is not as computationally expensive as it used to be, it does still add overhead that could be more than the bandwidth overhead of uncompressed data.

There should be a workaround for this using a custom pre-compression scheme, instead of relying on HTTP for the compression. Blocks within the file will have to be compressed separately, and you'll need some kind of index mapping uncompressed block offsets to compressed block offsets.

Unfortunately there doesn't seem to be a common of-the-shelf compression format that does this, and it means that you can't just use a standard SQLite file anymore. But it's definitely possible.

matharmin · 3 years ago

Update: This is an example of a compression format that allows random access to 64KB chunks, and is compatible with gzip:https://manpages.ubuntu.com/manpages/kinetic/en/man1/dictzip...

apignotti · 3 years ago

I see the caching argument only partially. It applies to any ranged request, non just compressed ones, and there are plenty of ways for origin servers to describe what and how to cache content. The origin may explicitly only allow caching when the access patterns are expected to likely repeat.

punnerud · 3 years ago

Why do you need HTTP compression, when all of SQLite is compressed? Will it give any value?

Would think HTTP2.0 is of greater value, to save TCP round trip time.

nbevans · 3 years ago

SQLite isn't compressed. There is a commercial offering from the creators of SQLite that adds compression but its rarely seen in the wild to be honest.

rjmunro · 3 years ago

SQLite isn't compressed by default, but there are extensions like https://phiresky.github.io/blog/2022/sqlite-zstd/ that offer compression. It would be interesting to see how well that extension works with this project.

apignotti · 3 years ago

I do not think SQLite compresses data by default. Can you point to any documentation suggesting otherwise?

CharlesW · 3 years ago

"GitHub Pages" is presumably in the title as a hook, but in TFA the author notes that this will work for any statically-hosted website.

punnerud · 3 years ago

Not any statically-hosted website, it have to support byte range request (as part of the header).

This is normally used to pause/resume video, resume downloading files etc.

CharlesW · 3 years ago

Good note. Byte-range requests are supported by default for static content on all web servers, so to your point, this won't work if you've explicitly turned it off.

svnpenn · 3 years ago

However, if you check the AFT, the author notes that GitHub is strictly required, so I don't know what to think.

langsoul-com · 3 years ago

Didn't the author mention it uses wasm and emscript? Which both can be run/compiled on browsers.

I think it's possible everywhere, but you'd just need to do more than copy paste the authors work.

chrisaycock · 3 years ago

Prior discussion (May 2, 2021):

https://news.ycombinator.com/item?id=27016630

AceJohnny2 · 3 years ago

It's funny, I was wondering whether this replicated Simon Willison's work on Datasette [1], but I guess I got things confused because Simon himself was the first to comment on that older post about how clever this was.

[1] https://simonwillison.net/2018/Oct/4/datasette-ideas/

crazysim · 3 years ago

There's a web browser version of datasette called datasette-lite which runs on Python ported to WASM with Pyodide which can load SQLite databases.

I grafted the enhanced lazyFile implementation from emscripten and then from this implementation to datasette-lite relatively recently as a curious test. Threw in a 18GB CSV from CA's unclaimed property records here

https://www.sco.ca.gov/upd_download_property_records.html

into a FTS5 Sqlite Database which came out to about 28GB after processing:

POC, non-merging Log/Draft PR for the hack:

https://github.com/simonw/datasette-lite/pull/49

You can run queries through to datasette-lite if you URL hack into it and just get to the query dialog, browsing is kind of a dud at the moment since datasette runs a count(*) which downloads everything.

Example: https://datasette-lite-lab.mindflakes.com/index.html?url=htt...

Still, not bad for a $0.42/mo hostable cached CDN'd read-only database. It's on Cloudflare R2, so there's no BW costs.

langsoul-com · 3 years ago

The author mentioned how running a Web server risks it going down because they forgot to pay or something. Then uses "free" services instead.

Unlike paying, which you can easily set to auto pay. Free services can go up and down whenever they wish. In fact, it's probs in their terms of service. The more generous the free service, the more likely it'd get cut down and be unusable later or just more expensive than alternatives. Like heroku free tier.

Paying money means there's an actual incentive and legal contract for the company to provide you service you paid for.

LukeLambert · 3 years ago

Static sites benefit from being low-maintenance and highly-portable. You can easily switch from generous free services (Cloudflare Pages, Netlify, Vercel, Firebase, etc.) to highly-available paid services (Amazon S3, Cloudflare R2, Backblaze B2, Google Cloud Storage, etc.) Keeping a VM running and updated is more work (and sometimes higher cost).

gaudat · 3 years ago

(Read only)

mr_toad · 3 years ago

In theory it should be possible to write back to static storage, but you’d have to deal with authentication and file management, cacheing writes, consistency/atomicity etc. It’d be a lot of work.

pfundstein · 3 years ago

Perhaps you could submit updates to the database as diffs and commit them via normal git channels

pfundstein · 3 years ago

Which is still usefull for querying a large database without having to download the entire thing.

altilunium · 3 years ago

Should be included in the title then.

floodle · 3 years ago

Seafowl (https://www.splitgraph.com/blog/seafowl), which was only launched a couple of days ago, is a database that can be hosted statically

mildbyte · 3 years ago

Hey, I co-built Seafowl, thanks for the plug!

To clarify, Seafowl itself can't be hosted statically (it's a Rust server-side application), but it works well for statically hosted pages. It's basically designed to run analytical SQL queries over HTTP, with caching by a CDN/Varnish for SQL query results. The Web page user downloads just the query result rather than required fragments of the database (which, if you're running aggregation queries, might have to scan through a large part of it).

DishyDev · 3 years ago

This is a cool project, and really well presented.

I've looked doing similar, trying to query a large dataset through ranged lookups on a static file server in the past. I ran into issue of some hosters not providing support for ranged lookups. Or the CDNs I was using having behaviour like they only fetch the origin data source in 8MB blocks meaning there was quite a lot of latency when doing lots of small reads across a massive file.

It would have been interesting to find out a bit more about these topics, and see a bit more on performance.

maxpert · 3 years ago

This can potentially become very interesting where if all these pages are pushed to edge via CDN, then browsers can essentially have a GraphQL built on top of SQLite on the client side.