No file system attributes or metadata on records which also means no (xattrs/fattrs) being written or updated, no checks to see if it's a physical file or a pipe/symlink, no permission checks, no block size alignment mismatches, single open command.
Makes sense when you consider you're throwing out functionality and disregarding general purpose design.
If you use a fuse mapping to SQLite, mount that directory and access it, you'd probably be very similar performance (perhaps even slower) and storage use as you'd need to add additional columns in the table to track these attributes.
I have no doubt that you could create a custom tuned file system on a dedicated mount with attributes disabled, minimized file table and correct/optimized block size and get very near to this perf.
Let's not forget the simplicity of being able to use shell commands (like rsync) to browse and manipulate those files without running the application or an SQL client to debug.
Makes sense for developers to use SQLite for this use case though for an appliance-type application or for packaged static assets (this is already commonplace in game development - a cab file is essentially the same concept)
For example, Ceph uses RocksDB as their metadata DB (and it's recommend to put it) directly on a block device, with the WAL on yet another separate raw device
(smaller block size, 100,000 inode file table entries (tuned to the number of blobs), no journal, no checksumming, no extended file attributes, use smaller integer file offset IDs, 32 bit padded vs 64 bit)
Then mount it and run the same test.
You could go even further and tune fopen BUFSIZE to be no greater than 12,000 bytes. You can even create this mount on a file inside your existing mount... which is essentially akin to having an sqlite file without needing a client library to read/write to it.
Anyway - if the purpose is to speed up reads and save disk space on small blob files, there is little need to ditch the file system and it's many many upsides.
Let's also note the 4x speed increase on windows 10, once again underlining just how slow windows filesystem calls are, when compared to direct access, and other (kernel, filesystem) combinations.
NTFS is really horrible handling many small files. When compiling/watching node modules (easily 10-100k files), we've seen a 10x size difference internally (same hardware, just different OSes). At some point that meant a compile time difference of 10-30 sec vs 6-10 min. Not fun.
Their "windows dev drive" setup addresses this. I haven't tested it myself but I saw a couple of inexpertly executed tests showing significant performance gains. I honestly have no idea if my compile times are quicker.
With the right code, NTFS is not much slower than ext4, for example. Nearly 3% for tar.gz and more than 30% for a heavily multi-threaded use like Pack.
I did some research in a database research lab, and we had a lot of colleagues working on OS research. It was always interesting to compare the constraints and assumptions across the two systems. I remember one of the big differences was the scale of individual records we expected to be working with, which in turn affected how memory and disk was managed. Most relational databases are very much optimized for small individual records and eventual consistency, which allows them to cache a lot more in memory. On the other hand, performance often drops sharply with the size of your rows.
This is precisely why I'm considering appending to a sqlite DB in WAL2 mode instead of plain text log files. Almost no performance penalty for writes but huge advantages for reading/analysis. No more Grafana needed.
Careful, some people will be along any second pointing out your approach limits your ability to use "grep" and "cat" on your log after recovering it to your pdp-11 running in your basement. Also something about the "Unix philosophy" :p
Seriously though, I think this is a great idea, and would be interested in how easy it is to write sqlite output adaptors for the various logging libraries out there.
> Careful, some people will be along any second pointing out your approach limits your ability to use "grep" and "cat" on your log after recovering it
I wish Splunk and friends would have an interface like that. Sure it does basic grep, and it is a much more powerful language, but sometimes you just needed some command line magic to find what you wanted.
I've been doing this for years. I keep SQLite log databases at varying grains depending on the solution. One for global logs, one per user/session/workflow, etc. I've also done things like in-memory SQLite trace databases that only get written to disk if an exception occurs.
SQLite doesn't look like a good fit for large logs - nothing can beat liner write at least on HDD and with plain text logs you will have it (though linear write of compressed data even better but rare software supports it out of the box). With SQLite I would expect more write requests for the same stream of logs (may be not much more). Reading analysis will be faster than using grep over plain text log only if you'll create indices which add write cost (and space overhead).
ClikcHouse works really well when you need to store and analyze large logs but compare to SQLite it would require to maintain a server(s). There is DuckDB which is embedded like SQLite and it could be a better than SQLite fit for logs but I have no experience with DuckDB.
I recently had the idea to record every note coming out of my digital piano in real-time. That way if I come up with a good idea when noodling around I don’t have to hope I can remember it later.
I was debating what storage layer to use and decided to try SQLite because of its speed claims — essentially a single table where each row is a MIDI event from the piano (note on, note off, control pedal, velocity, timestamp). No transactions, just raw inserts on every possible event. It so far has worked beautifully: it’s performant AND I can do fun analysis later on, e.g. to see what keys I hit more than others or what my average note velocity is.
If you play ten note chords — one for each finger — in quick succession, that can rack up a lot of inserts in short time period (say, medium-worst case, 100Hz, for playing a chord like that five times per second, counting both “on” and “off” events).
It’s also worth taking into consideration damper pedal velocity changes. When you go from “off” (velocity 0) to fully “on” and depressed (velocity 127), a lot of intermediate values will get fired off at high frequency.
Ultimately though you are right; it’s not enough frequency of information to overload SQLite (or a file system), probably by several orders of magnitude.
> The performance difference arises (we believe) because when working from an SQLite database, the open() and close() system calls are invoked only once, whereas open() and close() are invoked once for each blob when using blobs stored in individual files. It appears that the overhead of calling open() and close() is greater than the overhead of using the database
When something built on top of the filesystem is "faster" than the filesystem, it just means "when you use the filesystem in a less-than-optimal manner, it will be slower than an app that uses it in a sophisticated manner." An interesting point, but perhaps obvious...
To read/write blobs, you have to serialize/deserialize your objects to bytes. This process is not only tedious, but also varies for different objects and it's not a first-class citizen in other tools, so serialization kept breaking as my dependencies upgraded.
As my app matured, I found that I often wanted hierarchical folder-like functionality. Rather than recreating this mess in db relationships, it was easier to store the path and other folder-level metadata in sqlite so that I could work with it in Python. E.g. `os.listdir(my_folder)`.
Also, if you want to interact with other systems/services, then you need files. sqlite can't be read over NFS (e.g. AWS EFS) and by design it has no server for requests. so i found myself caching files to disk for export/import.
SQLite has some settings for handling parallel requests from multiple services, but when I experimented with them I always wound up with a locked db due to competing requests.
For one reason or another, you will end up with hybrid (blob/file) ways of persisting data.
> As my app matured, I found that I often wanted hierarchical folder-like functionality. Rather than recreating this mess in db relationships, it was easier to store the path and other folder-level metadata in sqlite so that I could work with it in Python. E.g. `os.listdir(my_folder)`.
This is a silly argument, there's no reason to recreate the full hierarchy. If you have something like this:
CREATE TABLE files (path TEXT UNIQUE COLLATE NOCASE);
Then you can do this:
SELECT path FROM files WHERE path LIKE "./some/path/%";
This gets you everything in that path and everything in the subpaths (if you just want from the single folder, you can always just add a `directory` column). I benchmarked it using hyperfine on the Linux kernel source tree and a random deep folder: `/bin/ls` took ~1.5 milliseconds, the SQLite query took ~3.0 milliseconds (this is on a M1 MacBook Pro).
The reason it's fast is because the table has a UNIQUE index, and LIKE uses it if you turn off case-sensitivity. No need to faff about with hierarchies.
EDIT: btw, I am using SQLite for this purpose in a production application, couldn't be happier with it.
> To read/write blobs, you have to serialize/deserialize your objects to bytes. This process is not only tedious, but also varies for different objects and it's not a first-class citizen in other tools, so things kept breaking as my dependencies upgraded.
I'm confused what you mean by this. Files also only contain bytes, so that serialization/deserialization has to happen anyway?
Also note that SQLite does have an incremental blob I/O API (sqlite3_blob_xxx), so unlike most other RDBMS there is no need to read/write blobs as a contiguous piece of memory - handling large blobs is more reasonable than in those. Though the blob API is still separate from normal querying.
Do you have or know of a clear example of how to do this? I have to ask because I spent half of yesterday trying to make it work. The blob_open command wouldn't work until I set a default value on the blob column and then the blob_write command wouldn't work because you can't resize a blob. It was very weird but I'm pretty confident it's because I'm missing something stupid.
MS SQL Server: READTEXT, WRITETEXT, substring, UPDATE.WRITE
Oracle: DBMS_LOB.READ, DBMS_LOB.WRITE
PG: Large Objects
Most of my experience is with SQL server and it can stream large objects incrementally through a web app to browser without loading the whole thing into memory at 100's Mbytes/sec on normal hardware.
> so i found myself caching files to disk for export/import
Could use a named pipe.
I’m reminded of what I often do at the shell with psub in fish. psub -f creates and returns the path to a fifo/named pipe in $TMPDIR and writes stdin to that; you’ve got a path but aren’t writing to the filesystem.
e.g. you want to feed some output to something that takes file paths as arguments. We want to compare cmd1 | grep foo and cmd2 | grep foo. We pipe each to psub in command substitutions:
> As my app matured, I found that I often wanted hierarchical folder-like functionality. Rather than recreating this mess in db relationships, it was easier to store the path in sqlite and work with it in Python. E.g. `os.listdir(my_folder)`
This makes total sense and it is also "frowned upon" by people who take a too purist view of databases
(Until it comes a time to backup, or extract files, or grow a hard drive etc and then you figure out how you shot yourself in the foot)
> As my app matured, I found that I often wanted hierarchical folder-like functionality.
In the process of prototyping some "remote" collaborating file systems, I always wonder whether it is a good idea maintaining a flat map from path concatenated with "/" like an S3 to the file content, in term of efficiency or elegancy.
Makes sense when you consider you're throwing out functionality and disregarding general purpose design.
If you use a fuse mapping to SQLite, mount that directory and access it, you'd probably be very similar performance (perhaps even slower) and storage use as you'd need to add additional columns in the table to track these attributes.
I have no doubt that you could create a custom tuned file system on a dedicated mount with attributes disabled, minimized file table and correct/optimized block size and get very near to this perf.
Let's not forget the simplicity of being able to use shell commands (like rsync) to browse and manipulate those files without running the application or an SQL client to debug.
Makes sense for developers to use SQLite for this use case though for an appliance-type application or for packaged static assets (this is already commonplace in game development - a cab file is essentially the same concept)
For example, Ceph uses RocksDB as their metadata DB (and it's recommend to put it) directly on a block device, with the WAL on yet another separate raw device
https://docs.ceph.com/en/latest/rados/configuration/bluestor...
mke2fs -t ext4 -b 1024 -N 100000 -O ^has_journal,^uninit_bg,^ext_attr,^huge_file,^64bit [/dev/sdx]
(smaller block size, 100,000 inode file table entries (tuned to the number of blobs), no journal, no checksumming, no extended file attributes, use smaller integer file offset IDs, 32 bit padded vs 64 bit)
Then mount it and run the same test.
You could go even further and tune fopen BUFSIZE to be no greater than 12,000 bytes. You can even create this mount on a file inside your existing mount... which is essentially akin to having an sqlite file without needing a client library to read/write to it.
Anyway - if the purpose is to speed up reads and save disk space on small blob files, there is little need to ditch the file system and it's many many upsides.
Related ongoing discussion, if someone cares to test this:
https://news.ycombinator.com/item?id=41085856
https://unixdigest.com/articles/battle-testing-php-fopen-sql...
Has been on HN lately..
Let's also note the 4x speed increase on windows 10, once again underlining just how slow windows filesystem calls are, when compared to direct access, and other (kernel, filesystem) combinations.
Deleted Comment
https://forum.lazarus.freepascal.org/index.php/topic,66281.m...
Deleted Comment
Seriously though, I think this is a great idea, and would be interested in how easy it is to write sqlite output adaptors for the various logging libraries out there.
And they won’t be wrong.
I wish Splunk and friends would have an interface like that. Sure it does basic grep, and it is a much more powerful language, but sometimes you just needed some command line magic to find what you wanted.
https://git.sr.ht/~martijnbraam/logbookd
Although I'm not sure it uses WAL2 mode, but that should be a trivial change.
https://lnav.org/
One could flip it around and store logs in a multimedia container, but then you won't have nice indices like with sqlite, just the one big time index
ClikcHouse works really well when you need to store and analyze large logs but compare to SQLite it would require to maintain a server(s). There is DuckDB which is embedded like SQLite and it could be a better than SQLite fit for logs but I have no experience with DuckDB.
I am not sure which one would be better for logs, I would need to play around with it. But i am not sure if SQLite wouldn’t be a better fit.
I was debating what storage layer to use and decided to try SQLite because of its speed claims — essentially a single table where each row is a MIDI event from the piano (note on, note off, control pedal, velocity, timestamp). No transactions, just raw inserts on every possible event. It so far has worked beautifully: it’s performant AND I can do fun analysis later on, e.g. to see what keys I hit more than others or what my average note velocity is.
It’s also worth taking into consideration damper pedal velocity changes. When you go from “off” (velocity 0) to fully “on” and depressed (velocity 127), a lot of intermediate values will get fired off at high frequency.
Ultimately though you are right; it’s not enough frequency of information to overload SQLite (or a file system), probably by several orders of magnitude.
I wonder how io_uring compares.
https://mesos.apache.org/documentation/latest/recordio/
Deleted Comment
I've used SQLite blob fields for storing files extensively.
Note that there is a 2GB blob maximum: https://www.sqlite.org/limits.html
To read/write blobs, you have to serialize/deserialize your objects to bytes. This process is not only tedious, but also varies for different objects and it's not a first-class citizen in other tools, so serialization kept breaking as my dependencies upgraded.
As my app matured, I found that I often wanted hierarchical folder-like functionality. Rather than recreating this mess in db relationships, it was easier to store the path and other folder-level metadata in sqlite so that I could work with it in Python. E.g. `os.listdir(my_folder)`.
Also, if you want to interact with other systems/services, then you need files. sqlite can't be read over NFS (e.g. AWS EFS) and by design it has no server for requests. so i found myself caching files to disk for export/import.
SQLite has some settings for handling parallel requests from multiple services, but when I experimented with them I always wound up with a locked db due to competing requests.
For one reason or another, you will end up with hybrid (blob/file) ways of persisting data.
This is a silly argument, there's no reason to recreate the full hierarchy. If you have something like this:
Then you can do this: This gets you everything in that path and everything in the subpaths (if you just want from the single folder, you can always just add a `directory` column). I benchmarked it using hyperfine on the Linux kernel source tree and a random deep folder: `/bin/ls` took ~1.5 milliseconds, the SQLite query took ~3.0 milliseconds (this is on a M1 MacBook Pro).The reason it's fast is because the table has a UNIQUE index, and LIKE uses it if you turn off case-sensitivity. No need to faff about with hierarchies.
EDIT: btw, I am using SQLite for this purpose in a production application, couldn't be happier with it.
https://www.postgresql.org/docs/current/ltree.html
I'm confused what you mean by this. Files also only contain bytes, so that serialization/deserialization has to happen anyway?
For example, `pd.read_parquet` accepts "file-like objects" as its first argument: https://pandas.pydata.org/docs/reference/api/pandas.read_par...
However, this is not the case for fringe tools
Also note that SQLite does have an incremental blob I/O API (sqlite3_blob_xxx), so unlike most other RDBMS there is no need to read/write blobs as a contiguous piece of memory - handling large blobs is more reasonable than in those. Though the blob API is still separate from normal querying.
MS SQL Server: READTEXT, WRITETEXT, substring, UPDATE.WRITE
Oracle: DBMS_LOB.READ, DBMS_LOB.WRITE
PG: Large Objects
Most of my experience is with SQL server and it can stream large objects incrementally through a web app to browser without loading the whole thing into memory at 100's Mbytes/sec on normal hardware.
Could use a named pipe.
I’m reminded of what I often do at the shell with psub in fish. psub -f creates and returns the path to a fifo/named pipe in $TMPDIR and writes stdin to that; you’ve got a path but aren’t writing to the filesystem.
e.g. you want to feed some output to something that takes file paths as arguments. We want to compare cmd1 | grep foo and cmd2 | grep foo. We pipe each to psub in command substitutions:
which expands to something like As long as the tool doesn’t seek around the file. (caveats are numerous enough that without -f, psub uses regular files.)ksh and bash too have this as <(…) and >(…) under Process Substitution.
An example from ksh(1) man page:
diff <(cmd1 | grep foo) <(cmd2 | grep foo)
This makes total sense and it is also "frowned upon" by people who take a too purist view of databases
(Until it comes a time to backup, or extract files, or grow a hard drive etc and then you figure out how you shot yourself in the foot)
In the process of prototyping some "remote" collaborating file systems, I always wonder whether it is a good idea maintaining a flat map from path concatenated with "/" like an S3 to the file content, in term of efficiency or elegancy.
(1) Slim table "items"
- id / parent_id / kind (0/1 file folder) integer
- name text
- Maybe metadata.
(2) Separate table "content"
- id integer
- data blob
There you have file-system-like structure and fast access times (don't mix content in the first table)
Or, if you wish for deduplication or compression, add item_content (3)
https://medium.com/@rishabhdevmanu/from-trees-to-tables-stor...