I always think discussions like this should start with the following: A database with no indexes is slow. Finding a particular row will require a linear search. Adding an index on a column means that your optimizing finding rows by the values of that column. Hence, an index is really a mapping of a particular column's value to the position in the db OF that row (very likely an 8 byte sized integer that is the offset into the file of the row in question).
This all means we can implement indexes as b-trees where the keys are the values of a particular column and the value is the file offset of the row with that value. You could envision a simple db format where indexes and the main row file are stored in separate files. In such a database you could drop an index simply by deleting the indexes file (or add one by creating it). The main row file actually has all of the data and so indexes can be recreated if necessary (at expense of course).
A database with no indexes is slow. Finding a particular row will require a linear search.
The crux is understanding what data access patterns you will have and what indexes / data structures accelerate that access pattern. "Index = fast" is a painfully pernicious untrue meme. It's absolutely true for application tables with queries only touching a few rows. On the other hand, analytics queries touching a high proportion of rows with joins on equality conditions (ie. hash joinable) isn't going to go any faster with an index.
I've seen devs shotgun indexes at tables to fix performance (done it myself too) but the real test of index understanding is when that doesn't work.
> On the other hand, analytics queries touching a high proportion of rows with joins on equality conditions (ie. hash joinable) isn't going to go any faster with an index.
> "Index = fast" is a painfully pernicious untrue meme.
I believe that lack of internalization of that meme (regardless of how true it is) can be a cause of real trouble.
I was working in a team where Java devs simply didn't bother to put indexes on tables because they were small (like 100 rows or so). When I (JS dev) pestered them long enough to finally do it suddenly the whole app got super snappy and they were very thankful as it happened just as degrading performance was causing a lot of gloomy mood.
Depends on the data, and the index. A covering index would almost certainly be faster. But you said analytics, which implies large results from huge datasets, so it’s unlikely to fit into memory in the first place.
> The crux is understanding what data access patterns you will have and what indexes / data structures accelerate that access pattern
We have a winner. But when looking at SQL tables/Views/Stored Procedures, the data is also stored in order in memory, in effect have a master database and files on disk, with sorted databases and files in memory for faster access.
No it’s not… if all you do is write to it. In fact, it’s the fastest possible database for such case.
Indexes are pure redundancy - they contain the data already in the base table which must be maintained during writes.
But they can make reads so much faster, if the data access pattern can utilize them. The key is to identify access patterns which justify the price of the index.
In some cases they can make things worse, it's worth remembering that query optimizer looks not only on indexes, but also on statistics and estimated operation costs. If your statistics are out of date and your criteria are not specific enough (e.g. they match 80% rows), then an index is going to slow the query down. It needs to traverse the index to get the row IDs, fetch all the blocks containing them, read those, filter out irrelevant rows. It's probably going to be faster with a pure full table scan (due to linear reads).
I've implemented a high performance btree this way in the past, where each table and each index were separate files (with append-only writes for concurrency). It worked pretty well and wasn't hard to get right, but it had some downsides (in particular, the kernel seemed to struggle with all the paging.)
Note that the real primitive is "find nearest [with hint]", which has to be the #1 thing I miss in Python.
For B-trees it's only going to be a significant win if the chance of intersection is small, less than about `1/(nodes_per_block)`. For binary trees it's a much bigger win since that becomes 1/2 and binary trees are horrible on cache.
Hmm, can you efficiently intersect an arbitrary number of B-trees using the same idea as a heap-merge, but with a max heap instead of a min heap? You'd still have to iterate over all on a match, but as long as 2 input trees don't have an intersection, you don't have to look at the others at all ... or does that become equivalent to just doing them in series?
A nuance that is important here is that not all accesses are equal within the context of disk reads. B-trees are designed to minimize block reads, not memory operations.
I guess there are worst case scenarios with evenly spaced intersections spread out exactly one per block, but in terms of block reads it fundamentally doesn't matter how you intersect two such trees, you'll still have to read all blocks, and that is orders of magnitude slower than comparing the values within.
I think the tree structure can be considered cached in real world scenarios; not really relevant to the performance you'll get.
In an SSD, a write operation can only be done when the page is already erased. However, the unit of read/write operations are a page, while the unit of erase operation is a block. That means for a disk write, a naive implementation needs to read the whole block, erase the block, then write updated data back to the block, which is unacceptable. Furthermore, blocks should wear out uniformly, otherwise, the SSD would lose capacity.
To tackle these problems, SSD introduces Flash Translation Layer (FTL) which helps to build an illusion of random access device. To achieve this, FTL employs an approach very similar to LSM trees. Writes are always written to new, already erased pages, while in the background, garbage collects (GC) outdated data. FTL needs to keep a map from the user’s logical address to physical address on SSD, both in-memory and persistently.
So to answer your question, why are sequential writes are faster than random writes on SSDs? Because the address map table is smaller since new data is consecutive in larger chunks. Garbage Collection is simpler and only metadata needs to be updated. Erasing a block is required anyway.
In practice the difference is much smaller between random and sequential reads, with the caveat that all SSD reads on some level are block operations, so locally sequential access is what make the big difference.
Though with mmap the OS may do a speculative async readahead, depending on memadvise.
Like sibling said, physical pages are typically much bigger than logical pages in SSDs. Also drives do prediction and sequential access is easy to predict.
Not just SSD, but in RAM also it's faster, for the same reasons (page-based access to caches). Basically you should always use a B-tree whenever you need the functionality of an STL map.
I'd assume at some level of IO, blocks/pages/whole buffers are being read, as opposed to reading bytes one by one. So the sequental access takes advantage of this I suppose.
> It was invented over 40 years ago, yet it is still employed by the majority of modern databases.
I wonder how true this is for the top commercial engines (Oracle, MS, IBM, etc.) whose internals are closed source and proprietary. Even a decade ago my experience performance testing Exadata implied some exotic magic at work, ie lookups are way faster than the expected O(log n). More recently while testing SQL Server's ability to join hundreds of tables together the performance was _way_ in excess of what I expected. I can't imagine these things have internals all that similar to say the B+Tree inside MySQL for example.
A lot of this comes down to query planners being really good at finding clever ways of doing scans and intersections of indexes, the tables themselves having indexes with a bunch of specialized representations, and the query execution doing very intelligent data traversal with partitioning or even multi-threading.
If you sit down and think carefully about your data you can often make even a simple bare-bones B-tree perform fantastically for a query, well in excess of what you'd get out of mysql or sqlite (which are already pretty fast).
I think the most important takeaway is that the old school RDBMS products are probably more than enough for whatever you are trying to accomplish. Query planners in these are indistinguishable from magic, as should anything that has been forged in the fires of a million production environments for a few decades.
I've been playing around with an idea that involves putting sql server at the heart of a game engine, and it is turning into one of the biggest rabbit holes I've ever explored. I thought latency/jitter would be more of a problem but it simply isn't.
On disk, SQL Server uses only b-trees, unless using the new ColumnStore format.
In memory during a query it can use temporary indexes of other types, primarily hash tables and bitmaps.
Its performance on ad-hoc complex queries is about as good as it gets, few if any other RDBMS can beat its performance, but under the hood it’s still mostly just doing joins on b-trees!
> ..under the hood it’s still mostly just doing joins on b-trees!
I could see the on-disk format needing to be simple and stable, but once the datas buffered who knows what structures and algorithms these proprietary engines are using? You would need to have done some reverse engineering or had hands-on details from the inside which presumably comes w/legal consequences for leaking them.
Amidst all these great discussions, I would like to point out that this article really helped me get my head around a B tree and why its a great optimization on top of the Binary Search Tree. Thanks to the author!
Very poorly is my understanding. There's various sequential UUID-like schemes that are more sortable by prefixing with bits of physical time. Off the top of my head, ULIDs and also UUID v7.
Probably poorly by default, but you could use a hash of the uuid as a key (to try and more evenly spread the entropy) or key it off a suffix instead of a prefix since iirc that's where most of the entropy lives.
In practice if you want good performance and scalability it's important to select keys well.
This all means we can implement indexes as b-trees where the keys are the values of a particular column and the value is the file offset of the row with that value. You could envision a simple db format where indexes and the main row file are stored in separate files. In such a database you could drop an index simply by deleting the indexes file (or add one by creating it). The main row file actually has all of the data and so indexes can be recreated if necessary (at expense of course).
The crux is understanding what data access patterns you will have and what indexes / data structures accelerate that access pattern. "Index = fast" is a painfully pernicious untrue meme. It's absolutely true for application tables with queries only touching a few rows. On the other hand, analytics queries touching a high proportion of rows with joins on equality conditions (ie. hash joinable) isn't going to go any faster with an index.
I've seen devs shotgun indexes at tables to fix performance (done it myself too) but the real test of index understanding is when that doesn't work.
That's when you bring a BRIN to the table. :-)
I believe that lack of internalization of that meme (regardless of how true it is) can be a cause of real trouble.
I was working in a team where Java devs simply didn't bother to put indexes on tables because they were small (like 100 rows or so). When I (JS dev) pestered them long enough to finally do it suddenly the whole app got super snappy and they were very thankful as it happened just as degrading performance was causing a lot of gloomy mood.
Depends on the data, and the index. A covering index would almost certainly be faster. But you said analytics, which implies large results from huge datasets, so it’s unlikely to fit into memory in the first place.
We have a winner. But when looking at SQL tables/Views/Stored Procedures, the data is also stored in order in memory, in effect have a master database and files on disk, with sorted databases and files in memory for faster access.
No it’s not… if all you do is write to it. In fact, it’s the fastest possible database for such case.
Indexes are pure redundancy - they contain the data already in the base table which must be maintained during writes.
But they can make reads so much faster, if the data access pattern can utilize them. The key is to identify access patterns which justify the price of the index.
> /dev/null will suffice.
And if you ever need to read anything even once database without indexes is slow
then …
> the kernel seemed to struggle …
What was the struggle? If it’s performance doesn’t that contradict your earlier statement?
Genuinely interested in what the issue was, not trying to be a pedant
They already envisioned it in DBF and CDX.
(Here discussed in terms of skip lists, but they are similar enough that the distinction doesn't matter)
For B-trees it's only going to be a significant win if the chance of intersection is small, less than about `1/(nodes_per_block)`. For binary trees it's a much bigger win since that becomes 1/2 and binary trees are horrible on cache.
Hmm, can you efficiently intersect an arbitrary number of B-trees using the same idea as a heap-merge, but with a max heap instead of a min heap? You'd still have to iterate over all on a match, but as long as 2 input trees don't have an intersection, you don't have to look at the others at all ... or does that become equivalent to just doing them in series?
I guess there are worst case scenarios with evenly spaced intersections spread out exactly one per block, but in terms of block reads it fundamentally doesn't matter how you intersect two such trees, you'll still have to read all blocks, and that is orders of magnitude slower than comparing the values within.
I think the tree structure can be considered cached in real world scenarios; not really relevant to the performance you'll get.
Dead Comment
For spinning disks, this is obvious. But why is sequential access faster on SSDs?
In an SSD, a write operation can only be done when the page is already erased. However, the unit of read/write operations are a page, while the unit of erase operation is a block. That means for a disk write, a naive implementation needs to read the whole block, erase the block, then write updated data back to the block, which is unacceptable. Furthermore, blocks should wear out uniformly, otherwise, the SSD would lose capacity.
To tackle these problems, SSD introduces Flash Translation Layer (FTL) which helps to build an illusion of random access device. To achieve this, FTL employs an approach very similar to LSM trees. Writes are always written to new, already erased pages, while in the background, garbage collects (GC) outdated data. FTL needs to keep a map from the user’s logical address to physical address on SSD, both in-memory and persistently.
So to answer your question, why are sequential writes are faster than random writes on SSDs? Because the address map table is smaller since new data is consecutive in larger chunks. Garbage Collection is simpler and only metadata needs to be updated. Erasing a block is required anyway.
Though with mmap the OS may do a speculative async readahead, depending on memadvise.
IOW what makes it faster to access sequentially on SSD
I wonder how true this is for the top commercial engines (Oracle, MS, IBM, etc.) whose internals are closed source and proprietary. Even a decade ago my experience performance testing Exadata implied some exotic magic at work, ie lookups are way faster than the expected O(log n). More recently while testing SQL Server's ability to join hundreds of tables together the performance was _way_ in excess of what I expected. I can't imagine these things have internals all that similar to say the B+Tree inside MySQL for example.
If you sit down and think carefully about your data you can often make even a simple bare-bones B-tree perform fantastically for a query, well in excess of what you'd get out of mysql or sqlite (which are already pretty fast).
I've been playing around with an idea that involves putting sql server at the heart of a game engine, and it is turning into one of the biggest rabbit holes I've ever explored. I thought latency/jitter would be more of a problem but it simply isn't.
In memory during a query it can use temporary indexes of other types, primarily hash tables and bitmaps.
Its performance on ad-hoc complex queries is about as good as it gets, few if any other RDBMS can beat its performance, but under the hood it’s still mostly just doing joins on b-trees!
I could see the on-disk format needing to be simple and stable, but once the datas buffered who knows what structures and algorithms these proprietary engines are using? You would need to have done some reverse engineering or had hands-on details from the inside which presumably comes w/legal consequences for leaking them.
In practice if you want good performance and scalability it's important to select keys well.
However uuid4 create a lot of work during updates, and tend to result in relatively low fill trees so querying is less efficient than it could be.
They also don’t really benefit from btrees as range or prefix queries are extremely rare.