How does a B-tree make queries fast?

I always think discussions like this should start with the following: A database with no indexes is slow. Finding a particular row will require a linear search. Adding an index on a column means that your optimizing finding rows by the values of that column. Hence, an index is really a mapping of a particular column's value to the position in the db OF that row (very likely an 8 byte sized integer that is the offset into the file of the row in question).

This all means we can implement indexes as b-trees where the keys are the values of a particular column and the value is the file offset of the row with that value. You could envision a simple db format where indexes and the main row file are stored in separate files. In such a database you could drop an index simply by deleting the indexes file (or add one by creating it). The main row file actually has all of the data and so indexes can be recreated if necessary (at expense of course).

code_biologist · 2 years ago

A database with no indexes is slow. Finding a particular row will require a linear search.

The crux is understanding what data access patterns you will have and what indexes / data structures accelerate that access pattern. "Index = fast" is a painfully pernicious untrue meme. It's absolutely true for application tables with queries only touching a few rows. On the other hand, analytics queries touching a high proportion of rows with joins on equality conditions (ie. hash joinable) isn't going to go any faster with an index.

I've seen devs shotgun indexes at tables to fix performance (done it myself too) but the real test of index understanding is when that doesn't work.

valenterry · 2 years ago

> On the other hand, analytics queries touching a high proportion of rows with joins on equality conditions (ie. hash joinable) isn't going to go any faster with an index.

That's when you bring a BRIN to the table. :-)

scotty79 · 2 years ago

> "Index = fast" is a painfully pernicious untrue meme.

I believe that lack of internalization of that meme (regardless of how true it is) can be a cause of real trouble.

I was working in a team where Java devs simply didn't bother to put indexes on tables because they were small (like 100 rows or so). When I (JS dev) pestered them long enough to finally do it suddenly the whole app got super snappy and they were very thankful as it happened just as degrading performance was causing a lot of gloomy mood.

sgarland · 2 years ago

> isn’t going to go any faster with an index

Depends on the data, and the index. A covering index would almost certainly be faster. But you said analytics, which implies large results from huge datasets, so it’s unlikely to fit into memory in the first place.

darkclouds · 2 years ago

> The crux is understanding what data access patterns you will have and what indexes / data structures accelerate that access pattern

We have a winner. But when looking at SQL tables/Views/Stored Procedures, the data is also stored in order in memory, in effect have a master database and files on disk, with sorted databases and files in memory for faster access.

branko_d · 2 years ago

> A database with no indexes is slow.

No it’s not… if all you do is write to it. In fact, it’s the fastest possible database for such case.

Indexes are pure redundancy - they contain the data already in the base table which must be maintained during writes.

But they can make reads so much faster, if the data access pattern can utilize them. The key is to identify access patterns which justify the price of the index.

isbvhodnvemrwvn · 2 years ago

In some cases they can make things worse, it's worth remembering that query optimizer looks not only on indexes, but also on statistics and estimated operation costs. If your statistics are out of date and your criteria are not specific enough (e.g. they match 80% rows), then an index is going to slow the query down. It needs to traverse the index to get the row IDs, fetch all the blocks containing them, read those, filter out irrelevant rows. It's probably going to be faster with a pure full table scan (due to linear reads).

scotty79 · 2 years ago

If you only need to write and never read you don't need database.

> /dev/null will suffice.

And if you ever need to read anything even once database without indexes is slow

thargor90 · 2 years ago

To be pedantic: writes may also make use of indices, if you have constraints (like foreign keys) the db needs to check for every write.

kevingadd · 2 years ago

I've implemented a high performance btree this way in the past, where each table and each index were separate files (with append-only writes for concurrency). It worked pretty well and wasn't hard to get right, but it had some downsides (in particular, the kernel seemed to struggle with all the paging.)

louthy · 2 years ago

> a high performance btree

then …

> the kernel seemed to struggle …

What was the struggle? If it’s performance doesn’t that contradict your earlier statement?

Genuinely interested in what the issue was, not trying to be a pedant

wruza · 2 years ago

You could envision a simple db format where indexes and the main row file are stored in separate files

They already envisioned it in DBF and CDX.

RaftPeople · 2 years ago

This is how systems handled it before relational was widely adopted, for example the IBM System 36.

> It was invented over 40 years ago, yet it is still employed by the majority of modern databases.

I wonder how true this is for the top commercial engines (Oracle, MS, IBM, etc.) whose internals are closed source and proprietary. Even a decade ago my experience performance testing Exadata implied some exotic magic at work, ie lookups are way faster than the expected O(log n). More recently while testing SQL Server's ability to join hundreds of tables together the performance was _way_ in excess of what I expected. I can't imagine these things have internals all that similar to say the B+Tree inside MySQL for example.

kevingadd · 2 years ago

A lot of this comes down to query planners being really good at finding clever ways of doing scans and intersections of indexes, the tables themselves having indexes with a bunch of specialized representations, and the query execution doing very intelligent data traversal with partitioning or even multi-threading.

If you sit down and think carefully about your data you can often make even a simple bare-bones B-tree perform fantastically for a query, well in excess of what you'd get out of mysql or sqlite (which are already pretty fast).

bob1029 · 2 years ago

I think the most important takeaway is that the old school RDBMS products are probably more than enough for whatever you are trying to accomplish. Query planners in these are indistinguishable from magic, as should anything that has been forged in the fires of a million production environments for a few decades.

I've been playing around with an idea that involves putting sql server at the heart of a game engine, and it is turning into one of the biggest rabbit holes I've ever explored. I thought latency/jitter would be more of a problem but it simply isn't.

jiggawatts · 2 years ago

On disk, SQL Server uses only b-trees, unless using the new ColumnStore format.

In memory during a query it can use temporary indexes of other types, primarily hash tables and bitmaps.

Its performance on ad-hoc complex queries is about as good as it gets, few if any other RDBMS can beat its performance, but under the hood it’s still mostly just doing joins on b-trees!

gfody · 2 years ago

> ..under the hood it’s still mostly just doing joins on b-trees!

I could see the on-disk format needing to be simple and stable, but once the datas buffered who knows what structures and algorithms these proprietary engines are using? You would need to have done some reverse engineering or had hands-on details from the inside which presumably comes w/legal consequences for leaking them.

dicroce · 2 years ago

marginalia_nu · 2 years ago

Another neat part is that, for intersecting different B-trees, you can use a technique like this to get very efficient algorithm: https://nlp.stanford.edu/IR-book/html/htmledition/faster-pos...

(Here discussed in terms of skip lists, but they are similar enough that the distinction doesn't matter)

o11c · 2 years ago

Note that the real primitive is "find nearest [with hint]", which has to be the #1 thing I miss in Python.

For B-trees it's only going to be a significant win if the chance of intersection is small, less than about `1/(nodes_per_block)`. For binary trees it's a much bigger win since that becomes 1/2 and binary trees are horrible on cache.

Hmm, can you efficiently intersect an arbitrary number of B-trees using the same idea as a heap-merge, but with a max heap instead of a min heap? You'd still have to iterate over all on a match, but as long as 2 input trees don't have an intersection, you don't have to look at the others at all ... or does that become equivalent to just doing them in series?

A nuance that is important here is that not all accesses are equal within the context of disk reads. B-trees are designed to minimize block reads, not memory operations.

I guess there are worst case scenarios with evenly spaced intersections spread out exactly one per block, but in terms of block reads it fundamentally doesn't matter how you intersect two such trees, you'll still have to read all blocks, and that is orders of magnitude slower than comparing the values within.

I think the tree structure can be considered cached in real world scenarios; not really relevant to the performance you'll get.

daveevad · 2 years ago

It's not that deep.

Dead Comment

avinassh · 2 years ago

> To sum up, the key takeaway is to prefer sequential access wherever we can.

For spinning disks, this is obvious. But why is sequential access faster on SSDs?

wizerno · 2 years ago

TLDR; Garbage Collection.

In an SSD, a write operation can only be done when the page is already erased. However, the unit of read/write operations are a page, while the unit of erase operation is a block. That means for a disk write, a naive implementation needs to read the whole block, erase the block, then write updated data back to the block, which is unacceptable. Furthermore, blocks should wear out uniformly, otherwise, the SSD would lose capacity.

To tackle these problems, SSD introduces Flash Translation Layer (FTL) which helps to build an illusion of random access device. To achieve this, FTL employs an approach very similar to LSM trees. Writes are always written to new, already erased pages, while in the background, garbage collects (GC) outdated data. FTL needs to keep a map from the user’s logical address to physical address on SSD, both in-memory and persistently.

So to answer your question, why are sequential writes are faster than random writes on SSDs? Because the address map table is smaller since new data is consecutive in larger chunks. Garbage Collection is simpler and only metadata needs to be updated. Erasing a block is required anyway.

what about sequential vs random reads?

In practice the difference is much smaller between random and sequential reads, with the caveat that all SSD reads on some level are block operations, so locally sequential access is what make the big difference.

Though with mmap the OS may do a speculative async readahead, depending on memadvise.

infogulch · 2 years ago

Like sibling said, physical pages are typically much bigger than logical pages in SSDs. Also drives do prediction and sequential access is easy to predict.

gniv · 2 years ago

Not just SSD, but in RAM also it's faster, for the same reasons (page-based access to caches). Basically you should always use a B-tree whenever you need the functionality of an STL map.

I understand it is faster on RAM, but I am asking why

IOW what makes it faster to access sequentially on SSD

zerr · 2 years ago

I'd assume at some level of IO, blocks/pages/whole buffers are being read, as opposed to reading bytes one by one. So the sequental access takes advantage of this I suppose.

cgopalan · 2 years ago

Amidst all these great discussions, I would like to point out that this article really helped me get my head around a B tree and why its a great optimization on top of the Binary Search Tree. Thanks to the author!

charlieyu1 · 2 years ago

Is there any other data structures that benefits from hardware?

prydt · 2 years ago

LSM trees are a good example of a data structure optimized for memory hardware (both hdds and ssds).

omginternets · 2 years ago

Arrays come to mind.

oggyboye · 2 years ago

How does it work when indexing uuid columns?

LAC-Tech · 2 years ago

Very poorly is my understanding. There's various sequential UUID-like schemes that are more sortable by prefixing with bits of physical time. Off the top of my head, ULIDs and also UUID v7.

ahoka · 2 years ago

But those sequential IDs are often just not suitable due to security reasons.

sroussey · 2 years ago

I think this really only matters for clustered indexes.

Probably poorly by default, but you could use a hash of the uuid as a key (to try and more evenly spread the entropy) or key it off a suffix instead of a prefix since iirc that's where most of the entropy lives.

In practice if you want good performance and scalability it's important to select keys well.

hobs · 2 years ago

For this use case people generally choose sequential UUIDs or they want random ones to prevent hot pages for their inserts.

masklinn · 2 years ago

Normally, a uuid is orderable.

However uuid4 create a lot of work during updates, and tend to result in relatively low fill trees so querying is less efficient than it could be.

They also don’t really benefit from btrees as range or prefix queries are extremely rare.