Are You Sure You Want to Use MMAP in Your Database Management System? (2022)

hyc_symas · 3 years ago

This is a pretty old argument and IMO it's far out of date/obsolete.

Taking full control of your I/O and buffer management is great if (a) your developers are all smart and experienced enough to be kernel programmers and (b) your DBMS is the only process running on a machine. In practice, (a) is never true, and (b) is no longer true because everyone is running apps inside containers inside shared VMs. In the modern application/server environment, no user level process has accurate information about the total state of the machine, only the kernel (or hypervisor) does and it's an exercise in futility to try to manage paging etc at the user level.

As Dr. Michael Stonebraker put it: The Traditional RDBMS Wisdom is (Almost Certainly) All Wrong. https://slideshot.epfl.ch/play/suri_stonebraker (See the slide at 21:25 into the video). Modern DBMSs spend 96% of their time managing buffers and locks, and only 4% doing actual useful work for the caller.

Granted, even using mmap you still need to know wtf you're doing. MongoDB's original mmap backing store was a poster child for Doing It Wrong, getting all of the reliability problems and none of the performance benefits. LMDB is an example of doing it right: perfect crash-proof reliability, and perfect linear read scalability across arbitrarily many CPUs with zero-copy reads and no wasted effort, and a hot code path that fits into a CPU's 32KB L1 instruction cache.

gavinray · 3 years ago

Out of curiosity, how many databases have you written?

This is co-authored by Pavlo, Viktor Leiss, with feedback from Neumann. I'm sorry, but if someone on the internet claims to know better than those 3, you're going to need some monumental evidence of your credibility.

Additionally, what you link here:

  > ... (See the slide at 21:25 into the video). Modern DBMSs spend 96% of their time managing buffers and locks, and only 4% doing actual useful work for the caller.

Is discussing "Main Memory" databases. These databases do no I/O outside of potential initial reads, because all of the data fits in-memory!

These databases represent a small portion of contemporary DBMS usage when compared to traditional RDBMS.

All you have to do is look at the bandwidth and reads/sec from the paper when using O_DIRECT "pread()"s versus mmap'ed IO.

LAC-Tech · 3 years ago

This is a classic appeal to authority. Let's play the argument, not the man.

(My understanding is that the GP wrote LMDB, works on openLDAP, and was a maintainer for BerkelyDB for a number of years. But even if he'd only written 'hello, world!' I'm much more interested in the specific arguments).

ilyt · 3 years ago

Out of curiosity, do you have anything actually useful to add or are just throwing appeals to authority because you don't ?

Mikhail_Edoshin · 3 years ago

Even thought the data resides mostly in-memory they still have to write transactions to disk to preserve them, don't they?

crabbone · 3 years ago

> your DBMS is the only process running on a machine. In practice, (a) is never true, and (b) is no longer true because everyone is running apps inside containers inside shared VMs.

There's nothing special about kernel programmers. In fact, if I had to compare, I'd go with storage people being the more experienced / knowledgeable ones. They have a highly competitive environment, which requires a lot more understanding and inventiveness to succeed, whereas kernel programmers proper don't compete -- Linux won many years ago. Kernel programmers who deal with stuff like drivers or various "extensions" are, largely, in the same group as storage (often time literally the same people).

As for "single process" argument... well, if you run a database inside an OS, then, obviously, that will never happen as OS has its own processes to run. But, if you ignore that -- no DBA worth their salt would put database in the environment where it has to share resources with applications. People who do that are, probably, Web developers who don't have high expectations from their database anyways and would have no idea how to configure / tune it for high performance, so, it doesn't matter how they run it, they aren't the target audience -- they are light years behind on what's possible to achieve with their resources.

This has nothing to do with mmap though. mmap shouldn't be used for storage applications for other reasons. mmap doesn't allow their users to precisely control the persistence aspect... which is kind of the central point of databases. So, it's a mostly worthless tool in that context. Maybe fine for some throw-away work, but definitely not for storing users' data or database's own data.

hyc_symas · 3 years ago

> There's nothing special about kernel programmers.

Yes, that was a shorthand generalization for "people who've studied computer architecture" - which most application developers never have.

> no DBA worth their salt would put database in the environment where it has to share resources with applications.

Most applications today are running on smartphones/mobile devices. That means they're running with local embedded databases - it's all about "edge computing". There's far more DBs in use in the world than there are DBAs managing them.

> mmap shouldn't be used for storage applications for other reasons. mmap doesn't allow their users to precisely control the persistence aspect... which is kind of the central point of databases. So, it's a mostly worthless tool in that context. Maybe fine for some throw-away work, but definitely not for storing users' data or database's own data.

Well, you're half right. That's why by default LMDB uses a read-only mmap and uses regular (p)write syscalls for writes. But the central point of databases is to be able to persist data such that it can be retrieved again in the future, efficiently. And that's where the read characteristics of using mmap are superior.

sakras · 3 years ago

Can you comment on what the paper gets wrong? It says that scalability with mmap is poor due to page table contention and others. How does LMDB manage to scale well with mmap? Is page table contention just not an issue in practice?

tadfisher · 3 years ago

Maybe someone should pull LMDB's mmap/paging system into a usable library. I'd love to use the k/v store part of course, but I keep hitting the default key size limitation and would prefer not to link statically.

hyc_symas · 3 years ago

It wouldn't be much use without the B+tree as well; it's the B+tree's cache friendliness that allows applications to run so efficiently without the OS knowing any specifics of the app's usage patterns.

ori_b · 3 years ago

Do you have benchmarks of lmdb when the working set is much larger than memory? I couldn't find any.

In my experience -- and in line with the article -- mmap works fine with small working sets. It seems that most benchmarks of lmdb have relatively small data sets.

hyc_symas · 3 years ago

> Do you have benchmarks of lmdb when the working set is much larger than memory? I couldn't find any.

Where did you look? This is a sample using DB 5x and 50x larger than RAM http://www.lmdb.tech/bench/hyperdex/

There are plenty of other larger-than-RAM benchmarks there.

jerrygenser · 3 years ago

> Taking full control of your I/O and buffer management is great if (a) your developers are all smart and experienced enough to be kernel programmers and (b) your DBMS is the only process running on a machine. In practice, (a) is never true, and (b) is no longer true because everyone is running apps inside containers inside shared VMs.

The article is about DBMS developers. For DBMS developers, "in practice" (a) and (b) are usually true I think.

danappelxx · 3 years ago

Who is deploying databases in containers?

orbz · 3 years ago

A disturbingly large number of deployments I’ve seen using Kubernetes or docker compose have databases deployed as such.

crabbone · 3 years ago

Nobody who matters.

Those who do that don't know what they are doing (even if they outnumber the other side hundred to one, they "don't count" because they aren't aiming for good performance anyways).

Well, maybe not quite... of course it's possible that someone would want to deploy a database in a container because of the convenience of assembling all dependencies in a single "package", however, they would never run database on the same node as applications -- that's insanity.

But, even the idea of deploying a database alongside something like kubelet service is cringe... This service is very "fat" and can spike in memory / CPU usage. I would be very strongly opposed to an idea of running a database on the same VM that runs Kubernetes or any container runtime that requires a service to run it.

Obviously, it says nothing about the number of processes that will run on the database node. At the minimum, you'd want to run some stuff for monitoring, that's beside all the system services... but I don't think GP meant "one process" literally. Neither that is realistic nor is it necessary.

morelisp · 3 years ago

I'm running prod databases in containers so the server infra team doesn't have to know anything about how that specific database works or how to upgrade it, they just need to know how to issue generic container start/stop commands if they want to do some maintenance.

(But just in containers, not in Kubernetes. I'm not crazy.)

didip · 3 years ago

My group and a bunch of my peer groups.

And we are running them at the scale that most people can’t even imagine.

huahaiy · 3 years ago

Embedded DB

jandrewrogers · 3 years ago

Another interesting limitation of mmap() is that real-world storage volumes can exceed the virtual address space a CPU can address. A 64-bit CPU may have 64-bit pointers but typically cannot address anywhere close to 64 bits of memory, virtually or physically. A normal buffer pool does not have this limitation. You can get EC2 instances on AWS with more direct-attached storage than addressable virtual address space on the local microarchitecture.

glandium · 3 years ago

To put concrete numbers: x86-64 is limited to 48 bits for virtual addresses, which is "only" 256TiB (281TB).

hyc_symas · 3 years ago

All of that is true, but I don't think it's a realistic concern. You're going to be sharding your data across multiple nodes before it gets that large. Nobody wants to sit around backing up or restoring a monolithic 256 TiB database.

Svetlitski · 3 years ago

Starting with Ice Lake there’s support for 5-level paging, which increases this to 128 PiB. Can’t say that I’ve ever seen this used in the wild though.

stevefan1999 · 3 years ago

Intel now extended the page table level to 5-level making this number not so valid. Granted, PL5 creates more TLB pressure and longer memory access time due to that.

pjdesno · 3 years ago

Not just databases - we ran into the same issues when we needed a high-performance caching HTTP reverse proxy for a research project. We were just going to drop in Varnish, which is mmap-based, but performance sucked and we had to write our own.

Note that Varnish dates to 2006, in the days of hard disk drives, SCSI, and 2-core server CPUs. Mmap might well have been as good or even better than I/O back then - a lot of the issues discussed in this paper (TLB shootdown overhead, single flush thread) get much worse as the core count increases.

Sesse__ · 3 years ago

Varnish' design wasn't very fast even for 2006-era hardware. It _was_ fast compared to Squid, though (which was the only real competitor at the time), and most importantly, much more flexible for the origin server case. But it came from a culture of “the FreeBSD kernel is so awesome that the best thing userspace can do is to offload as many decisions as humanly possible to the kernel”, which caused, well, suboptimal performance.

AFAIK the persistent backend was dropped pretty early on (eventually replaced with a more traditional read()/write()-based one as part of Varnish Plus), and the general recommendation became just to use malloc and hope you didn't swap.

tayo42 · 3 years ago

Varnish has a file system backed cache that depends on the page cache to keep it fast.

What did you differently in your custom one that was faster then varnish?

pjdesno · 3 years ago

Simple multithreaded read/write. On a 20-core 40-thread machine with a couple of fast NVMe drives it was way faster.

wood_spirit · 3 years ago

Old timers will recall when using mmap was a prominently promoted selling point for the “no sql” dbms.

ren_engineer · 3 years ago

seems like all databases are moving towards the middle. Postgres has JSON support, MongoDB has transactions and also a columnar extension for OLAP type data. NoSQL seems almost meaningless as a term now. Feels like a move towards a winner takes all multi-modal database that can work with most types of data fairly well. Postgres with all of it's specialized extensions seems like it will be the most popular choice. The convenience of not having to manage multiple databases is hard to beat unless performance is exponentially better, Postgres with these extensions can probably be "good enough" for a lot of companies

reminds me of how industries typically start out dominated by vertically integrated companies, move to specialized horizontal companies, then generally move back to vertical integration due to efficiency. Car industry started this way with Ford, went away from it, and now Tesla is doing it again. Lots of other examples in other industries

TheGeminon · 3 years ago

The pendulum swing is common in any system, and is a really effective mechanism for evaluation.

You almost always want somewhere in the middle, but it’s often much easier to move back after a large jump in one direction than to push towards the middle.

nemo44x · 3 years ago

For documents it made access fast since there’s no joins, etc. that require paging from all over. The problem ended up being updates and compaction issues.

wood_spirit · 3 years ago

My memory is that the problem was ACID. The document stores didn’t promise to be reliable because apparently that didn’t scale.

And there was a very well known cartoon video discussion about it with “web scale” and “just write to dev null” and other classics that became memes :)

dang · 3 years ago

Are you sure you want to use MMAP in your database management system? [pdf] - https://news.ycombinator.com/item?id=29936104 - Jan 2022 (127 comments)

dist1ll · 3 years ago

Many general-purpose OS abstractions start leaking when you're working on systems-like software.

You notice it when web servers are doing kernel bypass to for zero-copy, low-latency networking, or database engines throw away the kernel's page cache to implement their own file buffer.

kentonv · 3 years ago

Yes. I think mmap() is misunderstood as being an advanced tool for systems hackers, but it's actually the opposite: it's a tool to make application code simpler by leaving the systems stuff to the kernel.

With mmap, you get to avoid thinking about how much data to buffer at once, caching data to speed up repeated access, or shedding that cache when memory pressure is high. The kernel does all that. It may not do it in the absolute ideal way for your program but the benefit is you don't have to think about these logistics.

But if you're already writing intense systems code then you can probably do a better job than the kernel by optimizing for your use case.

arter4 · 3 years ago

Web servers doing kernel bypass for zero-copy networking? Do you have a specific example in mind? I'm curious.

dist1ll · 3 years ago

The most common example is DPDK [1]. It's a framework for building bespoke networking stacks that are usable from userspace, without involving the kernel.

You'll find DPDK mentioned a lot in the networking/HPC/data center literature. An example of a backend framework that uses DPDK is the seastar framework [2]. Also, I recently stumbled upon a paper for efficient RPC networks in data centers [3].

If you want to learn more, the p99 conference has tons of speakers talking about some interesting challenges in that space.

[1] https://www.dpdk.org/.

[2] https://github.com/scylladb/seastar

[3] https://github.com/erpc-io/eRPC

kentonv · 3 years ago

Probably the most common example is sendfile() for writing file contents out to a socket without reading them into userspace:

https://man7.org/linux/man-pages/man2/sendfile.2.html

kwohlfahrt · 3 years ago

It sounds like a lot of the performance issues are TLB-related. Am I right in thinking huge-pages would help here? If so, it's a bit unfortunate they didn't test this in the paper.

Edit: Hm, it might not be possible to mmap files with huge-pages. This LWN article[1] from 5 years ago talks about the work that would be required, but I haven't seen any follow-ups.

[1]: https://lwn.net/Articles/718102/

hyc_symas · 3 years ago

Huge pages aren't pageable, so they wouldn't be particularly advantageous for a mmap DB anyway, you'd have to do traditional I/O & buffer management for everything.

ori_b · 3 years ago

No, huge pages wouldn't help. They would change when the TLB gets flushed, but the flushes would still be there.

Dwedit · 3 years ago

Memory-Mapped Files = access violations when a disk read fails. If you're not prepared to handle those, don't use memory-mapped files. (Access violation exceptions are the same thing that happens when you attempt to read a null pointer)

Then there's the part with writes being delayed. Be prepared to deal with blocks not necessarily updating to disk in the order they were written to, and 10 seconds after the fact. This can make power failures cause inconsistencies.

kentonv · 3 years ago

> Be prepared to deal with blocks not necessarily updating to disk in the order they were written to, and 10 seconds after the fact. This can make power failures cause inconsistencies.

This is not specific to mmap -- regular old write() calls have the same behavior. You need to fsync() (or, with mmap, msync()) to guarantee data is on disk.

crabbone · 3 years ago

> This is not specific to mmap -- regular old write() calls have the same behavior.

This is not true. This depends on how the file was opened. You may request DIRECT | SYNC when opening and the writes are acknowledged when they are actually written. This is obviously a lot slower than writing to cache, but this is the way for "simple" user-space applications to implement their own cache.

In the world of today, you are very rarely writing to something that's not network attached, and depending on your appliance, the meaning of acknowledgement from write() differs. Sometimes it's even configurable. This is why databases also offer various modes of synchronization -- you need to know how your appliance works and configure the database accordingly.

tsimionescu · 3 years ago

It's fun to remember that fsync() on Linux on ext4 at least offers no real guarantee that the data was successfully written to disk. This happens when write errors from background buffered writes are handled internally by the kernel, and they cleanup the error situation (mark dirty pages clean etc). Since the kernel can't know if a later call to fsync() will ever happen, it can't just keep the error around. So, when the call does happen, it will not return any error code. I don't know for sure, but msync() may well have the same behavior.

Here is an LWN article discussing the whole problem as the Postgres team found out about it.

https://lwn.net/Articles/752063/

afr0ck · 3 years ago

Linux throws a SIGBUS. A process should anticipate such I/O failures by implementing a SIGBUS handler, especially a database server.

For the second part of your comment, on Linux systems, there is the msync() system call that can be used to flush the page cache on demand.

crabbone · 3 years ago

> msync() system call that can be used to flush the page cache on demand.

for everyone, not just the file you mapped to memory. I.e. the guarantee is that your file will be written, but there's no way to do that w/o affecting others. This is not such a hot idea in an environment where multiple threads / processes are doing I/O.

wmf · 3 years ago

I wonder how many apps don't handle errors from read() anyway.

sidewndr46 · 3 years ago

does that get delivered as SIGSEGV to the process or something else?

afr0ck · 3 years ago

On Linux, it's a SIGBUS.