Pointer Tagging for x86 Systems

ChuckMcM · 4 years ago

The UAI feature is nice! Back when I was noodling on creative ways to use 64 bit addresses the idea of a 16 bit 'space id' was bandied about. A space ID in the upper 16 bits would indicate the address space for a particular thing, be it a process, the kernel, a shared library, etc. And a new opcode to trampoline from one space to another for doing things like system calls or library calls, could provide a gateway for doing secure code analysis (basically no "side" entrances into a space) and it would be a much more robust pointer testing mechanism because any read or write with a pointer "outside" the space would be immediately flagged as an error. The use of segments was sort of like this on the 286.

Ideally you'd have a fine grained page map so that you could avoid copies from one space to another (instead of copy_in()/copy_out() you would do map_in()/map_out()) to get access to physical memory, while preserving the semantic that when something was mapped out of your space you couldn't affect it.(zero copy network stacks have been known to have issues when a process left holding a pointer to a buffer modified it when it wasn't "owned" kinds of bugs)

Could be a fun project for a RISC-V design. Not sure how many people are experimenting with various MMU systems but it is at least possible to do these kinds of things in nominally "real" hardware (it has always been possible to simulate of course).

erwincoumans · 4 years ago

In some cases (especially user land code), using array indexing can be better than using pointers. It allows to re-use data between GPU and CPU, use fewer bits for indices and simplify serialization. For example Flatbuffers is a very nice example: https://google.github.io/flatbuffers/

rayiner · 4 years ago

The issue addressed in the article with pointer tagging applies whether you use array indexing or pointers. Both are converted to pointer operations under the hood, which on X86 are some form of ADDR = BASE + SCALE*INDEX + OFFSET. Such an operation is still considered a pointer dereference (of BASE) from the view of the hardware, and will trap if tags are stored in the unused upper bits of the pointer.

Depending on the size of the array elements, you won't even be able to use the built-in index addressing mode, because the index can only be up to 8 bytes. For that reason, LLVM doesn't even have array indexing. Loads/stores can only be done against pointers, and pointers to items in arrays must be expressly computed using a get-element-pointer instruction: https://llvm.org/docs/GetElementPtr.html.

erwincoumans · 4 years ago

Thanks for the insightful reply. I didn't realize array indexing gets converted to pointers in the CPU. I expected the array indexing assembly instructions to be plumbed all the way down without converting to pointers until the very end.

LLVM not supporting array indexing means all those instructions go unused, on both x86 and arm? How about gcc or other compilers?

uvdn7 · 4 years ago

And not to mention better CPU cacheline hits!

KMag · 4 years ago

I used to sit next to a guy who previously professionally wrote compilers for both DEC Alpha and Itanium (http://robert.muth.org/), who mentioned that programs often ran faster on Alpha (which only supported 64-bit addressing) when modified to use 32-bit array indexes instead of pointers, due to reduced memory bandwidth/cache usage. Of course, one had to first determine if one needs to use any arrays larger than 2^32 elements.

wongarsu · 4 years ago

Instead of dealing with all the complexities of giving user processes control over the CPU's pointer masking with all the involved complexities of context switching and processes now using bit 63 (which marks kernel addresses), why doesn't the kernel just turn the feature on system wide when available, reserve a couple bits for kernel usage (like say bit 63 to mark kernel addresses) and provide a syscall that simply informs processes which bits they can use for pointer tagging, if any.

Are there any compatibility concerns that make it necessary to keep faulting on addresses outside the valid address range?

zozbot234 · 4 years ago

> Are there any compatibility concerns that make it necessary to keep faulting on addresses outside the valid address range?

As a sibling comment points out, forward compatibility is the whole reason why this "faulting on non-canonical addresses" was introduced. We used to have systems with 32-bit addresses and "upper byte ignore", leaving 24-bits of actual address space.

Applications that took advantage of that "ignore" feature by issuing non-normalized load or store operations broke on "32-bit clean" versions of the same architectures, even when limited to the same 24-bit address space. If they had stuck to issuing loads and stores with canonicalized addresses and treated "tagging" as a pure memory-saving thing, they could've been supported.

irdc · 4 years ago

Note that this was a problem on the classic Mac OS (https://en.wikipedia.org/wiki/Classic_Mac_OS_memory_manageme..., upper bits of pointers were used as flags) and early ARM processors (https://en.wikipedia.org/wiki/26-bit_computing#Early_ARM_pro..., combined program status register and program counter resulting in an effectively 26-bit system). Neither of these systems had an MMU, thus causing the breakage you mention. On a modern system with an MMU, one could instead reduce the size of a process' address space in exchange for tag bits.

ajross · 4 years ago

Because that changes behavior. A userspace process that would expect to fault currently on access to pointers with high bits set would suddently start touching different memory. I don't have specific examples, but there have been implementations of this kind of thing in the past that rely on traps like that to detect equivalent pointer tags or alternate memory spaces.

temac · 4 years ago

I'm intrigued by:

> Turning on UAI would allow user space to create pointer values that look like kernel addresses, but which would actually be valid user-space pointers. Those pointers can, of course, be passed into the kernel via system calls where, in the absence of due care, they might be interpreted as kernel-space addresses. The consequences of such confusion would not be good, and the possibility of it happening is relatively high.

Userspace can already forge "pointers" with whatever it wants in their bits. If the idea is to allow the transfer taggued pointers to userspace to the kernel and from there back to userspace, maybe just don't allow that?

I actually don't see why the kernel should accept tagged pointers to userspace at all. And it seems it should already be checked everywhere otherwise userspace could already make the kernel access unintended kernel memory. I don't see how, if some of them could start to be dereferenceable under some configuration from standard userspace, it would changes anything.

temac · 4 years ago

Ok so I read the mail from Andy and understand the real problem better: UAI is not context switched and is to be enabled system wide. I don't know what AMD has been smoking.

saagarjha · 4 years ago

Pretty sure this is how it works in ARM as well, except that TBI can be configured per-exception level so it can be turned off in the kernel.

kvakvs · 4 years ago

Virtual machines use pointer tagging in least significant bits, as most of the data is 4x or 8x byte aligned, and you can assume that zeroing those bits will always give you a safe correct pointer value. And the tag bits can instruct the code that the value is not a pointer but contains more data in other bits, or that the value is a structure on heap with special treatment.

khuey · 4 years ago

Depends on the virtual machine. LuaJIT and Spidermonkey (and probably others that I don't know about) use NaN-boxing which ends up storing the metadata in the higher bits.

pjmlp · 4 years ago

Intel history has quite a few failed attempts to memory tagging (iAPX 432, i960, MPX), maybe it needs again an AMD push to make it work.

adrian_b · 4 years ago

There are several cases when Intel had botched the definition of some instructions and later AMD had to redefine them correctly, and eventually Intel also adopted the corrected versions in their ISA.

Nevertheless, in this case AMD is wrong, without doubt.

Intel has published one year ago their version of "Linear Address Masking", which might become available in Sapphire Rapids.

There is nothing to criticize about Intel LAM. It keeps the highest address bit with the same meaning of today, to distinguish supervisor (kernel) pointers from user pointers.

Address masking can be enabled or disabled separately for supervisor pointers and user pointers. Address masking has 2 variants, depending on whether you choose 48-bit physical addresses (4-level page tables) or 57-bit physical addresses (5-level page tables).

AMD has published now a different incompatible method. It is likely that they have conceived it a few years ago, when the design of Zen 4 had started, which is why it is incompatible with Intel.

The fact that it is incompatible with Intel would not have been a big deal, except that the AMD method is wrong, because they also mask the MSB, breaking all the kernel code that tests pointers to determine if they are user pointers or not.

Like everyone else, I cannot understand why AMD did not specify rightly this feature, like Intel. It certainly is neither rocket science nor brain surgery.

pjmlp · 4 years ago

Oh, really bad then. So only SPARC and ARM will rescue us.

hansendc · 4 years ago

By an "AMD push" do you mean Intel should post a superior implementation a year before AMD does? ;)

https://lore.kernel.org/lkml/20210205151631.43511-1-kirill.s...

BTW, MPX was clearly not the right thing. Nobody really wanted it. The world would be a better place if these address-bit-ignoring things (ARM TBI, AMD UAI, Intel LAM) had been implemented years ago instead all the effort spent on MPX.

Believe me, I know. I put a lot of blood, sweat and tears into MPX for Linux.

Disclaimer: If you didn't figure it out by now, I work on Linux at Intel.

saagarjha · 4 years ago

You're misunderstanding what this tag is used for: it's to accelerate virtual machines, rather than for memory safety.

hansendc · 4 years ago

Actually, its primary design goal is to make address sanitizers faster. Right now, all the code that touches a sanitizer-tagged address must be recompiled to understand how to place and remove the tag. These address-bit-ignore approaches can (ideally) allow you to just modify the memory allocator to hand out tagged addresses. Those addresses can then be passed around to code that doesn't even know it's handling a tagged address. It doesn't need to be modified. You don't need to recompile the world. Even when the sanitizer is on, you also don't need to be constantly stripping tags out of pointers before dereferencing them.

netfl0 · 4 years ago

https://d3fend.mitre.org/technique/d3f:PointerAuthentication

We are building a KB of these sorts of things.

pjmlp · 4 years ago

I failed to find any reference to SPARC ADI, the longest stable solution to this problem,

https://docs.oracle.com/cd/E53394_01/html/E54815/gqajs.html

https://www.kernel.org/doc/html/latest/sparc/adi.html

netfl0 · 4 years ago

Thank you, it sounds like you have a lot of experience in this domain, if you’d like to contribute we’d welcome more of your perspective.

https://github.com/d3fend/d3fend-ontology

Otherwise we’ll get this reference added.

londons_explore · 4 years ago

Years ago, someone said "32 bit addresses! Thats huge! Lets use the top few bits for other stuff, like gate A20 and select lines of hardware".

That came to bite people in the form of the "3GB hole", the "PCI hole", etc. And those hacks were painful for lots of people.

I feel that by reusing address bits for other purposes in a non-flexible way like this, we're just repeating the mistakes of history. After all, there are 44 zettabytes of data in the world (in 2020), and addressing that is already beyond what a 64 bit number can do!

And one day, we'll have that much storage in your hand - your phone has about 2^80 atoms in it, so storing 2^64 bytes in there is totally physically possible.

scottlamb · 4 years ago

> I feel that by reusing address bits for other purposes in a non-flexible way like this, we're just repeating the mistakes of history.

Those old schemes were permanent machine-wide assumptions on physical addresses. They were essentially saying "32 - N bits is enough for anyone" (on this hardware design).

This is on virtual addresses, and the kernel developers want (and it sounds like the Intel feature allows) this to be configured per-process. I think it's totally reasonable for some process to say at runtime "56 bits is enough for me". In fact, it's common for Java code to run with "Compressed Oops" (32-bit pointers to 8-byte-aligned addresses, for a total of 32 GiB of addressable heap). This happens automatically when the configured heap size is sufficiently small.

> After all, there are 44 zettabytes of data in the world (in 2020), and addressing that is already beyond what a 64 bit number can do!

I don't think it makes sense for one process to be able to mmap all the data in the world. The major page fault latency would be nasty!

> And one day, we'll have that much storage in your hand - your phone has about 2^80 atoms in it, so storing 2^64 bytes in there is totally physically possible.

That day is pretty far off I think, but even when it happens I'm not sure it makes sense for all virtual pointers to be >=64-bit. A couple reasons it might not:

* Most programs (depending somewhat on programming language) use a lot of their memory for pointers, so it seems wasteful.

* I assume this hypothetical future device will still have slower and faster bytes to access. It may still make more sense to have different APIs for the fast stuff and the slow stuff. mmap's limitation of stalling the thread while something is paged in is a real problem today, and I don't know if that would get any better. Likewise error handling.

colejohnson66 · 4 years ago

The whole reason x86-64 required “canonical”[a] addresses in the first place was to prevent people using them for this purpose. Sure, this proposal allows applications to know how many bits they can work with, but what happens when an application developer writes `assert(freeBits >= 7)` when only 6 are available in the future?

[a] a “canonical” address is one where bits 63:MaxPhysAddr are identical. So on a processor supporting 48 address lines, bits 63:48 must be identical (either all clear or all set). Attempting reads or writes with non canonical addresses raises a #GP fault.

zozbot234 · 4 years ago

> Sure, this proposal allows applications to know how many bits they can work with

That is entirely orthogonal to this "upper bits ignore" feature. Ideally, a process would be able to set any reasonable amount of upper bits as "reserved for tagging", and the system allocator and kernel would then simply not require it to work with user virtual addresses that involve those upper bits. But upper bits can't simply be "ignored" if an app is to be forward-compatible; they still need to be canonicalized whenever external code is called.

saagarjha · 4 years ago

Then the application does not work on a future processor, that is correct. The person who wrote it, who likely is a JIT engineer, will get a bug report asking for this to be fixed, although most likely they're already aware of the upcoming hardware change and have prepared a different tagging scheme already.

hyperman1 · 4 years ago

The A20 gate story was something different though, it abused the numeric overflow of 16 bit addresses, not some unused bits in the middle of an address.

An x86 adres had a 16 bit segment and a 16 bit offset, with the address being calculated as 16*segment+offset, truncated to 20 bits. Segment F000 was in ROM and 0000 was in RAM, with the low RAM addresses used for different BIOS functionalities.

If you use a segment like 0xFF00 , then offset 0 to 0x0FFF correspond to linear addresses 0xFF000 to 0xFFFFF, in ROM. Offset 0x1000 to 0xFFFF correspond to 0x0000 to 0xEFFF, in RAM. This trick meant you can set the segment register only once, and use the offset to read both tables in ROM and use variables in RAM. Of course contemporary BIOS and other software used this trick to shave of a few instructions.

You know what happens next. The 80286 has 24 address lines, not 20, so the addresses just above the existing 0xFFFFF = 1MB became valid instead of wrapping around. In the name of backward compatibility, someone found an unused port on the keyboard controller, attached an AND port to it and to A20. If you talk to the keyboard controller, you can choose if you want 16MB ram or the old wraparound behavior. The keyboard controller was of course dog slow, and you have to set and reset the A20 gate the whole time to switch between not triggering BIOS/DOS/TSR bugs and usable upper RAM. This hack is AFAIK still there today in every x86 based PC, even if the gate stays enabled all the time.

ajross · 4 years ago

> The A20 gate story was something different though, it abused the numeric overflow of 16 bit addresses, not some unused bits in the middle of an address.

It wasn't even an abuse. It was a genuine attempt by the board manufacturer (IBM) to address a real backwards compatibility problem with a new CPU part. Obviously as memories grew far beyond 1M and real mode became a part of history (well, and of bootstrap code) it was a giant wart. But it solved a real problem.

pjmlp · 4 years ago

Hopefully by then the languages whose reason for pointer tagging exists in first place, won't matter as much as they still do today.

cmrdporcupine · 4 years ago

Given its longevity and installed base, I see no reasonable possibility of C/C++ going away or being significantly curtailed in the next half century.

C is 50 years old this year. And it's everywhere

It's more likely we'll be (or my kids will be) involved in harm reduction around these systems, rather than outright replacement.

karatinversion · 4 years ago

Pointers are a core part of the CPU memory model on both ARM and x64, so they aren't going away anytime soon; there's no reason compilers, JITs or runtimes couldn't use pointer tagging, just the same as C programmers.

rwmj · 4 years ago

I don't think C is going away any time soon, and even if it did, C reflects a common model of how hardware works that is shared by plenty of other higher level languages. Also GCs use pointer tagging.

saagarjha · 4 years ago

You mean most languages that run on virtual machines today, right? Java, JavaScript, C#, Lua, …