Yet in practice the stack on x64 is separate from the heap, GPU memory is general is separate from system memory, and even inside the GPU you have at least a 3 level segmented memory hierarchy.
Even in C-land, you're technically not supposed to fabricate arbitrary heap pointers, you're supposed to offset something you get from malloc(). The Standard says (if memory serves) that only pointers from the beginning of a malloc() to 1-past-the-end have defined behavior (when it comes to the heap).
Of course, there are probably lots of in-practice exceptions when it comes to embedded, kernel code, mmap() shenanigans, etc.
So what? In a bitmask, such as a hardware register or database Bloom filter, two adjacent bits may be unrelated, and their relative position entirely coincidental and insignificant.
That doesn't mean we need a memory model in which bits are separate objects, in some sense.
Real segmentation would have solved the problem described in the article. Under virtual memory segments like on the 80386 (and mainframes before that), you can physically relocate a segment and while adjusting its descriptor so that the addressing doesn't change.
The problem was mainly caused by having no MMU, so moving around objects in order to save space required adjusting pointers. Today, a copying garbage collector will do the same thing; rewrite all the links among the moved objects. You'd have similar hacks on Apple Macintoshes, with their MC68K processors and flat space, like doubly indirected objeccts managed by "handles" and whatnot.
It's fun to see that old article again. (I wrote the FixDS program Raymond mentions at the end.)
One pain point in 16-bit Windows programming was that you had to do two special things for any callback function in your application: EXPORT the function in your .DEF file, and call MakeProcInstance() on your callback function to get an "instance thunk".
FixDS made a simple patch to the compiled function prologs that rendered all of this unnecessary. You could just use your callback function directly, like we expect today.
I wrote this piece, frustrated by, what looks to me, like the entire semiconductor industry is only exploring one single computer storage organization, despite the fact that recent inventions like flash practically begs for innovation.
For instance few people realize that the Flash Adaptation Layer in the SSD devices means that we literally run two filesystems on top of each other, because nobody has seriously tried to get rid of the "a disk is an array of individually rewritable sectors" despite this literally being untrue both for modern disks and in particular for Flash based storage.
Similarly, the "flat physical/flat virtual" MMU model is a relic from the days of IBM 360 and VAX 11/780 and utterly inefficient and unsuitable for what we do in userland these days.
As Robert has shown with CHERI, there is plenty of space to innovate without breaking existing code.
And yes, C can be object oriented, all you have to do is keep your hands from the primitives which are necessary to access hardware directly.
Architectually GPU's are a superoptimized distraction, like the vector-units on Cray and Convex computers were 40-50 years ago, but those too can function in a non-flat address-space.
But even C in OO-mode, and C++, Go, Rust, PHP and for that matter LISP and SmallTalk, would benefit from an HW/MMU architecture which focused on delivering fast object service, rather than flat address-spaces which software must then convert into objects.
But to innovate, we must first realize that we are currently stuck in a box, and dare to look outside it.
Interesting but neither (esp. the latter) were known as racehorses.
I'd prefer to keep the hardware simple and fast and push the complexity into the software, and prove stuff.
> would benefit from an HW/MMU architecture which focused on delivering fast object service, rather than flat address-spaces which software must then convert into objects.
That conversion may not be cheap (edit: badly phrased, the object mapping process and hardware may not be cheaper (edit again: = faster) than that mapping done by the MMU for conventional memory) - can you exaplain how it would be done such that it would be cheaper in time than the current mapping in the common/hot/optimistic path, and how it would not be worse than it is now on the rare/cold/pessimistic path? And how it would behave on average, between those 2 extremes?
And why objects everywhere would be better all-round?
Probably if you haven't heard of CHERI and can't be bothered to Google it when one of the most respected systems architects around tells you it's worth looking at, you aren't going to put in the effort needed to read the CHERI tech reports so you can have an informed opinion on the performance cost of putting this kind of protection into hardware. And if the only historical "OO" processors you can think of are Rekursiv and iAPX432, and not the Dorado or the Symbolics line or the B5000/B5500/ClearPath line, it sounds like you have a lot of reading to do to get to having an informed opinion.
iAPX 432 arguably wasn't bad due to OOP orientation, but due to a bunch of several issues in how intel went about executing the idea. To the point that if they mismanaged a "normal" cpu this way they would have bungled it similarily.
That way innovation can be much faster, because applications can generally move quicker than kernels.
Btw, I'm not a fan of object orientation; and I don't think our hardware design should be infected by that fad. But I think your criticism of the badly fitting abstraction of flat address spaces is still valid. I am just not sure that 'fast object service' is necessarily the remedy.
I'm not a fan of any archtectural radicalism, and tend to think that there are things best done in both hardware, kernel and libraries and applications :-)
That is not to say that the boundaries should be cast in stone, they should obviously be flexible enough that you do not need a complete multi-user management system in a single-service jail or container nor a full-blown journaled COW storage-manager on a small embedded system.
In other words: I am firmly for the "Software Tools" paradigm.
I am very sympathetic to this argument overall and trace the hardware industry's failure back to the spread of C, UNIX, and worse-is-better.
With Wasm I see an opportunity to virtualize away the old world of C and linear address spaces. While we designed it to be low level and sandboxed to get C/C++ and Rust on board, I and others have always had in mind a future world where Wasm has managed (GC'd) data, first-class typed functions, and more. Those features should support a wide variety of source languages.
Wasm should become the new, final ISA. It should become like Unicode; the final, most widely-supported[1] format for executable code. When all languages and programs can run easily on Wasm, then hardware can easily be swapped out.
[1] Sure, Unicode has flaws, and it doesn't support everything equally as well. But it had the property that basically everyone dropped everything else in favor it, because it gained all the momentum.
A lot of this due to the hardware architecture itself. The software abstractions dictated/limited by the HW itself causes many of the risks!
If you designed BOTH HW and SW up to and including the OS, you _might_ have a chance to control the risks better. But by the very separation of duties and roles, papered over by abstraction itself, you create problems. ALL abstractions throw away information and eventually those abstractions bite you in the ass.
This was the case with digital logic once HW speeds rose to a critical level - suddenly the reality that digital is merely an abstraction upon analog and the very abstraction of lumped-model analog started failing which caused digital fail as well.
We definitely can have and have had the same failure occurring with von Neumann architecture - there's NOTHING magical about it that immunizing against model abstraction failure and it can creation "intrinsic failures" that can never be fixed thanks to Gödel's incompleteness theorem.
It’s a big thread (by virtue of being a cool piece), so maybe someone said this already, but isn’t there kind of a middle ground where we let existing software continue to run but acknowledge a changing hardware landscape by really cranking up (default) page sizes?
Userland allocators already work pretty hard to hit in the TLB [1], but huge-page tuning and whatnot is, to your point, only hitting the sweet spot on modern gear via effort/luck.
Channeling my inner Linux, defaulting to larger pages means a drastic loss in memory efficiency when mmaping lots of small files as happens when someone compiles the Linux kernel. If you've got a 2kb file and 4kb pages then half the memory you allocate when the file is paged in is wasted. For larger pages that goes way up.
A flash adaptation layer solves the following problem: I have M filesystems, that I'd like to use on any one of N different flash technologies. I don't want to complicate each M filesystem with support for N flashes.
I don't think both layers are "filesystem" in the same sense. We don't need the lower filesystem to provide permissions, ownerships, time stamps, directories, symbolic links and such.
Re: linear
A machine address is a word of bits. A word of bits always has a linear interpretation as a binary number. For instance if we have a 16 bit segment ID, and a 32 bit offset, then we can still pretend that it's a 48 bit linear address. We can compare two pointers for inequality, for instance: p0 < p1, as 48 bit words. That space may be sparsely populated, but that doesn't make it nonlinear; the spaces you are calling linear can also be sparsely populated and have numeric conventions about regions.
You say physical memories are linear, but they are also sparsely populated in the same way: such and such a range is a ROM, such and such a range is certain memory mapped registers, DRAM is over there. Generally speaking, hardware treats some part of an address as an ID that it recognizes, and then the bits below that as an offset. When there is a bus read or write, if the ID part matches that hardware device, then it selects itself for that I/O operation, and uses the offset bits to provide access to the right thing. So physical memory is arguably nonlinear; it is like a subclassed IP address space.
Physical addresses can have bits which are not used for addressing; e.g. a certain range of memory might be available as uncached if you enable a certain upper bit in the address. That looks like linear, but with aliasing between distant regions. Linear virtual memory is sparsely populated; there are mapped pages and unmapped pages. Pages can alias: the same object can be mapped in multiple places, so a write here can be read there.
If you want to split an address into an object ID and offset, you have to gamble about how many bits you need for each one. One application has hundreds of millions of tiny objects: it wants a big object ID part of the address, and a small offset. Another one has a small number of huge objects: it doesn't care for large object ID, but wants big offsets. Either you make that configurable (slow gets even slower), or else waste bits on making both spaces larger at the same time, perhaps ending up with wasteful 128 bit pointers on a system where 64 is more than enough.
All interpretations of the bits of an address above and beyond "pure binary number" add complication and overhead.
The hardware (e.g. DMA bus master) isn't going to understand objects; it will always want a simple address.
Re: C, C++, Go, Rust, PHP, Lisp, Smalltalk
No two implementations of these can agree with each other on what exactly is an object. Implementations will just carry their existing respective portable object representations into the non-linear model. Architectures without flat memory, like the JVM and WebAssembly, tend to only cause pain for Lisp and its ilk.
A Lisp is not going to want some weird object model from the operating system; it will want to do things like packing cons cells tightly together into one larger heap object. That heap object could be a segment. We had those; they are also from the 1960's. Operating systems started ignoring them, opting for just the demand paging support the VM hardware.
> A flash adaptation layer solves the following problem: I have M filesystems, that I'd like to use on any one of N different flash technologies. I don't want to complicate each M filesystem with support for N flashes.
I think you could argue that certain file system designs are better suited to different storage hardware. Maybe it's appropriate that the kernel runs different code depending on the underlying storage type?
We already have JEDEC standards for flash devices to report block layout, because it's important in microcontrollers where there is no adapation layer. We could have an SSD/M2 standard that reported that information, and then architecturally kernel FS stuff would probably split into a couple of layers. The 'top' that provides the filesystem features that you're used to in something like ZFS, and the bottom storage layer that has a couple of different implementations for 'linear' and 'blocks'.
No, that is actually not what the Flash Adaptation Layer does.
The FAL's primary job is to make the Flash array look like a disk, despite the fact that individual "sectors" are not individually rewriteable.
To do this, the FAL implements which is for all practical purposes a filesystem, where the files are all a single sector long and where the filename is the sector number exposed by the "disk".
In other words: Now we have two filesystems on top of each other, one lying to the other, which does a lot of unnecessary work, because it is being lied to.
We have pretty comprehensive user-side documentation of the R1000, but very, very little from the system/vendor side, so lots of things we simply do not know yet.
If you are into data-archaeology and lack for challenges, we have several good outstanding questions for research. For instance the layout of the object-store/filesystem.
If you want to see a R1000 running, and experience the worlds first truly semantic IDE, come to Datamuseum.dk just outside Copenhagen, because we have the only approx 1.83 running R1000 computers.
(We also just started fundraisning for a new permanent building for our huge collection, we are at €113K of €3M goal. See top right corner or homepage or email.)
We know of only four surviving computers, we have two, one is privately owned in NZ and IBM donated one to CHM. The rest have been shredded because of classified mil-spec workload.
If you are local to/affiliated with CHM, and are interested/allowed, we would love to know more about their machine, and if possible, assist to get that running too.
The Mill's memory model is one of its most interesting features IMO [1] and solves some of the same problems, but by going the other way.
On the Mill the whole processor bank uses a global virtual address space. TLB and mapping to physical memory happens at the memory controller. Everything above the memory controller is in the same virtual address space, including L1-L3+ caches. This solves a lot of problems, for example: If you go out to main memory you're already paying ~300 cycles of latency, so having a large silicon area / data structure for translation is no longer a 1-cycle latency problem. Writes to main memory are flushed down the same memory hierarchy that reads come from and succeed as soon as they hit L1. Since all cache lines are in the same virtual address space you don't have to track and synchronize reads and writes across translation zones within the cache hierarchy. When you request an unallocated page you get the whole pre-zeroed page back instantly, since it doesn't need to be mapped to physical pages until writes are flushed out of L3. This means its possible for a page to be allocated, written to, read, and deallocated which never actually touches physical memory throughout the whole sequence and the whole workload is served purely within the cache hierarchy.
Protection is a separate system ("PLB") and can be much smaller and more streamlined since it's not trying to do two jobs at once. The PLB allows processes to give fine-grained temporary access of a portion of its memory to another process; RW, Ro, Wo, byte-addressed ranges, for one call or longer etc. Processes get allocated available address space on start, they can't just assume they own the whole address space or start at some specific address (you should be using ASLR anyways so this should have no effect on well-formed programs, though there is a legacy fallback).
The Mill model is kind of cool, but today, many peripherals (including GPUs and NICs) have the ability to dump bytes straight into L3 cache. This improves latency in a lot of tasks, including the server-side ones that the Mill is designed for. This is possible due to the fact that MMUs are above the L3 cache.
Honestly, I'm happy waiting for 4k pages to die and be replaced by huge pages. Page tables were added to the x86 architecture in 1985, when 1MB of memory was a ton of memory to have. Having 256 pages worth of memory in your computer was weird and exotic. Fast forward to today, and the average user has several GB of memory - mainstream computers can be expanded to over 128 GB today - and we still mainly use 4k pages. That is the problem here. If we could swap to 2M pages in most applications, we would be able to reduce page table sizes by a factor of 512, and they would still be a lot larger than page tables when virtual memory was invented. And we wouldn't waste much memory!
But no, 4k pages for backwards compatibility. 4k pages forever. And while we're at it, let's add features to Linux (like TCP zero copy) that rely on having 4k pages.
Once on an embedded system, we found 30% of our processing time was spent handling TLB misses.
We also didn't have access to the code that set up the page tables. Somehow I got the linker to emit a function hook which I used to defrag the page tables after they were written but before the MMU was enabled.
I remember thinking, "This is not what I learned about in school."
This article compares CHERI to an 80's computer, the Rational R1000 (which I'm glad to know of). It's worth noting that CHERI's main idea was explored in the 70's by the CAP computer[1]. CAP and CHERI are both projects of the University of Cambridge's Computer Lab. It's fairly clear that CAP inspired CHERI.
The original machines like that were the Burroughs 5000 (1961), and the Burroughs 5500 (1964), which was quite successful. Memory was allocated by the OS in variable length chunks. Addresses were not plain numbers; they were more like Unix paths, as in /program/function/variable/arrayindex.
That model works, but is not compatible with C and UNIX.
For a micro, see iAPX 432. It was considered over-complicated for its time but if CHERI catches on, perhaps far ahead of its time.
Also interesting to me was the idea in Darek Mihocka's blog NO EXECUTE! about using emulation to implement advanced processor features rather than forcing everything through a one size fits all hardware policy. Of course emulation will always have a performance hit, but the concept is interesting.
I'm a little confused about how the object base is looked up in these systems, and if they're sparse or dense and have any size or total object count limitations, and if that ends up having the same limitations on total count as page tables that required the current multi-level approach.
As surely you could consider page table as effectively implementing a fixed-size "object cache"? It is just a lookup for an offset into physical memory, after all, with the "object ID" just being the masked first part of the address? And if the objects are variable sized, is it possible to end up with physical address fragmentation as objects of different sizes are allocated and freed?
The claim of single-cycle lookups today would require an on-chip fixed-size (and small!) fast sram, as there's a pretty hard limit on the amount of memory you can get to read in a single clock cycle, no matter how fancy or simple the logic behind deciding to lookup. If we call this area the "TLB" haven't we got back to pagetables again?
And for the size of sram holding the TLB/object cache entries - increasing the amount of data stored in them means you have less total too. A current x86_64 CPU supports 2^48 of physical address space, reduced to 36 bits if you know it's 4k aligned - and 2^57 of virtual address space as the tag, again reduced to 45 bits if we know it's 4k aligned. That means to store the tag and physical address you need a total of 81 bits of SRRAM. A 64-bit object ID, plus 64-bit physical address plus 64-bit size is 192bits, over 2x that, so you could pack 2x the number of TLB entries into the same sram block. To match the capabilities of the example above, 57 bits of physical address (cannot be reduced as arbitrary sizes means it's not aligned), plus a similarly reduced to 48 bit object ID and size still adds up to 153, only slightly less than 2x, though I'm sure people could argue that reducing the capabilities here have merit, I don't know how many objects or their maximum possible size in such a system. And that's "worst case" 4k pages for the pagetable system too.
I can't see how this idea could be implemented without extreme limitations - look at the TLB size of modern processors and that's the maximum number of objects you could have while meeting the claims of speed and simplicity. There may be some advantage in making them flexible in terms of size, rather than fixed-size, but then you run into the same fragmentation issues, and need to keep that size somewhere in the extremely-tight TLB memory.
So I commented on this a bit elsewhere, but the whole object business is irrelevant for how the address translation hardware in this machine actually works. While the subfields of the address are exploited to optimize hash function used, the hardware is otherwise agnostic to what the upper bits of the address mean. The TLB is just huge relative to the amount of memory it had such that there's one entry for each physical page in the system and it deals with collisions in the TLB by evicting pages to disk
> As surely you could consider page table as effectively implementing a fixed-size "object cache"? It is just a lookup for an offset into physical memory, after all, with the "object ID" just being the masked first part of the address? And if the objects are variable sized, is it possible to end up with physical address fragmentation as objects of different sizes are allocated and freed?
Because that's only a base, not a limit. The right pointer arithmetic can spill over to any other object base's memory.
Maybe? If you just round up each "object" to 4k then you can implement this using the current PTE on x86_64, but this removes the (supposed) advantage of only requiring a single PTE for each object (or "object cache" lookup entry or whatever you want to call it) in the cases when an object spans multiple page-sizes of data.
Having arbitrary sizes objects will likely be possible in hardware - it's just an extra size being stored in the PTE if you can mask out the objectID from the address (in the example in the original post, it's a whole 64-bit object ID, allowing a full 64-bits of offset within each object, but totaling a HUGE 128bit effectively address)
But arbitrary sizes feels like it pushes the issues that many current userspace allocators have to deal with today to the hardware/microcode - namely about packing to cope with fragmentation and similar (only instead of virtual address space they'll have to deal with physical address space). The solutions to this today are certainly non-trivial and still can fail in many ways, so far away from being solved, let along solved in a simple enough way to be implemented that close to hardware.
> Why do we even have linear physical and virtual addresses in the first place, when pretty much everything today is object-oriented?
Well, GPU code is certainly not object-oriented, and I hope it never becomes that. SIMD code won't be able to jump between objects like typical CPU-oriented OOP does (unless all objects within a warp/workgroup jump to the same function pointers?)
GPU code is common in video games. DirectX needs to lay out its memory very specifically as you write out the triangles and other vertex/pixel data for the GPU to later process. This memory layout is then memcopy'd over to PCIe using the linear address space mechanism, and GPUs are now cohesive with this space (thanks to Shared Virtual Memory).
So today, thanks to shared virtual memory and advanced atomics, we can have atomic compare-and-swap coordinate CPU and GPU code operating over the same data (and copies of that data can be cached in CPU-ram or GPU-VRAM and transferred over automatically with PCIe memory barriers and whatnot).
----------
Similarly, shared linear address spaces operate over rDMA (remote direct memory access), a protocol built on top of Ethernet. This means that your linear memory space is mmap'd on your CPU, but then asks for access to someone else's RAM over the network. The mmap then causes this whole "inefficient pointer-traversals" to then get turned into Ethernet packets to share RAM between CPUs.
Ultimately, when you start dealing with high-speed data-sharing between "external" compute units (ie: a GPU, or a ethernet-connected far-away CPU), rather than "just" a NUMA-node or other nearby CPU, the linear address space seems ideal.
--------
Even the most basic laptop, or even Cell Phone, these days, is a distributed system consisting of a CPU + GPU. Apple chips even have a DSP and a few other elements. Passing data between all of these things makes sense in a distributed linear address space (albeit really wonky with PCIe, mmaps, base address pointers and all sorts of complications... but they are figured out, and it does work every day)
I/O devices working directly in memory is going to only become more common. 100Gbps network connections exist in supercomputer labs, 10Gbps Ethernet is around the corner for consumers. NVMe drives are pushing I/O to such high bandwidths that'd make DDR2 RAM blush. GPUs are growing more complicated and are rumored to start turning into distributed chiplets soon. USB3.0 and beyond are high-speed links that directly drop off data into linear address spaces (or so I've been told). Etc. etc.
Thing is that I think TFA author isn't speaking of "object-oriented" in the typical sense. I believe it's more about how the VM paging system and cache and etc. is really working with abstract segment "objects" that are blobs of memory but presented to the process as a single linear address space. I don't think the author means OO as in "Smalltalk" or "Java" [methods + objects + inheritance] but more "object based" in a more generic sense of "there's blobs of stuff with abstract handles pointing to them" that then gets painted to the process as if it's a single linear sequence of bytes when in the actual machine... it's not.
Since "pretty much everything today is object-oriented" is such an important point in the article, I feel the author does his readers and himself a disservice by not defining what he means by object-oriented, especially if he uses it in a different sense than everybody else.
“An obstack is a pool of memory containing a stack of objects. You can create any number of separate obstacks, and then allocate objects in specified obstacks. Within each obstack, the last object allocated must always be the first one freed, but distinct obstacks are independent of each other.
Aside from this one constraint of order of freeing, obstacks are totally general: an obstack can contain any number of objects of any size. They are implemented with macros, so allocation is usually very fast as long as the objects are usually small. And the only space overhead per object is the padding needed to start each object on a suitable boundary.”
GPU stuff has been "object oriented" for a long time. The ability to just arbitrarily address data as bytes from shaders is relatively new and modern and still constrained.
If you want to talk to a GPU via vulkan or d3d12 you're going to navigate a maze of descriptor sets, pipeline objects, buffers of various types, samplers, etc. These are all Objects with distinct types and data and the GPU hardware will interact with them in specific ways, even if it's also a general purpose computing device that will run some shader code you throw at it. When I was writing bare-metal graphics code for the PS4 there was still a lot of Object Orientation there too even if there weren't vtables or a garbage collector involved.
Seeing RAM as a collection of linear addresses on GPUs (especially as shared virtual memory, pinned memory, or other such GPU/CPU virtual memory sharing features) is a feature from 2010-era, be it from OpenCL, DirectCompute, or CUDA.
DirectCompute just sees the data as a collection of bytes.
> Well, GPU code is certainly not object-oriented, and I hope it never becomes that.
Jonathan Blow has been experimenting a lot with arrays of structs versus structs of arrays, particularly in making them both first class instead of assuming one and ignoring the other. This is essentially the column- versus row-oriented data question answered with a question: why not both?
I think if you had a first class language that supported both, that you wouldn't mind so much if GPU code became object oriented.
I haven’t heard much about JAI for a few years, is this idea catching on?
I don’t know why not both, is it maybe because one of the two layouts will almost always have worse performance for any given access pattern? I can see uses for having both for the non-critical-path parts, though maybe once you’re indexing it in the awkward but perf-friendly way, you’ve already done the hard work and there’s little point to proving the other style of indexing? (Isn’t the best perf always found using whatever layout & indexing is the most difficult… there must be some eponymous law for this, right?) I guess it is possible to have multiple critical paths each favoring a different layout. Not sure how often I’ve ever seen that in practice, I would guess it’s probably rather rare.
In this case GPUs are actually the perfect argument in favor of the article. GPUs only speak in objects, never linear addresses. You allocate a type of object and that allocation just now is that type. eg, a texture is always a texture, never a vertex buffer. Which is a different object altogether. And you never work with addresses. You can point objects at other objects. You can point at an offset from an object. But you can never do arbitrary pointer manipulation, and you never have any idea if the address space is linear or not.
By making pointers treated the same on CPU or GPU (by allowing their 64-address spaces to be identical by sharing the same memory regions + using PCIe to keep those memory regions in sync), you can perform high-speed CPU / GPU communications of even linked data structures (linked lists, trees, graphs).
GPUs utilize many linked data-structures. Oct-trees accelerate bounds testing, BVH trees help raytracing. Etc. etc. The GPU addresses must be linear because they're synchronized to the CPU's memory space.
I think he's advocating fitting a high level language like Ada in the kernel or in the CPU(?), with one global addressing space and no separate individual process addressing space for memory protection, but rather relying on the high language to provide memory protection. That's where his bizarre hyping of "object oriented" addressing space came from.
It has been done. See Singularity project from Microsoft Research, which used C# as the language to provide memory protection, no process, all programs running in the same global memory space, and all of them running in ring-0. It was a fun research project, but never really made it out. There were other research projects like it.
Also his (object, offset) addressing space is essentially a segmented memory model. The object id is the segment id. I bet the object id is in a linear space.
It's been done way earlier than MSFT's SingularityOS: see e.g. Hansen's book "The Architecture of Concurrent Programs" (from 1977) where he writes an OS in Pascal with exactly this approach to process separation.
Heck, I'd argue everyone who writes their own toy OS should probably start with this approach and then retrofit memory protection later because it simplifies things tremendously and you get something actually usable much earlier (context switching between is not that hard but it's complicated, especially on x86).
Memory protection is nice to have because then you're likely to get crashes when you screw up pointer stuff. And crashes are easier to debug than writes to the wrong place.
Of course, you can do identity memory mapping and skip virtual vs physical questions (for x86, memory protection requires using virtual memory, because the protection information is in the page tables, and iiuc, amd64 also requires using virtual memory, but it's still a useful simplification to do identity mapping so the virtual address is the physical address). My toy OS only runs one process, so there was never going to be more than one memory map anyway.
> It has been done. See Singularity project from Microsoft Research, which used C# as the language to provide memory protection, no process, all programs running in the same global memory space, and all of them running in ring-0. It was a fun research project, but never really made it out.
The problem is basically the choice of C#. If today people use WASM, with no process distinctions, same address space, all in ring0 it would already be a much more successful project.
Simple: we don't want some low level kernel memory management dictating what constitutes an "object".
Everything isn't object-oriented. E.g. large arrays, memory-mapped files, including executables and libraries.
Linear memory sucks, but every other organization sucks more.
Segmented has been done; the benefit-to-clunk ratio was negligible.
[0] https://news.ycombinator.com/item?id=9807777
Of course, there are probably lots of in-practice exceptions when it comes to embedded, kernel code, mmap() shenanigans, etc.
That doesn't mean we need a memory model in which bits are separate objects, in some sense.
( one reference to thunks involving segmented memory: https://devblogs.microsoft.com/oldnewthing/20080207-00/?p=23... )
The problem was mainly caused by having no MMU, so moving around objects in order to save space required adjusting pointers. Today, a copying garbage collector will do the same thing; rewrite all the links among the moved objects. You'd have similar hacks on Apple Macintoshes, with their MC68K processors and flat space, like doubly indirected objeccts managed by "handles" and whatnot.
One pain point in 16-bit Windows programming was that you had to do two special things for any callback function in your application: EXPORT the function in your .DEF file, and call MakeProcInstance() on your callback function to get an "instance thunk".
FixDS made a simple patch to the compiled function prologs that rendered all of this unnecessary. You could just use your callback function directly, like we expect today.
I wrote this piece, frustrated by, what looks to me, like the entire semiconductor industry is only exploring one single computer storage organization, despite the fact that recent inventions like flash practically begs for innovation.
For instance few people realize that the Flash Adaptation Layer in the SSD devices means that we literally run two filesystems on top of each other, because nobody has seriously tried to get rid of the "a disk is an array of individually rewritable sectors" despite this literally being untrue both for modern disks and in particular for Flash based storage.
Similarly, the "flat physical/flat virtual" MMU model is a relic from the days of IBM 360 and VAX 11/780 and utterly inefficient and unsuitable for what we do in userland these days.
As Robert has shown with CHERI, there is plenty of space to innovate without breaking existing code.
And yes, C can be object oriented, all you have to do is keep your hands from the primitives which are necessary to access hardware directly.
Architectually GPU's are a superoptimized distraction, like the vector-units on Cray and Convex computers were 40-50 years ago, but those too can function in a non-flat address-space.
But even C in OO-mode, and C++, Go, Rust, PHP and for that matter LISP and SmallTalk, would benefit from an HW/MMU architecture which focused on delivering fast object service, rather than flat address-spaces which software must then convert into objects.
But to innovate, we must first realize that we are currently stuck in a box, and dare to look outside it.
Other OO processors: Rekursiv (https://en.wikipedia.org/wiki/Rekursiv) and that famously slooow intel iAPX 432 (https://en.wikipedia.org/wiki/Intel_iAPX_432)
Interesting but neither (esp. the latter) were known as racehorses.
I'd prefer to keep the hardware simple and fast and push the complexity into the software, and prove stuff.
> would benefit from an HW/MMU architecture which focused on delivering fast object service, rather than flat address-spaces which software must then convert into objects.
That conversion may not be cheap (edit: badly phrased, the object mapping process and hardware may not be cheaper (edit again: = faster) than that mapping done by the MMU for conventional memory) - can you exaplain how it would be done such that it would be cheaper in time than the current mapping in the common/hot/optimistic path, and how it would not be worse than it is now on the rare/cold/pessimistic path? And how it would behave on average, between those 2 extremes?
And why objects everywhere would be better all-round?
Are you familiar with exokernels? They were an attempt to remove abstractions from kernel land and leave that to applications.
See https://www.classes.cs.uchicago.edu/archive/2019/winter/3310...
That way innovation can be much faster, because applications can generally move quicker than kernels.
Btw, I'm not a fan of object orientation; and I don't think our hardware design should be infected by that fad. But I think your criticism of the badly fitting abstraction of flat address spaces is still valid. I am just not sure that 'fast object service' is necessarily the remedy.
That is not to say that the boundaries should be cast in stone, they should obviously be flexible enough that you do not need a complete multi-user management system in a single-service jail or container nor a full-blown journaled COW storage-manager on a small embedded system.
In other words: I am firmly for the "Software Tools" paradigm.
With Wasm I see an opportunity to virtualize away the old world of C and linear address spaces. While we designed it to be low level and sandboxed to get C/C++ and Rust on board, I and others have always had in mind a future world where Wasm has managed (GC'd) data, first-class typed functions, and more. Those features should support a wide variety of source languages.
Wasm should become the new, final ISA. It should become like Unicode; the final, most widely-supported[1] format for executable code. When all languages and programs can run easily on Wasm, then hardware can easily be swapped out.
[1] Sure, Unicode has flaws, and it doesn't support everything equally as well. But it had the property that basically everyone dropped everything else in favor it, because it gained all the momentum.
A lot of this due to the hardware architecture itself. The software abstractions dictated/limited by the HW itself causes many of the risks!
If you designed BOTH HW and SW up to and including the OS, you _might_ have a chance to control the risks better. But by the very separation of duties and roles, papered over by abstraction itself, you create problems. ALL abstractions throw away information and eventually those abstractions bite you in the ass.
This was the case with digital logic once HW speeds rose to a critical level - suddenly the reality that digital is merely an abstraction upon analog and the very abstraction of lumped-model analog started failing which caused digital fail as well.
We definitely can have and have had the same failure occurring with von Neumann architecture - there's NOTHING magical about it that immunizing against model abstraction failure and it can creation "intrinsic failures" that can never be fixed thanks to Gödel's incompleteness theorem.
Userland allocators already work pretty hard to hit in the TLB [1], but huge-page tuning and whatnot is, to your point, only hitting the sweet spot on modern gear via effort/luck.
[1] https://engineering.fb.com/2011/01/03/core-data/scalable-mem...
CHERI has shown that this kind of fundamental architectural improvements can happen with very little impact to running code.
A flash adaptation layer solves the following problem: I have M filesystems, that I'd like to use on any one of N different flash technologies. I don't want to complicate each M filesystem with support for N flashes.
I don't think both layers are "filesystem" in the same sense. We don't need the lower filesystem to provide permissions, ownerships, time stamps, directories, symbolic links and such.
Re: linear
A machine address is a word of bits. A word of bits always has a linear interpretation as a binary number. For instance if we have a 16 bit segment ID, and a 32 bit offset, then we can still pretend that it's a 48 bit linear address. We can compare two pointers for inequality, for instance: p0 < p1, as 48 bit words. That space may be sparsely populated, but that doesn't make it nonlinear; the spaces you are calling linear can also be sparsely populated and have numeric conventions about regions.
You say physical memories are linear, but they are also sparsely populated in the same way: such and such a range is a ROM, such and such a range is certain memory mapped registers, DRAM is over there. Generally speaking, hardware treats some part of an address as an ID that it recognizes, and then the bits below that as an offset. When there is a bus read or write, if the ID part matches that hardware device, then it selects itself for that I/O operation, and uses the offset bits to provide access to the right thing. So physical memory is arguably nonlinear; it is like a subclassed IP address space.
Physical addresses can have bits which are not used for addressing; e.g. a certain range of memory might be available as uncached if you enable a certain upper bit in the address. That looks like linear, but with aliasing between distant regions. Linear virtual memory is sparsely populated; there are mapped pages and unmapped pages. Pages can alias: the same object can be mapped in multiple places, so a write here can be read there.
If you want to split an address into an object ID and offset, you have to gamble about how many bits you need for each one. One application has hundreds of millions of tiny objects: it wants a big object ID part of the address, and a small offset. Another one has a small number of huge objects: it doesn't care for large object ID, but wants big offsets. Either you make that configurable (slow gets even slower), or else waste bits on making both spaces larger at the same time, perhaps ending up with wasteful 128 bit pointers on a system where 64 is more than enough.
All interpretations of the bits of an address above and beyond "pure binary number" add complication and overhead. The hardware (e.g. DMA bus master) isn't going to understand objects; it will always want a simple address.
Re: C, C++, Go, Rust, PHP, Lisp, Smalltalk
No two implementations of these can agree with each other on what exactly is an object. Implementations will just carry their existing respective portable object representations into the non-linear model. Architectures without flat memory, like the JVM and WebAssembly, tend to only cause pain for Lisp and its ilk.
A Lisp is not going to want some weird object model from the operating system; it will want to do things like packing cons cells tightly together into one larger heap object. That heap object could be a segment. We had those; they are also from the 1960's. Operating systems started ignoring them, opting for just the demand paging support the VM hardware.
I think you could argue that certain file system designs are better suited to different storage hardware. Maybe it's appropriate that the kernel runs different code depending on the underlying storage type?
We already have JEDEC standards for flash devices to report block layout, because it's important in microcontrollers where there is no adapation layer. We could have an SSD/M2 standard that reported that information, and then architecturally kernel FS stuff would probably split into a couple of layers. The 'top' that provides the filesystem features that you're used to in something like ZFS, and the bottom storage layer that has a couple of different implementations for 'linear' and 'blocks'.
The FAL's primary job is to make the Flash array look like a disk, despite the fact that individual "sectors" are not individually rewriteable.
To do this, the FAL implements which is for all practical purposes a filesystem, where the files are all a single sector long and where the filename is the sector number exposed by the "disk".
In other words: Now we have two filesystems on top of each other, one lying to the other, which does a lot of unnecessary work, because it is being lied to.
About the R1000:
We have pretty comprehensive user-side documentation of the R1000, but very, very little from the system/vendor side, so lots of things we simply do not know yet.
We have digitized everything we have here:
https://datamuseum.dk/wiki/Bits:Keyword/RATIONAL_1000
And our top-level wiki-page for the project is here:
https://datamuseum.dk/wiki/Rational/R1000s400
All the doc we have about the hardware type/object stuff and instruction set is in these course-slides:
https://datamuseum.dk/bits/30000916
If you are into data-archaeology and lack for challenges, we have several good outstanding questions for research. For instance the layout of the object-store/filesystem.
If you want to see a R1000 running, and experience the worlds first truly semantic IDE, come to Datamuseum.dk just outside Copenhagen, because we have the only approx 1.83 running R1000 computers.
(We also just started fundraisning for a new permanent building for our huge collection, we are at €113K of €3M goal. See top right corner or homepage or email.)
We know of only four surviving computers, we have two, one is privately owned in NZ and IBM donated one to CHM. The rest have been shredded because of classified mil-spec workload.
If you are local to/affiliated with CHM, and are interested/allowed, we would love to know more about their machine, and if possible, assist to get that running too.
PS: Here is a piece of Ada source code:
http://datamuseum.dk/aa/r1k_backup/13/1329b5ea7.html
And this may be what it compiles into:
http://datamuseum.dk/aa/r1k_backup/85/85b414c73.html
On the Mill the whole processor bank uses a global virtual address space. TLB and mapping to physical memory happens at the memory controller. Everything above the memory controller is in the same virtual address space, including L1-L3+ caches. This solves a lot of problems, for example: If you go out to main memory you're already paying ~300 cycles of latency, so having a large silicon area / data structure for translation is no longer a 1-cycle latency problem. Writes to main memory are flushed down the same memory hierarchy that reads come from and succeed as soon as they hit L1. Since all cache lines are in the same virtual address space you don't have to track and synchronize reads and writes across translation zones within the cache hierarchy. When you request an unallocated page you get the whole pre-zeroed page back instantly, since it doesn't need to be mapped to physical pages until writes are flushed out of L3. This means its possible for a page to be allocated, written to, read, and deallocated which never actually touches physical memory throughout the whole sequence and the whole workload is served purely within the cache hierarchy.
Protection is a separate system ("PLB") and can be much smaller and more streamlined since it's not trying to do two jobs at once. The PLB allows processes to give fine-grained temporary access of a portion of its memory to another process; RW, Ro, Wo, byte-addressed ranges, for one call or longer etc. Processes get allocated available address space on start, they can't just assume they own the whole address space or start at some specific address (you should be using ASLR anyways so this should have no effect on well-formed programs, though there is a legacy fallback).
[1]: My previous comment: https://news.ycombinator.com/item?id=27952660
Honestly, I'm happy waiting for 4k pages to die and be replaced by huge pages. Page tables were added to the x86 architecture in 1985, when 1MB of memory was a ton of memory to have. Having 256 pages worth of memory in your computer was weird and exotic. Fast forward to today, and the average user has several GB of memory - mainstream computers can be expanded to over 128 GB today - and we still mainly use 4k pages. That is the problem here. If we could swap to 2M pages in most applications, we would be able to reduce page table sizes by a factor of 512, and they would still be a lot larger than page tables when virtual memory was invented. And we wouldn't waste much memory!
But no, 4k pages for backwards compatibility. 4k pages forever. And while we're at it, let's add features to Linux (like TCP zero copy) that rely on having 4k pages.
We also didn't have access to the code that set up the page tables. Somehow I got the linker to emit a function hook which I used to defrag the page tables after they were written but before the MMU was enabled.
I remember thinking, "This is not what I learned about in school."
If you have any interest at all in the design of CPUs and instruction sets, I recommend viewing the entire series of video lectures by Ivan Godard:
https://www.youtube.com/playlist?list=PLFls3Q5bBInj_FfNLrV7g...
Warning: Viewing these is likely to make you at least a little bit upset with the current state-of-the-art architectures.
[1] https://en.wikipedia.org/wiki/CAP_computer
That's usually the case. For hardware, at least
For software, it usually was done before by Lisp in the '70s.
That model works, but is not compatible with C and UNIX.
Also interesting to me was the idea in Darek Mihocka's blog NO EXECUTE! about using emulation to implement advanced processor features rather than forcing everything through a one size fits all hardware policy. Of course emulation will always have a performance hit, but the concept is interesting.
http://emulators.com/nx_toc.htm
As surely you could consider page table as effectively implementing a fixed-size "object cache"? It is just a lookup for an offset into physical memory, after all, with the "object ID" just being the masked first part of the address? And if the objects are variable sized, is it possible to end up with physical address fragmentation as objects of different sizes are allocated and freed?
The claim of single-cycle lookups today would require an on-chip fixed-size (and small!) fast sram, as there's a pretty hard limit on the amount of memory you can get to read in a single clock cycle, no matter how fancy or simple the logic behind deciding to lookup. If we call this area the "TLB" haven't we got back to pagetables again?
And for the size of sram holding the TLB/object cache entries - increasing the amount of data stored in them means you have less total too. A current x86_64 CPU supports 2^48 of physical address space, reduced to 36 bits if you know it's 4k aligned - and 2^57 of virtual address space as the tag, again reduced to 45 bits if we know it's 4k aligned. That means to store the tag and physical address you need a total of 81 bits of SRRAM. A 64-bit object ID, plus 64-bit physical address plus 64-bit size is 192bits, over 2x that, so you could pack 2x the number of TLB entries into the same sram block. To match the capabilities of the example above, 57 bits of physical address (cannot be reduced as arbitrary sizes means it's not aligned), plus a similarly reduced to 48 bit object ID and size still adds up to 153, only slightly less than 2x, though I'm sure people could argue that reducing the capabilities here have merit, I don't know how many objects or their maximum possible size in such a system. And that's "worst case" 4k pages for the pagetable system too.
I can't see how this idea could be implemented without extreme limitations - look at the TLB size of modern processors and that's the maximum number of objects you could have while meeting the claims of speed and simplicity. There may be some advantage in making them flexible in terms of size, rather than fixed-size, but then you run into the same fragmentation issues, and need to keep that size somewhere in the extremely-tight TLB memory.
Because that's only a base, not a limit. The right pointer arithmetic can spill over to any other object base's memory.
Doesn't that imply the minimum-sized object requires 4K physical ram?
Is that a problem?
Having arbitrary sizes objects will likely be possible in hardware - it's just an extra size being stored in the PTE if you can mask out the objectID from the address (in the example in the original post, it's a whole 64-bit object ID, allowing a full 64-bits of offset within each object, but totaling a HUGE 128bit effectively address)
But arbitrary sizes feels like it pushes the issues that many current userspace allocators have to deal with today to the hardware/microcode - namely about packing to cope with fragmentation and similar (only instead of virtual address space they'll have to deal with physical address space). The solutions to this today are certainly non-trivial and still can fail in many ways, so far away from being solved, let along solved in a simple enough way to be implemented that close to hardware.
Well, GPU code is certainly not object-oriented, and I hope it never becomes that. SIMD code won't be able to jump between objects like typical CPU-oriented OOP does (unless all objects within a warp/workgroup jump to the same function pointers?)
GPU code is common in video games. DirectX needs to lay out its memory very specifically as you write out the triangles and other vertex/pixel data for the GPU to later process. This memory layout is then memcopy'd over to PCIe using the linear address space mechanism, and GPUs are now cohesive with this space (thanks to Shared Virtual Memory).
So today, thanks to shared virtual memory and advanced atomics, we can have atomic compare-and-swap coordinate CPU and GPU code operating over the same data (and copies of that data can be cached in CPU-ram or GPU-VRAM and transferred over automatically with PCIe memory barriers and whatnot).
----------
Similarly, shared linear address spaces operate over rDMA (remote direct memory access), a protocol built on top of Ethernet. This means that your linear memory space is mmap'd on your CPU, but then asks for access to someone else's RAM over the network. The mmap then causes this whole "inefficient pointer-traversals" to then get turned into Ethernet packets to share RAM between CPUs.
Ultimately, when you start dealing with high-speed data-sharing between "external" compute units (ie: a GPU, or a ethernet-connected far-away CPU), rather than "just" a NUMA-node or other nearby CPU, the linear address space seems ideal.
--------
Even the most basic laptop, or even Cell Phone, these days, is a distributed system consisting of a CPU + GPU. Apple chips even have a DSP and a few other elements. Passing data between all of these things makes sense in a distributed linear address space (albeit really wonky with PCIe, mmaps, base address pointers and all sorts of complications... but they are figured out, and it does work every day)
I/O devices working directly in memory is going to only become more common. 100Gbps network connections exist in supercomputer labs, 10Gbps Ethernet is around the corner for consumers. NVMe drives are pushing I/O to such high bandwidths that'd make DDR2 RAM blush. GPUs are growing more complicated and are rumored to start turning into distributed chiplets soon. USB3.0 and beyond are high-speed links that directly drop off data into linear address spaces (or so I've been told). Etc. etc.
https://www.gnu.org/software/libc/manual/html_node/Obstacks....
“An obstack is a pool of memory containing a stack of objects. You can create any number of separate obstacks, and then allocate objects in specified obstacks. Within each obstack, the last object allocated must always be the first one freed, but distinct obstacks are independent of each other.
Aside from this one constraint of order of freeing, obstacks are totally general: an obstack can contain any number of objects of any size. They are implemented with macros, so allocation is usually very fast as long as the objects are usually small. And the only space overhead per object is the padding needed to start each object on a suitable boundary.”
If you want to talk to a GPU via vulkan or d3d12 you're going to navigate a maze of descriptor sets, pipeline objects, buffers of various types, samplers, etc. These are all Objects with distinct types and data and the GPU hardware will interact with them in specific ways, even if it's also a general purpose computing device that will run some shader code you throw at it. When I was writing bare-metal graphics code for the PS4 there was still a lot of Object Orientation there too even if there weren't vtables or a garbage collector involved.
Seeing RAM as a collection of linear addresses on GPUs (especially as shared virtual memory, pinned memory, or other such GPU/CPU virtual memory sharing features) is a feature from 2010-era, be it from OpenCL, DirectCompute, or CUDA.
DirectCompute just sees the data as a collection of bytes.
Jonathan Blow has been experimenting a lot with arrays of structs versus structs of arrays, particularly in making them both first class instead of assuming one and ignoring the other. This is essentially the column- versus row-oriented data question answered with a question: why not both?
I think if you had a first class language that supported both, that you wouldn't mind so much if GPU code became object oriented.
I don’t know why not both, is it maybe because one of the two layouts will almost always have worse performance for any given access pattern? I can see uses for having both for the non-critical-path parts, though maybe once you’re indexing it in the awkward but perf-friendly way, you’ve already done the hard work and there’s little point to proving the other style of indexing? (Isn’t the best perf always found using whatever layout & indexing is the most difficult… there must be some eponymous law for this, right?) I guess it is possible to have multiple critical paths each favoring a different layout. Not sure how often I’ve ever seen that in practice, I would guess it’s probably rather rare.
Shared Virtual Memory is literally "GPU sees X region the same as the CPU sees it". And is implemented on all desktop GPUs today: https://www.intel.com/content/www/us/en/developer/articles/t...
By making pointers treated the same on CPU or GPU (by allowing their 64-address spaces to be identical by sharing the same memory regions + using PCIe to keep those memory regions in sync), you can perform high-speed CPU / GPU communications of even linked data structures (linked lists, trees, graphs).
GPUs utilize many linked data-structures. Oct-trees accelerate bounds testing, BVH trees help raytracing. Etc. etc. The GPU addresses must be linear because they're synchronized to the CPU's memory space.
It has been done. See Singularity project from Microsoft Research, which used C# as the language to provide memory protection, no process, all programs running in the same global memory space, and all of them running in ring-0. It was a fun research project, but never really made it out. There were other research projects like it.
Also his (object, offset) addressing space is essentially a segmented memory model. The object id is the segment id. I bet the object id is in a linear space.
Heck, I'd argue everyone who writes their own toy OS should probably start with this approach and then retrofit memory protection later because it simplifies things tremendously and you get something actually usable much earlier (context switching between is not that hard but it's complicated, especially on x86).
Of course, you can do identity memory mapping and skip virtual vs physical questions (for x86, memory protection requires using virtual memory, because the protection information is in the page tables, and iiuc, amd64 also requires using virtual memory, but it's still a useful simplification to do identity mapping so the virtual address is the physical address). My toy OS only runs one process, so there was never going to be more than one memory map anyway.
The problem is basically the choice of C#. If today people use WASM, with no process distinctions, same address space, all in ring0 it would already be a much more successful project.