Readit News logoReadit News
exDM69 · 10 years ago
I did something along the lines of his suggestion using OpenGL sparse textures (D3D would call them tiled resources) and persistently/coherently mapped buffers with disk i/o being done with memory mapped files. It's a rather crude proof of concept for on-demand loading of large textures (I used a 16k x 8k satellite image). I didn't properly detect "page faults", but I had some of the mechanisms implemented (outputs yellow pixels instead of page faults).

To make it work fully end-to-end, it would look something like this:

    1. Shader samples from a sparse texture, and detects that the requested page is non-resident.
    1b. Fall back to lower mip map level.
    2. Shader uses atomic-or to write to a "page fault" bitmap (one bit per page)
    3. The bitmap is transferred to the CPU
    4. For each set bit, start async copy from disk to DMA buffer (ie. pixel buffer object in GL)
    5. When disk i/o is complete, start texture upload from buffer to a "page pool" texture
    6. When texture upload is complete, re-map the texture page from "page pool" to the actual texture
Now this approach works alright, but there are a number of issues that make it impractical for the time being. Off the top of my head:

    1. Sparse textures are only supported on Nvidia and AMD hardware. Not Intel, ARM or IMG.
    2. Requires Vulkan or D3D12 for step #6 (the demo doesn't do this, so there may be pipeline stalls)
    3. One or two frames of latency that can only be avoided if this was done in the kernel mode driver.
    4. Poor fit for existing KMD architecture (which has its own concept of residency)
    5. Detecting page faults is easy. Detecting which pages can be dropped is hard.
Here's the source code of my demo. It's not pretty because it was a one off demo project for very specific hardware (Android + OpenGL 4.5, which means Nvidia Shield hardware with Maxwell GPUs). The technique is portable, though.

https://github.com/rikusalminen/sparsedemo/blob/master/jni/g...

(In the code above, all the interesting bits are the functions named xfer_*)

Based on the experience from writing this demo, I have to agree with Carmack here. File-backed textures would make a lot of sense for a lot of use cases.

yassim · 10 years ago
Thanks for the example code.
Retr0spectrum · 10 years ago

    Splash screens and loading bars vanish. Everything is just THERE.
I'm not sure I agree with this. It might be more convenient to have a filesystem-like interface, but at the end of the day everything still has to be loaded into the (rather limited) GPU memory at some point.

Most CPU applications can handle RAM swapping from disk, but I really doubt that big games could maintain 60fps if even a few assets needed re-loading.

If a frame is 16ms, and the best consumer SSDs are around 2GB/s, You can only load 33MB of assets in a one frame in a best-case scenario.

exDM69 · 10 years ago
> If a frame is 16ms, and the best consumer SSDs are around 2GB/s, You can only load 33MB of assets in a one frame in a best-case scenario.

33 MB of assets per frame is more than enough for a lot of practical tasks. Modern texture compression achieves 2 bits per pixel (e.g. ASTC 8x8) with very little percievable loss of quality (when you apply texture mapping, lighting, effects, etc. ASTC 12x12 gets 0.89 bits per pixel, but noticeable loss of quality). At this rate, the 33 MB/frame is 16k x 8k worth of texture data, every frame. Vastly more than there are pixels on screen.

And this figure does not take into account the block cache at all. If the data has been used recently (several minutes depending on RAM size), a copy of it might be still around in RAM which would make it nearly instantaneous and almost a gigabyte per frame of data (assuming all memory bw is put to this use).

As for the framerate issues, it would be imperative that this technique is implemented in a stall-free way (which is possible with Vulkan/D3D12) so that a steady 60 fps is maintained. All textures should have at least a few mip map levels resident at all times to fall back on. A single "standard" GPU sparse page (64k) is 512x512 texels at ASTC 8x8, which is quite a large texture already.

In other words: 33 MB per frame is a lot more than what we get today.

dogma1138 · 10 years ago
The GPU doesn't have DMA to the disk or RAM it has to go through the CPU to access it which causes quite a delay.

Even if you read from RAM during normal operations you'll have very low framerate unless you are using pretty damn good latency masking a good example will be texture streaming where you hold low res textures in your GPU memory and load the higher res ones from RAM or disk and even on very high end systems it causes allot of texture popins which people find rather annoying.

hartror · 10 years ago
I suspect if anyone had thought through this it would be John Carmack. His game design skills can be called into question. His 3D engine skills are second to none[1].

[1] As an individual. I suspect why we haven't see as much from him since Quake 3 is that his skills don't scale to a team.

joakleaf · 10 years ago
> I suspect why we haven't see as much from him since Quake 3

Not quite accurate. Since Quake 3 he made Unified lighting and shadowing (for Doom 3) and Megatextures (for Quake 5 and Rage).

I can't really think of any other major paradigm shifts over the last 10-15 in game graphics engines.

There are lots of little improvements everywhere, but they seem more evolutionary than the revolutionary steps we saw in the 90s (textureMapping, raycasting, BSPs, PVS, lightmaps, Sorted edge rasterization, bilinear filtering, bezier-surferaces, shaders, unified lighting/shadowing and beam trees, etc.).

ralfd · 10 years ago
> I suspect why we haven't see as much from him since Quake 3 is that his skills don't scale to a team.

He did other stuff like rocket design and all that VR/Oculus Gear stuff

hvidgaard · 10 years ago
Unless I'm understanding things the wrong way, he isn't saying it should be the only way, just the standard way. The majority of applications are not performance sensitive, and mmap will be simpler and probably save ressources (by leaving untouched assets on the disk).

If profiling show you're limited by pagefaults, then preloading by simply touching the memory will have it ready. It simplifies things, which is almost always a good thing.

taspeotis · 10 years ago
> You can only load 33MB of assets in a one frame in a best-case scenario.

Back the GPU mapped resources with RAM. I have 16GB of it. That's plenty for one level's worth of assets.

GPU <=> RAM <=> SSD.

masklinn · 10 years ago
Wouldn't that work completely naturally with unified memory? You mmap your assets, refer to that from GPU code, the OS would logically load the mapped data to RAM then push a subset of it to the GPU.

Deleted Comment

johncolanduoni · 10 years ago
I suspect we'll see more of this as 16+GB of RAM becomes more standard, at least for gamers. In the past it just wasn't a case that was worth optimizing for.
izacus · 10 years ago
Isn't that exactly how it's done right now?
vardump · 10 years ago
> If a frame is 16ms, and the best consumer SSDs are around 2GB/s, You can only load 33MB of assets in a one frame in a best-case scenario.

I think page faulting can work a lot better for graphics assets (= textures) than for general purpose computing, because the system has special knowledge about data and can avoid processing pauses by (temporarily) using lower fidelity mipmap levels.

1) All of those assets aren't present in the current frame. 33 MB might cover the whole scene anyways.

2) Texture aware page fault mechanism can transfer low lower detail mipmap levels first. Two detail levels lower mipmap is just 1/16th of data, but is still going to be good enough for those few tens of milliseconds until better LoD can be loaded.

3) So in 3 frames (5ms) you have 100 MB. This probably covers current scene pretty well.

So everything will just appear to be there. Eyes won't have time to focus in time it takes to load all full quality assets for current scene, even if your SSD can load just 500 MB/s.

Remember, with page fault mechanism you don't need to load all assets, just those that are actually visible in the current scene. So initially there's less data to transfer.

creshal · 10 years ago
Aren't textures loaded into memory compressed anyway? Even 33MB is a lot of data when compressed properly.
zamalek · 10 years ago
> can handle RAM swapping from disk

Slight caveat, Windows 10 rarely pages out to the disk.[1] I'm not sure if it's possible to ask it to treat your mmaps in this way. Regardless, implementing the synchronization required to pull this off would be a nightmare - especially in Vulkan/DX12. The OS would also need some form of API where you are notified that a mmaped page is faulting. Except for the very top studios (AAA) it would likely be a completely unapproachable API. Still, it would be fascinating to see what the very best could do with it.

> You can only load 33MB of assets in a one frame in a best-case scenario.

Carmack indicates that "you could still manually pre-touch media to guarantee residence". Meaning that we're back to loading screens (hopefully shorter ones, though).

[1]: https://channel9.msdn.com/Blogs/Seth-Juarez/Memory-Compressi...

CyberDildonics · 10 years ago
At 2560x1600 8 bit color a full frame uncompressed is only 12MB.

2560 * 1600 * 3 = 12,288,000 bytes.

vardump · 10 years ago
That's assuming a lot.

1) Hardware doesn't usually support 3 byte pixels. Pretty safe to assume at least 32bpp pixels, 4 bytes per pixel.

2) Hardware might only support power-of-two line pitches (width, number of bytes between screen pixel rows). Horizontal resolution would still be 2560, remaining pixels would just be hidden. That way hardware can always rely on bit shifting when computing display addresses and possibly other tricks.

3) Frame buffer might not even be in row-major format in the first place.

So 2560 * 1600 32bpp might just as well be 4096 * 1600 * 4 = 25 MB. Or something else.

Deleted Comment

amelius · 10 years ago
This is only a concern for "videophiles", who insist that the 60fps are always guaranteed.
astrange · 10 years ago
Which is tricky, actually, because real video can come in 24fps or 59.94fps (60/1.001). Your TV can handle this but your computer monitor can't, so you lose frames all the time.
saynsedit · 10 years ago
I was facepalming too while reading this. I'd expect Carmack to understand the ugly truth behind mmap: it's not free. It's more than not free, demand-paging actually adds significant cost: TLB flushes aren't performance friendly.

Linux ppl realized this and implemented MAP_POPULATE but once you're doing that you might as well just eagerly populate the normal way.

Mmap() can be a convenient interface when you aren't latency sensitive but otherwise it's not appropriate.

vardump · 10 years ago
You're right that mmap is not free.

It's just a lot better than current status quo. Loading more or less full scene assets to GPU RAM regardless whether they're actually present in current frame or not.

Even if some texture is visible, it might be only a tiny fraction of some particular mipmap level is actually needed.

TLB flush cost is pretty insignificant here. A bit like accidentally crushing your finger with a sledgehammer and complaining the hammer was a bit cold. Besides, you'd be talking about TLB flush cost on the GPU. Maybe GPUs can just hide all of that latency with a high number of hardware threads, just like they've hidden RAM latency for over 10 years.

Retr0spectrum · 10 years ago
Mirror for those who can't/won't use Facebook:

I have been advocating this for many years, but the case gets stronger all the time. Once more unto the breach.

GPUs should be able to have buffer and texture resources directly backed by memory mapped files. Everyone has functional faulting in the GPUs now, right? We just need extensions and OS work.

On startup, applications would read-only mmap their entire asset file and issue a bunch of glBufferMappedDataEXT() / glTexMappedImage2DEXT() or Vulkan equivalent extension calls. Ten seconds of resource loading and creation becomes ten milliseconds.

Splash screens and loading bars vanish. Everything is just THERE.

You could switch through a dozen rich media application with a gig of resources each, and come back to the first one without finding that it had been terminated to clear space for the others – read only memory mapped files are easy for the OS to purge and reload without input from the applications. This is Metaverse plumbing.

Not that many people give a damn, but asset loading code is a scary attack surface from a security standpoint, and resource management has always been a rich source of bugs.

It will save power. Hopefully these are the magic words. Lots of data gets loaded and never used, and many applications get terminated unnecessarily to clear up GPU memory, forcing them to be reloaded from scratch.

There are many schemes for avoiding the hard stop of a page fault by using a lower detail version of a texture and so on, but it always gets complicated and requires shader changes. I’m suggesting a complete hard stop and wait. GPU designers usually throw up their hands at this point and stop considering it, but this is a big system level win, even if it winds up making some frames run slower on the GPU.

You can actually handle quite a few page faults to an SSD while still holding 60 fps, and you could still manually pre-touch media to guarantee residence, but I suspect it largely won’t be necessary. There might also be little tweaks to be done, like boosting the GPU clock frequency for the remainder of the frame after a page fault, or maybe even the following frame for non-VR applications that triple buffer.

I imagine an initial implementation of GPU faulting to SSD would be an ugly multi-process communication mess with lots of inefficiency, but the lower limits set by the hardware are pretty exciting, and some storage technologies are evolving in directions that can have extremely low block read latencies.

Unity and Unreal could take advantage of this almost completely under the hood, making it a broadly usable feature. Asset metadata would be out of line, so the mapped data could be loaded conventionally if necessary on unsupported hardware.

A common objection is that there are lots of different tiling / swizzling layouts for uncompressed texture formats, but this could be restricted to just ASTC textures if necessary. I’m a little hesitant to suggest it, but drivers could also reformat texture data after a page fault to optimize a layout, as long as it can be done at something close to the read speed. Specifying a generously large texture tile size / page fault size would give a lot of freedom. Mip map layout is certainly an issue, but we can work it out.

There may be scheduling challenges for high priority tasks like Async Time Warp if a single unit of work can create dozens of page faults. It might be necessary to abort and later re-run a tile / bin that has suffered many page faults if a high priority job needs to run Right Now.

Come on, lets make this happen! Who is going to be the leader? I would love it to happen in the Samsung/Qualcomm Android space so Gear VR could immediately benefit, but it would probably be easiest for Apple to do it, and I would be just fine with that if everyone else chased them in a panic.

amelius · 10 years ago
It seems to load fine in an incognito window.
nowprovision · 10 years ago
Thanks. I hope this is a one off, if faceache becomes the article platform for techie material I may just change industries.
matthewmacleod · 10 years ago
I mean, the guy does work at Facebook. It seems churlish to complain that he uses his company's product to share material.
unixhero · 10 years ago
"This is a Facebook +Premium article. In order to access this content, you must sign in with a Facebook +Premium account[?]. [?] Facebook +Premium accounts are Facebook accounts where you have also confirmed your identify with your national passport [level1] and have your yearly retina scan access enabled and updated [level2]. As an option you may wish to access the Facebook extra features and free access to all services by enabling [level3] on your account by enabling location services and allowing us to store your GPS position throughout your day. A level 3 account will have access to location based and time based offers, content and Facebook friendship features that will simplify and improve your life. Level 4 access is not yet ready for us to offer you, but it will involve a small chip which you implant into your arm. This will let you get the full and unfettered Facebook experience without the need for any cellphone or other device!"
idunno246 · 10 years ago
i imagine its only because he works at facebook
theandrewbailey · 10 years ago
Loads fine and is completely readable on Firefox + NoScript.
angch · 10 years ago
I'm not quite sure mmap is such a good idea if you're trying to have more low-level control over performance. Weird Carmack advocating this, because you can't really guarantee the latency of grabbing any resource if you incur a fault and need to grab it from disk.

See also the comments from https://news.ycombinator.com/item?id=8704911

HelloNurse · 10 years ago
He notes that reasonable hardware should have the performance margin to load a reasonable number of pages from a SSD without dropping a frame, which seems a very good plan. Looking forward to actual tests, of course.

Considering that prefetching schemes allow the programmer to spread asset loading evenly over many frames, and cheap rendering approximations can be used in troublesome frames, there should also be enough low-level control.

lazyjones · 10 years ago
> reasonable hardware should have the performance margin to load a reasonable number of pages from a SSD without dropping a frame

My disks are usually encrypted though and sometimes I can choose faster or slower encryption methods (thus affecting throughput when loading). I don't see how this can work reliably without forcing the user to reserve specific disk areas just for GPU assets.

robert_tweed · 10 years ago
Just guessing, but from my reading it sounds like the aim is to maintain generally good frame rates and not worry too much about dropped frames due to page faults, since those will be rare. Presumably the idea is to rely on ATW so that when frames do get dropped, it's imperceptible.
CyberDildonics · 10 years ago
It doesn't mean that everything has to be done like this, only that it would be an available feature. Even then you could touch memory to make sure it is available.
aphextron · 10 years ago
Am I alone in finding this an odd thing to be reading on Facebook?
tomlong · 10 years ago
Facebook is the new .plan!

Seriously though, the company he works for is owned by Facebook. This may be a factor.

Loque · 10 years ago
Not when you remember he works for them?
wiredfool · 10 years ago
"""... it would probably be easiest for Apple to do it, and I would be just fine with that if everyone else chased them in a panic. """
pawadu · 10 years ago
it seems to me that when apple does something it quickly becomes "accepted" by the consumers (even if it is a technical thing and not consumer facing). This is not always a bad thing for competitors
wiredfool · 10 years ago
I think it's a combination of Apple owning enough of the stack to make it happen, and occasionally Apple's secrecy catching the rest of the industry flat footed. (see the 64bit ARM transition)
jmount · 10 years ago
It does sound like memory mapped assets would be a great feature. One thing to read (not really an objection, just a commentary that remains relevant) is "On the design of display processors" Myer, Sutherland; Communications of the ACM, 1968 (also called the wheel of reincarnation http://cva.stanford.edu/classes/cs99s/papers/myer-sutherland... ).
jokoon · 10 years ago
I recently upgraded from a 1.5MB L2 cached athlon 2 to a 6MB L3 core i5, and surprisingly, game loadings are still as slow. I guess that copying assets files onto RAM doesn't result in a speed up ?

So if I understand the problem right, it's because copying data to the GPU is made through the PCI express bus, and done "piece by piece", instead of larger batches ? A little like grouping draw calls ? That's funny how that problem can be seen everywhere in hardware, where multiplying queries will make latencies snowball.

monocasa · 10 years ago
I think it has more to do with the fact that GPU's memory accesses aren't cache coherent with the CPU, so a larger L2 doesn't really add much to the table.
amscanne · 10 years ago
I think you want a different word here.

Generally DMA to/from the GPU is cache coherent (either via DMA sniffing for cache invalidation or software managing regions for DMA, e.g. marking relevant PTEs as nocache).

So accesses are _coherent_, but the cache is simply irrelevant (or even more costly, if it's using snooping).