Decade and a half ago there was PacketShader, which used usermode networking+GPUs to do packet routing. It was a thrilling time. I thought this kind of effort to integrate off-the-shelf hardware and open source software-defined-networking (SDN) was only going to build and amplify. We do have some great SDN but remains a fairly niche world that stays largely behind the scenes. https://shader.kaist.edu/packetshader/index.html
I wish someone would take up this effort again. It'd be awesome to see VPP or someone target offload to GPUs again. It feels like there's a ton of optimization we could do today based around PCI-P2P, where the network card could DMA direct to the GPU and back out without having to transit main-memory/the CPU at all; lower latency & very efficient. It's a long leap & long hope, but I very much dream that CXL eventually brings us closer to that "disaggregated rack" model where a less host-based fabric starts disrupting architecture, create more deeply connected systems.
That said, just dropping down an fpga right on the nic is probably/definitely a smarter move. Seems like a bunch of hyperscaler do this. Unclear how much traction Marvell/Nvidia get from BlueField being on their boxes but it's there. Actually using the fpga is hard of course. Xilinx/AMD have a track record of kicking out some open source projects that seem interesting but don't seem to have any follow through. Nanotube being an XDP offload engine seemed brilliant, like a sure win. https://github.com/Xilinx/nanotube and https://github.com/Xilinx/open-nic .
I've been exactly thinking about it this way for a long time. Actually once we're able to push computation down into our disk drives, I wouldn't be surprised if these "Disk Shaders" will be written in eBPF.
Totally sibling comment confirms it already exists! I hope that the 'shader' name sticks too! I find the idea of a shader has a very appropriate shape for tiny-program-embedded-in-specific-context so it seems perfect from a hacker POV!
I have a VFX(Houdini now, RSL shaders etc earlier) and openCL-dabbling and demoscene-lurking background, based on which I think I prefer 'shader' to 'kernel', that's what OpenCL calls them.. but that conflicts with the name of like, 'the OS kernel' at least somewhat..
This analogy works well when trying to describe how eBPF is used for network applications. The eBPF scripts are "packet shaders" - like a "pixel shaders" they are executed for every packet independently and can modify attributes and/or payload according to a certain algorithm.
The name “shader” screwed me up for so long. But once I better understood what they really are I think they’re incredibly powerful. “Kernel shader” is amazing.
I've actually written a high-performance metaverse client, one that can usefully pull half a gigabit per second and more from the Internet. So I get to see this happening. I'm looking at a highly detailed area right now, from the air, and traffic peaked around 200Mb/s. This XDP thing seems to address the wrong problem.
Actual traffic for a metaverse is mostly bulk content download. Highly interactive traffic over UDP is maybe 1MB/second, including voice. You're mostly sending positions and orientations for moving objects.
Latency matters for that, but an extra few hundred microseconds won't hurt. The rest is large file transfers. Those may be from totally different servers than the ones that talk interactive UDP. There's probably a CDN involved, and you're talking to caches. Latency doesn't matter that much, but big-block bandwidth does.
Practical problems include data caps. If you go driving around a big metaverse, you can easily pull 200GB/hour from the asset servers. Don't try this on "AT&T Unlimited Extra® EL". Check your data plan.
The last thing you want is game-specific code in the kernel. That creates a whole new attack surface.
I don't know about how well this solves any game programmer's problem, but the attack surface thing --- modulo the kfunc trick --- doesn't seem real: eBPF programs are ruthlessly verified, and most valid, safe C programs aren't accepted (because the verifier can't prove every loop in them is bounded and every memory access is provably bounded). It's kind of an unlikely place to expect a vulnerability, just because the programming model is so simplistic.
> Actual traffic for a metaverse is mostly bulk content download. Highly interactive traffic over UDP is maybe 1MB/second, including voice.
Typical bandwidth for multiplayer games like FPS (Counterstrike, Apex Legends) are around 512kbps-1mbit per-second down per-client, and this is old information, newer games almost certainly use more.
It's easy to see a more high fidelity gaming experience taking 10mbit - 100mbit traffic from server to client, just increase the size and fidelity of the world. next, increase player counts and you can easily fill 10gbit/sec for a future FPS/MMO hybrid.
There are only so many pixels on the screen, as Epic points out. The need for bandwidth is finite.
It will be interesting to see if Epic makes a streamed version of the Matrix Awakens demo. You can download that and build it. It's about 1TB after decompression. If they can make that work with their asset streaming system, that will settle the question of what you really need from the network.
You'll hit game engine CPU usage limits way before getting anywhere near Gbit/s in outbound game traffic on a game server, and the cost of network I/O is going to be negligible.
XDP is useful for applications that are network I/O bound. Gaming is not one of those.
i venture a guess that there are still no online games that use more than one mbps for interactive traffic and seeing fully remote gaming using tens of mbps, i don't see any justification for complexities like asset streaming.
That's really not true for the type of game the OP is talking about. Think Second Life et al, where most of the content is dynamically streamed and rendered in real time
> Why? Because otherwise, the overhead of processing each packet in the kernel and passing it down to user space and back up to the kernel and out to the NIC limits the throughput you can achieve. We're talking 10gbps and above here.
_Throughpout_ is not problematic at all for the Linux network stack, even at 100gbps. What is problematic is >10gbps line rate. In other words, unless you're receiving 10gbps unshaped UDP datagrams with no payloads at line rate, the problem is non existant. Considering internet is 99% fat TCP packets, this sentence is completely absurd.
> With other kernel bypass technologies like DPDK you needed to install a second NIC to run your program or basically implement (or license) an entire TCP/IP network stack to make sure that everything works correctly under the hood
That is just wrong on so many levels.
First, DPDK allows reinjecting packets in the Linux network stack. That is called queue splitting,is done by the NIC, and can be trivially achieved using e.g. the bifurcated driver.
Second, there are plenty of available performant network stacks out there, especially considering high end NICs implement 80% of the performance sensitive parts of the stack on chip.
Last, kernel bypassing is made on _trusted private networks_, you would have to be crazy or damn well know what you're doing to bypass on publicly addressable networks, otherwise you will have a bad reality check. There are decades of security checks and counter measures baked in the Linux network stack that a game would be irresponsible to ask his players to skip.
I'm not even mentioning the ridiculous latency gains to be achieved here. Wire tapping the packet "NIC in" to userspace buffer should be in the ballpark of 3us. If you think you can do better and this latency is too much for your application, you're either day dreaming or you're not working in the video game industry.
> _Throughpout_ is not problematic at all for the Linux network stack, even at 100gbps. What is problematic is >10gbps line rate. In other words, unless you're receiving 10gbps unshaped UDP small datagrams at line rate, the problem is non existant. Considering internet is 99% fat TCP packets, this sentence is completely absurd.
But games are not 99% fat TCP packets.
Games are typically networked with UDP small datagrams sent at high rates for most recent state or inputs, with custom protocols built on top of UDP to avoid TCP head of line blocking. Packet send rates per-client can often exceed 60HZ, especially when games tie client packet send rate to the display frequency, eg. Valve and Apex Legends network models.
Now imagine you have many thousands of players and you can see that the problem does indeed exist. If not for current games, for future games and metaverse applications when we start to scale up the player counts from the typical 16, 32 or 64 players per-server instance, and try to merge something like FPS techniques with the scale of MMOs, which is actively something I'm actually doing.
XDP/eBPF is a useful set of technologies for people who develop multiplayer games with custom UDP based protocols. You'll see a lot more usage of this in the future moving forward, as player counts increase for multiplayer games, and other metaverse-like experiences.
A modern server class machine can push "100Gbps" through the entire Linux stack just fine. TCP or UDP. With standard packet sizes (e.g. 1500 bytes.) We do this where I work. Yes a long time ago 1Gbps was hard, you needed jumbo frames, then 10Gbps was hard, then 100Gbps was hard. Right now where we are seems to be you don't need to do a kernel bypass unless you're running multiple 100Gbps NICs - that's an application where I've seen DPDK used in the wild.
EDIT: Might be some benefits in terms of latency through the stack and resource usage on the machine... But I don't think 10Gbps is where the pain is for these either.
> First, DPDK allows reinjecting packets in the Linux network stack. That is called queue splitting,is done by the NIC, and can be trivially achieved using e.g. the bifurcated driver.
My apologies, the last time I looked at DPDK was in 2016, and I don't believe this was true back then. Either way, it seems that XDP/eBPF is much easier to use. YMMV!
> Packet send rates per-client can often exceed 60HZ
You do realize the most random Python program, written by the most random programmer, running on the most random computer, could easily read 10k TCP packets per second per core?
I feel like I'm reasonably well read into low-level TCP/IP security (also: eBPF) and I'm not sure what you mean by "decades of security checks and countermeasures baked into the Linux network stack" than an XDP kernel bypass would skip. Can you say more?
> I'm reasonably well read into low-level TCP/IP security
> I'm not sure what you mean by "decades of security checks and countermeasures baked into the Linux network stack" than an XDP kernel bypass would skip. Can you say more?
If you bypass the kernel stack on an publicly addressable network, then it is your responsibility to implement and calibrate backloging, handshake recycling, SYN cookies
Are you under the impression that he was talking about game clients doing this? That would be absurd, since no gamers (within epsilon of zero) use Linux, and you'd need a completely different bag of tricks on Windows.
> In other words, unless you're receiving 10gbps unshaped UDP datagrams with no payloads at line rate, the problem is non existant. Considering internet is 99% fat TCP packets, this sentence is completely absurd.
Uh, "the internet" traffic shape is a terrible model for low-latency multiplayer games traffic shape. Surely you don't think that 99% of the packets being exchanged during a session of Counter-Strike 2 are fat TCP packets, right?
Surely not, but I don't expect a video game datagram size and rate to be anywhere near line rate. To put things in perspective, we are talking ~1M datagrams per seconds on a 1gbps link, 10M dgps on a 10gbps link.
Hell, I don't even think a commodity gaming computer have enough cores to process line rate datagrams on a 1gbps link.
> Within 10 years, everybody is going to have 10gbps internet. How will this change how games are made? How are we going to use all this bandwidth? 5v5 shooters just aren't going to cut it anymore.
I have no doubt that more bandwidth will change how games are made, but 5v5 shooters (and pretty much all existing multiplayer styles) are here to stay for a lot longer than that, in some form or another.
Meanwhile in the rest of the world it is ALSO not true.
That opening blanket statement turned me off so hard I couldn't get past that paragraph to read the rest of the article.
I am struggling to justify 10Gb LAN in my house. Between purchase costs and energy requirements (some 10Gb arrangements seem to be crazy inefficient). And I like "more, more, faster, faster" in my tech.
How does this scale (and cost) across even a significant section of human society?!? According to the article's prediction things must get crazy soon.
You don't need to create your own kernel modules to define custom packet processing logic. You can map an RX/TX ring into userspace memory by invoking setsockopt on your XDP socket with the XDP_RX_RING, XDP_TX_RING and XDP_UMEM_REG options. Your XDP BPF program can then choose to redirect an incoming packet to the RX ring (see BPF_MAP_TYPE_XSKMAP).
I wish someone would take up this effort again. It'd be awesome to see VPP or someone target offload to GPUs again. It feels like there's a ton of optimization we could do today based around PCI-P2P, where the network card could DMA direct to the GPU and back out without having to transit main-memory/the CPU at all; lower latency & very efficient. It's a long leap & long hope, but I very much dream that CXL eventually brings us closer to that "disaggregated rack" model where a less host-based fabric starts disrupting architecture, create more deeply connected systems.
That said, just dropping down an fpga right on the nic is probably/definitely a smarter move. Seems like a bunch of hyperscaler do this. Unclear how much traction Marvell/Nvidia get from BlueField being on their boxes but it's there. Actually using the fpga is hard of course. Xilinx/AMD have a track record of kicking out some open source projects that seem interesting but don't seem to have any follow through. Nanotube being an XDP offload engine seemed brilliant, like a sure win. https://github.com/Xilinx/nanotube and https://github.com/Xilinx/open-nic .
It looks like they have some demo code doing something like that. https://docs.nvidia.com/doca/archive/doca-v2.2.1/gpu-packet-...
What kind of workloads do you think would benefit from GPU processing?
I have a VFX(Houdini now, RSL shaders etc earlier) and openCL-dabbling and demoscene-lurking background, based on which I think I prefer 'shader' to 'kernel', that's what OpenCL calls them.. but that conflicts with the name of like, 'the OS kernel' at least somewhat..
Deleted Comment
Actual traffic for a metaverse is mostly bulk content download. Highly interactive traffic over UDP is maybe 1MB/second, including voice. You're mostly sending positions and orientations for moving objects. Latency matters for that, but an extra few hundred microseconds won't hurt. The rest is large file transfers. Those may be from totally different servers than the ones that talk interactive UDP. There's probably a CDN involved, and you're talking to caches. Latency doesn't matter that much, but big-block bandwidth does.
Practical problems include data caps. If you go driving around a big metaverse, you can easily pull 200GB/hour from the asset servers. Don't try this on "AT&T Unlimited Extra® EL". Check your data plan.
The last thing you want is game-specific code in the kernel. That creates a whole new attack surface.
Typical bandwidth for multiplayer games like FPS (Counterstrike, Apex Legends) are around 512kbps-1mbit per-second down per-client, and this is old information, newer games almost certainly use more.
It's easy to see a more high fidelity gaming experience taking 10mbit - 100mbit traffic from server to client, just increase the size and fidelity of the world. next, increase player counts and you can easily fill 10gbit/sec for a future FPS/MMO hybrid.
God save us from the egress BW costs though :)
It will be interesting to see if Epic makes a streamed version of the Matrix Awakens demo. You can download that and build it. It's about 1TB after decompression. If they can make that work with their asset streaming system, that will settle the question of what you really need from the network.
XDP is useful for applications that are network I/O bound. Gaming is not one of those.
> Why? Because otherwise, the overhead of processing each packet in the kernel and passing it down to user space and back up to the kernel and out to the NIC limits the throughput you can achieve. We're talking 10gbps and above here.
_Throughpout_ is not problematic at all for the Linux network stack, even at 100gbps. What is problematic is >10gbps line rate. In other words, unless you're receiving 10gbps unshaped UDP datagrams with no payloads at line rate, the problem is non existant. Considering internet is 99% fat TCP packets, this sentence is completely absurd.
> With other kernel bypass technologies like DPDK you needed to install a second NIC to run your program or basically implement (or license) an entire TCP/IP network stack to make sure that everything works correctly under the hood
That is just wrong on so many levels.
First, DPDK allows reinjecting packets in the Linux network stack. That is called queue splitting,is done by the NIC, and can be trivially achieved using e.g. the bifurcated driver.
Second, there are plenty of available performant network stacks out there, especially considering high end NICs implement 80% of the performance sensitive parts of the stack on chip.
Last, kernel bypassing is made on _trusted private networks_, you would have to be crazy or damn well know what you're doing to bypass on publicly addressable networks, otherwise you will have a bad reality check. There are decades of security checks and counter measures baked in the Linux network stack that a game would be irresponsible to ask his players to skip.
I'm not even mentioning the ridiculous latency gains to be achieved here. Wire tapping the packet "NIC in" to userspace buffer should be in the ballpark of 3us. If you think you can do better and this latency is too much for your application, you're either day dreaming or you're not working in the video game industry.
But games are not 99% fat TCP packets.
Games are typically networked with UDP small datagrams sent at high rates for most recent state or inputs, with custom protocols built on top of UDP to avoid TCP head of line blocking. Packet send rates per-client can often exceed 60HZ, especially when games tie client packet send rate to the display frequency, eg. Valve and Apex Legends network models.
Now imagine you have many thousands of players and you can see that the problem does indeed exist. If not for current games, for future games and metaverse applications when we start to scale up the player counts from the typical 16, 32 or 64 players per-server instance, and try to merge something like FPS techniques with the scale of MMOs, which is actively something I'm actually doing.
XDP/eBPF is a useful set of technologies for people who develop multiplayer games with custom UDP based protocols. You'll see a lot more usage of this in the future moving forward, as player counts increase for multiplayer games, and other metaverse-like experiences.
Best wishes
- Glenn
EDIT: Might be some benefits in terms of latency through the stack and resource usage on the machine... But I don't think 10Gbps is where the pain is for these either.
My apologies, the last time I looked at DPDK was in 2016, and I don't believe this was true back then. Either way, it seems that XDP/eBPF is much easier to use. YMMV!
Deleted Comment
Deleted Comment
You do realize the most random Python program, written by the most random programmer, running on the most random computer, could easily read 10k TCP packets per second per core?
If you bypass the kernel stack on an publicly addressable network, then it is your responsibility to implement and calibrate backloging, handshake recycling, SYN cookies
He's fully focused on the backend.
Uh, "the internet" traffic shape is a terrible model for low-latency multiplayer games traffic shape. Surely you don't think that 99% of the packets being exchanged during a session of Counter-Strike 2 are fat TCP packets, right?
Hell, I don't even think a commodity gaming computer have enough cores to process line rate datagrams on a 1gbps link.
I have no doubt that more bandwidth will change how games are made, but 5v5 shooters (and pretty much all existing multiplayer styles) are here to stay for a lot longer than that, in some form or another.
Deleted Comment
LOL, no way in America is that going to be true.
That opening blanket statement turned me off so hard I couldn't get past that paragraph to read the rest of the article.
I am struggling to justify 10Gb LAN in my house. Between purchase costs and energy requirements (some 10Gb arrangements seem to be crazy inefficient). And I like "more, more, faster, faster" in my tech.
How does this scale (and cost) across even a significant section of human society?!? According to the article's prediction things must get crazy soon.
Dead Comment
So these mini kernel programs are written in a subset of C?
Also event-based protocols with deterministic physics.
Last but not least, you need to use a language that can atomically share memory between threads; C (with Arrays of 64 byte Structs) or Java.