I designed my own fast game streaming video codec

I know next to nothing about video encoding, but I feel like there should be so much low hanging fruit when it comes to videogame streaming if the encoder just cooperated with the game engine even slightly. Things like motion prediction would be free since most rendering engines already have a dedicated buffer just for that for its own rendering, for example. But there's probably some nasty patent hampering innovation there, so might as well forget it!

torginus · 5 months ago

'Motion vectors' in H.264 are a weird bit twiddling/image compression hack and have nothing to do with actual motion vectors.

- In a 3d game, a motion vector is the difference between the position of an object in 3d space from the previous to the current frame

- In H.264, the 'motion vector' is basically saying - copy this rectangular chunk of pixels from some point from some arbitrary previous frame and then encode the difference between the reference pixels and the copy with JPEG-like techniques (DCT et al)

This block copying is why H.264 video devolves into a mess of squares once the bandwidth craps out.

pornel · 5 months ago

Motion vectors in video codecs are an equivalent of a 2D projection of 3D motion vectors.

In typical video encoding motion compensation of course isn't derived from real 3D motion vectors, it's merely a heuristic based on optical flow and a bag of tricks, but in principle the actual game's motion vectors could be used to guide video's motion compensation. This is especially true when we're talking about a custom codec, and not reusing the H.264 bitstream format.

Referencing previous frames doesn't add latency, and limiting motion to just displacement of the previous frame would be computationally relatively simple. You'd need some keyframes or gradual refresh to avoid "datamoshing" look persisting on packet loss.

However, the challenge is in encoding the motion precisely enough to make it useful. If it's not aligned with sub-pixel precision it may make textures blurrier and make movement look wobbly almost like PS1 games. It's hard to fix that by encoding the diff, because the diff ends up having high frequencies that don't survive compression. Motion compensation also should be encoded with sharp boundaries between objects, as otherwise it causes shimmering around edges.

robterrell · 5 months ago

Isn't the use of the H.264 motion vector to preserve bit when there is a camera pan? A pan is a case where every pixel in the frame will change, but maybe doesn't have to.

ChadNauseam · 5 months ago

I think you're right. Suppose the connection to the game streaming service adds two frames of latency, and the player is playing an FPS. One thing game engines could do is provide the game UI and the "3D world view" as separate framebuffers. Then, when moving the mouse on the client, the software could translate the 3D world view instantly for the next two frames that came from the server but are from before the user having moved their mouse.

VR games already do something like this, so that when a game runs at below the maximum FPS of the VR headset, it can still respond to your head movements. It's not perfect because there's no parallax and it can't show anything for the region that was previously outside of your field of view, but it still makes a huge difference. (Of course, it's more important for VR because without doing this, any lag spike in a game would instantly induce motion sickness in the player. And if they wanted to, parallax could be faked using a depth map)

rowanG077 · 5 months ago

You can do parallax if you use the depth buffer.

WantonQuantum · 5 months ago

A simple thing to start with would be akin to Sensor Assisted Video Encoding where phone accelerometers and digital compasses are used to give hints to video encoding: https://ieeexplore.ieee.org/document/5711656

Also, for 2d games a simple sideways scrolling game could give very accurate motion vectors for the background and large foreground linearly moving objects.

I'm surprised at the number of people disagreeing with your idea here. I think HN has a lot of "if I can't see how it can be done then it can't be done" people.

Edit: Also any 2d graphical overlays like HUDs, maps, scores, subtitles, menus, etc could be sent as 2d compressed data, which could enable better compression for that data - for example much sharper pixel perfect encoding for simple shapes.

derf_ · 5 months ago

> I think HN has a lot of "if I can't see how it can be done then it can't be done" people.

No, HN has, "This has been thought of a thousand times before and it's not actually that good of an idea," people.

The motion search in a video encoder is highly optimized. Take your side-scroller as an example. If several of your neighboring blocks have the same MV, that is the first candidate your search is going to check, and if the match is good, you will not check any others. The check itself has specialized CPU instructions to accelerate it. If the bulk of the screen really has the same motion, the entire search will take a tiny fraction of the encoding time, even in a low-latency, real-time scenario. Even if you reduce that to zero, you will barely notice.

On the other end of the spectrum, consider a modern 3D engine. There will be many things not describable by block-based motion of the underlying geometry: shadows, occlusions, reflections, clouds or transparency, shader effects, water or atmospheric effects, etc. Even if you could track the "real" motion through all of that, the best MV to use for compression does not need to match the real motion (which might be very expensive to code, while something "close enough" could be much cheaper, as just one possible reason), it might come from any number of frames (not necessarily the most recent), etc., so you still need to do a search, and it's not obvious the real motion is much better as a starting point than the heuristics an encoder already uses, where they even differ.

All of that said, some encoder APIs do allow providing motion hints [0], you will find research papers and theses on the topic, and of course, patents. That the technique is not more widespread is not because no one ever tried to make it work.

[0] https://docs.nvidia.com/video-technologies/video-codec-sdk/1... as the first random result of a simple search.

mikepurvis · 5 months ago

I’ve wondered about this as well, like most clients should be capable of still doing a bit of compositing. Like if you sent billboard renders of background objects at lower fidelity/frequency than foreground characters, updated hud objects with priority and using codecs that prioritize clarity, etc.

It was always shocking to me that Stadia was literally making their own games in house and somehow the end result was still just a streamed video and the latency gains were supposed to come from edge deployed gpus and a wifi-connected controller.

Then again, maybe they tried some of this stuff and the gains weren't worth it relative to battle-tested video codecs.

toast0 · 5 months ago

For 2d sprite games, OMG yes, you could provide some very accurate motion vectors to the encoder. For 3d rendered games, I'm not so sure. The rendering engine has (or could have) motion vectors for the 3d objects, but you'd have to translate them to the 2d world the encoder works in; I don't know if it's reasonable to do that ... or if it would help the encoder enough to justify.

sudosysgen · 5 months ago

Schemes like DLSS already do provide 2D motion vectors, it's not necessarily a crazy ask.

markisus · 5 months ago

The ultimate compression is to send just the user inputs and reconstitute the game state on the other end.

w-ll · 5 months ago

The issue is the "reconstitute the game state on the other end" when it comes to at least how I travel.

I haven't in a while but I used to use https://parsec.app/ on a cheap intel Air to do my STO dailies on vacation. It sends inputs, but gets a compressed stream. Im curious of any OS of something similar.

Zardoz84 · 5 months ago

Good old DooM save demos are essentially this.

cma · 5 months ago

> Things like motion prediction would be free since most rendering engines already have a dedicated buffer just for that for its own rendering, for example.

Doesn't work for translucency and shader animation. The latter can be made to work if the shader can also calculate motion vectors.

WithinReason · 5 months ago

Instead of motion vectors you probably want to send RGBD (+depth) so the client can compute its own motion vectors based on input, depth, and camera parameters. You get instant response to user input this way, but you need to in-paint disocclusions somehow.

dmos62 · 5 months ago

Could you say more? My first thought is that CPUs and GPUs have much higher bandwidths and lower latencies than ethernet, so just piping some of that workload to a streaming client wouldn't be feasible. Am I wrong?

Deleted Comment

IshKebab · 5 months ago

I don't think games do normally have a motion vector buffer. I guess they could render one relatively easily, but that's a bit of a chicken and egg problem.

garaetjjte · 5 months ago

They do, one reason is postprocessing effects like motion blur, another is antialiasing like TAA or DLSS upscaling.

ACCount36 · 5 months ago

Exposing motion vectors is a prerequisite for a lot of AI framegen tech. If you could tap that?

tomaskafka · 5 months ago

Also all major GPUs now have machine learning based next frame prediction, it’s hard to imagine this wouldn’t be useful.

keyringlight · 5 months ago

Plus whether there's further benefits available for the FSR/DLSS/XeSS type upscalers in knowing more about the scene. I'm reminded a bit of variable rate shading where if renderer analyses the scene for where detail levels will reward spending performance, could assign blocks (eg, 1x2, 4x2 pixels etc) to be shaded once instead of per-pixel to concentrate there. It's not exactly the same thing as the upscalers, but it seems a better foundation for a better output image compared to a blunt dropping the whole rendered resolution by a percentage. However, that's assuming traditional rendering before any ML gets involved which I think has proven its case in the past 7 years.

I think the other side to this is the difference between further integration of the engine and scaler/frame generation which would seem to involve a lot of low level tuning (probably per-title), and having a generic solution that uplifts as many titles as possible even if there's "perfect is the enemy of good" left on the table.

d--b · 5 months ago

The point of streaming games though is to offload the hard computation to the server.

I mean you could also ship the textures ahead of time so that the compressor could look up if something looks like a distorted texture. You could send the geometry of what's being rendered, that would give a lot of info to the decompressor. You could send the HUD separately. And so on.

But here you want something that's high level and works with any game engine, any hardware. The main issue being latency rather than bandwidth, you really don't want to add calculation cycles.

Very cool - That's nearly exactly what I need for a research project.

FWIW, there's also the non-free JPEG-XS standard [1] which also claims very low latency [2] and might be a safer choice for commercial projects, given that there is a patent pool around it.

[1] https://www.jpegxs.com/

[2] https://ds.jpeg.org/whitepapers/jpeg-xs-whitepaper.pdf

jamesfmilne · 5 months ago

JPEG-XS is great for low latency, but it uses more bandwidth. We're using it for low-latency image streaming for film/TV post production:

https://www.filmlight.ltd.uk/store/press_releases/filmlight-...

We currently use the IntoPIX CUDA encoder/decoder implementation, and SRT for the low-level transport.

You can definitely achieve end-to-end latencies <16ms over decent networks.

We have customers deploying their machines in data centres and using them in their post-production facilities in the centre of town, usually over a 10GbE link. But I've had others using 1GbE links between countries, running at higher compression ratios.

indolering · 5 months ago

A patent pool doesn't make you safer: it's just a patent troll charging you to cross the bridge. They are not offering insurance against more patent trolls blackmailing you after you cross the bridge.

raphman · 5 months ago

While I am personally opposed to software patents, I'd argue that the JPEG XS patent holders [1] are not 'patent trolls' in any meaningful sense of the word.

While I have no personal experience on that topic, I'd assume that a codec with a patent pool is a safer bet for a commercial project. Key aspects being protected by patents makes it less likely that some random patent troll or competitor extorts you with some nonsense patent. Also, using e.g., JPEG XS instead of e.g., pyrowave also ensures that you won't be extorted by the JPEG XS patent holders.

One may call this a protection racket - but under the current system, it may make economical sense to pay for a license instead of risking expensive law suits.

[1] https://www.jpegxspool.com/

Almondsetat · 5 months ago

VC-2 is an intra-only wavelet-based ultra low latency codec developed by the BBC years ago for exactly this purpose. It is royalty free and currently the only implementations are in ffmpeg and in the official BBC repository, and are CPU based. I am planning to make a CUDA accelerated version for my master thesis, since the Vulkan implementations made at GSoC last year still suck quite a bit. I would suggest people to look into this codec

_kb · 5 months ago

Definitely a neat codec! You can get COTS hardware en/decoders that use it via https://atlona.com/omnistream-av-over-ip/.

averne_ · 5 months ago

Do you mind going in some detail as to why they suck? Not a dig, just genuinely curious.

95% GPU usage but only x2 faster than the reference SIMD encoder/decoder

actionfromafar · 5 months ago

What I wonder is, how do you get the video frames to be compressed from the video card into the encoder?

The only frame capture APIs I know, take the image from the GPU, to CPU RAM, then you can put it back into the GPU for encoding.

Are there APIs which can sidestep the "load to CPU RAM" part?

Or is it implied, that a game streaming codec has to be implemented with custom GPU drivers?

oplav · 5 months ago

In your experience, how does VC-2 compare to JPEG XS from a quality perspective? The JPEG XS resources I’ve seen say JPEG XS has higher visual quality, but curious what it’s like in practice.

JPEG-XS is an almost direct successor to VC-2. They use the same techniques and if you read JPEG-XS's whitepaper they explicitly cite VC-2 as an inspiration and a target to surpass. JPEG-XS is an improvement, there is not doubt about that, but unfortunately they decided to patent it for all uses. In both cases, the publicly available software implementations are very few, CPU-based, and the ones that aren't are implemented in hardware inside business AV solutions.

sippeangelo · 5 months ago

This is a really nice walkthrough of matching trade offs to acceptable distortions for a known signal type. Even if you’re selecting rather than designing a codec, it’s a great process to follow.

For those interesting in the ultra low latency space (where you’re willing to trade a bit of bandwidth to gain quality and minimise latency), VSF have a pretty good wrap up of other common options and what they each optimise for: https://static.vsf.tv/download/technical_recommendations/VSF...

keketi · 5 months ago

Have an LLM transcribe what is happening in the game into a few sentences per frame, transfer the text over network and have another LLM reconstruct the frame from the text. It won't be fast, it's going to be lossy, but compression ratio is insane and it's got all the right buzzwords.

jameshart · 5 months ago

Frame 1:

You are standing in an open field west of a white house, with a boarded front door. There is a small mailbox here.

nusl · 5 months ago

They did this :P

https://www.youtube.com/watch?v=ZpCrBBj6AWE

Eduard · 5 months ago

(user input: mouse delta: (-20, -8))

Frame 2:

A few blades of grass sway gently in the breeze. The camera begins to drift slightly, as if under player control — a faint ambient sound begins: wind and birds.

taneq · 5 months ago

Ah, this explains why there are clowns under the bed and creepy children staring at me from the forest.

Y_Y · 5 months ago

kill jester

cyclotron3k · 5 months ago

Send the descriptions via the blockchain so there's an immutable record

poglet · 5 months ago

Maybe even one day we reach point where the game can run locally on the end users' machine.

foota · 5 months ago

You've got my attention

Thaxll · 5 months ago

There is the creator of VLC that is working on something similar, very cutting edge.

https://streaminglearningcenter.com/codecs/an-interview-with...

Ultra low latency for streaming.

https://www.youtube.com/watch?v=0RvosCplkCc

Having worked in the space, I'd have to say hardware encoders and H.264 is pretty dang good - NVENC works with very little latency (if you tell it to, and disable the features that increase it, such as multiple frame prediction, B-frames).

The two things that increase latency are more advanced processing algorithms, giving the encoder more stuff to do, and schemes that require waiting multiple frames. If you go disable those, the encoder can pretty much start working on your frame the nanosecond the GPU stops rendering to it, and have it encoded in <10ms.

Wowfunhappy · 5 months ago

> have it encoded in <10ms.

For context, OP achieved 0.13 ms with his codec.

dishsoap · 5 months ago

10ms is quite long in this context.

RobRivera · 5 months ago

>10 ms

Do not shame this dojo.

latchkey · 5 months ago

Sadly appears to be unavailable.

Cadwhisker · 5 months ago

This CODEC uses the same base algorithm as HTJ2K (High-Throughput JPEG 2000).

If the author is reading this, it would be very interesting to read about the differences between this method and HTJ2K.

Fidelix · 5 months ago

Unbelievable... Good job mate.

Can't wait until one day this gets into Moonlight or something like it.

cpeth · 5 months ago

Exactly what I was thinking. Wish I had the time and expertise to give adding support for this codec myself a go. Streaming Clair Obscure over my LAN via Sunshine / Moonlight is exactly my use-case and the latency could definitely be better.