Perceptually lossless (talking head) video compression at 22kbit/s

This is very impressive, but “perceptually lossless” isn’t a thing and doesn’t make sense. It means “lossy”.

Bjartr · a year ago

It may sound like marketing wank, but it does a appear to be an established term of art in academia as far back as 1997 [1]

It just means that a person can't readily distinguish between the compressed image and the uncompressed image. Usually because it takes some aspect(s) of the human visual system into account.

[1] https://scholar.google.com/scholar?hl=en&as_sdt=0%2C22&q=per...

tatersolid · a year ago

I read “perceptually lossless” to be equivalent to “transparent”, a more common phrase used in the audio/video codec world. It’s the bitrate/quality at which some large fraction of human viewers can’t distinguish a losslessly-encoded sample and the lossy-encoded sample, for some large fraction of content (constants vary in research papers).

As an example, crf=18 in libx264 is considered “perceptually lossless” for most video content.

Ladsko · a year ago

Can you propose a better term for the concept then? Perceiving something as lossless is a real world metric that has a proper use case. "Perceptually lossless" does not try to imply that it is not lossy.

ComplexSystems · a year ago

The term for this is "transparency." A codec is "transparent" if people can't tell the difference between the original and the compressed version.

high_byte · a year ago

why not? if you change one pixel by one pixel brightness unit it is perceptually the same.

for the record, I found liveportrait to be well within the uncanny valley. it looks great for ai generated avatars, but the difference is very perceptually noticeable on familiar faces. still it's great.

LegionMammal978 · a year ago

For one, it doesn't obey the transitive property like a truly lossless process should: unless it settles into a fixed point, a perceptually lossless copy of a copy of a copy, etc., will eventually become perceptually different. E.g., screenshot-of-screenshot chains, each of which visually resembles the previous one, but which altogether make the original content unreadable.

codeflo · a year ago

GP is correct, that’s the definition of “lossy”. We don’t need to invent ever new marketing buzzwords for well-established technical concepts.

Brian_K_White · a year ago

It means what it already says for itself, and does not need correcting into incorrectness.

"no perceived loss" is a perfectly internally consistent and sensible concept and is actually orthogonal to whether it's actually lossless or lossy.

For instance an actually lossless block of data could be perceptually lossy if displayed the wrong way.

In fact, even actual lossless data is always actually lossy, and only ever "perceptually lossless", and there is no such thing as actually lossless, because anything digital is always only a lossy approximation of anything analog. There is loss both at the ADC and at the DAC stage.

If you want to criticize a term for being nonsense misleading dishonest bullshit, then I guess "lossless" is that term, since it never existed and never can exist.

unshavedyak · a year ago

Similar to your points, i also expect `perceptually lossless` to be a valid term in the future with respect to AI. Ie i can imagine a compression which destroys detail, but on the opposite end it uses "AI" to reconstruct detail. Of course though, the AI is hallucinating the detail, so objectively it is lossy but perceptibly it is lossless because you cannot know which detail is incorrect if the ML is doing a good job.

In that scenario it certainly would not be `transparent` ie visually without any lossy artifacts. But your perception of it would look lossless.

The future is going to be weird.

rowanG077 · a year ago

Why don't you think it's a thing? A trivial example is audio. A ton of audio speakers can produce frequencies people cannot hear. If you have an unprocessed audio recording from a high end microphone one of the first compressions things you can do is clip of imperceptible frequencies. A form of compression.

ranger_danger · a year ago

As there are several patents, published studies, IEEE papers and thousands of google results for the term, I think it's safe to say that many people do not agree with your interpretation of the term.

"As a rule, strong feelings about issues do not emerge from deep understanding." -Sloman and Fernbach

lifthrasiir · a year ago

It is definitely a thing given a good perceptual metric. The metric even doesn't have to be very accurate if the distortion is highly bounded, like only altering the lowermost bit. It is unfortunate that most commonly used distortion metrics like PSNR are not really that, though.

rini17 · a year ago

But that's mathematically impossible, to restore signal from extremely low bitrate stream with any highly bounded distortion. Perhaps only if you have highly restricted set of posible input, which online meetings aren't.

_ZeD_ · a year ago

also are .mp3, yet they are hardly discernible from the originals

bityard · a year ago

Ability to tell MP3 from the original source was always dependent on encoder quality, bitrate, and the source material. In the mid 2000's, I tried to encode all of my music as MP3. Most of it sounded just fine because pop/rock/alt/etc are busy and "noisy" by design. But some songs (particularly with few instruments, high dynamic range, and female vocals) were just awful no matter how high I cranked the bitrate. And I'm not even an "audiophile," whatever that means these days.

No doubt encoders and the codecs themselves have improved vastly since then. It would be interesting to see if I could tell the difference in a double-blind test today.

Dwedit · a year ago

Lossy audio formats suddenly become very discernible once you subtract the left channel from the right channel. Try that with Lossless audio vs MP3, Vorbis, Opus, AAC, etc. You're listening to only the errors at that point.

rini17 · a year ago

not at 22kbit :)

rob74 · a year ago

Yeah, all lossy compression could be called "perceptually lossless" if the perception is bad enough...

bux93 · a year ago

A family member of mine didn't see the point of 1080p. Turned out they needed cataract surgery and got fancy replacement lenses in their eyes. After that, they saw the point.

Dylan16807 · a year ago

Needing to define "perception" is a much weaker criticism than "isn't a thing and doesn't make sense".

It's easy enough to specify an average person looking very closely, or a 99th percentile person, or something like that, and show the statistics backing it up.

k__ · a year ago

Is this the real-time discussion all over again?

> But one overlooked use case of the technology is (talking head) video compression.

> On a spectrum of model architectures, it achieves higher compression efficiency at the cost of model complexity. Indeed, the full LivePortrait model has 130m parameters compared to DCVC’s 20 million. While that’s tiny compared to LLMs, it currently requires an Nvidia RTX 4090 to run it in real time (in addition to parameters, a large culprit is using expensive warping operations). That means deploying to edge runtimes such as Apple Neural Engine is still quite a ways ahead.

It’s very cool that this is possible, but the compression use case is indeed .. a bit far fetched. A insanely large model requiring the most expensive consumer GPU to run on both ends and at the same time being limited in bandwidth so much (22kbps) is a _very_ limited scenario.

gambiting · a year ago

One cool use would be communication in space - where it's feasible that both sides would have access to high-end compute units but have a very limited bandwidth between each other.

bliteben · a year ago

Wonder if its better than a single color channel hologram though

bityard · a year ago

Bandwidth is not the limitation in space comms, latency is.

JamesLeonis · a year ago

Increasingly mobile networks are like this. There are all kinds of bandwidth issues, especially when customers are subject to metered pricing for data.

loa_in_ · a year ago

Staying in contact with someone for hours on metered mobile internet connection comes to mind. Low bandwidth translates to low total data volume over time. If I could be video chatting on one of those free internet SIM cards that's a breakthrough.

omh · a year ago

One use case might be if you have limited bandwidth, perhaps only a voice call, and want to join a video conference. I could imagine dialling in to a conference with a virtual face as an improvement over no video at all.

Dead Comment

jl6 · a year ago

130m parameters isn’t insanely large, even for smartphone memory. The high GPU usage is a barrier at the moment, but I wouldn’t put it past Apple to have 4090-level GPU performance in an iPhone before 2030.

loudmax · a year ago

The trade-off may not be worth it today, but the processing power we can expect in the coming years will make this accessible to ordinary consumers. When your laptop or phone or AR headset has the processing power to run these models, it will make more efficient use of limited bandwidth, even if more bandwidth is available. I don't think available bandwidth will scale at the same rate as processing power, but even if it does, the picture be that much more realistic.

gwd · a year ago

This reminds me of a scene in "A Fire Upon the Deep" (1992) where they're on a video call with someone on another spaceship; but something seems a bit "off". Then someone notices that the actual bitrate they're getting from the other vessel is tiny -- far lower than they should be getting given the conditions -- and so most of what they're seeing on their own screens isn't actual video feed, but their local computer's reconstruction.

Rebelgecko · a year ago

Was that the same book that had the concept of (paraphrasing using modern terminology) doing interstellar communications by sending back and forth LLMs trained on the people who wanted to talk, prompted to try and get a good business deal or whatever?

alex-robbins · a year ago

That happened in Redemption Ark by Alastair Reynolds (2002), though of course the idea may also have been used before or since.

DoneWithAllThat · a year ago

This idea was also used in The Algebraist.

miohtama · a year ago

And also it was a deep fake.

BTW This is the best sci-fi book ever.

Retric · a year ago

Might be better if you like space opera style really soft science fiction. I really didn’t enjoy it.

jf · a year ago

I beg to differ. A Deepness In The Sky is the best sci-fi book ever.

_kb · a year ago

At least for audio, that dystopia is already shipping in end-user product: https://blog.webex.com/collaboration/hybrid-work/next-level-...

janandonly · a year ago

I came here to reply just this exactly and found a fellow geek beat me to it. Indeed a brilliant book.

zbobet2012 · a year ago

These sorts of models pop here quite a bit, and they ignore fundamental facts of video codecs (video specific lossy compression technologies).

Traditional codecs have always focused on trade offs among encode complexity, decode complexity, and latency. Where complexity = compute. If every target device ran a 4090 at full power, we could go far below 22kbps with a traditional codec techniques for content like this. 22kbps isn't particularly impressive given these compute constraints.

This is my field, and trust me we (MPEG committees, AOM) look at "AI" based models, including GANs constantly. They don't yet look promising compared to traditional methods.

Oh and benchmarking against a video compression standard that's over twenty years old isn't doing a lot either for the plausibility of these methods.

skandium · a year ago

This is my field as well, although I come from the neural network angle.

Learned video codecs definitely do look promising: Microsoft's DCVC-FM (https://github.com/microsoft/DCVC) beats H.267 in BD-rate. Another benefit of the learned approach is being able to run on soon commodity NPUs, without special hardware accommodation requirements.

In the CLIC challenge, hybrid codecs (traditional + learned components) are so far the best, so that has been a letdown for pure end to end learned codecs, agree. But something like H.267 is currently not cheap to run either.

Winning in bd rate though isn't hard. You need to win in bd rate and have a hardware implementable, power efficient, cheap decoder.

Agreed hybrid presents real opportunity.

AzzyHN · a year ago

Did you mean H.266? Or is there some secret H.267 that hasn't been agreed upon yet

smokel · a year ago

Why so sour? This particular article doesn't seem to ignore a lot, it even references the Nvidia work that inspired it, as well as a recent benchmark.

Someone was just having fun here, it's not as if they present it as a general codec.

Really? Would there be a way to replicate this with currently available encoders? I'd like to try it

LeoPanthera · a year ago

Vecr · a year ago

Fire Upon the Deep had more or less this. Story important, so I won't say more. That series in general had absolutely brutal bandwidth limitations.

MayeulC · a year ago

I like how the saddle in the background moves with the reconstructed head; it probably works better with uncluttered backgrounds.

This is interesting tech, and the considerations in the introduction are particularly noteworthy. I never considered the possibility of animating 2D avatars with no 3D pipeline at all.

antiquark · a year ago

Not quite lossless... look at the bicycle seat behind him. When he tilts his head, the seat moves with his hair.

manmal · a year ago

His gaze also doesn’t quite match.

hinkley · a year ago

Why is nobody noticing the eyes?? This is important!

I feel like I’m taking crazy pills.

metaphor · a year ago

Very noticeable jitter in bicycle front tire too.

red0point · a year ago

The second example shown is not perceptually lossless, unless you’re so far on the spectrum you won’t make eye contact even with a picture of a person. The reconstructed head doesn’t look in the same direction as the original.

However is does raise an interesting property in that if you are on the spectrum or have ADHD, you only need one headshot of yourself staring directly at the camera and then the capture software can stop you from looking at your taskbar or off into space.

DCH3416 · a year ago

> unless you’re so far on the spectrum you won’t make eye contact even with a picture of a person.

I don't know. I think you'd be surprised.

That's already kind of an issue with vloggers. Often they're looking just left or right of the camera at a monitor or something.