Smells like rationalization to me.
Traditional codecs have always focused on trade offs among encode complexity, decode complexity, and latency. Where complexity = compute. If every target device ran a 4090 at full power, we could go far below 22kbps with a traditional codec techniques for content like this. 22kbps isn't particularly impressive given these compute constraints.
This is my field, and trust me we (MPEG committees, AOM) look at "AI" based models, including GANs constantly. They don't yet look promising compared to traditional methods.
Oh and benchmarking against a video compression standard that's over twenty years old isn't doing a lot either for the plausibility of these methods.
Learned video codecs definitely do look promising: Microsoft's DCVC-FM (https://github.com/microsoft/DCVC) beats H.267 in BD-rate. Another benefit of the learned approach is being able to run on soon commodity NPUs, without special hardware accommodation requirements.
In the CLIC challenge, hybrid codecs (traditional + learned components) are so far the best, so that has been a letdown for pure end to end learned codecs, agree. But something like H.267 is currently not cheap to run either.
This is intuitive, as the competition organisers say: compression is prediction.
To see what a lossy generator hallucinating patterns means in practice, I recommend viewing HiFiC vs original here: https://hific.github.io/