The first two demo videos are interesting examples of using StyleCLIP's global directions to guide an image toward a "smiling face" as noted in that paper with smooth interpolation: https://github.com/orpatashnik/StyleCLIP
I had ran a few chaotic experiments with StyleCLIP a few months ago which would work very well with smooth interpolation: https://minimaxir.com/2021/04/styleclip/
The previous approaches learned screen-space-textures for different features and a feature mask to compose them.
Now it seems to actually learn the topology lines of the human face [0], as 3D artists would learn them [1] when they study anatomy. It also uses quad grids and even places the edge loops and poles in similar places.
There are some interesting 2d things our eyes do for 3d. If something is on the ground, half is above the horizon and half is below. Parallax is a 2d phenomenon.
After styleGAN-2 came out, I couldn't image what improvements could be made over it. This work is truly impressive.
The comparisons are illuminative: StyleGAN2's mapping of texture to specific pixel location looks very similar to poorly implemented video-game textures. Perhaps future GAN improvements could come from tricks used in non-AI graphic development.
If ReLU-introduced high frequency components are indeed the culprit, won't using "softened" ReLU (without discontinuity in the derivative at 0) everywhere solve the problem, too?
I wonder if you could make the noise inputs work again by using the same process as for the latent code - generate the noise in the frequency domain, and apply the same shift and careful downsampling. If you apply the same shift to the noise as to the latent code, then maybe the whole thing will still be equivariant? In other words, it seems like the problem with the per-pixel noise inputs is that they stay stationary while the latent is shifted, so just shift them also!
I wonder if there are learnings from this that could be transposed into the 1-D domain for audio; as far as I know, aliasing is a frequent challenge when using deep learning methods for audio (e.g. simulating non-linear circuits for guitar amps).
You can see what they're saying about the fixed in place features with the beards in the first video, but StyleGAN gets the teeth symmetry right whereas this work seems to have trouble with it. Why don't the teeth in the StyleGAN slide around like the beard does?
That's likely the GANSpace/SeFa part of the manipulation.
> In a further test we created two example cinemagraphs that mimic small-scale head movement and
facial animation in FFHQ. The geometric head motion was generated as a random latent space
walk along hand-picked directions from GANSpace [24] and SeFa [50]. The changes in expression
were realized by applying the “global directions” method of StyleCLIP [45], using the prompts
“angry face”, “laughing face”, “kissing face”, “sad face”, “singing face”, and “surprised face”. The
differences between StyleGAN2 and Alias-Free GAN are again very prominent, with the former
displaying jarring sticking of facial hair and skin texture, even under subtle movements
That's starting to be high enough quality that you could start considering using that for some Hollywood-grade special effects. That beach morph stuff is pretty impressive. Faces, perhaps not quite there yet because we are so hyper-focused on those biologically, but you could make one heck of a drug trip scene or a Doctor Strange-esque scene with much less effort with some of those techniques, effort perhaps even getting down to the range of Youtuber videos in the near enough future.
First, that's not the same technique and it's not being used for the same purpose.
Second, Hollywood doesn't care about that problem. They will take the best application of the technique, and they don't care if they have to apply a few manual touchups on the result. As long as there is one way of using the system to do the sort of thing they showed in the sample, it won't matter to them that they can't embed a full video game into the neural network itself. They only care about the happy path of the tech.
Someone's probably already starting the company now to use this in special effects, or putting someone on research in an existing company.
I had ran a few chaotic experiments with StyleCLIP a few months ago which would work very well with smooth interpolation: https://minimaxir.com/2021/04/styleclip/
Now it seems to actually learn the topology lines of the human face [0], as 3D artists would learn them [1] when they study anatomy. It also uses quad grids and even places the edge loops and poles in similar places.
[0] https://nvlabs-fi-cdn.nvidia.com/_web/alias-free-gan/img/ali... [1] https://i.pinimg.com/originals/6b/9a/0c/6b9a0c2d108b2be75bf7...
The comparisons are illuminative: StyleGAN2's mapping of texture to specific pixel location looks very similar to poorly implemented video-game textures. Perhaps future GAN improvements could come from tricks used in non-AI graphic development.
Still has the telltale of mismatched ears and/or earrings. This seems the most reliable way to recognize them. Well, and the nondescript background.
I wonder what dataset you could even use to tell a GAN about human internals. 3D renders of a skull with various layers removed?
> In a further test we created two example cinemagraphs that mimic small-scale head movement and facial animation in FFHQ. The geometric head motion was generated as a random latent space walk along hand-picked directions from GANSpace [24] and SeFa [50]. The changes in expression were realized by applying the “global directions” method of StyleCLIP [45], using the prompts “angry face”, “laughing face”, “kissing face”, “sad face”, “singing face”, and “surprised face”. The differences between StyleGAN2 and Alias-Free GAN are again very prominent, with the former displaying jarring sticking of facial hair and skin texture, even under subtle movements
Second, Hollywood doesn't care about that problem. They will take the best application of the technique, and they don't care if they have to apply a few manual touchups on the result. As long as there is one way of using the system to do the sort of thing they showed in the sample, it won't matter to them that they can't embed a full video game into the neural network itself. They only care about the happy path of the tech.
Someone's probably already starting the company now to use this in special effects, or putting someone on research in an existing company.