Btw, I did this in pixel space for simplicity, cool animations, and compute costs. Would be really interesting to do this as an LDM (though of course you can't really do the LAB color space thing, unless you maybe train an AE specifically for that color space. )
I was really interested in how color was represented in latent space and ran some experiments with VQGAN clip. You can actually do a (not great) colorization of an image by encoding it w/ VQGAN, and using a prompt like "a colorful image of a woman".
Would be fun to experiment with if anyone wants to try, would love to see any results if someone wants to build
Depends, given the low res, the 3x64x64 pixel space image is smaller than the latents you would get from encoding a higher-res image with models like VQGAN or the stablediff VAE at their native resolutions.
It's easier to get a sense of what's going wrong with a pixel space model though. With latent space, there's always the question of how color is represented in latent space / how entangled it is with other structure / semantics.
Starting in pixel space removed a lot of variables from the equation, but latent diffusion is the obvious next step
Took a lot of failed experiments, the model would keep converging to greyscale / sepia images. Think one of the ways I fixed was by adding an greyscale encoder to the arch. Used its output embedding as additional conditioning. Can't remember if I only added it to the Unet input or injected it during various stages of the unet down pass.
I’m not a fan of b&w colorization. Often the colors are wrong, either outright color errors (like choices for clothing or cars) or often not taking in to account lighting conditions (late in day shadows but midday brightness).
Then there is the issue of B&W movies. Using this kind of tech might not give pleasing results as the colors used for sets and outfits were chosen to work well for film contrast and not for story accuracy. That “blue” dress might really be green. (Please, just leave B&W movies the way they are.)
I think keeping the art as it was produced is important but there is also a good history of modifying art to produce new art too. In the digital age, we aren’t losing the original art so it seems even stranger to be against modification of the “original.”
However, just applying a simple filter (or single transform without effort) definitely feels derivative to me.
Maybe you're used to looking at B&W stuff and effortlessly figuring out what the scene is depicting, but for me at least it's very hard. Adding a little color makes it much easier. In that regard, it doesn't matter to me if the colors are wrong.
(Perhaps it just takes some getting used to. Back when I read a black and white comic for the first time (as a child), I had a hard time figuring out things at first but got used to it at some point.)
I think the point being made is that movies were made for the B&W end result, not just shooting color with B&W film.
For instance, fake blood in B&W was often produced with black liquid. Colorizing it correctly just doesn't make sense. Or a green or blue dress can be chosen because of the way it looks on film, not because it's supposed to BE a green or blue dress.
I don’t see why it matters if the blue dress was really green. The result is either an enhanced experience or not, if it is then minor inaccuracies don’t seem relevant.
If there's a source that a blue dress was green, then that could be taken into consideration for recoloring, but as you said, it's to enhance the experience, not to be 100% accurate.
Quite often, colorized pics and movies have people wear blue-ish clothing, which is fairly unbelievable. It's a gimmick that produces an effect that's not quite right for a goal for which it's not suitable. Because what is it that colorizations try to achieve? To make people think "Oh, so that's how it looked back then"? Then there shouldn't be errors in the image. And if it makes the pictures more relatable, or whatever handwaivy arguments are being thrown around, then non-colorized pictures will become even less relatable, in effect alienating people from recent history (if you believe such arguments).
I'd like to make one exception, though, for They Shall not Grow Old. That was impressive.
I think colorization with some effort put in can be pretty decent. E.g. I prefer the 2007 colorization of It's a Wonderful Life to the original. It's never perfect but I don't think that's a prerequisite to being better. Some will always disagree though.
About every completely automated colorized video tends to be pretty bad though. Particularly the YouTube "8k colorized interpolated" kind of low effort channels where they just let them pump out without caring if it's actually any good.
Yeah it's cool tech but I really don't appreciate how it is just straight up deceitful and spreading misinformation. A lot of hues are underdetermined and the result is more or less arbitrary in a historical context. If one were to research and fine-tune the model such that ambiguous shades are historically accurate I would be less annoyed by the sense that these images are just spreading misinformation. Compare this with Sergey Prokudin-Gorsky's photos of the Russian Empire or autochromes of Paris in 1910 which are actual windows into a lost world.
*for works of fiction these issues vanish, but for any historical or documentary photographs/films, I really hate that I am being lied to.
Edit - technically, I suppose, the way Deoldify works is by rendering the color at a low resolution and then applying the filter to a higher resolution using OpenCV. I think the same sub-sampling approach could work here...
Technically yes, the encoder and unet are convolutional and support arbitrary input sizes, but the model was trained at 64x64px bc of compute limitations. You could probably resume the training from a 64x64 resolution checkpoint and train at a higher resolution.
But like most diffusion models, they don't generalize very well to resolutions outside of their training dataset
Is there anything that exists right now with diffusion models to improve poor VHS coloring? The coloring does exist so I would not want to replace a red shirt by a blue shirt for example but it's just not very accurate.
I was really interested in how color was represented in latent space and ran some experiments with VQGAN clip. You can actually do a (not great) colorization of an image by encoding it w/ VQGAN, and using a prompt like "a colorful image of a woman".
Would be fun to experiment with if anyone wants to try, would love to see any results if someone wants to build
A slight nitpick, wouldn't doing diffusion in the latent space be cheaper?
It's easier to get a sense of what's going wrong with a pixel space model though. With latent space, there's always the question of how color is represented in latent space / how entangled it is with other structure / semantics.
Starting in pixel space removed a lot of variables from the equation, but latent diffusion is the obvious next step
Then there is the issue of B&W movies. Using this kind of tech might not give pleasing results as the colors used for sets and outfits were chosen to work well for film contrast and not for story accuracy. That “blue” dress might really be green. (Please, just leave B&W movies the way they are.)
However, just applying a simple filter (or single transform without effort) definitely feels derivative to me.
(Perhaps it just takes some getting used to. Back when I read a black and white comic for the first time (as a child), I had a hard time figuring out things at first but got used to it at some point.)
For instance, fake blood in B&W was often produced with black liquid. Colorizing it correctly just doesn't make sense. Or a green or blue dress can be chosen because of the way it looks on film, not because it's supposed to BE a green or blue dress.
The last time I checked, “the source is public domain” is not a valid defense against the pro-DRM parts of that law.
I'd like to make one exception, though, for They Shall not Grow Old. That was impressive.
About every completely automated colorized video tends to be pretty bad though. Particularly the YouTube "8k colorized interpolated" kind of low effort channels where they just let them pump out without caring if it's actually any good.
*for works of fiction these issues vanish, but for any historical or documentary photographs/films, I really hate that I am being lied to.
One of the nice features of the somewhat old Deoldify colorizer is support for any resolution. It actually does better than photoshops colorization: https://blog.maxg.io/colorizing-infrared-images-with-photosh...
Edit - technically, I suppose, the way Deoldify works is by rendering the color at a low resolution and then applying the filter to a higher resolution using OpenCV. I think the same sub-sampling approach could work here...
But like most diffusion models, they don't generalize very well to resolutions outside of their training dataset
I have to wonder whether it works well with anything else.