There's definitely value in providing this functionality for photographs taken in the present.
But I think the real value -- and this is definitely in Google's favor -- is providing this functionality for photos you have taken in the past.
I have probably 30K+ photos in Google Photos that capture moments from the past 15 years. There are quite a lot of them where I've taken multiple shots of the same scene in quick succession, and it would be fairly straightforward for Google to detect such groupings and apply the technique to produce synthesized pictures that are better than the originals. It already does something similar for photo collages and "best in a series of rapid shots." They surface without my having to do anything.
Tiled/stacked approach as others mention is good, and probably the best approach. Could also try doing an uncompressed format (even just .png uncompressed) or something simple like RLE then 7zip them together since 7zip is the only archive format that does inter-file (as opposed to intra-file) compression as far as I am aware.
Unfortunately lossless video compression won't help here as it will compress frames individually for lossless.
Philosophically, yes. But some photo-editing techniques rely on data that is not backfillable and must be recorded at capture time. And even in cases where there is no functional impediment to applying it against historical photos, sometimes there is product gatekeeping to contend with.
> ..fairly straightforward for Google to detect such groupings and apply the technique to produce synthesized pictures that are better than the originals.
Wouldn't an operation like this require some kind of fine-tuning? Or do diffusion models have a way of using images as context, the way one would provide context to an LLM?
I think simpler algorithms (e.g. image histograms) can get you a long way. Regardless of the mechanism, Google Photos already has the capability to detect similar images, which is used to generate animated gifs.
When you think about it, the only thing that's weird about this hypothetical conversation is the context of it being about (purported) photographs.
We expect images that look like photographs — at least when taken by amateurs — to be the result of a documentary process, rather than an artistic one. They might be slightly filtered or airbrushed, but they won't be put together from whole cloth.
But amateur photography is actually the outlier, in the history of "capturing memories"!
If you imagine yourself before the invention of photography, describing your vacation to an illustrator you're commissioning to create a some woodblock-print artwork for a set of christmas cards you're having made up, the conversation you've laid out here is exactly how things would go. They'd ask you to recount what you saw, do a sketch, and then you'd give feedback and iterate together with them, to get a final visual down that reflects things the way you remember them, rather than the way they were, per se.
This is an interesting point. Usually people claim technology goes inexorably forward, yet here we are, merrily destroying trust in the most objective method we have to record the past!
FB AI, make a series of posts about me climbing mount everest, meeting dalai lama, curing cancer, bringing peace to ukraine, changing my name to Melon Tusk, announcing running for president and adopting a dog named Molly
But see, that's the sort of thing that would give it away.
You got to shoot for something just attainable enough to sound credible, while still being at the "enviable" end of the spectrum.
"FB AI, make a series of pictures of my first 3 months at Goldman Sachs in 2021. Include me shaking hands with the VP of software as I receive a productivity award for making them $1m in a week. Include a group photo of me and 12 other people (all C execs and my VP must be there). Crosspost all to LinkedIn, with notifications muted."
"Ok done"
"ChatGPT, take my existing CV and replace entries from 2021 onwards with a job as Head of Performance Monitoring at Goldman Sachs, reporting to VP of software. Include several projects with direct CEO and CFO involvement. Crosspost changes to LinkedIn."
This actually feels like it could be an incredibly valuable post-production tool in film and TV, once they get it working consistently across multiple frames.
Not only for more flexibility in "uncropping" after shooting (there was a tree/wall in the way), but this could basically be the holy grail solution for converting 4:3 to widescreen without cutting off content on the top and bottom.
I already use Photoshop Generative Fill for uncropping videos, but it only works for fixed camera shots. Photoshop just added feature where you can just drag the video file in and do the uncrop in one step.
The problem I'm solving is converting videos from widescreen to vertical and sometimes you need some extra height.
Mind if I ask why you'd need to do that? It's a huge amount of if the frame being generated artificially, especially if you're talking cinema aspect ratio wide-screen.
I can see it working great for some stuff but wouldn't you ultimately face the issue with more artistic work that the framing might not be very good if just artificially extending.
It definitely needs to be applied judiciously on a shot-by-shot basis.
There have been quite a few 4:3-to-widescreen conversions that were done using the original film that was actually shot in widescreen and cropped for TV.
Sometimes, the wider shot makes perfect sense. Sometimes, they keep the original cropped one but cut off top/bottom. Sometimes it's a combination of the two. It all depends on what's being framed -- two people in a car usually benefits from cropping (nobody needs the bottom third of the frame occupied by the car's hood), while a close-up on someone's face usually benefits from extending the sides (otherwise it's an uncomfortable mega-close up that cuts off their mouth).
But having the flexibility to extend horizontally gives you the artistic possibilities.
Also getting everyone smiling with their eyes open at the same time. Phone cameras could record a group photo for five or ten seconds and use the best expression from different times for each person.
Or you take a single picture of a group in front of a monument, but cut it off. As I understand it you could find pictures of the monument online, run the model, and have a picture with the group and the entire monument.
Probably google can even do this automatically - I would not be surprised if I get suggestions to fix images with cut off buildings via Google Photos in the future! Would be so cool.
You don’t even need to take the photo, with enough images of each family member and images of a tourist destination you can just automatically construct a photo of everyone together at the location, saving the costs and carbon footprint of getting everyone together.
And then why demand "photos" of family excursions at all, when it is just an AI imagining how things probably were happening at the time, or would have happened? We should just stick to our own imperfect memory.
I have been working on a holographic camera, but the ultra-cheap pinhole cameras I chose for the array have two issues: the exposure can't be controlled and the lenses are poorly aligned. I can calibrate away most lens aberrations with OpenCV, but some of the outliers have so much cropping that I am discarding 75% of my good pixels to get a coherent result. I was considering using NeRFs to reproject the ideal camera angles, but COLMAP is not very tolerant of brightness fluctuations and NeRF training is relatively slow (considering my goal is video). This would be a nice solution to my problem, because I have a comprehensive set of angles to pull context from.
So is the weather just hallucinated then? We're just making up memories and calling them real? And advertising this blatently, called rainy days sunny and sunny days rainy? My god I hate this so much.
Not even a discussion about if this might be harmful or what the risks are or anything, just plain old "THIS FAKE MOMENT WAS REAL AND YOU'LL BELIEVE IT"?!
I really have a hard time with this. Wow I'm upset, more than I expected. The tech is fine yeah but the marketing is just deeply upsetting.
Seems like the real utility of this technique will be as a way to vastly improve the temporal stability with a variety of generative video techniques. For example, if you are trying to use a video as a base for a new generative video: Take the first frame of your video and run it through SD with the control net of your choice. Then take that initial image and run it through this process to generate a new base model and then use that to generate your second frame. Now you can use that second frame to feed back into your model and rinse and repeat, always using the past few frames to inform the latest.
an interesting use case for this once the compute is there is to basically allow for ai powered digital zoom-out. it could work by instructing the user to take several pictures around the target, and then you take regular pictures of your subject.
then, as you like, you can do an "ai zoom out" to get zoomed out pictures, no longer constrained by your lens or distance.
I imagine this to be included relatively soon, just like how panaromas were once a niche thing that became much easier to do with some good ui/ux. pretty much any modern phone can do them without having to struggle with lining up photos and what not.
one thing that does greatly concern me about the demo/site is that they have "authentic" and "recover" as terms. the result here is not authentic nor has anything been "recovered." it's an illusion at best. I personally don't like how they portray the new image as being equivalent as if the lens framed it in the original picture. it's not, as they show themselves in the later portion (near the end) with the text sign. seriously irresponsible framing (pun intended) to what's otherwise very cool tech.
But I think the real value -- and this is definitely in Google's favor -- is providing this functionality for photos you have taken in the past.
I have probably 30K+ photos in Google Photos that capture moments from the past 15 years. There are quite a lot of them where I've taken multiple shots of the same scene in quick succession, and it would be fairly straightforward for Google to detect such groupings and apply the technique to produce synthesized pictures that are better than the originals. It already does something similar for photo collages and "best in a series of rapid shots." They surface without my having to do anything.
They do take up a lot of space, and just today I asked in photo.stackexchange for backup compression techniques that can exploit inter-image similarities: https://photo.stackexchange.com/questions/132609/backup-comp...
Unfortunately lossless video compression won't help here as it will compress frames individually for lossless.
Wouldn't an operation like this require some kind of fine-tuning? Or do diffusion models have a way of using images as context, the way one would provide context to an LLM?
Facebook: Great. I'd be happy to. Any more detail you'd like to add?
Me: Make us look attractive. Show that we're a having a great time. Also, we went to see the Chatham Lighthouse.
Facebook: OK, done!
...
Facebook: You've received 48 likes. Your mother would like to know if you had any salt water taffy.
Me: Yes, and please create a picture of my oldest daughter having trouble chewing it.
Facebook: Done.
We expect images that look like photographs — at least when taken by amateurs — to be the result of a documentary process, rather than an artistic one. They might be slightly filtered or airbrushed, but they won't be put together from whole cloth.
But amateur photography is actually the outlier, in the history of "capturing memories"!
If you imagine yourself before the invention of photography, describing your vacation to an illustrator you're commissioning to create a some woodblock-print artwork for a set of christmas cards you're having made up, the conversation you've laid out here is exactly how things would go. They'd ask you to recount what you saw, do a sketch, and then you'd give feedback and iterate together with them, to get a final visual down that reflects things the way you remember them, rather than the way they were, per se.
Facebook: I'd be happy to. Are there any more details you'd like to include?
me: Please show how he didn't understand me at first, but then he looks at me and starts crying with love and regret.
Facebook: Done. Your relationship with your father must have been deeply fulfilling.
...that, and other thoughts I have while baked.
FB AI, make a series of posts about me climbing mount everest, meeting dalai lama, curing cancer, bringing peace to ukraine, changing my name to Melon Tusk, announcing running for president and adopting a dog named Molly
You got to shoot for something just attainable enough to sound credible, while still being at the "enviable" end of the spectrum.
"FB AI, make a series of pictures of my first 3 months at Goldman Sachs in 2021. Include me shaking hands with the VP of software as I receive a productivity award for making them $1m in a week. Include a group photo of me and 12 other people (all C execs and my VP must be there). Crosspost all to LinkedIn, with notifications muted."
"Ok done"
"ChatGPT, take my existing CV and replace entries from 2021 onwards with a job as Head of Performance Monitoring at Goldman Sachs, reporting to VP of software. Include several projects with direct CEO and CFO involvement. Crosspost changes to LinkedIn."
"Ok done"
... and now I can go job-hunting.
This actually feels like it could be an incredibly valuable post-production tool in film and TV, once they get it working consistently across multiple frames.
Not only for more flexibility in "uncropping" after shooting (there was a tree/wall in the way), but this could basically be the holy grail solution for converting 4:3 to widescreen without cutting off content on the top and bottom.
The problem I'm solving is converting videos from widescreen to vertical and sometimes you need some extra height.
You’re a monster.
There have been quite a few 4:3-to-widescreen conversions that were done using the original film that was actually shot in widescreen and cropped for TV.
Sometimes, the wider shot makes perfect sense. Sometimes, they keep the original cropped one but cut off top/bottom. Sometimes it's a combination of the two. It all depends on what's being framed -- two people in a car usually benefits from cropping (nobody needs the bottom third of the frame occupied by the car's hood), while a close-up on someone's face usually benefits from extending the sides (otherwise it's an uncomfortable mega-close up that cuts off their mouth).
But having the flexibility to extend horizontally gives you the artistic possibilities.
Dead Comment
So then you just feed RealFill the 20 pictures you took and your uncle is magically painted in.
Probably google can even do this automatically - I would not be surprised if I get suggestions to fix images with cut off buildings via Google Photos in the future! Would be so cool.
A box that takes your gps location, weather, etc and autogenerates a photo from your PoV.
Not even a discussion about if this might be harmful or what the risks are or anything, just plain old "THIS FAKE MOMENT WAS REAL AND YOU'LL BELIEVE IT"?!
I really have a hard time with this. Wow I'm upset, more than I expected. The tech is fine yeah but the marketing is just deeply upsetting.
This has always been the case, you just don't remember it, and the (human) hallucinated details are usually just not important enough to care about.
https://www.reddit.com/r/StableDiffusion/comments/16uqqrh/ho...
then, as you like, you can do an "ai zoom out" to get zoomed out pictures, no longer constrained by your lens or distance.
I imagine this to be included relatively soon, just like how panaromas were once a niche thing that became much easier to do with some good ui/ux. pretty much any modern phone can do them without having to struggle with lining up photos and what not.
one thing that does greatly concern me about the demo/site is that they have "authentic" and "recover" as terms. the result here is not authentic nor has anything been "recovered." it's an illusion at best. I personally don't like how they portray the new image as being equivalent as if the lens framed it in the original picture. it's not, as they show themselves in the later portion (near the end) with the text sign. seriously irresponsible framing (pun intended) to what's otherwise very cool tech.
Give that a couple generations. “You were at location X and didn’t take a pic. We generated you some selfies, choose one that you like.”
I don't know if that's more or less creepy than the AI stuff...
Dead Comment