Text to image models feels inefficient to me. I wonder if it would be possible and better to do it in separate steps, like text to scene graph, scene graph to semantically segmented image, segmented image to final image. That way each step could be trained separately and be modular, and the image would be easier to edit instead of completely replace it with the output of a new prompt. That way it should be much easier to generate stuff like "object x next to object y, with the text foo on it", and the art style or level of realism would depend on the final rendering model which would be separate from the prompt adherence.
Kind of like those video2video (or img2img on each frame I guess) models where they enhance the image outputs from video games:
In general, it has been shown time and time again that this approach fails for neural network based models.
If you can train a neural network that goes from a to b and a network that goes from b to c, you can usually replace that combination with a simpler network that goes from a to c directly.
This makes sense, as there might be information in a that we lose by a conversion to b. A single neural network will ensure that all relevant information from a that we need to generate c will be passed to the upper layers.
Yes this is true, you do lose some information between the layers, and this increased expressibility is the big benefit of using ML instead of classic feature engineering. However, I think the gain would be worth it for some use cases. You could for instance take an existing image, run that through a semantic segmentation model, and then edit the underlying image description. You could add a yellow hat to a person without regenerating any other part of the image, you could edit existing text, change a person's pose, you could probably more easily convert images to 3D, etc.
It's probably not a viable idea, I just wish for more composable modules that lets us understand the models' representation better and change certain aspects of them, instead of these massive black boxes that mix all these tasks into one.
I would also like to add that the text2image models already have multiple interfaces between different parts. There's the text encoder, the latent to pixel space VAE decoder, controlnets, and sometimes there a separate img2imgstyle transfer at the end. Transformers already process images patchwise, but why does those patches have to be even square patches instead of semantically coherent areas?
Isn't this essemtially the approach to image recognition etc. that failed for ages until we brute forced it with bigger and deeper matrices?
It seems sensible to extract features and reason about things the way a human would, but it turns out its easier to scale pattern matching purely done by computer.
A problem with image recognition i can think of, is that any rude categorization of the image, which is millions of pixels will make it less accurate.
With image generation on the other hand, which starts from a handful of words, we can first do some text processing into categories, such as objects vs people, color vs brightness, environment vs main object, etc.
You could imagine doing it with 2 specialized NNs, but then you have to figure out a huge labeled dataset of scene graphs. The problem fundamentally is that any "manual" feature engineering is not going to be supervised and fitted on a huge corpus, the way the self-learned features are.
I am hoping that AI art tends towards a modular approach, where generating a character, setting, style, and camera movement each happens in its own step. It doesn’t make sense to describe everything at once and hope you like what you get.
Definitely, that would make much more sense seeing how content is produced by people. Adjust the technology to how people want to use it instead of forcing artists becoming prompt engineers and settling for something close enough what they want.
At the very least image generators should output layers, I think the style component is already possible with the img2img models.
That's essentially what diffusion does, except it doesn't have clear boundaries between "scene graph" and "full image". It starts out noisy and adds more detail gradually
That's true, the inefficiency is from using pixel-to-pixel attention at each stage. It the beginning low resolution would be enough, even at the end high resolution is only needed at the pixel's neighborhood
I guess the inefficiency is obvious to many, it's just a matter of time until something like this will come out. and yeah, as others said, you might lose info a-to-b that's needed for b-to-c, but you gain more in predictability/customization
Non-commercial is not open-source, because if the original copyright holder stops maintaining it, nobody else can continue (or has to work like a slave for free). Open-source is about what happens if the original author stops working on it. Open-source gives everyone the license to continue developing it, which obviously means also the ability to get paid. Don't call it open-source if this aspect is missing.
Only the FLUX.1 [schnell] is open-source (Apache2), FLUX.1 [dev] is non-commercial.
There is OpenFLUX.1 which is a fine tune of the FLUX.1-schnell model that has had the distillation trained out of it. OpenFLUX.1 is licensed Apache 2.0. https://huggingface.co/ostris/OpenFLUX.1/
It's based on the Debian Free Software Guidelines (DFSG), which were adopted by the Debian Project to determine what software does, and does not, qualify to be incorporated into the core distribution. (There is a non-free section, it is not considered part of the core distribution.)
> Doesn’t open source mean the source is viewable/inspectable?
According to the OSI definition, you also need a right to modify the source and/or distribute patches.
> I don’t know any closed source apps that let you view the source.
A lot of them do, especially in the open-core space. THe model is called source-available.
If you're selling to enterprises and not gamers, that model makes sense. What stops large enterprises from pirating software is their own lawyers, not DRM.
This is why you can put a lot of strange provisions into enterprise software licenses, even if you have little to no way to enforce these provisions on a purely technical level.
Open source usually means that you are able to modify and redistribute the software in question freely. However between open and closed, there is another class - source-available software. From its wikipedia page:
> Any software is source-available in the broad sense as long its source code is distributed along with it, even if the user has no legal rights to use, share, modify or even compile it.
As I said above: Open-source is about what happens if the original author stops working on it. Having the code viewable/inspectable is a side effect of that - can't sustain a project if all you have are blobs. Famously, Richard Stallman started GNU because he wanted to fix a printer: "Particular incidents that motivated this include a case where an annoying printer couldn't be fixed because the source code was withheld from users." https://en.wikipedia.org/wiki/History_of_free_and_open-sourc...
My favorite thing to do with Flux is create images with a white background for my substack[1] because the text following is amazing and I can communicate something visually through the artwork as well.
That example you gave is a good reason why artists get pissed off IMO. The LLM is clearly aping some artists specific style, and now missing out on paid work as a result.
Not sure I have an opinion on that, technology marches on etc, but it is interesting.
I understand your point, but in 0% of all cases would I hire an artist to create imagery for my personal blog. Therefore, I would think that market doesn't exist.
Dont care about artists opinion on rest of using AI tools instead of not paying them because I couldnt and wouldnt so theres no demand in the first place.
All I wanna know is the prompt that was used to generate the art speaking of which i wanna know how to create cartoony images like that OP
"A hand-drawing of a scientific middle-aged man in front of a white background.
The man is wearing jeans and a t-shirt.
He is thinking a bubble stating "What's in a ReAct JSON prompt?"
In the style of European comic book artists of the 1970s and 1980s."
Finding the right seed and model configuration is the more difficult part.
Flux is the leading contender for a locally hosted generative systems in terms of prompt adherence, but the omnipresent shallow depth of field is irritatingly hard to get rid of.
They almost certainly did DPO it, so that would have an effect. It was also probably just trained more on professional photography than cell phone pics.
I’ve found it odd how there’s a segment of the population that hates a shallow depth of field now, as they’re so used to their phone pictures. I got in an argument on Reddit (sigh) with someone who insisted that the somewhat shallow depth of field that SDXL liked to do by default was “fake.”
As in, he was only ever exposed to it through portrait mode and the like on phones and didn’t comprehend that larger sensors simply looked like that. The images he was posting that looked “fake” to him looked to be about a 50mm lens at f/4 on a full frame camera at a normal portrait distance, so nothing super shallow either.
I just cancelled my Midjourney subscription, it feels like it's fallen too far behind for the stuff I'd like to do. Spent a lot of time considering using Replicate as well as Ideogram.
I have been questioning the value beyond novelty as well recently. I’m curious if you replaced it with another tool or simply don’t derive value from those things?
Does someone know what FLUX 1.1 has been trained on?
I generated almost hundred images on the pro model using "camera filename + simple word" two word prompts, and it all looks like photos from someones phone. Like, unless it has text I would not even stop to consider any of these images AI. They sometimes look cropped. A lot of food pictures, messy tables and appartments etc.
Did they scrape public facebook posts? Snapchat? Vkontakte? Buy private images from onedrive/dropbox? If I put as the second word a female name, it almost always triggers nsfw filter. So I assume images in the training set are quite private.
[edit]
Looking at these images feels uneasy, like I am looking at someones private photos. There is not enough "guidance" in a prompt like "IMG00012.JPG forbid" to account for these images, so it must all come from the training data.
I do not believe FLUX 1.1 pro has radically different training set than these previous open models, even if it is more prone to such generation.
It feels really off, so, again, is there any info on training data used for these models?
It's not just flux, you can do the same with other models including Stable Diffusion.
These two reddit threads [1][2] explore this convention a bit.
DSC_0001-9999.JPG - Nikon Default
DSCF0001-9999.JPG - Fujifilm Default
IMG_0001-9999.JPG - Generic Image
P0001-9999.JPG - Panasonic Default
CIMG0001-9999.JPG - Casio Default
PICT0001-9999.JPG - Sony Default
Photo_0001-9999.JPG - Android Photo
VID_0001-9999.mp4 - Generic Video
Edit: Also created a version for 3D Software Filenames (all of them tested, only a few had some effects)
Autodesk Filmbox (FBX): my_model0001-9999.fbx
Stereolithography (STL): Model0001-9999.stl
3ds Max: 3ds_Scene0001-9999.max
Cinema 4D: Project0001-9999.c4d
Maya (ASCII): Animation0001-9999.ma
SketchUp: SketchUp0001-9999.skp
I highly doubt it’s a product of the raw training dataset because I had the opposite problem. The token for “background” introduced intense blur on the whole image almost regardless of how it was used in the prompt, which is interesting because their prompt interpretation is much better.
It seems likely that they did heavy calibration of text as well as a lot of tuning efforts to make the model prefer images that are “flux-y”.
Whatever process they’re following, they’ve inadvertently made the model overly sensitive to certain terms to the point at which their mere inclusion is stronger than a Lora.
The photos you’re showing aren’t especially noteworthy in the scheme of things. It doesn’t take a lot of effort to “escape” the basic image formatting and get something hyper realistic. Personally I don’t think they’re trying to hide the hyper realism so much as trying to default to imagery that people want.
They point to their comparison page to claim similar quality. First off it's very clear that the details are way less, but worse, look at the example "Three-quarters front view of a yellow 2017 Corvette coming around a curve in a mountain road and looking over a green valley on a cloudy day."
The Original model shows the FRONT, the speed version shows the BACK of the corvette. It's a completely different picture. This is not similar but strikingly different.
Kind of like those video2video (or img2img on each frame I guess) models where they enhance the image outputs from video games:
https://www.theverge.com/2021/5/12/22432945/intel-gta-v-real...https://www.reddit.com/r/aivideo/comments/1fx6zdr/gta_iv_wit...
If you can train a neural network that goes from a to b and a network that goes from b to c, you can usually replace that combination with a simpler network that goes from a to c directly.
This makes sense, as there might be information in a that we lose by a conversion to b. A single neural network will ensure that all relevant information from a that we need to generate c will be passed to the upper layers.
It's probably not a viable idea, I just wish for more composable modules that lets us understand the models' representation better and change certain aspects of them, instead of these massive black boxes that mix all these tasks into one.
I would also like to add that the text2image models already have multiple interfaces between different parts. There's the text encoder, the latent to pixel space VAE decoder, controlnets, and sometimes there a separate img2imgstyle transfer at the end. Transformers already process images patchwise, but why does those patches have to be even square patches instead of semantically coherent areas?
It seems sensible to extract features and reason about things the way a human would, but it turns out its easier to scale pattern matching purely done by computer.
https://www.cs.utexas.edu/~eunsol/courses/data/bitter_lesson...
With image generation on the other hand, which starts from a handful of words, we can first do some text processing into categories, such as objects vs people, color vs brightness, environment vs main object, etc.
At the very least image generators should output layers, I think the style component is already possible with the img2img models.
I want a picture of frozen cyan peach fuzz.
Prompt: frozen cyan peach fuzz, with default settings on a first generation SD model.
People _seriously_ do not understand how good these tools have been for nearly two years already.
compositing.
do this with ai today where each layer you want has just the artifact on top of a green background.
layer them in the order you want, then chroma key them out like you’re a 70s public broadcasting station producing reading rainbow.
the ai workflow becomes a singular, recursive step until your disney frame is complete. animate each layer over time and you have a film.
Deleted Comment
Only the FLUX.1 [schnell] is open-source (Apache2), FLUX.1 [dev] is non-commercial.
<https://opensource.org/osd>
Certain usages may be covered by trademark protection, as an "OSI Approved License":
<https://opensource.org/trademark-guidelines>
It's based on the Debian Free Software Guidelines (DFSG), which were adopted by the Debian Project to determine what software does, and does not, qualify to be incorporated into the core distribution. (There is a non-free section, it is not considered part of the core distribution.)
<https://www.debian.org/social_contract#guidelines>
Both definitions owe much to the Free Software Foundation's "Free Software" definition and the four freedoms protected by the GNU GPL:
- the freedom to use the software for any purpose,
- the freedom to change the software to suit your needs,
- the freedom to share the software with your friends and neighbors, and
- the freedom to share the changes you make.
<https://www.gnu.org/licenses/quick-guide-gplv3>
<https://www.gnu.org/philosophy/free-sw.html>
According to the OSI definition, you also need a right to modify the source and/or distribute patches.
> I don’t know any closed source apps that let you view the source.
A lot of them do, especially in the open-core space. THe model is called source-available.
If you're selling to enterprises and not gamers, that model makes sense. What stops large enterprises from pirating software is their own lawyers, not DRM.
This is why you can put a lot of strange provisions into enterprise software licenses, even if you have little to no way to enforce these provisions on a purely technical level.
> Any software is source-available in the broad sense as long its source code is distributed along with it, even if the user has no legal rights to use, share, modify or even compile it.
https://pollinations.ai/p/a_donkey_holding_a_sign_with_flux_...
https://pollinations.ai/p/a_donkey_holding_a_sign_with_flux_...
https://pollinations.ai/p/Minimalist%20and%20conceptual%20ar...
It's incredible how fast it is. We generate 8000 images every 30 minutes for our users using only three L40S GPUs. Disclaimer: I'm behind Pollinations
[1]https://substackcdn.com/image/fetch/w_1456,c_limit,f_webp,q_...
Not sure I have an opinion on that, technology marches on etc, but it is interesting.
All I wanna know is the prompt that was used to generate the art speaking of which i wanna know how to create cartoony images like that OP
"A hand-drawing of a scientific middle-aged man in front of a white background. The man is wearing jeans and a t-shirt. He is thinking a bubble stating "What's in a ReAct JSON prompt?" In the style of European comic book artists of the 1970s and 1980s."
Finding the right seed and model configuration is the more difficult part.
I’ve found it odd how there’s a segment of the population that hates a shallow depth of field now, as they’re so used to their phone pictures. I got in an argument on Reddit (sigh) with someone who insisted that the somewhat shallow depth of field that SDXL liked to do by default was “fake.”
As in, he was only ever exposed to it through portrait mode and the like on phones and didn’t comprehend that larger sensors simply looked like that. The images he was posting that looked “fake” to him looked to be about a 50mm lens at f/4 on a full frame camera at a normal portrait distance, so nothing super shallow either.
crazy not even a year has past since Emad's downfall a local open source and superior model drops
which just shows how little moat these companies have and are just lighting cash on fire which we benefit from
> which just shows how little moat these companies have
Flux was developed by the same people that made Stable Diffusion.
Did they scrape public facebook posts? Snapchat? Vkontakte? Buy private images from onedrive/dropbox? If I put as the second word a female name, it almost always triggers nsfw filter. So I assume images in the training set are quite private.
See for yourself (autoplay music warning):
people: https://vm.tiktok.com/ZGdeXEhMg/
food and stuff: https://vm.tiktok.com/ZGdeXEBDK/
signs: https://vm.tiktok.com/ZGdeXoAgy/
[edit] Looking at these images feels uneasy, like I am looking at someones private photos. There is not enough "guidance" in a prompt like "IMG00012.JPG forbid" to account for these images, so it must all come from the training data.
I do not believe FLUX 1.1 pro has radically different training set than these previous open models, even if it is more prone to such generation.
It feels really off, so, again, is there any info on training data used for these models?
These two reddit threads [1][2] explore this convention a bit.
[1]: https://www.reddit.com/r/StableDiffusion/comments/1fxkt3p/co...[2]: https://www.reddit.com/r/StableDiffusion/comments/1fxdm1n/i_...
Of all the models exibiting this behaviour, has anyone published, what are the training data sources? Like, honest list, not the PR-boilerplate.
https://i.postimg.cc/vT6SV7pq/replicate-prediction-6ap8z1jv5...
https://i.postimg.cc/vZzMTM71/replicate-prediction-7r4b4p6sj...
https://i.postimg.cc/rs6wM5LJ/replicate-prediction-d8s4c93v5...
I DEMAND TO KNOW HOW RUN LOCAL SAAR
It seems likely that they did heavy calibration of text as well as a lot of tuning efforts to make the model prefer images that are “flux-y”.
Whatever process they’re following, they’ve inadvertently made the model overly sensitive to certain terms to the point at which their mere inclusion is stronger than a Lora.
The photos you’re showing aren’t especially noteworthy in the scheme of things. It doesn’t take a lot of effort to “escape” the basic image formatting and get something hyper realistic. Personally I don’t think they’re trying to hide the hyper realism so much as trying to default to imagery that people want.
dont know why all the critical comments about flux are being downvoted or flag sure is weird
The Original model shows the FRONT, the speed version shows the BACK of the corvette. It's a completely different picture. This is not similar but strikingly different.
https://flux-quality-comparison.vercel.app/