The transformer layers perform self-attention between all pairs of patches, allowing the model to build a rich understanding of the relationships between areas of an image. These relationships extend into the dimensions of the conditioning prompts, which is why you can say “put a red cube over there” and it actually is able to do that.
I suspect that the smaller model versions will do a great job of generating imagery, but may not follow the prompt as closely, but that’s just a hunch.
It’s a recipe for burnout, but I did get a lot done. A good protocol for very tight important deadlines, but it took its toll.
I'll be releasing more information soon, but since this article's content seems aligned with my mission (build the Young Lady's Illustrated Primer, or something better), I wanted to give it a mention.
If you'd like to beta test, or collaborate, please send me a note directly, explaining what skills you most want to learn -- luke (at) lukebechtel.com
Psychotherapy is effective, but only marginally better than placebo. The differences in efficacy between various modes of psychotherapy are statistically insignificant and undoubtedly clinically insignificant. People receiving psychoanalysis improve at basically the same rate as people receiving CBT or IPT or ACT or a raft of other interventions that loosely resemble psychotherapy. In the most basic sense, it doesn't matter whether Freud was a genius or a fraud; entertaining his theories, even if only to criticise them, is a distraction at best.
The bottom line is that people tend to feel a bit better when they talk about their problems with someone who is attentive, supportive and non-judgemental. That's a valuable insight, but it's inherently limited and it's never going to yield the kind of treatments that we want and need.
1. SSMs are a type of recurrent model that can scale linearly with sequence length, making them more efficient than Transformers. However, prior SSMs struggled with discrete sequence modeling tasks like language.
2. This paper augments SSMs with a "selection mechanism" that allows the model dynamics to depend on the input, giving it the ability to selectively remember or forget information. This makes SSMs effective on tasks requiring discrete reasoning.
3. They design an efficient parallel scan algorithm to implement selective SSMs on GPUs. Despite recurrency, this achieves up to 5x higher throughput over Transformers in benchmarks.
4. They simplify prior SSM architectures into a new model called Mamba. On language modeling, Mamba matches or exceeds Transformers of 2-3x its size, while retaining linear scaling. It also achieves state-of-the-art results on audio, genomics, and synthetic tasks requiring long-term reasoning.
This work makes SSMs truly competitive with Transformers through selectivity and efficient engineering. Mamba matches or beats Transformers on major modalities while being substantially more efficient in computation, memory, and scaling to long sequences. If replicated, it's arguably the first linear-time architecture with Transformer-quality performance!!
Can't wait to see the code!
If you look at how a single step of the DDIM sampler interacts with the target timestep, it is actually just a linear function. This is obviously quite inflexible if we want to use it to represent a flexible function where we can choose any target timestep. So just add this as an argument to the neural network and then train it with a moment matching objective.
In general, I feel that analyzing a method's inference-time properties before training it can be helpful to not only diffusion models, but also LLMs including various recent diffusion LLMs, which prompted me to write a position paper in the hopes that others develop cool new ideas (https://arxiv.org/abs/2503.07154).
Also regarding linearity, why is it inflexible? It seems quite convenient that a simple linear interpolation is used for reconstruction, besides, even in DDIM, the directions towards the final target changes at each step as the images become less noisy. In standard diffusion models or even flow matching, denoising is always equal to the prediction of the original data + direction from current timestep to the timestep t'. Just to be clear, it is intuitive that such models are inferior in few-step generations since they don't optimise for test time efficiency (in terms of the tradeoff of quality vs compute), but it's unclear what inflexibility exists there beyond this limitation.
Clearly there's no expected benefit in quality if all timesteps are used in denoising?