I have no idea how it works actually (in google) but I wouldn't be surprised if it was just post-training because recently RWKV people did something similar: They replaced the whole attention mechanism with WKV (forward-only linear attention), and created such franken-stein just by post-training.
The big wow moment about that is that it sort of implies that most of the useful knowledge is in the FFN, and attention itself is not that unique/important.
BTW: It could be also interesting to try use already trained attention and see how long the FFN itself takes in the gpt2 speedtraining (it would be against the rules but still very interesting IMHO - definitely something I'd like to read paper about)
https://github.com/KellerJordan/modded-nanogpt
Also, I read yesterday that at some point, the embeddings across all of the models are (very) comparable/similar, and a simple converter can be trained, and if both of these statements are true maybe we could just train everything much faster just by sharing fixed embeddings and attentions.
Ever notice that attention is (with the highest respect to the original researchers) "just" inputting the entire past of the network into a reverse-MoE neural network? (meaning the expert is selecting parts of the input instead of parts of the neural network to execute)
In a way everyone knew this would work. Nobody did it because it's so inefficient even R and Python users thought that it would be ridiculously slow (or simply couldn't execute it enough to train to a reasonable extent)
Attention is just completely arbitrary way to split the network so the learning can be parallelized.
What contributed more towards success in my opinion are "shortcut connections" through layers which enable more influence on early layers during learning.
> What contributed more towards success in my opinion are "shortcut connections" through layers which enable more influence on early layers during learning.
For those who don't know, that is the idea behind ResNet (He et al., Deep Residual Learning for Image Recognition, https://arxiv.org/abs/1512.03385), one of the most influential papers in deep learning of all time.
Residual connections make it possible to train networks that are arbitrarily deep. Before ResNet, networks that were too deep were essentially not trainable due to vanishing or exploding gradients.
> Also, I read yesterday that at some point, the embeddings across all of the models are (very) comparable/similar, and a simple converter can be trained
The relative unimportance of the exact SDPA attention in use in modern transformers is already known: https://arxiv.org/abs/2111.11418
The FFN, normalization, and residual connections are absolutely irreplaceable -- but attention can be replaced with almost any other layer that shares information between tokens, such as pooling, convolution, random mixing, etc.
I still feel like the best uses of models we've seen to date is for brand new code and quick prototyping. I'm less convinced of the strength of their capabilities for improving on large preexisting content over which someone has repeatedly iterated.
Part of that is because, by definition, models cannot know what is not in a codebase and there is meaningful signal in that negative space. Encoding what isn't there seems like a hard problem, so even as models get smarter, they will continue to be handicapped by that lack of institutional knowledge, so to speak.
Imagine giving a large codebase to an incredibly talented developer and asking them to zero-shot a particular problem in one go, with only moments to read it and no opportunity to ask questions. More often than not, a less talented developer who is very familiar with that codebase will be able to add more value with the same amount of effort when tackling that same problem.
The trick to this is you've got to talk to them and share this information in the same way. I can give an example. These days my main workflow is as follows: if I have some big feature/refactor/whatever I'm going to work on I'll just start talking to o3 about it essentially as if it was a coworker and (somewhat painstakingly) paste in relevant source files it needs for context. We'll have a high-level discussion about what it is we're trying to build and how it relates to the existing code until I get the sense o3 has a clear and nuanced understanding (these discussions tend to sharpen my own understanding as well). Then, I'll ask o3 to generate an implementation plan that describes what needs to happen across the codebase in order for whatever it is to be realized. I'll then take that and hand it off to Codex, which might spend 10min executing shell commands to read source, edit files, test, etc. and then I've got a PR ready, which sometimes takes a bit more manual editing, and other times is perfectly ready to merge.
What you're saying is true RE them needing rich context, too—but this isn't a fundamental limitation, it's just an aspect of what it takes to work with them effectively. There's definitely a learning curve but once you've got it down it's not only very powerful but, for me anyway, a more enjoyable headspace to occupy than lots of lower level manual editing.
I would suggest trying the Continue.dev VSCode plugin for selective context injection. The plugin is Apache 2.0 licensed, and you can hook it up to any LLM API including local.
It has most of the same features as GitHub Copilot, but a few extra features I find essential. It can scrape documentation sites for individual libraries, which means you can do stuff like `@pandas @terminal @codebase Help me fix this error`.
For greenfield projects I will usually start out in a web-based chat interface, but the second I need to go back and forth between IDE and the web I switch over to the Continue.dev plugin.
Interesting approach, I'm definitely going to steal your wording for "generate an implementation plan that...".
I do something similar but entirely within Cursor:
1. create a `docs/feature_name_spec.md`, use voice-to-text to brain dump what I am trying to do
2. open up a the AI chat panel in "Ask" mode while referencing that spec file, ask (paste) a boilerplate snippet like: "1) Ask clarifying questions about intent, domain, restrictions, ambiguity or missing details 2) Briefly identify any missing documents, data, or background information that would help you complete the task thoroughly"
3. move that list of questions into the spec doc and answer them there, attach the files it asked for and just rerun the above request (optionally, switching to a different model, like gemini-2.5-pro -> o3, for different perspective)
4. ask it to make an execution plan and at that point i have a fully spec'd out feature and documented business logic, I either use the Edit mode on each step or Agent mode
That's for more complex features touching many files or refactors, but I essentially do a simplified version of that within the same chat by editing my original chat prompt until I'm confident I explained myself well
This is absolutely the best way to do it. However it's also infeasible for number-of-queries-based quota like most front-ends have. And of course running through API for models like o3 and 4-opus is basically always way more expensive. Hence the desire for one-shotting stuff.
I find myself using a similar workflow with Aider. I'll use chat mode to plan, adjust context, enable edits, and let it go. I'll give it a broad objective and tell it to ask me questions until the requirements are clear, then a planning summary. Flipping the script is especially helpful when I'm unsure what I actually want.
I do the same thing, though sometimes I take one extra step to elaborate on the first implementation plan ‘in minute detail such that a weaker model could successfully implement it’, with deep research selected.
I am not certain that I agree with this. If there are alternative ways of solving a problem that we're not taken then these should be documented in comments. A mantra I try to tell myself and my colleagues is if information exists in your brain and nowhere else then write down it down _somewhere_. If I tried 5 different libraries before settling on one, then I write in comments which libraries I tried but didn't work and why. If I used a particular tool to debug a race condition then I put a link to a wiki page on how to use it in the comments. If we have one particular colleague who is an expert in some area then I write their name in a comment. Basically anything that is going to save future developers' time should be written down.
That's not been my experience so far. LLMs are good at mimicking existing good, it doesn't usually bring in new things when not asked. Sometimes I have to go out of my way to point to other bits of code in the project to copy from because it hasn't ingested enough of the codebase.
That said, a negative prompt like we have in stable diffusion would still be very cool.
I'm in the camp of 'no good for existing'. I try to get ~1000 line files refactored to use different libraries, design paradigms, etc and it usually outputs garbage - pulling db logic into the UI, grabbing unrelated api/function calls, to entirely just corrupting the output.
I'm sure there is a way to correctly use this tool, so I'm feeling like I'm "just holding it wrong".
They could read the whole git history and have all issue tracker tickets in the context, and maybe even recordings from meetings. It remains to be seen though if such large context will yield usable results.
I find most meetings I'm in nowadays are mostly noise; there's no clear "signal" that "this is the outcome", which I think is what an AI should be able to filter out.
Of course, it'd be even better if people communicated more clearly and succinctly.
A human working on an existing codebase does not have any special signal about what is _not_ in a codebase. Instead, a (good) human engineer can look at how a problem is handled and consider why it might have been done that way vs other options, then make an educated decision about whether that alternative would be an improvement. To me this seems like yet another piece of evidence that these models are not doing any "reasoning" or problem-solving.
If you make models fast enough, you can onboard that expert developer instantly and let them reason their way to a solution, especially when giving access to a RAG to.
Over time, I models will add more memory and institutional knowledge capture rather than starting from a blank slate each time.
I thought of that as I wrote my comment, but I think the infrastructure and glue to make that possible in a consistent, fast and scalable way is still a few years out.
But plenty of companies already do this for a decade and more
Having old shitty code base and not retaining the people who built it.
I have done that too despite the creator sitting only 100km away. Code was shit as hell tons of c&p different logic in different endpoints for logging in.
Finally it's worth it to have adrs and similar things.
A LLM could easily use its own knowledge to create a list of things to check inside the code base and generate a fact sheet and use best practices and similar knowledge to extend on it.
Just because one query might not be able to do so doesn't mean there are no ways around it
I've been thinking about adding in an agent to our Codex/Jules like platform which goes through the git history for the main files being changed, extracts the Jira ticket ID's, look through them for additional context, along with the analyzing the changes to other files in commits.
...which is why top LLM providers' web apps like ChatGPT, Claude.ai, Gemini try to nudge you to connect with Google Drive, and where appropriate, GitHub Repos. They also allow the user/dev to provide feedback to revise the results.
All the training and interaction data will help make them formidable.
Is anyone else totally blown away by this? I feel like it’s easily the biggest announcement out of IO, however it’s been overshadowed by Veo 3 etc.
Diffusion models for code generation are a big deal. If they are using transformers this would likely fall into the DiT bucket (diffusion transformers). I had previously worked on use cases that leveraged U-Net diffusion several years ago and there was quite a bit of interest in hybrid models. I expect to see further leaps in the diffusion space in the near future.
Can someone help with the intuition here? My understanding from vision transformers is you start with noise and use a series of hierarchical models to iteratively refine the noise into the target. Each layer is trained to produce images at an increasing resolution, and by layering them you skip the problem of sparse gradients at the beginning to get from “noise” to “noise that kinda looks like a face”.
How does this work for coding? It would require you to be able to hierarchically structure the emitted artifacts. Maybe this sort of works; low granularity concepts like “use Django for this problem”, then “I need these endpoints” then “emit the code”. But AIUI diffusion doesn’t have a mechanism for backtracking, so you can’t feed back signals from the detailed layers to the “higher abstraction” layers at the top of your need to change an aspect of the design in response to a low-level problem.
Whereas transformers, you go through the whole model for each token and therefore can deploy all your smarts and logic at each step of the problem (if needed), including backtracking on key design decisions.
I’m sure my mental model has some big gaps, would appreciate any insights.
Despite the name, diffusion LMs have little to do with image diffusion and are much closer to BERT and old good masked language modeling. Recall how BERT is trained:
1. Take a full sentence ("the cat sat on the mat")
2. Replace 15% of tokens with a [MASK] token ("the cat [MASK] on [MASK] mat")
3. Make the Transformer predict tokens at masked positions. It does it in parallel, via a single inference step.
Now, diffusion LMs take this idea further. BERT can recover 15% of masked tokens ("noise"), but why stop here. Let's train a model to recover texts with 30%, 50%, 90%, 100% of masked tokens.
Once you've trained that, in order to generate something from scratch, you start by feeding the model all [MASK]s. It will generate you mostly gibberish, but you can take some tokens (let's say, 10%) at random positions and assume that these tokens are generated ("final"). Next, you run another iteration of inference, this time input having 90% of masks and 10% of "final" tokens. Again, you mark 10% of new tokens as final. Continue, and in 10 steps you'll have generated a whole sequence. This is a core idea behind diffusion language models.
Of course, there are some optimizations in the real world. If you need to generate a really long text (over 200 tokens), you'd better split it in chunks and fully generate the first chunk in parallel before moving to the next one. This semi-autoregressive generation is what Block Diffusion does.
You can be smart about how exactly you pick tokens you consider generated and what % exactly. At earlier stages, when it's mostly noise, you can take more, and on final stages you can do more iterations and take fewer tokens.
All in all, diffusion LMs are still iterative, but the number of steps is much lower than in autoregressive models. A nice thing is that you can choose how many steps are you going to make, trading quality for speed.
In the extreme, you can even generate just one leftmost masked token with a diffusion LM, effectively turning it into a traditional causal language model.
You could downscale text the same way you downscale images, by averaging token embeddings instead of pixel values. But you don't have to. AFAIK vision transformers don't suffer from sparse gradients that need a resolution hierarchy to overcome, downscaling is just a performance optimization, because processing an image at full resolution is expensive.
Because it’s simple to understand the power and difference in capability of Veo 3.
Understanding important steps forward in text completion requires understanding the value of what we have already and potential implications. Many people are not yet convinced LLMs are valuable for coding at all.
> Diffusion models for code generation are a big deal.
This is my intuition as well, as there are a lot of low-hanging fruits that a model like this could tackle in coding:
- you should be able to have a workflow where you constrain the generation w/ a function definition, and its output, and "generate" the tokens in between. Kind of like constrained generation but with the model being able to attend to tokens both ways.
- you should also be able to use a 2 step workflow like first writing a high level description of the function layout (think "write the chapters for an article on x" from LLMs) and then ping-pong between the actual implementations ("and now write chapter x"), using larger and larger context, using proxies like linters, code compilation, AST derived info, etc. for signals of "completion". Lots of things to be tried here indeed.
That’s kind of hard though, right? If we have a rule that only B can follow A, and token at position 5 changes to an A you will have a cascade of constraints to follow.
In principle one would imagine that models of this type would have an advantage-- you can use information from both the left and right, etc. and in practice I've found LLaDA to be impressive considering its size and my assumption that they have had small training resources, but they are behind in perplexity, and I think this is unavoidable. They also become rather fixed early, so I don't believe fully in these hopes to be able to really correct text deeply (although they will of course be able to correct their partially completed texts to some degree, especially when it's just a word or two that are wrong, but I believe that the words that are wrong basically need to get masked simultaneously, so 1/masking_probability^2, and 1/masking_probability^3 for three and so on).
Despite this I've been happy with the practical results I've seen during my experimentation.
I think the lede is being buried. This is a great and fast InstructGPT. This is absolutely going to be used in spell checks, codemods, and code editors.
Instant edits feature can surgically perform text edits fast without all the extra fluff or unsolicited enhancements.
I copied shadertoys, asked it to rename all variables to be more descriptive and pasted the result to see it still working. I'm impressed.
No. Spell check frequently still gets things wrong if the word is spelled correctly and the sentence is grammatically correct but the wrong word was used.
It might sound unbelievable but if you write in multiple languages and mix languages in the same message or sentence, often spell check doesn't work properly. Which is only normal.
I regularly send messages in 4 different languages (living in a bilingual city + frequent use of English and lots of Spanish friends). Sometimes even using 3 languages in one sentence.
Whatsapp kind of improved it now in that you can "activate" two languages at the same time. Apart from that I'm not sure there's much else that can be done.
It's not even that much of an edge case. Brussels is the one of the most international cities in the world, street names exist in 2 languages, a lot of slang and expressions get borrowed from other languages.
AR doesn't inhibit long planning processes, but some popular, modern instantiations of AR have that flaw. AR in general is critical for learning the right distribution.
A claim I believe (or want to) but can you point to any papers about this? I haven’t seen any papers at all or demos showing a revise diffusion text step. I’d reallly like to use one though.
I have been wondering about the use of diffusion techniques for text generation, it is nice to see Google release a model that, seemingly, validates some thoughts I had.
Most folks I have seen experimenting with AI are either using a paid service or running high-grade hardware (even if consumer-level). The best I have in my current repertoire is a 5700XT and am not able to upgrade from that yet. The limitation, though, has at least also given some more significant insights into the shortcomings of current models.
Model sizes have gotten quite large and coherence seems to mostly have scaled with the density of a model, leaving the smaller models useful for only smaller tasks. Context size is also extremely important from my experiments with long-running dialogues and agent sessions, but a smaller GPU simply cannot fit a decent model and enough context at the same time. I do wonder if diffusion techniques will allow for a rebalancing of this density-to-coherence connection, letting smaller models produce chunks of coherent text even if limited by context. From my viewpoint it seems it will. Mixed tool call + response outputs also have the potential to be better.
Speed is also another problem I, and everyone else, has had with modern LLMs. The nature of cycling around the input with a new additional output each time is time consuming. On an older GPU with no AI-specific hardware it is an eternity! Being able to at least track 0-100% progress state would be an improvement from the current solution. At the moment one must simply wait for the LLM to decide to stop (or hit the max number of inference tokens). I am hopeful that, even on lower-end GPUs, a diffusion model will perform slightly better.
This does now beg several questions. If we are processing noise, where does the noise come from? Is there a good source of noise for LLMs/text specifically? Is the entire block sized beforehand or is it possible to have variable length in responses?
I am so excited about diffusion language models. They may be the piece we need to make our voice-to-code game mechanic be as smooth as we envision it.
Cerebras and Groq are amazing, but the fact that they use custom hardware really limits the ability to finetune or scale. The other route would be an MoE that has barely 0.5b parameters active, but that would be a major undertaking that we can't prioritize at the moment.
---
If anyone at Google/Deepmind reads this, please give us API access.
We are building generative sandbox games. First title is a monster trainer where you get to actually command your creature in realtime, here is an early prototype: https://youtu.be/BOwpLyj2Yqw
This is super interesting and obviously someone would have tried diffusion for text. But I will ask the obvious question… how does it know how many words or even tokens to fill in, before it knows what the words will be? It would hamstring itself a lot of the time, can it edit the words later and create more space or is it kind of stuck with the token positioning as it would be with parts of an image? It seems very strange. Usually, words are composed in order like AR models do it, because they are using a recursive grammar, and this is especially true of computer languages. This is a bit like mad libs but madder libs. My question is, how could this possibly give better results than AR, it would need to perfectly converge on something with the right grammar context and the semantic meaning, while perfectly predicting early on the amount of tokens that would appear between words. Seems like there is some major impedance mismatch.
The big wow moment about that is that it sort of implies that most of the useful knowledge is in the FFN, and attention itself is not that unique/important.
https://substack.recursal.ai/p/qwerky-72b-and-32b-training-l...
BTW: It could be also interesting to try use already trained attention and see how long the FFN itself takes in the gpt2 speedtraining (it would be against the rules but still very interesting IMHO - definitely something I'd like to read paper about) https://github.com/KellerJordan/modded-nanogpt
Also, I read yesterday that at some point, the embeddings across all of the models are (very) comparable/similar, and a simple converter can be trained, and if both of these statements are true maybe we could just train everything much faster just by sharing fixed embeddings and attentions.
In a way everyone knew this would work. Nobody did it because it's so inefficient even R and Python users thought that it would be ridiculously slow (or simply couldn't execute it enough to train to a reasonable extent)
What contributed more towards success in my opinion are "shortcut connections" through layers which enable more influence on early layers during learning.
For those who don't know, that is the idea behind ResNet (He et al., Deep Residual Learning for Image Recognition, https://arxiv.org/abs/1512.03385), one of the most influential papers in deep learning of all time.
Residual connections make it possible to train networks that are arbitrarily deep. Before ResNet, networks that were too deep were essentially not trainable due to vanishing or exploding gradients.
That was from here: https://news.ycombinator.com/item?id=44054425
The FFN, normalization, and residual connections are absolutely irreplaceable -- but attention can be replaced with almost any other layer that shares information between tokens, such as pooling, convolution, random mixing, etc.
I still feel like the best uses of models we've seen to date is for brand new code and quick prototyping. I'm less convinced of the strength of their capabilities for improving on large preexisting content over which someone has repeatedly iterated.
Part of that is because, by definition, models cannot know what is not in a codebase and there is meaningful signal in that negative space. Encoding what isn't there seems like a hard problem, so even as models get smarter, they will continue to be handicapped by that lack of institutional knowledge, so to speak.
Imagine giving a large codebase to an incredibly talented developer and asking them to zero-shot a particular problem in one go, with only moments to read it and no opportunity to ask questions. More often than not, a less talented developer who is very familiar with that codebase will be able to add more value with the same amount of effort when tackling that same problem.
What you're saying is true RE them needing rich context, too—but this isn't a fundamental limitation, it's just an aspect of what it takes to work with them effectively. There's definitely a learning curve but once you've got it down it's not only very powerful but, for me anyway, a more enjoyable headspace to occupy than lots of lower level manual editing.
It has most of the same features as GitHub Copilot, but a few extra features I find essential. It can scrape documentation sites for individual libraries, which means you can do stuff like `@pandas @terminal @codebase Help me fix this error`.
For greenfield projects I will usually start out in a web-based chat interface, but the second I need to go back and forth between IDE and the web I switch over to the Continue.dev plugin.
I do something similar but entirely within Cursor:
1. create a `docs/feature_name_spec.md`, use voice-to-text to brain dump what I am trying to do 2. open up a the AI chat panel in "Ask" mode while referencing that spec file, ask (paste) a boilerplate snippet like: "1) Ask clarifying questions about intent, domain, restrictions, ambiguity or missing details 2) Briefly identify any missing documents, data, or background information that would help you complete the task thoroughly" 3. move that list of questions into the spec doc and answer them there, attach the files it asked for and just rerun the above request (optionally, switching to a different model, like gemini-2.5-pro -> o3, for different perspective) 4. ask it to make an execution plan and at that point i have a fully spec'd out feature and documented business logic, I either use the Edit mode on each step or Agent mode
That's for more complex features touching many files or refactors, but I essentially do a simplified version of that within the same chat by editing my original chat prompt until I'm confident I explained myself well
Man, I'm writing software for money for decades now, but this fundamental truth never occured to me, at least not consciously and with such clarity.
So, thank you!
That said, a negative prompt like we have in stable diffusion would still be very cool.
I'm sure there is a way to correctly use this tool, so I'm feeling like I'm "just holding it wrong".
Of course, it'd be even better if people communicated more clearly and succinctly.
Deleted Comment
Over time, I models will add more memory and institutional knowledge capture rather than starting from a blank slate each time.
Having old shitty code base and not retaining the people who built it.
I have done that too despite the creator sitting only 100km away. Code was shit as hell tons of c&p different logic in different endpoints for logging in.
Finally it's worth it to have adrs and similar things.
Just because one query might not be able to do so doesn't mean there are no ways around it
I wonder if git history would be enough to cover this. It has alternatives tried and code that was removed at the very least.
Until we give them access to all Jira tickets instead of just one so they know what's missing.
All the training and interaction data will help make them formidable.
Diffusion models for code generation are a big deal. If they are using transformers this would likely fall into the DiT bucket (diffusion transformers). I had previously worked on use cases that leveraged U-Net diffusion several years ago and there was quite a bit of interest in hybrid models. I expect to see further leaps in the diffusion space in the near future.
How does this work for coding? It would require you to be able to hierarchically structure the emitted artifacts. Maybe this sort of works; low granularity concepts like “use Django for this problem”, then “I need these endpoints” then “emit the code”. But AIUI diffusion doesn’t have a mechanism for backtracking, so you can’t feed back signals from the detailed layers to the “higher abstraction” layers at the top of your need to change an aspect of the design in response to a low-level problem.
Whereas transformers, you go through the whole model for each token and therefore can deploy all your smarts and logic at each step of the problem (if needed), including backtracking on key design decisions.
I’m sure my mental model has some big gaps, would appreciate any insights.
1. Take a full sentence ("the cat sat on the mat") 2. Replace 15% of tokens with a [MASK] token ("the cat [MASK] on [MASK] mat") 3. Make the Transformer predict tokens at masked positions. It does it in parallel, via a single inference step.
Now, diffusion LMs take this idea further. BERT can recover 15% of masked tokens ("noise"), but why stop here. Let's train a model to recover texts with 30%, 50%, 90%, 100% of masked tokens.
Once you've trained that, in order to generate something from scratch, you start by feeding the model all [MASK]s. It will generate you mostly gibberish, but you can take some tokens (let's say, 10%) at random positions and assume that these tokens are generated ("final"). Next, you run another iteration of inference, this time input having 90% of masks and 10% of "final" tokens. Again, you mark 10% of new tokens as final. Continue, and in 10 steps you'll have generated a whole sequence. This is a core idea behind diffusion language models.
Of course, there are some optimizations in the real world. If you need to generate a really long text (over 200 tokens), you'd better split it in chunks and fully generate the first chunk in parallel before moving to the next one. This semi-autoregressive generation is what Block Diffusion does.
You can be smart about how exactly you pick tokens you consider generated and what % exactly. At earlier stages, when it's mostly noise, you can take more, and on final stages you can do more iterations and take fewer tokens.
All in all, diffusion LMs are still iterative, but the number of steps is much lower than in autoregressive models. A nice thing is that you can choose how many steps are you going to make, trading quality for speed.
In the extreme, you can even generate just one leftmost masked token with a diffusion LM, effectively turning it into a traditional causal language model.
Because it’s simple to understand the power and difference in capability of Veo 3.
Understanding important steps forward in text completion requires understanding the value of what we have already and potential implications. Many people are not yet convinced LLMs are valuable for coding at all.
This is my intuition as well, as there are a lot of low-hanging fruits that a model like this could tackle in coding:
- you should be able to have a workflow where you constrain the generation w/ a function definition, and its output, and "generate" the tokens in between. Kind of like constrained generation but with the model being able to attend to tokens both ways.
- you should also be able to use a 2 step workflow like first writing a high level description of the function layout (think "write the chapters for an article on x" from LLMs) and then ping-pong between the actual implementations ("and now write chapter x"), using larger and larger context, using proxies like linters, code compilation, AST derived info, etc. for signals of "completion". Lots of things to be tried here indeed.
In principle one would imagine that models of this type would have an advantage-- you can use information from both the left and right, etc. and in practice I've found LLaDA to be impressive considering its size and my assumption that they have had small training resources, but they are behind in perplexity, and I think this is unavoidable. They also become rather fixed early, so I don't believe fully in these hopes to be able to really correct text deeply (although they will of course be able to correct their partially completed texts to some degree, especially when it's just a word or two that are wrong, but I believe that the words that are wrong basically need to get masked simultaneously, so 1/masking_probability^2, and 1/masking_probability^3 for three and so on).
Despite this I've been happy with the practical results I've seen during my experimentation.
Instant edits feature can surgically perform text edits fast without all the extra fluff or unsolicited enhancements.
I copied shadertoys, asked it to rename all variables to be more descriptive and pasted the result to see it still working. I'm impressed.
I regularly send messages in 4 different languages (living in a bilingual city + frequent use of English and lots of Spanish friends). Sometimes even using 3 languages in one sentence.
Whatsapp kind of improved it now in that you can "activate" two languages at the same time. Apart from that I'm not sure there's much else that can be done.
It's not even that much of an edge case. Brussels is the one of the most international cities in the world, street names exist in 2 languages, a lot of slang and expressions get borrowed from other languages.
This is because it can edit and doesn’t suffer from early token bias.
> Gemini Diffusion’s external benchmark performance is comparable to much larger models, whilst also being faster.
That doesn't necessarily mean that they scale as well as autoregressive models.
# d1: Scaling Reasoning in Diffusion Large Language Models via Reinforcement Learning
https://dllm-reasoning.github.io/
Could you please clarify that?
Most folks I have seen experimenting with AI are either using a paid service or running high-grade hardware (even if consumer-level). The best I have in my current repertoire is a 5700XT and am not able to upgrade from that yet. The limitation, though, has at least also given some more significant insights into the shortcomings of current models.
Model sizes have gotten quite large and coherence seems to mostly have scaled with the density of a model, leaving the smaller models useful for only smaller tasks. Context size is also extremely important from my experiments with long-running dialogues and agent sessions, but a smaller GPU simply cannot fit a decent model and enough context at the same time. I do wonder if diffusion techniques will allow for a rebalancing of this density-to-coherence connection, letting smaller models produce chunks of coherent text even if limited by context. From my viewpoint it seems it will. Mixed tool call + response outputs also have the potential to be better.
Speed is also another problem I, and everyone else, has had with modern LLMs. The nature of cycling around the input with a new additional output each time is time consuming. On an older GPU with no AI-specific hardware it is an eternity! Being able to at least track 0-100% progress state would be an improvement from the current solution. At the moment one must simply wait for the LLM to decide to stop (or hit the max number of inference tokens). I am hopeful that, even on lower-end GPUs, a diffusion model will perform slightly better.
This does now beg several questions. If we are processing noise, where does the noise come from? Is there a good source of noise for LLMs/text specifically? Is the entire block sized beforehand or is it possible to have variable length in responses?
Cerebras and Groq are amazing, but the fact that they use custom hardware really limits the ability to finetune or scale. The other route would be an MoE that has barely 0.5b parameters active, but that would be a major undertaking that we can't prioritize at the moment.
--- If anyone at Google/Deepmind reads this, please give us API access.
We are building generative sandbox games. First title is a monster trainer where you get to actually command your creature in realtime, here is an early prototype: https://youtu.be/BOwpLyj2Yqw