Gemini Diffusion - Readit News

That's...ridiculously fast.

I still feel like the best uses of models we've seen to date is for brand new code and quick prototyping. I'm less convinced of the strength of their capabilities for improving on large preexisting content over which someone has repeatedly iterated.

Part of that is because, by definition, models cannot know what is not in a codebase and there is meaningful signal in that negative space. Encoding what isn't there seems like a hard problem, so even as models get smarter, they will continue to be handicapped by that lack of institutional knowledge, so to speak.

Imagine giving a large codebase to an incredibly talented developer and asking them to zero-shot a particular problem in one go, with only moments to read it and no opportunity to ask questions. More often than not, a less talented developer who is very familiar with that codebase will be able to add more value with the same amount of effort when tackling that same problem.

westoncb · 7 months ago

The trick to this is you've got to talk to them and share this information in the same way. I can give an example. These days my main workflow is as follows: if I have some big feature/refactor/whatever I'm going to work on I'll just start talking to o3 about it essentially as if it was a coworker and (somewhat painstakingly) paste in relevant source files it needs for context. We'll have a high-level discussion about what it is we're trying to build and how it relates to the existing code until I get the sense o3 has a clear and nuanced understanding (these discussions tend to sharpen my own understanding as well). Then, I'll ask o3 to generate an implementation plan that describes what needs to happen across the codebase in order for whatever it is to be realized. I'll then take that and hand it off to Codex, which might spend 10min executing shell commands to read source, edit files, test, etc. and then I've got a PR ready, which sometimes takes a bit more manual editing, and other times is perfectly ready to merge.

What you're saying is true RE them needing rich context, too—but this isn't a fundamental limitation, it's just an aspect of what it takes to work with them effectively. There's definitely a learning curve but once you've got it down it's not only very powerful but, for me anyway, a more enjoyable headspace to occupy than lots of lower level manual editing.

Onawa · 7 months ago

I would suggest trying the Continue.dev VSCode plugin for selective context injection. The plugin is Apache 2.0 licensed, and you can hook it up to any LLM API including local.

It has most of the same features as GitHub Copilot, but a few extra features I find essential. It can scrape documentation sites for individual libraries, which means you can do stuff like `@pandas @terminal @codebase Help me fix this error`.

For greenfield projects I will usually start out in a web-based chat interface, but the second I need to go back and forth between IDE and the web I switch over to the Continue.dev plugin.

dimitri-vs · 7 months ago

Interesting approach, I'm definitely going to steal your wording for "generate an implementation plan that...".

I do something similar but entirely within Cursor:

1. create a `docs/feature_name_spec.md`, use voice-to-text to brain dump what I am trying to do 2. open up a the AI chat panel in "Ask" mode while referencing that spec file, ask (paste) a boilerplate snippet like: "1) Ask clarifying questions about intent, domain, restrictions, ambiguity or missing details 2) Briefly identify any missing documents, data, or background information that would help you complete the task thoroughly" 3. move that list of questions into the spec doc and answer them there, attach the files it asked for and just rerun the above request (optionally, switching to a different model, like gemini-2.5-pro -> o3, for different perspective) 4. ask it to make an execution plan and at that point i have a fully spec'd out feature and documented business logic, I either use the Edit mode on each step or Agent mode

That's for more complex features touching many files or refactors, but I essentially do a simplified version of that within the same chat by editing my original chat prompt until I'm confident I explained myself well

landl0rd · 7 months ago

This is absolutely the best way to do it. However it's also infeasible for number-of-queries-based quota like most front-ends have. And of course running through API for models like o3 and 4-opus is basically always way more expensive. Hence the desire for one-shotting stuff.

jacob019 · 7 months ago

I find myself using a similar workflow with Aider. I'll use chat mode to plan, adjust context, enable edits, and let it go. I'll give it a broad objective and tell it to ask me questions until the requirements are clear, then a planning summary. Flipping the script is especially helpful when I'm unsure what I actually want.

ckw · 7 months ago

I do the same thing, though sometimes I take one extra step to elaborate on the first implementation plan ‘in minute detail such that a weaker model could successfully implement it’, with deep research selected.

ManuelKiessling · 7 months ago

"...what is not in a codebase, and there is meaningful signal in that negative space."

Man, I'm writing software for money for decades now, but this fundamental truth never occured to me, at least not consciously and with such clarity.

So, thank you!

spuz · 7 months ago

I am not certain that I agree with this. If there are alternative ways of solving a problem that we're not taken then these should be documented in comments. A mantra I try to tell myself and my colleagues is if information exists in your brain and nowhere else then write down it down _somewhere_. If I tried 5 different libraries before settling on one, then I write in comments which libraries I tried but didn't work and why. If I used a particular tool to debug a race condition then I put a link to a wiki page on how to use it in the comments. If we have one particular colleague who is an expert in some area then I write their name in a comment. Basically anything that is going to save future developers' time should be written down.

airstrike · 7 months ago

My pleasure ;-) I borrowed the term from art: https://www.michaelalfano.com/tag/negative-space/?id=400

woctordho · 7 months ago

Then document it. Whenever you choose one algorithm/library/tech stack but not another, write your consideration in the documents.

stef25 · 7 months ago

I understand what negative space is in art. Can you explain how this applies to writing software ?

FieryTransition · 7 months ago

There's a reason why less is called less, and not more.

8n4vidtmkvmk · 7 months ago

That's not been my experience so far. LLMs are good at mimicking existing good, it doesn't usually bring in new things when not asked. Sometimes I have to go out of my way to point to other bits of code in the project to copy from because it hasn't ingested enough of the codebase.

That said, a negative prompt like we have in stable diffusion would still be very cool.

Incipient · 7 months ago

I'm in the camp of 'no good for existing'. I try to get ~1000 line files refactored to use different libraries, design paradigms, etc and it usually outputs garbage - pulling db logic into the UI, grabbing unrelated api/function calls, to entirely just corrupting the output.

I'm sure there is a way to correctly use this tool, so I'm feeling like I'm "just holding it wrong".

manmal · 7 months ago

They could read the whole git history and have all issue tracker tickets in the context, and maybe even recordings from meetings. It remains to be seen though if such large context will yield usable results.

eMPee584 · 7 months ago

This. Git ( / tig!) blame and log -p --stat -S SEARCHSTR are extremely powerful for understanding the what why and when about code..

Cthulhu_ · 7 months ago

I find most meetings I'm in nowadays are mostly noise; there's no clear "signal" that "this is the outcome", which I think is what an AI should be able to filter out.

Of course, it'd be even better if people communicated more clearly and succinctly.

internet_points · 7 months ago

That also leads to more noise and opportunities to get lost in the woods.

ttoinou · 7 months ago

Do we already have tools to do thar automagically ?

Deleted Comment

aposm · 7 months ago

A human working on an existing codebase does not have any special signal about what is _not_ in a codebase. Instead, a (good) human engineer can look at how a problem is handled and consider why it might have been done that way vs other options, then make an educated decision about whether that alternative would be an improvement. To me this seems like yet another piece of evidence that these models are not doing any "reasoning" or problem-solving.

ec109685 · 7 months ago

If you make models fast enough, you can onboard that expert developer instantly and let them reason their way to a solution, especially when giving access to a RAG to.

Over time, I models will add more memory and institutional knowledge capture rather than starting from a blank slate each time.

airstrike · 7 months ago

I thought of that as I wrote my comment, but I think the infrastructure and glue to make that possible in a consistent, fast and scalable way is still a few years out.

Flemlo · 7 months ago

But plenty of companies already do this for a decade and more

Having old shitty code base and not retaining the people who built it.

I have done that too despite the creator sitting only 100km away. Code was shit as hell tons of c&p different logic in different endpoints for logging in.

Finally it's worth it to have adrs and similar things.

Flemlo · 7 months ago

A LLM could easily use its own knowledge to create a list of things to check inside the code base and generate a fact sheet and use best practices and similar knowledge to extend on it.

Just because one query might not be able to do so doesn't mean there are no ways around it

mejutoco · 7 months ago

> Part of that is because, by definition, models cannot know what is not in a codebase and there is meaningful signal in that negative space

I wonder if git history would be enough to cover this. It has alternatives tried and code that was removed at the very least.

scotty79 · 7 months ago

> they will continue to be handicapped by that lack of institutional knowledge, so to speak

Until we give them access to all Jira tickets instead of just one so they know what's missing.

campers · 7 months ago

I've been thinking about adding in an agent to our Codex/Jules like platform which goes through the git history for the main files being changed, extracts the Jira ticket ID's, look through them for additional context, along with the analyzing the changes to other files in commits.

nopinsight · 7 months ago

...which is why top LLM providers' web apps like ChatGPT, Claude.ai, Gemini try to nudge you to connect with Google Drive, and where appropriate, GitHub Repos. They also allow the user/dev to provide feedback to revise the results.

All the training and interaction data will help make them formidable.

Is anyone else totally blown away by this? I feel like it’s easily the biggest announcement out of IO, however it’s been overshadowed by Veo 3 etc.

Diffusion models for code generation are a big deal. If they are using transformers this would likely fall into the DiT bucket (diffusion transformers). I had previously worked on use cases that leveraged U-Net diffusion several years ago and there was quite a bit of interest in hybrid models. I expect to see further leaps in the diffusion space in the near future.

theptip · 7 months ago

Can someone help with the intuition here? My understanding from vision transformers is you start with noise and use a series of hierarchical models to iteratively refine the noise into the target. Each layer is trained to produce images at an increasing resolution, and by layering them you skip the problem of sparse gradients at the beginning to get from “noise” to “noise that kinda looks like a face”.

How does this work for coding? It would require you to be able to hierarchically structure the emitted artifacts. Maybe this sort of works; low granularity concepts like “use Django for this problem”, then “I need these endpoints” then “emit the code”. But AIUI diffusion doesn’t have a mechanism for backtracking, so you can’t feed back signals from the detailed layers to the “higher abstraction” layers at the top of your need to change an aspect of the design in response to a low-level problem.

Whereas transformers, you go through the whole model for each token and therefore can deploy all your smarts and logic at each step of the problem (if needed), including backtracking on key design decisions.

I’m sure my mental model has some big gaps, would appreciate any insights.

nvtop · 7 months ago

Despite the name, diffusion LMs have little to do with image diffusion and are much closer to BERT and old good masked language modeling. Recall how BERT is trained:

1. Take a full sentence ("the cat sat on the mat") 2. Replace 15% of tokens with a [MASK] token ("the cat [MASK] on [MASK] mat") 3. Make the Transformer predict tokens at masked positions. It does it in parallel, via a single inference step.

Now, diffusion LMs take this idea further. BERT can recover 15% of masked tokens ("noise"), but why stop here. Let's train a model to recover texts with 30%, 50%, 90%, 100% of masked tokens.

Once you've trained that, in order to generate something from scratch, you start by feeding the model all [MASK]s. It will generate you mostly gibberish, but you can take some tokens (let's say, 10%) at random positions and assume that these tokens are generated ("final"). Next, you run another iteration of inference, this time input having 90% of masks and 10% of "final" tokens. Again, you mark 10% of new tokens as final. Continue, and in 10 steps you'll have generated a whole sequence. This is a core idea behind diffusion language models.

Of course, there are some optimizations in the real world. If you need to generate a really long text (over 200 tokens), you'd better split it in chunks and fully generate the first chunk in parallel before moving to the next one. This semi-autoregressive generation is what Block Diffusion does.

You can be smart about how exactly you pick tokens you consider generated and what % exactly. At earlier stages, when it's mostly noise, you can take more, and on final stages you can do more iterations and take fewer tokens.

All in all, diffusion LMs are still iterative, but the number of steps is much lower than in autoregressive models. A nice thing is that you can choose how many steps are you going to make, trading quality for speed.

In the extreme, you can even generate just one leftmost masked token with a diffusion LM, effectively turning it into a traditional causal language model.

yorwba · 7 months ago

You could downscale text the same way you downscale images, by averaging token embeddings instead of pixel values. But you don't have to. AFAIK vision transformers don't suffer from sparse gradients that need a resolution hierarchy to overcome, downscaling is just a performance optimization, because processing an image at full resolution is expensive.

pertymcpert · 7 months ago

I have the exact same questions as you. I can barely understand how diffusion works for images, for sequential data like text it makes no sense to me.

bredren · 7 months ago

> however it’s been overshadowed by Veo 3 etc.

Because it’s simple to understand the power and difference in capability of Veo 3.

Understanding important steps forward in text completion requires understanding the value of what we have already and potential implications. Many people are not yet convinced LLMs are valuable for coding at all.

NitpickLawyer · 7 months ago

> Diffusion models for code generation are a big deal.

This is my intuition as well, as there are a lot of low-hanging fruits that a model like this could tackle in coding:

- you should be able to have a workflow where you constrain the generation w/ a function definition, and its output, and "generate" the tokens in between. Kind of like constrained generation but with the model being able to attend to tokens both ways.

- you should also be able to use a 2 step workflow like first writing a high level description of the function layout (think "write the chapters for an article on x" from LLMs) and then ping-pong between the actual implementations ("and now write chapter x"), using larger and larger context, using proxies like linters, code compilation, AST derived info, etc. for signals of "completion". Lots of things to be tried here indeed.

janalsncm · 7 months ago

That’s kind of hard though, right? If we have a rule that only B can follow A, and token at position 5 changes to an A you will have a cascade of constraints to follow.

bn-l · 7 months ago

Like in-painting except code?

impossiblefork · 7 months ago

I am not sure.

In principle one would imagine that models of this type would have an advantage-- you can use information from both the left and right, etc. and in practice I've found LLaDA to be impressive considering its size and my assumption that they have had small training resources, but they are behind in perplexity, and I think this is unavoidable. They also become rather fixed early, so I don't believe fully in these hopes to be able to really correct text deeply (although they will of course be able to correct their partially completed texts to some degree, especially when it's just a word or two that are wrong, but I believe that the words that are wrong basically need to get masked simultaneously, so 1/masking_probability^2, and 1/masking_probability^3 for three and so on).

Despite this I've been happy with the practical results I've seen during my experimentation.

spiderfarmer · 7 months ago

Not really only because I saw it demoed before: https://www.inceptionlabs.ai

TeMPOraL · 7 months ago

Right. It's not novel, but it's great to see this getting fully mainstream.