nathan-barry (u/nathan-barry)

nathan-barry commented on Show HN: Tiny Diffusion – A character-level text diffusion model from scratch github.com/nathan-barry/t... · Posted by u/nathan-barry

gdiamos · a month ago

One year later and there is still no inference engine for diffusion LLMs

Students looking for a project to break into AI - please!

nathan-barry · a month ago

Actually NVIDIA made one earlier this year, check out their Fast-dLLM paper

nathan-barry commented on Show HN: Tiny Diffusion – A character-level text diffusion model from scratch github.com/nathan-barry/t... · Posted by u/nathan-barry

embedding-shape · a month ago

Fun project, easy to understand and nice looking results, everything one could ask for! I played around with it locally, did some optimizations of low hanging fruits without making it much more complicated, and was gonna send over a PR. But then I noticed there is no license attached to the project. What are your plans regarding the licensing for this?

nathan-barry · a month ago

Hey, I’ll add the MIT licenses later today!

nathan-barry commented on Show HN: Tiny Diffusion – A character-level text diffusion model from scratch github.com/nathan-barry/t... · Posted by u/nathan-barry

yugretcx · a month ago

Why do these text diffusion demos always look like the number of allowed tokens is fixed for a specific unfilled region?

Is this the case?

Ie. if the region only has four tokens(here characters) but calculates the best word is “forget” does it just abandon the best fit or truncate it to fit?

Are there text diffusion models with lax infill directives?

nathan-barry · a month ago

Yes, this is the case. During training, the model will get a sequence of text (ex, 512 tokens long) with a percentage of them masked out (with a special <MASK> token). It learns how to unmask those tokens to construct the original text.

In the case that you mentioned, if we had 4 <MASK> tokens in a row, all we are doing for decoding is predicting what those 4 tokens should be.

Generally, this does not seem to be a significant problem, as there are usually multiple ways to express an idea in varying lengths. Also, with confidence-aware parallel decoding, it can usually avoid the scenario you mentioned, as focusing on decoding the highest confident tokens will generally avoid such scenarios with a well trained model.