Readit News logoReadit News
nathan-barry commented on Show HN: Tiny Diffusion – A character-level text diffusion model from scratch   github.com/nathan-barry/t... · Posted by u/nathan-barry
gdiamos · a month ago
One year later and there is still no inference engine for diffusion LLMs

Students looking for a project to break into AI - please!

nathan-barry · a month ago
Actually NVIDIA made one earlier this year, check out their Fast-dLLM paper
nathan-barry commented on Show HN: Tiny Diffusion – A character-level text diffusion model from scratch   github.com/nathan-barry/t... · Posted by u/nathan-barry
embedding-shape · a month ago
Fun project, easy to understand and nice looking results, everything one could ask for! I played around with it locally, did some optimizations of low hanging fruits without making it much more complicated, and was gonna send over a PR. But then I noticed there is no license attached to the project. What are your plans regarding the licensing for this?
nathan-barry · a month ago
Hey, I’ll add the MIT licenses later today!
nathan-barry commented on Show HN: Tiny Diffusion – A character-level text diffusion model from scratch   github.com/nathan-barry/t... · Posted by u/nathan-barry
yugretcx · a month ago
Why do these text diffusion demos always look like the number of allowed tokens is fixed for a specific unfilled region?

Is this the case?

Ie. if the region only has four tokens(here characters) but calculates the best word is “forget” does it just abandon the best fit or truncate it to fit?

Are there text diffusion models with lax infill directives?

nathan-barry · a month ago
Yes, this is the case. During training, the model will get a sequence of text (ex, 512 tokens long) with a percentage of them masked out (with a special <MASK> token). It learns how to unmask those tokens to construct the original text.

In the case that you mentioned, if we had 4 <MASK> tokens in a row, all we are doing for decoding is predicting what those 4 tokens should be.

Generally, this does not seem to be a significant problem, as there are usually multiple ways to express an idea in varying lengths. Also, with confidence-aware parallel decoding, it can usually avoid the scenario you mentioned, as focusing on decoding the highest confident tokens will generally avoid such scenarios with a well trained model.

Deleted Comment

u/nathan-barry

KarmaCake day545May 8, 2023
About
https://nathan.rs
View Original