Is this the case?
Ie. if the region only has four tokens(here characters) but calculates the best word is “forget” does it just abandon the best fit or truncate it to fit?
Are there text diffusion models with lax infill directives?
In the case that you mentioned, if we had 4 <MASK> tokens in a row, all we are doing for decoding is predicting what those 4 tokens should be.
Generally, this does not seem to be a significant problem, as there are usually multiple ways to express an idea in varying lengths. Also, with confidence-aware parallel decoding, it can usually avoid the scenario you mentioned, as focusing on decoding the highest confident tokens will generally avoid such scenarios with a well trained model.
Deleted Comment
Students looking for a project to break into AI - please!