The authors randomly permute (i.e., shuffle) input tokens in training and add two positional encodings to each token: one with the token's position and another with the position of the token to be predicted. Otherwise, the model is a standard autoregressive GPT. The consequences of this seemingly "simple" modification are significant:
* The authors can prompt the trained model with part of a sequence and then decode the missing tokens, all at once, in parallel, regardless of order -- i.e., the model can in-fill in parallel.
* The authors can compute conditional probability densities for every missing token in a sequence, again in parallel, i.e., densities for all missing tokens at once.
* The authors propose a rejection-sampling method for generating in-fill tokens, again in parallel. Their method seems to work well in practice.
I've added this to my reading list. Thank you for sharing it on HN.
This problem formulation has been around for a while, it’s kind of the holy grail of modeling. What is new compared to PixelCNN and related is this position embedding idea.
Give the model the tokens "happily" and "I", and add to each input token its respective position embedding and the position embedding for the token to be predicted. You can do this in parallel for all tokens to be predicted. The model has been trained so it can predict tokens in any position.
I know this is for tokens/text, but can the same concept be applied to images using something like a diffusion model? And then be able to upscale images arbitrarily by infilling?
Yes. See the related work section in the paper: there is a long history of models, recently like MAE and MaskGit, which predict pixels in basically arbitrary orders, and that is useful because it lets you train on subsets of each image, upscale/infill during generation, and so on. (If you know what MAE is, that might be the fastest way to summarize OP: "it's a GPT trained like a MAE".)
Hey! I'm Arnaud, first author of the paper.
XLNet also shuffles the data during training, but they use a masking mechanism instead of the causal + double positional encoding. The application differs, XLNet is not AFAIK focused on generation (even if it can be used for that) and the burst-sampling idea is new.
hijacking for a bit of shameless self promotion: if you're an obsidian user, I recently built a plugin that simplifies web pages, parses out metadata, and saves them to obsidian as markdown files: https://github.com/inhumantsar/slurp
arXiv comes through a bit ugly atm but it's high on my to-do list. I'm leveraging the library that Firefox uses for reader mode, so most sites come through quite well. A lot of my work right now is expanding their metadata support and fixing parser issues.
I use Emergent Mind[1] to keep track of new research published on ArXiv. You can bookmark articles once logged in. It's very useful for keeping track of articles, reading quick summaries, and following conversations on various social media.
Zotero is great for organizing and annotating papers, keeping notes, and building bibliographies.
You can create libraries and sub libraries according to topic, and also create libraries for projects or reading lists. You can file items into multiple libraries, and you can also create shared libraries, allowing your team to share annotated papers.
Finally it can archive offline copies of web pages, which makes it useful for blog articles and other online resources that might vanish.
There's a learning curve, but it's worth it if you find yourself juggling dozens or hundreds of technical papers! Enjoy!
Wow, really cool concept! I wonder if this starts to become similar dynamics to what we see in image generation models, where structure/detail emerges in one region of the image and then the surrounding areas start to resolve themselves into place. That kind of behavior seems particularly useful for longer reasoning/logic/planning, where the big ideas might become apparent first, and then the interstitial details and text just naturally fill in…
Yep yep I know, but I was trying to suggest something diffusion-like occurring with a language model through a totally separate mechanism that does not rely on the denoising process (at least not literally).
I kept thinking about this paper today, and I really like the capabilities.
A number of things that are relatively hard for sequential LLMs are easy here. Want json? Fix curly brace tokens to the beginning and end.
Want a specific token-length explanation of an answer? Write a short answer, post-pend it, and infill.
Want a higher-density answer to something? Add a density assessment section to your generated text, a space for the LLM to score info-density, and generate looking for a high density score.
I would guess there's a lot here to be experimented with. It would be nice to get an 8b parameter model with reasonable number of tokens (x3 based on the paper, sadly) through it.
Regular LLMs can already do this, by prefilling the start of the assistant's response.
But there is actually something even better: you can constrain the LLM's output to a specific grammar (like JSON), so it'll only be able to answer with syntactically valid JSON.
Yes. And you can have a grammar parser only select from valid tokens in a randomized distribution. But, this feels much more sophisticated to me, especially if you can mix specific token-based grammar requirements with other instructions during the token selection phase.
I wonder if this would help especially for computer code generation, where what is output at a given step may materially depend on what would be written at a later step.
And, maybe prohibitively slow, perhaps integrate some kind of linting or syntax checking as part of the rejection sampling. I.e. burst sample N potential generated snippets in parallel, reject those that are syntactically invalid.
It would be nice if it could diffuse right on the AST. That would ensure each generated item passes a syntax check, without the waste of rejection sampling
This is an interesting study. A similar permutation approach appears already in the Taylorformer paper (https://arxiv.org/pdf/2305.19141v1). The authors use a Transformer decoder for continuous processes, like time series. During training, each sequence is shuffled randomly. Each sequence element has a positional encoding. Then, they use log-likelihood on the shuffled sequence.
There, the permutation helps with predictions for interpolation, extrapolation and irregularly sampled data. Also, they show it helps with 'consistency', i.e., roughly the MSE is the same regardless of the generated order.
What might this paper add to our understanding or application of these ideas?
The idea of permuting the sequence order also appears in the Transformer Nerural Process paper: https://arxiv.org/pdf/2207.04179.
Is this applying the learnings from vision transformers to language transformers?
If I understand correctly, vision models split an image into tiles and append a positional encoding to each so the model can understand the relative position of each tile.
I admittedly only read the abstract - a lot of this stuff goes over my head - but it seems like this paper proposes a similar idea, but for 1D instead of 2D?
Positional encoding is standard for transformers of all stripes. They introduce a seemingly novel, redundant positional encoding scheme. It's more difficult to train, but seems to enable producing multiple tokens at once (i.e. you could get an answer that is N tokens long in N/x steps instead N steps).
The authors randomly permute (i.e., shuffle) input tokens in training and add two positional encodings to each token: one with the token's position and another with the position of the token to be predicted. Otherwise, the model is a standard autoregressive GPT. The consequences of this seemingly "simple" modification are significant:
* The authors can prompt the trained model with part of a sequence and then decode the missing tokens, all at once, in parallel, regardless of order -- i.e., the model can in-fill in parallel.
* The authors can compute conditional probability densities for every missing token in a sequence, again in parallel, i.e., densities for all missing tokens at once.
* The authors propose a rejection-sampling method for generating in-fill tokens, again in parallel. Their method seems to work well in practice.
I've added this to my reading list. Thank you for sharing it on HN.
Deleted Comment
It's "obvious" only in hindsight.
Let's say I give it as input the sentence:
I . . . . . . . . happily.
The second word to be predicted depends on the first word.
https://news.ycombinator.com/item?id=40609689
arXiv comes through a bit ugly atm but it's high on my to-do list. I'm leveraging the library that Firefox uses for reader mode, so most sites come through quite well. A lot of my work right now is expanding their metadata support and fixing parser issues.
[1]: https://www.emergentmind.com/papers/2404.09562
You can create libraries and sub libraries according to topic, and also create libraries for projects or reading lists. You can file items into multiple libraries, and you can also create shared libraries, allowing your team to share annotated papers.
Finally it can archive offline copies of web pages, which makes it useful for blog articles and other online resources that might vanish.
There's a learning curve, but it's worth it if you find yourself juggling dozens or hundreds of technical papers! Enjoy!
[1]: https://arxiv.org/abs/1902.03249
https://x.com/ArnaudPannatier/status/1799055129829839166
A number of things that are relatively hard for sequential LLMs are easy here. Want json? Fix curly brace tokens to the beginning and end.
Want a specific token-length explanation of an answer? Write a short answer, post-pend it, and infill.
Want a higher-density answer to something? Add a density assessment section to your generated text, a space for the LLM to score info-density, and generate looking for a high density score.
I would guess there's a lot here to be experimented with. It would be nice to get an 8b parameter model with reasonable number of tokens (x3 based on the paper, sadly) through it.
Regular LLMs can already do this, by prefilling the start of the assistant's response.
But there is actually something even better: you can constrain the LLM's output to a specific grammar (like JSON), so it'll only be able to answer with syntactically valid JSON.
What might this paper add to our understanding or application of these ideas?
The idea of permuting the sequence order also appears in the Transformer Nerural Process paper: https://arxiv.org/pdf/2207.04179.
If I understand correctly, vision models split an image into tiles and append a positional encoding to each so the model can understand the relative position of each tile.
I admittedly only read the abstract - a lot of this stuff goes over my head - but it seems like this paper proposes a similar idea, but for 1D instead of 2D?