σ-GPTs: A new approach to autoregressive models

cs702 · 2 years ago

This looks great.

The authors randomly permute (i.e., shuffle) input tokens in training and add two positional encodings to each token: one with the token's position and another with the position of the token to be predicted. Otherwise, the model is a standard autoregressive GPT. The consequences of this seemingly "simple" modification are significant:

* The authors can prompt the trained model with part of a sequence and then decode the missing tokens, all at once, in parallel, regardless of order -- i.e., the model can in-fill in parallel.

* The authors can compute conditional probability densities for every missing token in a sequence, again in parallel, i.e., densities for all missing tokens at once.

* The authors propose a rejection-sampling method for generating in-fill tokens, again in parallel. Their method seems to work well in practice.

I've added this to my reading list. Thank you for sharing it on HN.

toxik · 2 years ago

This problem formulation has been around for a while, it’s kind of the holy grail of modeling. What is new compared to PixelCNN and related is this position embedding idea.

Deleted Comment

cs702 · 2 years ago

Yes, the modification the authors propose is indeed "simple."

It's "obvious" only in hindsight.

thomashop · 2 years ago

I don't understand how that parallel prediction can work...

Let's say I give it as input the sentence:

I . . . . . . . . happily.

The second word to be predicted depends on the first word.

cs702 · 2 years ago

Give the model the tokens "happily" and "I", and add to each input token its respective position embedding and the position embedding for the token to be predicted. You can do this in parallel for all tokens to be predicted. The model has been trained so it can predict tokens in any position.

taneq · 2 years ago

Wow, if that works that's wild (and also has that "damn, now you say it it's obvious" flavour that so many really cool discoveries share...)

WanderPanda · 2 years ago

Wait wasn't BERT all about non-causal masking aka predicting words in the middle?!

nico · 2 years ago

I know this is for tokens/text, but can the same concept be applied to images using something like a diffusion model? And then be able to upscale images arbitrarily by infilling?

gwern · 2 years ago

Yes. See the related work section in the paper: there is a long history of models, recently like MAE and MaskGit, which predict pixels in basically arbitrary orders, and that is useful because it lets you train on subsets of each image, upscale/infill during generation, and so on. (If you know what MAE is, that might be the fastest way to summarize OP: "it's a GPT trained like a MAE".)

RivieraKid · 2 years ago

If there are multiple missing tokens, what's the positional encoding for the "token to be predicted"?

cs702 · 2 years ago

See this thread, also on this page:

https://news.ycombinator.com/item?id=40609689

tripplyons · 2 years ago

The only difference I see from XLNet is how they use it during inference.

arnaudpannatier · 2 years ago

Hey! I'm Arnaud, first author of the paper. XLNet also shuffles the data during training, but they use a masking mechanism instead of the causal + double positional encoding. The application differs, XLNet is not AFAIK focused on generation (even if it can be used for that) and the burst-sampling idea is new.

mglikesbikes · 2 years ago

Off topic, but what do you use for your reading list?

inhumantsar · 2 years ago

hijacking for a bit of shameless self promotion: if you're an obsidian user, I recently built a plugin that simplifies web pages, parses out metadata, and saves them to obsidian as markdown files: https://github.com/inhumantsar/slurp

arXiv comes through a bit ugly atm but it's high on my to-do list. I'm leveraging the library that Firefox uses for reader mode, so most sites come through quite well. A lot of my work right now is expanding their metadata support and fixing parser issues.

ofou · 2 years ago

I use Emergent Mind[1] to keep track of new research published on ArXiv. You can bookmark articles once logged in. It's very useful for keeping track of articles, reading quick summaries, and following conversations on various social media.

[1]: https://www.emergentmind.com/papers/2404.09562

concurrentsquar · 2 years ago

Google Chrome has a built-in reading list (go open the 3-dotted menu at the top-right corner, then click on "Bookmarks and lists" -> "Reading list")

cs702 · 2 years ago

old-fashioned text files

barfbagginus · 2 years ago

Zotero is great for organizing and annotating papers, keeping notes, and building bibliographies.

You can create libraries and sub libraries according to topic, and also create libraries for projects or reading lists. You can file items into multiple libraries, and you can also create shared libraries, allowing your team to share annotated papers.

Finally it can archive offline copies of web pages, which makes it useful for blog articles and other online resources that might vanish.

There's a learning curve, but it's worth it if you find yourself juggling dozens or hundreds of technical papers! Enjoy!

nsagent · 2 years ago

What's old [1] is new again... without citing prior work. It's not like it's an unknown work. It was published in ICML and has ~250 citations.

[1]: https://arxiv.org/abs/1902.03249

szvsw · 2 years ago

Wow, really cool concept! I wonder if this starts to become similar dynamics to what we see in image generation models, where structure/detail emerges in one region of the image and then the surrounding areas start to resolve themselves into place. That kind of behavior seems particularly useful for longer reasoning/logic/planning, where the big ideas might become apparent first, and then the interstitial details and text just naturally fill in…

byteknight · 2 years ago

The process you describe is referred to as diffusion

szvsw · 2 years ago

Yep yep I know, but I was trying to suggest something diffusion-like occurring with a language model through a totally separate mechanism that does not rely on the denoising process (at least not literally).

immibis · 2 years ago

I'm fairly certain diffusion refers to the overall architecture, not the emergent self-organization process.

smusamashah · 2 years ago

There is a video on twitter showing it generating text (looks a bit like image diffusion)

https://x.com/ArnaudPannatier/status/1799055129829839166

lukasb · 2 years ago

Weird that they chose an example that ended up somewhat nonsensical.

sebzim4500 · 2 years ago

Part of the issue is they are training a pretty tiny model, it's not like GPT-2 ~100M is especially coherent either.

vessenes · 2 years ago

I kept thinking about this paper today, and I really like the capabilities.

A number of things that are relatively hard for sequential LLMs are easy here. Want json? Fix curly brace tokens to the beginning and end.

Want a specific token-length explanation of an answer? Write a short answer, post-pend it, and infill.

Want a higher-density answer to something? Add a density assessment section to your generated text, a space for the LLM to score info-density, and generate looking for a high density score.

I would guess there's a lot here to be experimented with. It would be nice to get an 8b parameter model with reasonable number of tokens (x3 based on the paper, sadly) through it.

zakkor1 · 2 years ago

> Fix curly brace tokens to the beginning

Regular LLMs can already do this, by prefilling the start of the assistant's response.

But there is actually something even better: you can constrain the LLM's output to a specific grammar (like JSON), so it'll only be able to answer with syntactically valid JSON.

vessenes · 2 years ago

Yes. And you can have a grammar parser only select from valid tokens in a randomized distribution. But, this feels much more sophisticated to me, especially if you can mix specific token-based grammar requirements with other instructions during the token selection phase.

mbil · 2 years ago

I wonder if this would help especially for computer code generation, where what is output at a given step may materially depend on what would be written at a later step.

mbil · 2 years ago

And, maybe prohibitively slow, perhaps integrate some kind of linting or syntax checking as part of the rejection sampling. I.e. burst sample N potential generated snippets in parallel, reject those that are syntactically invalid.

barfbagginus · 2 years ago

It would be nice if it could diffuse right on the AST. That would ensure each generated item passes a syntax check, without the waste of rejection sampling

omernivro · 2 years ago

This is an interesting study. A similar permutation approach appears already in the Taylorformer paper (https://arxiv.org/pdf/2305.19141v1). The authors use a Transformer decoder for continuous processes, like time series. During training, each sequence is shuffled randomly. Each sequence element has a positional encoding. Then, they use log-likelihood on the shuffled sequence. There, the permutation helps with predictions for interpolation, extrapolation and irregularly sampled data. Also, they show it helps with 'consistency', i.e., roughly the MSE is the same regardless of the generated order.

What might this paper add to our understanding or application of these ideas?

The idea of permuting the sequence order also appears in the Transformer Nerural Process paper: https://arxiv.org/pdf/2207.04179.

bigyikes · 2 years ago

Is this applying the learnings from vision transformers to language transformers?

If I understand correctly, vision models split an image into tiles and append a positional encoding to each so the model can understand the relative position of each tile.

I admittedly only read the abstract - a lot of this stuff goes over my head - but it seems like this paper proposes a similar idea, but for 1D instead of 2D?

seurimas · 2 years ago

Positional encoding is standard for transformers of all stripes. They introduce a seemingly novel, redundant positional encoding scheme. It's more difficult to train, but seems to enable producing multiple tokens at once (i.e. you could get an answer that is N tokens long in N/x steps instead N steps).