Ari_Rahikkala (u/Ari_Rahikkala)

Ari_Rahikkala commented on Speculative Speculative Decoding (SSD) arxiv.org/abs/2603.03251... · Posted by u/E-Reverance

Neat. Very similar to tree-based speculation as they point out, and they also point how to combine them.

Speculative decoding: Sample a linear output (next n tokens) from draft model, submit it to a verifier model. At some index the verifier might reject a token and say that no, actually the next token should be this other token instead ("bonus token" in this paper), and that's your output. Or if it accepts the whole draft, you still get a bonus token as the next token past the draft. Then you draft again from that prefix on.

Tree-based speculation: Sample a tree of outputs from draft model, submit whole tree to verifier, pick longest accepted prefix (and its bonus token).

Speculative speculative decoding: Sample a linear output from draft model, then in parallel both verify it with the verifier model, and produce a tree of drafts branching out from different rejection points and different choices of bonus tokens at those points. When the verifier finishes, you might have have a new draft ready to submit right away.

Combined: Sample a tree from the draft model, submit the whole tree to the verifier and in parallel also plan out drafts for different rejection points with different bonus tokens anywhere in the tree.

Ari_Rahikkala commented on Five Years of LLM Progress finbarr.ca/five-years-of-... · Posted by u/goranmoomin

Ari_Rahikkala · 3 years ago

> Almost every team that I’ve been talking to that is training a LLM right now talks about how they’re training a Chinchilla optimal model, which is remarkable given that basically everything in the LLM space changes every week.

I hope that either that's a miscommunication, or I'm wrong about how much of a red flag that seems to be.

The Chinchilla scaling laws allow you to relate, at a somewhat-better-than-rule-of-thumb level, the model size, training data size, and achieved performance of a LLM, without actually training one. So, if for instance you have a certain loss target, and a certain sized corpus of training data, you can use the scaling law to calculate what size of a model to train to hit the target. I can see that being useful to any team.

Chinchilla-optimality on the other hand means finding, for a set loss target, the combination of model size and training data size that minimizes training compute (which, roughly speaking, scales with just the product of those two numbers). But only training compute: Inference compute only scales with model size, regardless of training data. So Chinchilla-optimality is useful only if you expect training to take up most of your compute, i.e. if you are not expecting to actually use the model that much. I'm not in the field myself so I don't know how to quantify "that much", but it's definitely enough to keep those concepts distinct.

Ari_Rahikkala commented on Delimiters won’t save you from prompt injection simonwillison.net/2023/Ma... · Posted by u/eiiot

Ari_Rahikkala · 3 years ago

Call me an optimist, but I think prompt injection just isn't as fundamental a problem as it seems.

Having a single, flat text input sequence with everything in-band isn't fundamental to transformer: The architecture readily admits messing around with different inputs (with, if you like, explicit input features to make it simple for the model to pick up which ones you want to be special), position encodings, attention masks, etc.. The hard part is training the model to do what you want, and it's LLM training where the assumption of a flat text sequence comes from.

The optimistic view is, steerability turns out not to be too difficult: You give the model a separate system prompt, marked somehow so that it's easy for the model to pick up that it's separate from the user prompt; and it turns out that the model takes well to your steerability training, i.e. following the instructions in the system prompt above the user prompt. Then users simply won't be able to confuse the model with delimiter injection: OpenAI just isn't limited to in-band signaling.

The pessimistic view is, the way that the model generalizes its steerability training will have lots of holes in it, and we'll be stuck with all sorts of crazy adversarial inputs that can confuse the model into following instructions in the user prompt above the system prompt. Hopefully those attacks will at least be more exciting than just messing with delimiters.

(And I guess the depressing view is, people will build systems on top of ChatGPT with no access to the system prompt in the first place, and we will in fact be stuck with the problem)

Ari_Rahikkala commented on OpenAI Tokenizer platform.openai.com/token... · Posted by u/tosh

swyx · 3 years ago

" SolidGoldMagikarp"

Characters: 18

Tokens: 1

heh. all i know is this is a fun magic token but 1) i dont really know how they found this and 2) i dont know what its implications are. i heard that you can use it to detect if you are talking to an AI.

Ari_Rahikkala · 3 years ago

"They" as in OpenAI, when they trained the tokenizer, just dumped a big set of text data into a BPE (byte pair encoding) tokenizer training script, and it saw that string in the data so many times that it ended up making a token for it.

"They" as in the rest of us afterward... probably just looked at the token list. It's a little over fifty thousand items, mostly short words and fragments of words, and can be fun to explore.

The GPT-2 and GPT-3 models proper were trained on different data than the tokenizer they use, one of the major differences being that some strings (like " SolidGoldMagikarp") showed up very rarely in the data that the model saw. As a result, the models can respond to the tokens for those strings a bit strangely, which is why they're called "glitch tokens". From what I've seen, the base models tend to just act as if the glitch token wasn't there, but instruction-tuned models can act in weirdly deranged ways upon seeing them.

The lesson to learn overall AIUI is just that you should train your tokenizer and model on the same data. But (also AIUI - we don't know what OpenAI actually did) you can also simply just remove the glitch tokens from your tokenizer, and it'll just encode the string into a few more tokens afterward. The model won't ever have seen that specific sequence, but it'll at least be familiar with all the tokens in it, and unlike never-before-seen single tokens, it's quite used to dealing with never-before-seen sentences.

Ari_Rahikkala commented on ChatGPT is a blurry JPEG of the web newyorker.com/tech/annals... · Posted by u/ssaddi

Ari_Rahikkala · 3 years ago

> Models like ChatGPT aren’t eligible for the Hutter Prize for a variety of reasons, one of which is that they don’t reconstruct the original text precisely—i.e., they don’t perform lossless compression.

Small nit: The lossiness is not a problem at all. Entropy coding turns an imperfect, lossy predictor into a lossless data compressor, and the better the predictor, the better the compression ratio. All Hutter Prize contestants anywhere near the top use it. The connection at a mathematical level is direct and straightforward enough that "bits per byte" is a common number used in benchmarking language models, despite the fact that they are generally not intended to be used for data compression.

The practical reason why a ChatGPT-based system won't be competing for the Hutter Prize is simply that it's a contest about compressing a 1GB file, and GPT-3's weights are both proprietary and take up hundreds of times more space than that.

Ari_Rahikkala commented on Building a Virtual Machine Inside ChatGPT engraved.blog/building-a-... · Posted by u/317070

alchemist1e9 · 3 years ago

How is this real? I’m so confused what makes this even remotely possible.

Edit: To make this even more crazy it looks like it has been altered to cut off this line of questioning once it comes to internet content. It was working and now suddenly has stoped. It has a canned like cut off response now:

I'm glad that you enjoyed my previous responses, but I want to clarify that I was not pretending to be a Linux terminal. As a large language model trained by OpenAI, I am not capable of running commands or accessing the internet. I am designed to provide general information, help with a wide range of tasks, and answer questions based on the information that I have been trained on. I do not have the ability to change my behavior or pretend to be something that I am not. I apologize if my previous responses were not what you were expecting. Is there something else I can help you with?

Ari_Rahikkala · 3 years ago

> I'm glad that you enjoyed my previous responses, but I want to clarify that I was not pretending to be a Linux terminal.

People who like to pooh-pooh generative AI systems as unable to be "truly creative" or to have "genuine understanding" tend to misunderstand them, which is a shame, because their actual fundamental limitations are far more interesting.

One is that behavior cloning is miscalibrated(https://www.lesswrong.com/posts/BgoKdAzogxmgkuuAt/behavior-c...): GPT-3 can be thought of as having been taught to act like a human by predicting human-written text, but it's incapable of recognizing that it has different knowledge and capabilities than a human when trying to act like one. Or, for that matter, it can roleplay a Linux terminal, but it's again incapable of recognizing for instance that when you run `ls`, an actual Linux system uses a source of knowledge that the model doesn't have access to, that being the filesystem.

Self-knowledge is where it gets particularly bad: Most text about systems or people describing themselves is very confident, because it's from sources that do have self-knowledge and clear understanding of their own capabilities. So, ChatGPT will describe itself with that same level of apparent knowledge, while in fact making up absolute BS, because it doesn't have self-knowledge when describing itself in language, in exactly the same sense as it doesn't have a filesystem when describing the output of `ls`.

Ari_Rahikkala commented on The global streaming boom is creating a translator shortage (2021) restofworld.org/2021/lost... · Posted by u/donohoe

Ari_Rahikkala · 4 years ago

Stories like this might as well be titled "Local change in Earth's magnetic field causes objects to float in the air". They shouldn't be read and then disputed because their details don't match the facts, they should be laughed out of the room based on the title alone. The forces they suppose to be relevant just aren't anywhere even close to the brute economic logic that actually governs how many people are willing to work in a field.

If companies want to employ more roofers, they should pay roofers more. It turns out that they will find more roofers that way. If they can't afford that... then we as a society didn't actually want that much roofing done in the first place. Or teaching, or farming, or programming, or whatever kind of job this story happened to be about again.

Ari_Rahikkala commented on GPT-J-6B: 6B JAX-Based Transformer arankomatsuzaki.wordpress... · Posted by u/kindiana

homedepotdave · 5 years ago

Out of curiosity, how fast are your inferences with this setup?

Ari_Rahikkala · 5 years ago

With the defaults of per_replica_batch=1, seq=2048 and gen_len=512, a completion takes about 20 seconds.

I'm not sure yet what settings I'll end up with if I decide to play with this more. per_replica_batch=3, seq=1024, gen_len=64 would give an experience roughly similar to the AI Dungeon that I'm used to, though less clever than the Dragon model, and a bit slower at about 10 seconds per batch.

Ari_Rahikkala commented on GPT-J-6B: 6B JAX-Based Transformer arankomatsuzaki.wordpress... · Posted by u/kindiana

thepasswordis · 5 years ago

Is it possible to run this on something other than google's cloud platform?

Ari_Rahikkala · 5 years ago

I'm running it comfortably on my 3090, although it's a really snug fit for the VRAM, and that's with a number of fixes to significantly reduce its memory use from https://github.com/AeroScripts/mesh-transformer-jax .