Distillation makes AI models smaller and cheaper

Also makes openai moan about companies stealing from them when they stole the internet for free

tcldr · a month ago

Exactly. This is the argument that I find lacking from today's discourse: AI companies are already extracting generations worth of human intellectual data into their models. If they want to argue that this is 'fair use' then model distillation is, too. Can't have it both ways.

an0malous · a month ago

You can when the laws exist to serve the investor class instead of fairness and justice. There is a ludicrous amount of money in AI now, it has become a central initiative of the current administration and defense industry. The large AI companies will get whatever they want now.

GuB-42 · a month ago

It is complicated, and culture and legal systems will have to adapt.

But you can have it both way. Often, a distinction between fair and unfair is if are competing against the authors directly.

Take Ghibli memes for instance. While obviously the result of training on studio Ghibli content without permission, it doesn't compete against Studio Ghibli directly. Studio Ghibli doesn't draw memes and ChatGPT doesn't make feature films or copy official artwork, I don't think Studio Ghibli lost anything to the meme, they are not in the same business. So it could be considered fair use.

Training a LLM on data from a law firm to make a search engine directly competing against the search engine of said law firm is not fair use, and there is a legal precedent (Thomson Reuters vs Ross). Training your model from another model to compete against them would be the same kind of thing.

There are plenty of nuance, like how transformative it is. But it is possible that extracting massive amount of data is fair use but distillation is not. There are plenty of people at work on the question right now.

miki123211 · a month ago

Open AI is transforming those works, Deepseek is not.

OpenAI takes in code, books and articles and produces a model. This model can be used for novel tasks, like paraphrasing your own writing, translating your text to a different language, writing code according to a provided specification etc, even if there was nothing in the original corpus that exactly solved your problem.

To produce this model, you need four ingredients. The data, the compute, research effort and a lot of tedious RLHF work. While OpenAI uses the first one without providing author compensation (and it has no other option here), the latter three it provides entirely on its own.

People distilling from OpenAI do not create transformative works. They take Open AI's model and make a model of their own. Both models can do very similar things and are suitable for very similar purposes.

Distillation is just a particularly easy way of making an inexact copy of the model weights. The values of those weights will be very different, just as the values of each pixel in an illicit camera recording of a movie at a cinema are very different from those in the original version, but the net result is the same.

cma · a month ago

Not just that, o1 didn't even show its real chain of thought, yet OpenAI said deepseek distilled from them to make their reasoning model: distilling what wasn't there.

atmosx · a month ago

Funny how that works :-)

Dead Comment

Obligatory [1]

My apologies for not being able to find the original tale. I’m sure the original website is around but this is a decent synopsis regardless.

Doesn’t look like they cover it in the article but if I remember correctly they pruned the model down to fit on 56k eprom that was able to be sold for originally $10 (also dating myself, this article claims $15)

And of course the jargon has changed with time, I guess were saying distilled now, originally we said pruned… because thats what you did once you had your weights you would prune the rest of the network to get the core model. I guess distilled works also, just less literal imho. I guess if we want to get really pedantic networks exists in liquids, but I digress.

[1] (apologies for the add crap, best I could find) https://www.mentalfloss.com/article/22269/how-electronic-20-...

DoctorOetker · a month ago

pruning and distilling are 2 totally different things.

pruning: discarding low weight connections after training, makes the network sparser but also less regular (complications for memory layout, and compute kernels to access the sparse network weights).

distilling: take a large pretrained model, and train a smaller one from it, for example consider a cloze task (fill the blanked token in a sentence), then compute the probabilities using the large model, and train the smaller model to reproduce the same probabilities

distilling is a form of fitting into a smaller regular network, of potentially totally different architecture, while pruning is a form of discarding low weight coefficients resulting in a sparser network.

wizardforhire · a month ago

Thanks for taking the time to clarify for me.

meatmanek · a month ago

I'm surprised those things used neural networks. With a matrix of answer probabilities (trivially calculated from people's answers), you can choose the question that maximizes your expected information gain.

wizardforhire · a month ago

As I remember it, it was the break out moment for NN that made them mainstream to the masses. Prior to that they were an academic / hacker oddity relegated to works of fictions and just one of the many competing theories towards functioning AI. After 20Q you could buy a handheld NN at walmart. The delay to LLM was such that 20Q made it apparent to the scene that the limiting factor for more practical ai development was purely a scaling problem of complexity limited by compute power. A lot of conversations on /. and the likes centered around when the threshold would be crossed. Most at the time could not have predicted nor accepted that moore’s law would fail putting development back a decade.

To the credit of the naysayers at the time hotmail was still the primary free email service, gmail had yet to come out. Google buying up the darkfiber and had yet to open up their excess compute starting the arms race for the cloud. Most still thought of GPUs only for graphics even though their architecture and intent was there since their inception at thinking machines…

FlyingLawnmower · a month ago

Sidenote, but the scholarship on distillation always makes me a bit sad. The Original work, cited in the abstract of the Hinton, Vinyals, and Dean paper that is cited everywhere, was the model compression work from Caruana, Buciluǎ, and Niculescu-Mizil.

The distillation paper added minor parameter tweaks and had a fancier name, but the essence of the method came from Caruana et. al's model compression paper: https://dl.acm.org/doi/abs/10.1145/1150402.1150464

1991 https://people.idsia.ch/~juergen/very-deep-learning-1991.htm...

flukas88 · a month ago

NitpickLawyer · a month ago

The article is pretty light on details, and misses (or I missed it if they mentioned it) an important distinction. There are two main types of distillation:

- completion based methods, where you take a big model, give it some queries, and use the answers to post-train a smaller model. This is what deepseek did with qwen models, where they took ~800k traces made by R1 and used sft on smaller qwen2.5 models. What the sky team found in their experiments is that you can use as few as 1-2k traces to reach similar results. Much cheaper.

- logit/internal representations based methods, where you need access to the raw model, and for each pair q -> response you train the small model on the entire distribution of the logits at the same time. This is a method suited for model creators, where they can take a pair of big + small model of the same architecture, and "distill" it in the smaller one. This is likely how they train their -flash -mini -pico and so on.

The first method can be used via API access. The second one can't. You need access to things that API providers won't give you.

m12k · a month ago

From the article:

"Considering that the distillation requires access to the innards of the teacher model, it’s not possible for a third party to sneakily distill data from a closed-source model like OpenAI’s o1, as DeepSeek was thought to have done. That said, a student model could still learn quite a bit from a teacher model just through prompting the teacher with certain questions and using the answers to train its own models — an almost Socratic approach to distillation."

Right, my bad then I read it in a hurry. They do mention the distinction.

dr_dshiv · a month ago

Like PHI — textbooks are all you need. You can create entirely synthetic yet high quality training data with a strong model (the generated textbooks) and make very small models like PHI.

pyman · a month ago

This is exactly what the DeepSeek team did, and now Anthropic is repackaging it a year later, calling it “subliminal learning” or using the teacher and student analogy to take credit for work done by Chinese researchers.

https://malted.ai/deepseek-and-the-future-of-distillation/

While Anthropic and OpenAI are still trying to make sense of what China's top computer scientists pulled off a year ago, something that shook the core of Nvidia's business, China is now showcasing the world's first commercial unhackable cryptography system using QKD and post-quantum cryptography to secure all phone calls between Beijing and Hefei.

sebau · a month ago

I wonder how a company like OpenAI can be stolen/distilled via API without noticing, given the amount of data the is needed even for smaller models

ben_w · a month ago

Stolen: There was some research a year or so ago that showed if you have access to the probability distribution for the next token, you can efficiently steal some layers of the model. When this work was done, OpenAI switched off direct access to those probabilities.

Distilled: Two years ago, one of the AI podcasts I was listening to (probably TWIML&AI) had someone use a big model to create a small high-quality training set for another model (as I understand it, this is what Microsoft's Phi series does, but that wasn't the example in whichever podcast I'm thinking of).

And remember, OpenAI's price for a million tokens is a rounding error for most businesses. Last year's reported revenue of USD 3.7 billion* suggests their customers collectively paid them for order-of a quadrillion tokens in and out, so even getting a trillion tokens from them without them noticing what you're up to (so long as you paid) is very plausible.

* https://www.cnbc.com/2024/09/27/openai-sees-5-billion-loss-t...

oblio · a month ago

Corporate espionage or a distributed, concerted, scraping effort. Which would make OpenAI user counts completely useless, but it doesn't sound impossible. If anyone could pull this off, it's some Chinese company.

Deleted Comment

In 2024, DeepSeek's researchers used the DeepSeek-R1 model to transfer knowledge to a smaller model using distillation:

Honest question:

Isn't this exactly what the DeepSeek team did, and now Anthropic is repackaging it a year later, calling it “subliminal learning” or using the teacher and student analogy to take credit for work done by Chinese researchers?

It's like if China claimed they invented the Transformer by renaming it the “Pattern Matching architecture.”

Why is Anthropic doing this? Isn't this the same company that recently scraped 7 million books? And now they’re “transforming” research papers too?

rcxdude · a month ago

>and now Anthropic is repackaging it a year later, calling it “subliminal learning”

No, distillation and student/teacher is a well known technique (much older than even the original chatGPT), and Anthropic are not claiming to have invented it (it would be laughable to anyone familiar with the field). "subliminal learning" is an observation by Anthropic about something surprising that can happen during the process, which is that, for sufficiently similar models, behaviour can be transferred from student to teacher that is not obviously present in the information transferred between them (i.e. text outputted from the teacher and used to train the student. For example, the student's "favourite animal" changed despite the fact that the teacher was only creating 'random' numbers for the student to try to predict)

> something surprising that can happen during the process, which is that, for sufficiently similar models, behaviour can be transferred from student to teacher

By "behaviour" they mean data and pattern matching, right? Alan Turing figured that out in the 1940s.

LLMs aren't black boxes doing voodoo, like we like to tell politicians and regulators. They're just software processing massive amounts of data to find patterns and predict what comes next. It looks magical, but it's maths and stats, not magic.

This post is just selling second-hand ideas. And for those of us outside the US who spend all day reading scientific papers, sorry Anthropic, we're not buying it.

Icko_ · a month ago

distillation and teacher-student models are definitely way older than 2024.

My point is: OpenAI raised $40 billion and Anthropic raised $10 billion, claiming they needed the money to buy more expensive Nvidia servers to train bigger models. Then Chinese experts basically said, no you don't. And they proved it.

funfunfunction · a month ago

There are even companies starting to offer distillation as a service https://inference.net/explore/model-training

jgalt212 · a month ago

Distillation formerly was the key to self-hosted usable models. However, the unceasing pressure to be "agentic", has made self-hosting once again untenable. Agentic tools just hover up too many tokens.

ricardobeat · a month ago

If they use more tokens isn’t that a case in favor of self-hosting to reduce costs? Or are you saying performance is not good enough for local agents?

regularfry · a month ago

More tokens in the context means disproportionately more VRAM, to the extent that you really do need multiple GPUs if you're running an interestingly-sized model.