Anthropic announces proof of distillation at scale by MiniMax, DeepSeek,Moonshot

logicprog · 20 days ago

Just reading the headline, I say good.

A) These models are trained by ignoring IP. It is hypocritical and absurd to then try to assert IP over them. And I am for the destruction of IP on all ends.

B) What this essentially means is that the Chinese labs are taking the work of these mega corporations into making it freely accessible to other labs and businesses, to serve inference, fine tune, and host privately on prem. That's clearly a good thing for competition in the market as a whole.

C) I don't see why we should have to duplicate the massive energy and infrastructure investment of building foundation models over and over forever just because we want to preserve the IP rights of a few companies. That seems a shame and it seems better to me for everything to learn from everything else for the whole ecosystem to get better by topping each other and building off each other; that's also why publishing research into the architecture and training of these models is so much better than what the proprietary labs do (keeping everything a secret), although tbf Anthropic's interpretability research is cool.

D) these Chinese models give 90% of the performance of frontier proprietary models at a 10th or 20th of the cost. That seems like a win for everyone. Not to mention the fact that this distilling also allows them to make much smaller local models that everyone can run. This is a win for actual democratization, decentralization, and accessibility for the little guy.

spudlyo · 20 days ago

> And I am for the destruction of IP on all ends.

While I'm not unsympathetic to the plight of creatives, and their need to eat, I feel like the pendulum has swung so far to the interests of the copyright holders and away from the needs of the public that the bargain is no longer one I support. To the extent that AI is helping to expose the absurdity of this system, I'm all for it.

I don't think "burn it all down" is the answer, but I'd love to see the pendulum swing back our way.

paxys · 20 days ago

Because copyright laws rarely serve small independent creatives, but rather corporations like Disney that are in the business of hoarding and monetizing culture.

jsheard · 20 days ago

They're trying to kidnap what Anthropic has rightfully stolen!

Jokes and complete lack of sympathy aside, it does complicate the narrative that these small labs are always on the heels of the big labs for pennies on the dollar, if they rely on distilling the big labs models. That means there still has to be big bucks coming from somewhere.

Imustaskforhelp · 20 days ago

I don't see Z.ai (GLM 5) in the list though. I consider Qwen/Kimi to have a close relationship so I might not be sure but Qwen might be using Kimi data (I have written another comment in more depth)

I still prefer kimi fwiw. It's one of the best models I have witnessed open source and when I tried GLM 5, it really was lacklustre for me on its launchday but I will have to see it for myself now comparing the two maybe as I do see GLM 5 do some good things in benchmarks but we all know how benchmarks should be less trusted.

I still think that there is still some hope in chinese models even after this ie. they aren't completely dependent on the large models seeing GLM 5.

I am seeing an accusation of GLM 5 doing Distillation[0] but I am not seeing any hard evidence of it.

[0]: https://mtsoln.com/id/blog/wawasan-720/the-temu-fication-of-...

nashadelic · 19 days ago

something about Bill Gates telling Steve Jobs he didn't steal from him, he stole from their rich neighbour Parc

ashertrockman · 20 days ago

A) The "IP" they're concerned about isn't the same IP you speak of. It's the investment in RL training / GPU hours that it takes to go from a base model to a usable frontier model.

B) I don't think the story is so clean. The distilled models often have regressions in important areas like safety and security (see, for example, NIST's evaluation of DeepSeek models). This might be why we don't see larger companies releasing their own tiny reasoning models so much. And copying isn't exactly healthy competition. Of course, I do find it useful as a researcher to experiment with small reasoning models -- but I do worry that the findings don't generalize well beyond that setting.

C) Maybe because we want lots of different perspectives on building models, lots of independent innovation. I think it's bad if every model is downstream of a couple "frontier" models. It's an issue of monoculture, like in cybersecurity more generally.

D) Is it really 90% of the performance, or are they just extremely targeted to benchmarks? I'd be cautious about running said local models for, e.g., my agent with access to the open web.

_aavaa_ · 20 days ago

> Maybe because we want lots of different perspectives on building models, lots of independent innovation.

That’s only really possible if the front runner don’t buy up all of the chips on the market.

maxglute · 19 days ago

> A) The "IP" they're concerned about isn't the same IP you speak of. It's the investment in RL training / GPU hours that it takes to go from a base model to a usable frontier model.

Investment/gpu hours is locked behind export controls, which Anthropic supports since it keeps GPU prices low(er) sans PRC demand. Given that why would PRC labs care about US IP laws, the high level story is pretty clean, there's no healthy competition when US policy (supported by US labs) has been stacked to keep PRC behind, and it's entirely reasonable within PRC purview to circumvent.

logicprog · 20 days ago

Fair points, and worth responding to for a more nuanced discussion! I hope you take these responses in that light :)

A) Well, sure, yes, it's different specific IP being distilled on versus what was trained on. But I don't see why the same principles should not apply to both. If companies ignore IP when training on material, then it should be okay for other companies to ignore IP when distilling on material — either IP is a thing we care about or it isn't. (I don't).

B) I'm really not sure how seriously I take the worries about safety and security RLing models. You can RLA amodel to refuse to hack something or make a bio weapon or whatever as much as you want, but ultimately, for one thing, the model won't be capable of helping a person who has no idea what they're doing. Do serious harm anyway. And for another thing, the internet already exists for finding information on that stuff. And finally, people are always going to build the jailbreak models anyway. I guess the only safety related concern I have with models is sychophancy, and from what I've seen, there's no clear trend where closed frontier models are less sychophantic than open source ones. In fact, quite the opposite, at least in the sense that the Kimi models are significantly less psychophantic than everyone else.

C) This is a pretty fair point. I definitely think that having more base frontier models in the world, trained separately based on independent innovations, would be a good thing. I'm definitely in favor of having more perspectives.

But it seems to me that there is not really much chance for diversity in perspectives when it comes to training a base frontier model anyway because they're all already using the maximum amount of information available. So that set is going to be basically identical.

And as for distilling the RL behaviors and so on of the models, this distillation process is still just a part of what the Chinese labs do — they've also all got their own extensive pre-training and RL systems, and especially RL with different focuses and model personalities, and so on.

They've also got diverse architectures and I suspect, in fact, very different architectures from what's going on under the hood from the big frontier labs, considering, for instance, we're seeing DSA and other hybrid attention systems make their way into the Chinese model mainstream and their stuff like high variation in size, and sparsity, and so on.

D) I find that for basically all the tasks that I perform, the open models, especially since K2T and now K2.5, are more than sufficient, and I'd say the kind of agentic coding, research, and writing review I do is both very broad and pretty representative. So I'd say that for 90% of tasks that you would use an AI for, the difference between the large frontier models and the best open weight models is indistinguishable just because they've saturated them, and so they're 90% equivalent even if they're not within 10% in terms of the capabilities on the very hardest tasks.

impulser_ · 20 days ago

It's greed, now that they have all the data and infrastructure they are pulling up the ladder.

Why do you think not a single one of these labs have released an open source models distilled on their own SOTA model?

They are all preaching they want to provide AI to everyone, wouldn't this be the best way to do this? Use your SOTA model to produce a lesser but open source model?

juleiie · 19 days ago

Well if you think about AI as another manhattan project this is clearly not that good. Or maybe it is? Maybe it will zero sum into some kind of mutually assured AI superiority destruction. Maybe anything else would be too tempting for the nation with access to super intelligence to not instantly submit the opponent that doesn’t possess an equivalent?

This is a bit overdramatizing. Let’s just think of it as strategic, tactical and planning superalgorithms. A nation without access to them becomes between severely disadvantaged and completely defenseless. Is it in the interest of the world to preserve the exiting status quo and balance of power?

Or let’s say we make china severely behind this technology. Now they cannot defend themselves from a potential USA swift nuclear strike so precise that a chance of retaliation is zero. Is this a net benefit to the world.

lofaszvanitt · 18 days ago

Free Open AI for everyone. Well, coding was mostly trained on FOSS projects.

hereme888 · 19 days ago

I would not hire someone with your sense of judgment and moral compass.

logicprog · 19 days ago

Well, it's a good thing I'm not looking for a job from you then

impulser_ · 20 days ago

Why would anyone care about this at all?

MiniMax, DeepSeek, and Moonshot are all releasing models for the public to use for free.

Anthropic, OpenAI, Google ect have been scraping information to train their models that they had no right in scraping yet when these company pay them to scrap data we are suppose to be worried?

Labs like Anthropic always preach we are trying to build AI for everyone while releasing expensive models that are closed source.

The only reason AI is affordable at all is because of these Chinese AI labs.

lumost · 20 days ago

Also - how can this be prevented? the AI labs can't seriously expect that each lab will filter LLM generated content from their training sets based on the source model. Leakage of AI behavior into public datasets is inevitable.

reactordev · 20 days ago

Turn the lens the other way around. By publicly posting that these models violate IP and anyone can run them, they are painting a specific political picture here…

NitpickLawyer · 20 days ago

> Why would anyone care about this at all?

Anthropic have been the loudest in pushing for regulatory capture, often citing "muh security" as FUD. People should care what they write on this topic, because they're not writing for us, they're writing for "the regulators". Member when the usgov placed a dude in solitary confinement because they thought he could launch nukes with a whistle? Yeah... Let's hope they don't do some cray cray stuff with open LLMs.

Anthropic make amazing coding models, kudos for that. But they should be mocked for any communication like the one linked. Boo-hoo. Deal with it, or don't, I don't care. No one will feel for you. What goes around, comes around. Etc.

bigyabai · 20 days ago

Administratively, Anthropic seems to misunderstand politics. You don't get to wear the "people's champion" and "government sweetheart" hats at the same time, when push comes to shove you'll be forced to pick a lane. We saw it with Microsoft, we saw it with Apple and Google, and now we're seeing it with OpenAI too. You can't drive down both paths at the same time.

As a member of the target audience for Claude, their messaging just leaves me confused. Are you a renegade success, or do you need the government's help? Are you a populist juggernaut, or do you hide from competition? OpenAI, for all their myriad issues, understood this from the start and stuck to the blithely profitable federal ass-kisser route.

nashadelic · 19 days ago

Why? Imagine the frontier labs lobbying for a ban on using "chinese ai" for commercial use in america

PlatoIsADisease · 20 days ago

Go free stuff! But... no one is running 400B models on their computers.

You are just giving them data instead. Its not like China is known to protect IP. Your data is going to be used against you, and we cant use western laws to keep it safe.

impulser_ · 20 days ago

Yeah they do.

https://openrouter.ai/minimax/minimax-m2.5/providers https://openrouter.ai/z-ai/glm-5/providers https://openrouter.ai/moonshotai/kimi-k2.5/providers

SlavikCA · 20 days ago

So, only Americans can use data against others?

By the way, I'm running 400B model on my computer with 72GB VRAM: Qwen3.5-397B-A17B-GGUF/UD-Q4_K_XL getting 13 t/s. Subjectively, I feel it's runs at the level of Anthropic Claude, just slower.

bigyabai · 20 days ago

> we cant use western laws to keep it safe

Western laws didn't stop OpenAI from leaking PII, or Nest from getting hacked. I'll take my chances with the CCP.

selfhoster11 · 20 days ago

It doesn't take much hardware. I have run larger models.

LZ_Khan · 20 days ago

If you care about improvement of models, you would support the US labs here.

It costs hundreds of millions of dollars to train a frontier model. It's not just "scraping the web."

Distillation allows labs to replicate these results at 1/100th of the cost. This creates a prisoner's dilenmma which incentivizes labs to withhold their models from the public.

ElevenLathe · 20 days ago

How much did it cost to produce all the data on the internet and every book ever published? Surely even the most conservative calculations put it at multiple years of planetary GDP. The same argument can be made to say that letting the big labs get away with pirating it will disincentivize people to publish anything.

bigyabai · 20 days ago

This reads a bit like over-moralizing to me. US labs will continue improving their models because they have to make money in a competitive market. Chinese distillations have arguably improved the status-quo, with Qwen and R1 forcing GPT-OSS to be released to the public. American businesses are competing, and American customers are getting better products because of the competitive pressure on them.

Your purported "prisoner's dilemma" hasn't happened yet to my knowledge, instead we seem to see the opposite. The high-speed development velocity has forced US labs to release more often with less nebulous results. Supporting either side will contribute to healthier competition in the long run.

contravariant · 20 days ago

If 'we' really cared about the improvement of models all of them would be public.

Anything else just proves someone prefers making money to improving the models.

falcor84 · 20 days ago

> incentivizes labs to withhold their models from the public.

Does it really? How would they get revenue if they withhold their models? And doesn't economics generally say that if it's easier for your competitor to catch up, you have a higher incentive to maintain your lead?

falcor84 · 20 days ago

I think that the bigger conversation to be had here is about the environmental damage - if by using distillation we can really train new models at 1% of the cost in energy, it is ethically imperative that we do this.

wpm · 20 days ago

> If you care about improvement of models, you would support the US labs here.

I guess I don't care then.

hermanzegerman · 19 days ago

Tell me how they obtained that data?

Nobody feels sorry for big Multinationals trying to skirt Copyright for their own good, but then cry about it when their competition ignores it too.

You can't have your cake and eat it

YetAnotherNick · 20 days ago

> incentivizes labs to withhold their models from the public.

This is the only way they make money.

paxys · 20 days ago

It's crazy for their official account to post this when Anthropic itself is fighting multiple high-profile lawsuits over its unauthorized use of proprietary content to train its models. Did no one run this by legal?

notatoad · 20 days ago

i don't see them making any claim that unauthorized use of their proprietary content is illegal.

i read this more as a claim to protect their brand and valuation. They just want us all to know that we shouldn't be too impressed by deepseek, because deepseek is training off claude.

also, i think this blog post should be read in the context of anthropic execs meeting with Pete Hegseth today - this isn't legal, it's political, they're playing up the national security aspects here for some political benefit.

paxys · 20 days ago

> industrial-scale distillation attacks

> fraudulent accounts

These are not terms used to describe regular users of your software.

This tweet is 100% going to show up in court (whether in the current crop of cases or future ones) as an example of Anthropic accepting that copyright infringement and unauthorized use hurts their business as an IP holder.

ndiddy · 20 days ago

Yeah the blog post isn't saying distillation is illegal, it's saying that it should be illegal:

> These [distillation] campaigns are growing in intensity and sophistication. The window to act is narrow, and the threat extends beyond any single company or region. Addressing it will require rapid, coordinated action among industry players, policymakers, and the global AI community.

> Illicitly distilled models lack necessary safeguards, creating significant national security risks. Anthropic and other US companies build systems that prevent state and non-state actors from using AI to, for example, develop bioweapons or carry out malicious cyber activities. Models built through illicit distillation are unlikely to retain those safeguards, meaning that dangerous capabilities can proliferate with many protections stripped out entirely.

> Anthropic has consistently supported export controls to help maintain America’s lead in AI. Distillation attacks undermine those controls by allowing foreign labs, including those subject to the control of the Chinese Communist Party, to close the competitive advantage that export controls are designed to preserve through other means.

B1FF_PSUVM · 20 days ago

I'm curious about the "created over 24,000 fraudulent accounts". They didn't pay?

lostmsu · 20 days ago

They violated TOS

anonnon · 19 days ago

It shows you how sociopathically shameless they are.

cs702 · 20 days ago

It's been known for a long while that model outputs = data for training another model to copy the original model's behavior, also known as distillation.

What I didn't know is that the three groups mentioned "created over 24,000 fraudulent accounts and generated over 16 million exchanges with Claude, extracting its capabilities to train and improve their own models." There's some irony in that, given that Anthropic and all other established AI shops have been criticized for using copyrighted materials without permission to train their own models. I wouldn't be shocked if we subsequently find out tat every major AI shop has secretly engaged in distillation at some point in the past.

Still, wow, 24,000 accounts. I can't help but wonder, how many other AI shops have surreptitious accounts with other AI shops right now?

lejalv · 20 days ago

So they did pay to distill a piratic model.

More than can be said from Anthropic et al.’s leeching of a substantial proportion of human culture

joquarky · 20 days ago

The real cultural leeches are the corporations that kept extending copyright terms to the point that a kid can never create derivative works of their favorite show.

lumost · 20 days ago

Also makes you wonder how much of the user growth could just be distillation attempts from one model vendor to another.

cs702 · 20 days ago

Yeah, those 24K accounts are likely high-volume.

16M "exchanges" / 24K accounts = 667 "exchanges" per account.

How many tokens per "exchange"? I imagine a lot, because these accounts were likely maxing out on context.

taytus · 20 days ago

This reads like AI slop.

falcor84 · 20 days ago

Interesting, and my main take away is that ~16 million sessions is enough to distill Claude. That's extremely doable - obviously, as it's been done repeatedly - but it just looks very feasible in general.

If I think of the number of lessons and educational conversations that a human would have to acquire their lifetime knowledge, I would hazard to say that AI-to-AI learning no longer requires many orders of magnitude beyond that.

Imustaskforhelp · 20 days ago

I wonder if more companies from different countries would get interested into Distillation efforts.

Because a huge downside of Chinese-models is that these are chinese models with tianmen square and tibet and other issues.

Yet everyone uses them because they thought that it was insanely hard to build and obviously I am not trying to downplay that even now its an incredible accomplishment that they achieve by created such good open source models and providing them at competitive rates.

Now that we know it might be (more?) easier than previously thought. Would more countries, say South Korea, Japan or India want to enter the market as well without much bias on certain topics which are raised about Chinese censorship everytime a new model is discussed at times.

It's a huge risk/rewards ratio thing. From what I can tell, inference is extremely profitable (Deepseek was profitable at inference fwiw) so perhaps, more countries could try to create their own "Deepseek" where they would focus on having a brand value + open-source/selling for entreprise.

Mistral is a good example of that especially with their entreprise related contracts. Speaking of mistral, are they doing distillation too or not

MiSeRyDeee · 20 days ago

Kudos to them then, for doing such a good job at distillation. Only 16 million chats(shared by multiple labs/models) needed for distillation for getting mostly on par performance at 1/10th - 1/50th cost, keep up keeping up!

icedchai · 19 days ago

The output quality of open models still has a long way to go... I've experimented with many of them, through services like openrouter and on my own hardware.

culi · 17 days ago

Try Kimi 2.5

throwfaraway4 · 20 days ago

Company that rips-off creators to build their product complains other companies are doing the same to them.

burnt-resistor · 19 days ago

Not morally or ethically equivalent, but developed nations regularly denounce developing nations for "stealing" IP and talent as they once did. The commonality is the hypocrisy of the pot castigating the kettle.

Alifatisk · 20 days ago

Reading the comment section in this thread gave me a good laugh, no one is buying into this.

If anything, it’s thanks to these Chinese labs that I’m able to have something like glm-5 for 7$ quarterly or kimi k2.5 for 2$ month, while getting results close to Claude. I am grateful. Looking forward to the new Deepseek model.

But one thing that makes me curious is how, lets say, Deepseek is doing this. Are they paying cheap workers to buy subscriptions and chat to gather data? Have they purchased lots of api keys and using automated scripts to feed Claude data and collect the output? How are they doing this?