I hope they do well. AFAIK they’re training or finetuning an older LLaMA model, so performance might lag behind SOTA. But what really matters is that ETH and EPFL get hands-on experience training at scale. From what I’ve heard, the new AI cluster still has teething problems. A lot of people underestimate how tough it is to train models at this scale, especially on your own infra.
Disclaimer: I’m Swiss and studied at ETH. We’ve got the brainpower, but not much large-scale training experience yet. And IMHO, a lot of the “magic” in LLMs is infrastructure-driven.
No, the model has nothing do to with Llama. We are using our own architecture, and training from scratch. Llama also does not have open training data, and is non-compliant, in contrast to this model.
If you guys need help on GGUFs + Unsloth dynamic quants + finetuning support via Unsloth https://github.com/unslothai/unsloth on day 0 / 1, more than happy to help :)
Hey, really cool project, I’m excited to see the outcome.
Is there a blog / paper summarizing how you are doing it ?
Also which research group is currently working on it at eth ?
So you're not going to use copyrighted data for training? That's going to be a disadvantage with respect to LLaMa and other well-known models, it's an open secret that everyone is using everything they can get their hands on.
Imo, a lot of the magic is also dataset driven, specifically the SFT and other fine tuning / RLHF data they have. That's what has separated the models people actually use from the also-rans.
I agree with everything you say about getting the experience, the infrastructure is very important and is probably the most critical part of a sovereign LLM supply chain. I would hope there will also be enough focus on the data, early on, that the model will be useful.
When I read "from scratch", I assume they are doing pre-training, not just finetuning, do you have a different take? Do you mean it's normal Llama architecture they're using?
I'm curious about the benchmarks!
The infra does become pretty complex to get a SOTA LLM trained. People assume it's as simple as loading up the architecture and a dataset + using something like Ray. There's a lot that goes into designing the dataset, the eval pipelines, the training approach, maximizing the use of your hardware, dealing with cross-node latency, recovering from errors, etc.
But it's good to have more and more players in this space.
SOTA performance is relative to model size. If it performs better than other models in the 70B range (e.g. Llama 3.3) then it could be quite useful. Not everyone has the VRAM to run the full fat Deepseek R1.
I wonder if the reason for these results is that any data on the internet is already copied to other locations by actors who ignore crawling opt-outs. So, even if they respect all web crawling opt-outs, they are still effectively copying the data because someone else did not respect it who does not include an opt-out.
Yes this is an interesting question. In our arxiv paper [1] we did study this for news articles, and also removed duplicates of articles (decontamination). We did not observe an impact on the downstream accuracy of the LLM, in the case of news data.
Is there not yet a Source where the web has already been scraped and souped down to just the text? It would seem someone would have created such a thing in order to save LLM training from having to reinvent the wheel.
I understand the web is a dynamic thing but still it would seem to be useful on some level.
No performance degradation on training metrics except for the end user. At the end of the day users and website owners have completely orthogonal interests. Users want answers and content, website owners want attention so they can upsell/push ads. You can only serve one master.
It's also possible you just think of ETH Zurich as great and automatically associate the people and products as amazing. Could be a circular dependency here.
I took courses online from ETH Zurich before the formula was "perfected" and I'd say they were ahead of the curve in quality, concise but info-dense educational content.
That is indeed how things work. I can think of a few 'good' media-relevant examples, including e.g. that recent super-quick cart project [1], that reach beyond the more vanilla startup-spinoffs or basic media efforts.
I had no idea what ETH means 2 years ago, I thought it's ethereum club in switzerland or something. Then I kept hearing about it, noticing people wearing ETH stuff.
obviously I don't know if it's university or people there because I haven't been there, but I keep hearing about ETH Zurich in different areas and it means something
Pretty proud to see this at the top of HN as a Swiss (and I know many are lurking here!). These two universities produce world-class founders, researchers, and engineers. Yet, we always stay in the shadow of the US. With our top-tier public infrastructure, education, and political stability (+ neutrality), we have a unqiue opportunity to build something exceptional in the open LLM space.
I think EPFL and ETH are generally well known internationally, but Switzerland being rather small (9M pop), it's only natural you don't hear much about it compared to other larger countries!
I think that the Allen Institute for Artificial Intelligence OLMo models are also completely open:
OLMo is fully open
Ai2 believes in the power of openness to build a future where AI is accessible to all. Open weights alone aren’t enough – true openness requires models to be trained in the open with fully open access to data, models, and code.
The open training data is a huge differentiator. Is this the first truly open dataset of this scale? Prior efforts like The Pile were valuable, but had limitations. Curious to see how reproducible the training is.
> The model will be fully open: source code and weights will be publicly available, and the training data will be transparent and reproducible
This leads me to believe that the training data won’t be made publicly available in full, but merely be “reproducible”. This might mean that they’ll provide references like a list of URLs of the pages they trained on, but not their contents.
Yeah, I suspect you're right. Still, even a list of URLs for a frontier model (assuming it does turn out to be of that level) would be welcome over the current situation.
Sure, but usually you teach something that is inherently useful, or can be applied to some sort of useful endeavor. In this case I think it's fair to ask what the collision of two bubbles really achieves, or if it's just a useful teaching model, what it can be applied to.
The model will be released in two sizes — 8 billion and 70 billion parameters [...]. The 70B version will rank among the most powerful fully open models worldwide. [...] In late summer, the LLM will be released under the Apache 2.0 License.
Disclaimer: I’m Swiss and studied at ETH. We’ve got the brainpower, but not much large-scale training experience yet. And IMHO, a lot of the “magic” in LLMs is infrastructure-driven.
Source: I'm part of the training team
Good luck though, very needed project!
I agree with everything you say about getting the experience, the infrastructure is very important and is probably the most critical part of a sovereign LLM supply chain. I would hope there will also be enough focus on the data, early on, that the model will be useful.
But it's good to have more and more players in this space.
Great to read that!
[1] https://arxiv.org/abs/2504.06219
I understand the web is a dynamic thing but still it would seem to be useful on some level.
How are you going to serve users if web site owners decide to wall their content? You can't ignore one side of the market.
1 https://ethz.ch/en/news-and-events/eth-news/news/2023/09/fro...
obviously I don't know if it's university or people there because I haven't been there, but I keep hearing about ETH Zurich in different areas and it means something
They missed an opportunity though. They should have called their machine the AIps (AI Petaflops Supercomputer).
OLMo is fully open
Ai2 believes in the power of openness to build a future where AI is accessible to all. Open weights alone aren’t enough – true openness requires models to be trained in the open with fully open access to data, models, and code.
https://allenai.org/olmo
This leads me to believe that the training data won’t be made publicly available in full, but merely be “reproducible”. This might mean that they’ll provide references like a list of URLs of the pages they trained on, but not their contents.
We'll find out in September if it's true?