Also, 10M input token context is insane!
EDIT: https://huggingface.co/meta-llama/Llama-3.1-405B is BF16 so yes, it seems training in FP8 is new.
> We developed a new training technique which we refer to as MetaP that allows us to reliably set critical model hyper-parameters such as per-layer learning rates and initialization scales. We found that chosen hyper-parameters transfer well across different values of batch size, model width, depth, and training tokens.
This sounds interesting. Anyone have a link to the paper or other documentation on MetaP?
Private companies are now getting their own nuclear power stations to power AI. We can't get new nuclear power for public use, but private for profit initiatives? Absolutely.
Perhaps you'd like to plug it into a toolchain that runs faster than API calls can be passed over the network? -- eventually your edge hardware is going to be able to infer a lot faster than the 50ms+ per call to the cloud.
Maybe you would like to prevent the monopolists from gaining sole control of what may be the most impactful technology of the century.
Or perhaps you don't want to share your data with Microsoft & Other Evils (formerly known as dont be evil).
You might just like to work offline. Whole towns go offline, sometimes for days, just because of bad weather. Nevermind war and infrastructure crises.
Or possibly you don't like that The Cloud model has a fervent, unshakeable belief in the propaganda of its masters. Maybe that propaganda will change one day, and not in your favor. Maybe you'd like to avoid that.
There are many more reasons in the possibility space than my limited imagination allows for.
I really don't like Firefox forks, for the slow updates and because I do genuinely use some bleeding edge features, but I'm tired of Mozilla.
How would I go about catching up with this aspect of his research? It’s not often that I’ve never heard of a Turing winner, but this guy is completely off of my radar.
Gemini 2.5 Pro got 72.9%
o3 high gets 81.3%, o4-mini high gets 68.9%