Train Your Own O1 Preview Model Within $450

I do love competition.

In the last weeks are are seeing a torrent of advances, just because someone opened their architectures.

Imagine where we could go if the training datasets were also publicly available and unbounded by any copyright laws. (I'm not talking about doing anything illegal).

I can only dream, I guess.

Lucasoato · 6 months ago

A torrent of advances is the right way to word it, especially after it has been discovered what Meta trained their models on :)

paper2d · 6 months ago

Those training datasets can never be free as almost all of them is copyrighted.

landryraccoon · 6 months ago

Japan has said AI can train on copyrighted materials.

https://www.privacyworld.blog/2024/03/japans-new-draft-guide...

I imagine if copyright is a big issue for AI, Japanese startups will have an advantage.

chii · 6 months ago

perhaps copyright needs to be updated. And in any case, my personal belief is that training on data that is publicly released, and as well as purchased media, is fair use.

taosx · 6 months ago

Share the non-copyrighted ones and it's still a win if you make it possible to people to contribute, both through PRs, testing and discussion.

lionkor · 6 months ago

almost all free things are copyrighted

Kye · 6 months ago

It seems like the torrent was already happening and DeepSeek's part is just one example of that. They did help bring attention to those advancements, and that's led to lots more people contributing and finding more niche applications.

noduerme · 6 months ago

Isn't the general attitude these days to just break laws and bribe officials once you own the hottest startup? /s

edit: re. the /s I was living offshore and running the most popular bitcoin casino at the time, spending a vast amount of money and energy to block any player who might be American. As a result I didn't make that much money. And I tried to calculate how much I would need to make if I wanted to break the law and hide out forever. I figured I could make $10-15M a year but that wouldn't be enough to hide. I fucked up, I guess. Because the richest man in the world made most of his first round of money facilitating gambling transactions, and he's now got his snout in every federal agency. I should have had the balls, I guess, to ask forgiveness rather than permission.

coliveira · 6 months ago

This was always like this. Youtube started publishing mostly copyrighted content, then Google settled with copyright owners. Google by the way has perfected the "art" of training their algos with content without approval from copyright owners.

Inference time compute is still very under utilized in actual AI deployments. Lots of folks are working on foundation models, which require reasoning about broad problem domains. Not enough people are using the same techniques for task-specific performance improvements. You can easily distill the reasoning from larger models like R1 for your task. Often better, you can mix in custom thinking instructions for specific sub-problems so a fine tuned model learns a mix of task specific reasoning and custom logic. It’s not hard and easily beats prompt iteration. When you find bugs, you can fix it.

I made a GitHub project for distilling thinking models (and customs COT inference time fine tuning): https://docs.getkiln.ai/docs/guide-train-a-reasoning-model

anon373839 · 6 months ago

Thanks for linking to this. That’s a good resource!

Do you have any pointers on assembling fine-tuning data not for isolated tasks, but for a flexible range of queries in a particular problem domain? Similar to general purpose instruction-tuning, but much more focused.

For example, suppose you’re building an app that helps doctors search through research literature to aid in diagnosis, check hypotheses, etc. Of course you would want to have some domain experts and real users available to see what kind of queries they would create. But getting from that point to a well-balanced dataset that adequately represents the distribution of possible queries, instructions, writing/cognitive styles, formatting, dialog flows, etc. your app will encounter —- it just seems kind of hard to know how to approach a task like that. It seems there are infinitely many dimensions you could accidentally overfit on.

pizza · 6 months ago

General advice? Collect data, train a model, note the mistakes in the model, mistakes in the data, and think critically about what it is that you're ending up teaching. Repeat many, many, many times.. For some tasks, don't be surprised if it ends up taking months or a year or several. It took me 6 months of building a dataset, by hand, by myself, to produce ~1600 'gold standard' text examples (bolstered by ~100K synthetic examples) - texts plus 20 dimensions rated 1-4. But I managed to beat SOTA models in this task from all the frontier labs by doing so. It also makes sense to consider all of the various "lacks" of the competing models.

It's quite difficult to see all the future decisions you will make due to future insights about future versions of the whole loop. But you will be needing to make some.

I will say one more concrete thing though: the more metadata you collect, generally, the better, but this can make it more expensive.

Also, if you ever need to update your schema.. well this is actually one reason why text data for LLMs is nice: your schema is essentially fluid in the first place, so you could eg stick metadata in the text itself if at some future point you start collecting it.

I guess, also, it's a good thing to constantly add new benchmarks, if possible. Treat your model's capabilities as knowable, but never treat your model's capabilities as actually known.

danielhanchen · 6 months ago

If anyone's interested, I made Colab notebooks with free GPUs for both GRPO (the algo DeepSeek used) to train a reasoning model from scratch, and also general finetuning, which the Berkeley team employed!

GRPO notebook for Llama 3.1 8B: https://colab.research.google.com/github/unslothai/notebooks...

General finetuning notebook: https://colab.research.google.com/github/unslothai/notebooks...

The Berkeley team's 17K dataset: https://huggingface.co/datasets/NovaSky-AI/Sky-T1_data_17k Hugging Face also released a 220K dataset: https://huggingface.co/datasets/open-r1/OpenR1-Math-220k

threecheese · 6 months ago

How long does this take on a free tier T4? This is really neat, I’d assumed this type of “playing with the guts” work was more difficult to access as a normie programmer. Looks like something I’d like to try!

For GRPO - we also made it much faster, but you might need to wait 2 to 4 hours in the minimum for anything meaningful :)

Also you can install Unsloth on your local machine :)

Kaggle has 2x Tesla T4s as well for free for 30 hours per week!

mkagenius · 6 months ago

Weird that they had to resort to click bait using "O1 preview" in their name.

I expected some sort of way to actually get o1 preview retrained (and downloadable).

Also, calling it O1 preview on just 7 benchmarks is not correct. What if someone comes up with some use cases where O1 preview does better than this.

apart from that, good that things are becoming cheaper.

Deleted Comment

jug · 6 months ago

It’s dishonest because they not only point towards a specific language model, but the beta version of a specific model. WTH?

anigbrowl · 6 months ago

You should always assume headlines are hyperbolic, and 'verb your own noun for cheap' headlines are always offering a way to make your own version of $expensive_thing for hobby prices, not to provide a copy of $expensive_thing.

If you a headline saying 'make your own James Webb Space Telescope in a weekend' they're offering a project that leverages some tech concept from the JWST, like mirror arrays or a particular sort of sensor. They're not promising that you will be able to build a space-capable telescope the size of a semi truck.

echelon · 6 months ago

It's not dishonest, it's simple human behavior.

The vocabulary used to describe the culturally prevailing leader will be used to explain similar concepts and create analogies. That's an easier tool to communicate to the masses than crafting super tailored messages for only domain experts.

It's why we keep doing this, and it's also why trademarks become generics.

"Google it", "Uber for X", "band aid", "the band sounds like Y", "the actor looks like Z", etc. etc.

This is a core part of how human language works and how we as a species communicate with one another.

codelion · 6 months ago

Yeah, I agree. The "O1 preview" naming feels a bit misleading. It sets an expectation of broader coverage than just those specific benchmarks. It's cool to see cost reductions, but the marketing could be more transparent about the scope.

fl4tul4 · 6 months ago

scosman · 6 months ago

rdli · 6 months ago

The blog post was a little unclear, so my summary was:

- They used QwQ to generate training data (with some cleanup using GPT-4o-mini)

- The training data was then used to FT Qwen2.5-32B-Instruct (non-reasoning model)

- Result was that Sky-T1 performs slightly worse than QwQ but much better than Qwen2.5 on reasoning tasks

There are a few dismissive comments here but I actually think this is pretty interesting as it shows how you can FT a foundation model to do better at reasoning.

azinman2 · 6 months ago

I wish they would have compared to the r1 distills of qwen2.5

magicalhippo · 6 months ago

So this is a fine-tune and not from scratch, which makes the proposition much more reasonable.

That said, for someone who's not in the game but been curious as to the details of fine-tuning, it's great to get both the dataset and the code.

Tiberium · 6 months ago

Better URL: https://novasky-ai.github.io/posts/sky-t1/

9woc · 6 months ago

True. The previous discussion on this is here: https://news.ycombinator.com/item?id=42681417

moconnor · 6 months ago

They trained on QwQ traces and in their evaluation they are… mostly slightly worse than QwQ.

Hardly a huge win.