If anyone's interested, I made Colab notebooks with free GPUs for both GRPO (the algo DeepSeek used) to train a reasoning model from scratch, and also general finetuning, which the Berkeley team employed!
How long does this take on a free tier T4? This is really neat, I’d assumed this type of “playing with the guts” work was more difficult to access as a normie programmer. Looks like something I’d like to try!
You should always assume headlines are hyperbolic, and 'verb your own noun for cheap' headlines are always offering a way to make your own version of $expensive_thing for hobby prices, not to provide a copy of $expensive_thing.
If you a headline saying 'make your own James Webb Space Telescope in a weekend' they're offering a project that leverages some tech concept from the JWST, like mirror arrays or a particular sort of sensor. They're not promising that you will be able to build a space-capable telescope the size of a semi truck.
The vocabulary used to describe the culturally prevailing leader will be used to explain similar concepts and create analogies. That's an easier tool to communicate to the masses than crafting super tailored messages for only domain experts.
It's why we keep doing this, and it's also why trademarks become generics.
"Google it", "Uber for X", "band aid", "the band sounds like Y", "the actor looks like Z", etc. etc.
This is a core part of how human language works and how we as a species communicate with one another.
Yeah, I agree. The "O1 preview" naming feels a bit misleading. It sets an expectation of broader coverage than just those specific benchmarks. It's cool to see cost reductions, but the marketing could be more transparent about the scope.
In the last weeks are are seeing a torrent of advances, just because someone opened their architectures.
Imagine where we could go if the training datasets were also publicly available and unbounded by any copyright laws. (I'm not talking about doing anything illegal).
perhaps copyright needs to be updated. And in any case, my personal belief is that training on data that is publicly released, and as well as purchased media, is fair use.
It seems like the torrent was already happening and DeepSeek's part is just one example of that. They did help bring attention to those advancements, and that's led to lots more people contributing and finding more niche applications.
Isn't the general attitude these days to just break laws and bribe officials once you own the hottest startup? /s
edit: re. the /s
I was living offshore and running the most popular bitcoin casino at the time, spending a vast amount of money and energy to block any player who might be American. As a result I didn't make that much money. And I tried to calculate how much I would need to make if I wanted to break the law and hide out forever. I figured I could make $10-15M a year but that wouldn't be enough to hide. I fucked up, I guess. Because the richest man in the world made most of his first round of money facilitating gambling transactions, and he's now got his snout in every federal agency. I should have had the balls, I guess, to ask forgiveness rather than permission.
This was always like this. Youtube started publishing mostly copyrighted content, then Google settled with copyright owners. Google by the way has perfected the "art" of training their algos with content without approval from copyright owners.
Inference time compute is still very under utilized in actual AI deployments. Lots of folks are working on foundation models, which require reasoning about broad problem domains. Not enough people are using the same techniques for task-specific performance improvements. You can easily distill the reasoning from larger models like R1 for your task. Often better, you can mix in custom thinking instructions for specific sub-problems so a fine tuned model learns a mix of task specific reasoning and custom logic. It’s not hard and easily beats prompt iteration. When you find bugs, you can fix it.
Thanks for linking to this. That’s a good resource!
Do you have any pointers on assembling fine-tuning data not for isolated tasks, but for a flexible range of queries in a particular problem domain? Similar to general purpose instruction-tuning, but much more focused.
For example, suppose you’re building an app that helps doctors search through research literature to aid in diagnosis, check hypotheses, etc. Of course you would want to have some domain experts and real users available to see what kind of queries they would create. But getting from that point to a well-balanced dataset that adequately represents the distribution of possible queries, instructions, writing/cognitive styles, formatting, dialog flows, etc. your app will encounter —- it just seems kind of hard to know how to approach a task like that. It seems there are infinitely many dimensions you could accidentally overfit on.
General advice? Collect data, train a model, note the mistakes in the model, mistakes in the data, and think critically about what it is that you're ending up teaching. Repeat many, many, many times.. For some tasks, don't be surprised if it ends up taking months or a year or several. It took me 6 months of building a dataset, by hand, by myself, to produce ~1600 'gold standard' text examples (bolstered by ~100K synthetic examples) - texts plus 20 dimensions rated 1-4. But I managed to beat SOTA models in this task from all the frontier labs by doing so. It also makes sense to consider all of the various "lacks" of the competing models.
It's quite difficult to see all the future decisions you will make due to future insights about future versions of the whole loop. But you will be needing to make some.
I will say one more concrete thing though: the more metadata you collect, generally, the better, but this can make it more expensive.
Also, if you ever need to update your schema.. well this is actually one reason why text data for LLMs is nice: your schema is essentially fluid in the first place, so you could eg stick metadata in the text itself if at some future point you start collecting it.
I guess, also, it's a good thing to constantly add new benchmarks, if possible. Treat your model's capabilities as knowable, but never treat your model's capabilities as actually known.
The blog post was a little unclear, so my summary was:
- They used QwQ to generate training data (with some cleanup using GPT-4o-mini)
- The training data was then used to FT Qwen2.5-32B-Instruct (non-reasoning model)
- Result was that Sky-T1 performs slightly worse than QwQ but much better than Qwen2.5 on reasoning tasks
There are a few dismissive comments here but I actually think this is pretty interesting as it shows how you can FT a foundation model to do better at reasoning.
GRPO notebook for Llama 3.1 8B: https://colab.research.google.com/github/unslothai/notebooks...
General finetuning notebook: https://colab.research.google.com/github/unslothai/notebooks...
The Berkeley team's 17K dataset: https://huggingface.co/datasets/NovaSky-AI/Sky-T1_data_17k Hugging Face also released a 220K dataset: https://huggingface.co/datasets/open-r1/OpenR1-Math-220k
Also you can install Unsloth on your local machine :)
Kaggle has 2x Tesla T4s as well for free for 30 hours per week!
I expected some sort of way to actually get o1 preview retrained (and downloadable).
Also, calling it O1 preview on just 7 benchmarks is not correct. What if someone comes up with some use cases where O1 preview does better than this.
apart from that, good that things are becoming cheaper.
Deleted Comment
If you a headline saying 'make your own James Webb Space Telescope in a weekend' they're offering a project that leverages some tech concept from the JWST, like mirror arrays or a particular sort of sensor. They're not promising that you will be able to build a space-capable telescope the size of a semi truck.
The vocabulary used to describe the culturally prevailing leader will be used to explain similar concepts and create analogies. That's an easier tool to communicate to the masses than crafting super tailored messages for only domain experts.
It's why we keep doing this, and it's also why trademarks become generics.
"Google it", "Uber for X", "band aid", "the band sounds like Y", "the actor looks like Z", etc. etc.
This is a core part of how human language works and how we as a species communicate with one another.
In the last weeks are are seeing a torrent of advances, just because someone opened their architectures.
Imagine where we could go if the training datasets were also publicly available and unbounded by any copyright laws. (I'm not talking about doing anything illegal).
I can only dream, I guess.
https://www.privacyworld.blog/2024/03/japans-new-draft-guide...
I imagine if copyright is a big issue for AI, Japanese startups will have an advantage.
edit: re. the /s I was living offshore and running the most popular bitcoin casino at the time, spending a vast amount of money and energy to block any player who might be American. As a result I didn't make that much money. And I tried to calculate how much I would need to make if I wanted to break the law and hide out forever. I figured I could make $10-15M a year but that wouldn't be enough to hide. I fucked up, I guess. Because the richest man in the world made most of his first round of money facilitating gambling transactions, and he's now got his snout in every federal agency. I should have had the balls, I guess, to ask forgiveness rather than permission.
I made a GitHub project for distilling thinking models (and customs COT inference time fine tuning): https://docs.getkiln.ai/docs/guide-train-a-reasoning-model
Do you have any pointers on assembling fine-tuning data not for isolated tasks, but for a flexible range of queries in a particular problem domain? Similar to general purpose instruction-tuning, but much more focused.
For example, suppose you’re building an app that helps doctors search through research literature to aid in diagnosis, check hypotheses, etc. Of course you would want to have some domain experts and real users available to see what kind of queries they would create. But getting from that point to a well-balanced dataset that adequately represents the distribution of possible queries, instructions, writing/cognitive styles, formatting, dialog flows, etc. your app will encounter —- it just seems kind of hard to know how to approach a task like that. It seems there are infinitely many dimensions you could accidentally overfit on.
It's quite difficult to see all the future decisions you will make due to future insights about future versions of the whole loop. But you will be needing to make some.
I will say one more concrete thing though: the more metadata you collect, generally, the better, but this can make it more expensive.
Also, if you ever need to update your schema.. well this is actually one reason why text data for LLMs is nice: your schema is essentially fluid in the first place, so you could eg stick metadata in the text itself if at some future point you start collecting it.
I guess, also, it's a good thing to constantly add new benchmarks, if possible. Treat your model's capabilities as knowable, but never treat your model's capabilities as actually known.
- They used QwQ to generate training data (with some cleanup using GPT-4o-mini)
- The training data was then used to FT Qwen2.5-32B-Instruct (non-reasoning model)
- Result was that Sky-T1 performs slightly worse than QwQ but much better than Qwen2.5 on reasoning tasks
There are a few dismissive comments here but I actually think this is pretty interesting as it shows how you can FT a foundation model to do better at reasoning.
That said, for someone who's not in the game but been curious as to the details of fine-tuning, it's great to get both the dataset and the code.
Hardly a huge win.