Readit News logoReadit News
highfrequency · a year ago
Cool result, but worth highlighting two points:

- Model is finetuned from Qwen-2.5 Instruct, which includes millions of specially filtered math examples in both pretraining and supervised fine-tuning already.

- To generate the perfect 817 math examples for LIMO, they used state of the art models like R1 to filter down from an initial pool of 10 million math problems. In other words, a whole lot of intelligence was used to craft a maximally informative and distilled set of fine-tuning data. It’s not very clear to me if this is more or less impressive than getting the same result by simply fine-tuning on the 10 million initial pool, but I suppose that would make for a worse headline.

armcat · a year ago
Yes, the authors explicitly highlighted those two points in the abstract, in terms of them being the elicitation threshold for complex reasoning, namely, an extremely complete pre-trained foundation model, and a set of extremely high quality examples post-training.

To your question on finetuning on the initial 10 million pool - intuitively, it would require tremendous amount of finetuning data to move the needle - you really won't be able to move the gradients much with just 817 examples, that initial pool is effectively enforcing pretty rigid regularization.

There is now an increasing interest in showing that small data with inference time scaling is providing significant yield. Couple of recent examples:

* TinyZero: https://github.com/Jiayi-Pan/TinyZero * s1 Simple Test Time Scaling: https://arxiv.org/abs/2501.19393

highfrequency · a year ago
The abstract doesn’t specify that the 857 training examples were filtered down by R1 from 10 million initial questions. This helps to understand the result better: it is in large part a testament to R1 and similar models’ remarkable ability sift through and identify/construct perfect training data for other models.

Dead Comment

amingilani · a year ago
Why is everyone is so critical of using information from a previous model to make a more efficient model. There’s nothing wrong with making progress using prior work. And increasing efficiency is progress.

You wouldn’t criticize someone’s kombucha because they didn’t piece their SCOBY (symbiotic culture of bacteria and yeast) together microbe by microbe.

carschno · a year ago
You are looking at it from a product perspective. From a scientific perspective, it just means the respective benchmark is meaningless, so we don't know how well such a model generalizes.
btown · a year ago
There is a valid criticism that when you rely heavily on synthetic outputs, you bring along the precursor model's biases and assumptions without fully knowing the limitations of the data set the precursor model was trained on, as well as intentional adjustments made by the designers of the precursor model to favor certain geopolitical goals.

But that's not the criticism that I'm often seeing; it's more that there's an "unfair" amount of press coverage towards new models that rely, in the critics' views, more on distillation than on "true" innovation.

It's worth noting that there are many parties with significant motivation to build public sympathy that only "true" innovation should be valued, and it is only their highly-valued investments that can uniquely execute in that space. Cutting-edge models built in caves with a box of their scraps are counter to that narrative. It's worth considering https://paulgraham.com/submarine.html in this context, and understanding whether it is truly "everyone" that is critical in this way.

Rumengol · a year ago
The issue is that they claim that you don't need an extensive amount of data to do efficient reasoning. But that alone is a bit misleading, if you need a massive model to fine tune and another one to piece together the small amount of data.

I've seen the textbook analogy used, but to me it's like a very knowledgeable person reading an advanced textbook to become an expert. Then they say they're better than the other very knowledgeable persons because he read that manual, and everyone can start from scratch using it.

So there's nothing wrong with making a more efficient model from an existing one, the issue is concluding you don't need all the data that made the existing one possible in the first place. While that may be true, this is not how you prove it.

Deleted Comment

novakboskov · a year ago
I'd say that the critique points out that this "information from a previous model" itself needs tremendous amounts of data. Now, did we see any better generalization capabilities with all data counted?
trott · a year ago
Another way to look at this is that there are 12,290 bits of information in choosing 817 samples from 10,000,000.
TOMDM · a year ago
And much more information when selecting just as many examples from quadrillions of randomly generated examples.

The information from the selection criteria isn't available to the model, just the chosen samples.

EternalFury · a year ago
Just imagine a textbook that gives you the understanding you need to score high in math competitions…and it describes less than 1,000 problems. This in itself is a major discovery in metacognition.
robotresearcher · a year ago
It's one more textbook, not one textbook.

I'm not knocking the work. They report large improvements using relatively little data. That's good. But let's be clear that this is further training of a good sized LLM that has read far, far more than any human that ever lived already.

sdenton4 · a year ago
Well, there's this, which comes close: https://www.wiley.com/en-us/The+Art+and+Craft+of+Problem+Sol...

Most of the math competitions people are working on are high school math competitions - these have problems from a relatively small set of mathematics, so that high school students can reasonably know the appropriate background.

Terretta · a year ago
> To generate the perfect 817 math examples for LIMO, they used state of the art models like R1 to filter down from an initial pool of 10 million math problems. In other words, a whole lot of intelligence was used to craft a maximally informative and distilled set of fine-tuning data

The paper, and this comment, seem awfully reminiscent of creating a textbook of curated "maximally informative and distilled" set of cognitive examples to teach students with foundational learning a next level of reasoning.

The last few years of LLM progress have shown we can predict human "reasoning" responses to inputs by modeling likely human responses as if LLM generated. Put another way, most responses are not particularly reasoned, but chain of tokgen*.

Sit near someone who "talks to herself" while doing problems and it's even more evident.

---

* tokgen definition: Listen to conversations in a cafeteria. Many are something other than thoughtful, responses that follow the prompts, with near perfect predictability. To differentiate from these responses and speech that comes after a pause and reflect, one can use the labels thought versus token generation or tokgen.

ciphix · a year ago
After reviewing the paper and GitHub training dataset, I have the following observations:

The 800+ training samples, each containing solutions with detailed reasoning steps, were primarily generated by DeepSeek r1 and advanced models. The reasoning processes within these training solutions are crucial. It's possible that the advanced models have encoded these reasoning processes through the generated samples. Given a sufficiently large model, it can effectively restore such reasoning weights, effectively adding a delta from DeepSeek r1, among others.

Therefore, it's not surprising that, with relatively few fine-tuning data, Qwen 2.5 has achieved such significant improvements.

This is merely a conjecture. Further research is needed to analyze and visualize the changes in network weights before and after fine-tuning.

GTP · a year ago
>The last few years of LLM progress have shown we can predict human "reasoning" responses to inputs by modeling likely human responses as if LLM generated. Put another way, most responses are not particularly reasoned, but chain of tokgen*.

Sorry, but I don't get the point of your comment as a whole, and of this part in particular. Yes, most human day-to-day conversations are quite predictable, but some people are still capable of generating original thoughts from time to time. And still, how is it related to the comment you are replying to?

orbital-decay · a year ago
>In other words, a whole lot of intelligence was used to craft a maximally informative and distilled set of fine-tuning data.

Sounds like any textbook. (and generally the process of knowledge compression over generations that made us who we are)

smallerize · a year ago
Yeah, but it's cheaper.

The context right now is that OpenAI, with first-mover advantage, cutting-edge-hardware, and tens of billions of dollars of investment, are not getting benchmark performance better than Chinese-developed models that are trained with cut-down nvidia GPUs and a lot less money.

rfoo · a year ago
But... they are? o3-mini is faster than DeepSeek-R1 and has comparable capability. And while I hate "AGI achieved internally" meme, o3 is significantly better than o1. Though I doubt how long until DeepSeek-R3 happens. They could skip R2 too citing Cloudflare R2 :P
yishanchuan · a year ago
Sure,but just in mathematical reasoning. If future works contain mathematical logic reasoning, it will perfect.
mattigames · a year ago
You are missing three point, it's about stating the importance of the preselection, now we know that we may not need huge amounts of data for similar results in other reasoning areas, only highly curated data, yes, sometimes by models themselves but not necessarily.
hexomancer · a year ago
Here is how I make sense of it (I have no expertise in this subject, please feel free to correct me if I am wrong): I think when the model is pretrained on the internet, it does gain most of the skills required to do mathematical reasoning, however, since its task is to predict the next word distribution on the entire internet, it does not normally use this ability, since most of the text on the internet is not this type of reasoning text (think of generative image models a few years ago, where appending "unreal engine" to a prompt would significantly improve the quality of the output, the reason was that the model was trained to generate the distribution of the images on the internet, most of them are not particularly impressive, however, since images containing "unreal engine" were usually high-quality screenshots of images, it would also move the distribution of generated images towards higher quality generations). So I think the model already has most of the ability, it just needs to adjust a few connections to actually utilize this latent skill, so it makes sense that a few training examples are enough to adjust the connections to increase mathematical reasoning skills.
cube2222 · a year ago
Kinda similar to how Anthropic was able to achieve golden gate Claude or even maximize/minimize features like “buggy code” via analyzing concepts in activations and manipulating them[0].

[0]: https://www.anthropic.com/news/mapping-mind-language-model

zozbot234 · a year ago
The nice thing about Golden Gate Claude is that it shows very clearly how easily LLM's can be used for advertising, even in response to arbitrary user queries. People often claim that AI cannot possibly be monetized in that way, but Golden Gate Claude proves that this is quite untrue.
user_7832 · a year ago
Thank you for the link, I wasn’t aware that there were high quality blogs by Anthropic (or about golden Gate Claude).
barrkel · a year ago
I'd add a little bit more to that.

Pattern identification and continuation can be applied to evaluate symbolic reasoning. You can see this in e.g. the semantics of a functional programming language if evaluation semantics are defined in terms of rewrite rules.

If you have a model which can convert a problem into language that's precise enough to start pattern matching to LLM-encoded generative programs that evaluate logical implications, you can get into a very interesting space. Autoregressive prediction can turn into symbolic progressive evaluation and calculation. The background LLM is still guiding choice of evaluation and goal seeking.

Reinforcing these evaluation rules seems like it should be doable without enormous corpora, as long as the base model already has enough meat on it to cleanly attach to the more precise language.

larodi · a year ago
The reasoning R1 demonstrates most times sounds to me like 5th grader's wording - in support of what you say. But then if you compress compress the knowledge needed for math reasoning, perhaps you get category theory paired with prolog or something along the line which is rule-based.
cubefox · a year ago
This suggests fine-tuning a base model (with SL or RL) generally doesn't make the model inherently smarter, only the initial self-supervised learning during pretraining does. Though it would be strange if no amount of reinforcement learning could make the LLM truly smarter.
easeout · a year ago
My guess at the upshot: Some domains, like math, are general but have outsized effective vocabularies like all possible numbers, which makes them more expensive to train by the same method that works for domains of regular-sized vocabularies. If you train for reasoning steps in such a problem domain, you can reinforce the comparatively few general terms of the vocabulary like "add", "inverse", "solve". And that leaves the arithmetic of number combinations separate from particular problems because you're not emphasizing one-shot answers. You can train N reasoning cases + M arithmetic cases instead of N*M whole math problems. So you have to use more inference power but you can get better answers for less training.

Theory aside, I would think a good application-side method is to use this general reasoning process to structure a final expression and then pass that through a traditional evaluator. Then the reasoning and training thereof need only go as far as symbol manipulation. This is something like Wolfram Alpha, if its NLP handed off to the evaluator much later in the process.

sega_sai · a year ago
A connected question -- has there been an LLM that is a perfect calculator ? I.e. you give it a expression involving standard operations +/- and (say) integer numbers, standard operations and it should returns always a correct result. I don't remember seeing any papers on this (but i'm not an expert)
jkhdigital · a year ago
Why would you ever want an LLM that is a perfect calculator? Humans invented calculators for a reason. A good LLM should respond to arithmetic questions by executing a cheap and efficient calculator program instead of wasting cycles on it.
Scene_Cast2 · a year ago
Standard neural nets (created through regular training methods) have no guarantees about their output. So no, there hasn't been anything like that.

I do recall someone handcrafting the weights for a transformer and getting some sort of useful algorithm or computation going, so there's that.

scotty79 · a year ago
Conversly, is there an LLM that is given a calculator and taught how to use it so it doesn't need to waste neurons on doing simple arithmetic that neurons actually suck at?

Or even better, a simple programmable calculator and/or symbolic calculator.

igleria · a year ago
I think I've recently read two seemingly contradicting things:

1- LLMs can never generalize theorem proving

2- this paper: "This suggests that contemporary LLMs may already possess rich mathematical knowledge in their parameter space, transforming the challenge from knowledge acquisition to knowledge elicitation"

Not sure what is what anymore!

bilater · a year ago
I think the way to swallow this bitter pill is to acknowledge they can "generalize" because all human knowledge is actually a relatively "small" finite distribution that models are now big enough to pattern match on.
gmueckl · a year ago
Calling human knowledge small is hyperbole. I cannot get any LLM even close to giving accurate answers related to the things I know. They simply do not know what I, a single human being, knows. That's simply because I'm a subject matter expert on somewhat niche topics. There are easily hundreds of thousands of people like me out there.

There's simply no way an LLM can even train on all of that because each bit of true expert knowledge necessarily comically underrepresented in any possible training set.

Davidzheng · a year ago
And another way is that the human brain is a relatively "small" circuit that models are now big enough to model ;)
ak_111 · a year ago
The LLM can generate the correct search space for the problem, but identifying the solution within the search space is inefficient?

Another way to put this: most of students who study the lecture notes for their high school math already have it within them to get a gold on olympiad (the math itself is not more advance than their high school) but getting a high school kid to get gold on olympiad is hard. It might be something similar to P vs NP.

wrsh07 · a year ago
You are going to see a lot of people (both hype and skeptic) tell you things that you can verify. Even while you have a screenshot verifying the opposite of what they are claiming, they will continue to claim it.

For skeptics in particular, you will be able to use a top tier llm and see: does this do the thing someone is claiming it doesn't do? It often will. If you look at recently submitted papers by skeptics you will see them making a claim about state of the art LLMs but then only test using versions from over a year ago (this has happened recently!^)

The way for you to be sure what is what is to just use the thing for yourself and decide what is true.

^ https://x.com/tylercowen/status/1881051976102035880

Deleted Comment

solomatov · a year ago
You could have a rich mathematical knowledge, while being not very good at proving theorems. Also, you might be good at proving competitive mathematics problems without having a rich mathematical knowledge. It's also possible to have rich mathematical knowledge, and being good at proving theorems but mostly in the field of your expertise.
sebzim4500 · a year ago
I think that "LLMs can never X" is just always false.
solomatov · a year ago
LLM can never solve a halting problem (because no one can using a Turing machine).
theWreckluse · a year ago
"LLMs can never predict the next word"
doug_durham · a year ago
In the same way that image diffusion models showed that convincing approximations of the entire visual world could be summarized in a 5GB model, are "reasoning patterns" similarly compressible? Are there actually countably few reasoning patterns that are used across all domains, and as such can be captured with relatively small training sets?
HarHarVeryFunny · a year ago
I would say there are only a smallish number of truly generic "reasoning patterns" (strategies/approaches), but applied reasoning not only requires a reasoning "pattern", but also a repertoire of valid domain-specific reasoning steps that can be applied pursuant to that approach, as well as the combination of capabilities it takes to overcome impasses when you've exhausted your knowledge and learnt reasoning steps and still not got to a solution.

Perhaps in a domain like math a smallish number of math-specific reasoning steps will go a long way, but math itself also has many "sub-domains" (algebra, geometry, calculus, topology, etc) and AFAIK the techniques of one branch are only going to be useful in another to extent you can map the problem from one domain to another.

guyomes · a year ago
I wonder if their curated set of 817 math problems is also useful as teaching material for training math students on a diverse set of problems.

Deleted Comment

Limoynada · a year ago
If the LIMO hypothesis about the existence of a latent capacity for efficient reasoning in small models that can be elicited by finetuning the model with a small datasets is true, then we could see a huge transference of power from huge models to small models and that in a recurrent way seems to offer unlimited power. But to feed that loop there should be a property of those datasets, they teach the model to adapt reasoning to model size and that is verified by the model extending the depth of the reasoning chain using a small branching factor in the exploration space, like a minimum cover to detect deep patterns.