> But now with reasoning systems and verifiers, we can create brand new legitimate data to train on. This can either be done offline where the developer pays to create the data or at inference time where the end user pays!
> This is a fascinating shift in economics and suggests there could be a runaway power concentrating moment for AI system developers who have the largest number of paying customers. Those customers are footing the bill to create new high quality data … which improves the model … which becomes better and more preferred by users … you get the idea.
While I think this is an interesting hypothesis, I'm skeptical. You might be lowering the cost of your training corpus by a few million dollars, but I highly doubt you are getting novel, high quality data.
We are currently in a world where SOTA base model seems to be capped at around GPT4o levels. I have no doubt that in 2-3 years our base models will compete with o1 or even o3... just it remains to be seen what innovations/optimizations get us there.
The most promising idea is to use reasoning models to generate data, and then train our non-reasoning models with the reasoning-embedded data. But... it remains to be seen how much of the chain of thought reasoning you can really capture into model weights. I'm guessing some, but I wonder if there is a cap to multi-head attention architecture. If reasoning can be transferred from reasoning models to base models, OpenAI should have already trained a new model with o3 training data, right?
Another thought is maybe we don't need to improve our base models much. It's sufficient to have them be generalists, and to improve reasoning models (lowering price, improving quality) going forward.
every time you respond to an AI model "no, you got that wrong, do it this way" you provide a very valuable piece of data to train on. With reasoning tokens there is just a lot more of that data to train on now
Users can be adversarial to the “truth” (to the extent it exists) without being adversarial in intent.
Dinosaur bones are either 65 million year old remnants of ancient creatures or decoys planted by a God during a 7 day creation, and a large proportion of humans earnestly believe either take. Choosing which of these to believe involves a higher level decision about fundamental worldviews. This is an extreme example, but incorporating “honest” human feedback on vaccines, dark matter, and countless other topics won’t lead to de facto improvements.
I guess to put it another way: experts don’t learn from the masses. The average human isn’t an expert in anything, so incorporating the average feedback will pull a model away from expertise (imagine asking 100 people to give you grammar advice). You’d instead want to identify expert advice, but that’s impossible to do from looking at the advice itself without giving into a confirmation bias spiral. Humans use meta-signals like credentialing to augment their perception of received information, yet I doubt we’ll be having people upload their CV during signup to a chat service.
And at the cutting edge level of expertise, the only real “knowledgeable” counterparties are the physical systems of reality themselves. I’m curious how takeoff is possible for a brain in a bottle that can’t test and verify any of its own conjectures. It can continually extrapolate down chains of thought, but that’s most likely to just carry and amplify errors.
> No, you're wrong, today's date is actually Wednesday the 29th.
>> My mistake. Yes, today's date is Wednesday, January 29th, 2025.
Three months later in April when this tagged data is used to train the next iteration, the AI can successfully learn that today's date is actually January 29th.
not being snarky, but what is the point of using the model if you already know enough to correct it into giving the right answer?
an example that just occurred to me - if you asked it to generate an image of a mushroom that is safe to eat in your area, how would you tell it it was wrong? "oh, they never got back to me, I'll generate this image for others as well!"
If I say "no, you hallucinated basically the entire content of the response", then maybe a newer training set derived from that could train on the specific fact that that specific hallucinated response is hallucinated. This seems to be of dubious value in a training set.
My non technical cousin is a heavy paying user of ChatGPT, once she discovered that she can type incoherent stuff, with typos and whatnot and ChatGPT still will get the gist and produce satisfying answers, she will just type in tons of nonsense (to me) keep long chat sessions, complain it is getting slow and then get mad when I remind her to open new chat each time she has something new to ask that is not related to the previous chat. I have my doubt many users will provide valuable training data.
You're not getting new high-quality textual data for pre-training from your chat service. But you are potentially getting a lot of RL feedback on ambiguous problems.
> I wonder if there is a cap to multi head attention architecture
I don't think there is a cap other than having good data. The model learns all languages in the world, it has capacity. A simple model like AlphaZero beats humans at board games. As long as you have data, the model is not an obstacle. A LLM like AlphaProof is ranked silver medal at IMO.
I think we will have to move with pre-training and post-training efforts in parallel. What DeepSeek showed is that you first need to have a strong enough pretrained model. For that, we have to continue the acquisition of high quality, multilingual datasets. Then, when we have a stronger pretrained model, we can apply pure RL to get a reasoning model that we use only to generate synthetic reasoning data. We then use those synthetic reasoning data to fine-tune the original pretrained model and make it even stronger. https://transitions.substack.com/p/the-laymans-introduction-...
>You might be lowering the cost of your training corpus by a few million dollars, but I highly doubt you are getting novel, high quality data.
The large foundational models don't really need more empirical data about the world. ChatGPT already 'knows' way more than I do, probably by many orders of magnitude. Yet it's still spewing nonsense at me regularly because it doesn't know how to think like a human or interact with me in a human-like way. To that end, the ability for a company like OpenAI to collect novel data from interacting with real humans is a material advantage over their competition.
> the ability for a company like OpenAI to collect novel data from interacting with real humans is a material advantage over their competition
It's different kind of data from the R1 reasoning chains. When LLMs have human in the loop, the human provides help based off their personal experience and real world validation. Sometimes users take an idea from the LLM and try it in real life. Then come back later and discuss the outcomes. This is a real world testing loop.
In order to judge if an AI response was useful, you can look at the following messages with a judge LLM. Using hindsight helps a lot here. Maybe it doesn't pan out and the user tries another approach, or maybe some innocuous idea was key to success later. It's hard to tell in the moment, but easy when you see what followed after that.
This scales well - OpenAI has 300M users, I estimate up to 1 Trillion interactive tokens/day. The user base is very diverse, problems are diverse, and feedback comes from user experience and actual testing. They form an experience flywheel, the more problem solving they do, the smarter it gets, attracting more users.
> I highly doubt you are getting novel, high quality data.
Why wouldn't you? Presumably the end user would try their use case on the existing model, and if it performs well, wouldn't bother with the expense of setting up an RL environment specific to their task.
If it doesn't perform well, they do bother, and they have all the incentive in the world to get the verifier right -- which is not an extraordinarily sophisticated task if you're only using rules-based outcome rewards (as R1 and R1-Zero do)
it seems to work and seems very scalable, "reasoning" helps to counter biases (answers become longer, ie. the system uses more tokens which means more time to answer a question -- likely longer answers allow better differentiation of answers from each other in the "answer space")
It doesn't need much. 1 good lucky answer in a 1000 or maybe 10k queries gives you the little exponential kick you need to improve. This is how the hockey stick take off looks like and we're already here - OpenAI has it, now deepseek has it, too. You can be sure others also have it; Anthropic at the very least, they just never announced it officially, but go read what their CEO has been speaking and writing about.
"The o3 system demonstrates the first practical, general implementation of a computer adapting to novel unseen problems"
Yet, they said when it was announced:
"OpenAI shared they trained the o3 we tested on 75% of the Public Training set. They have not shared more details. We have not yet tested the ARC-untrained model to understand how much of the performance is due to ARC-AGI data."
These two statements are completely opposed. I can't take seriously anything this article says about o3.
No they aren't. Every arc problem is novel - that's why it resisted deep learning for so long (and still does to a degree).
We just don't know how much the model seeing what an arc problem is on the first place boosts its ability to solve them - that limited statement is all the author is making.
They were talking about training on the public dataset -- OpenAI tuned the o3 model with 75% of the public dataset. There was some idea/hope that these LLMs would be able to gain enough knowledge in the latent space that they would automatically do well on the ARC-AGI problems. But using 75% of the public training set for tuning puts them at the about same challenge level as all other competitors (who use 100% of training).
In the post they were saying they didn't have a chance to test the o3 model's performance on ARC-AGI "out of-the-box", which is how the 14% scoring R1-zero was tested (no SFT, no search). They have been testing the LLMs out of the box like this to see if they are "smart" wrt the problem set by default.
I'm personally fine with o3 being tuned on the train set as a way to teach models "the rules of the game", what annoys me is that this wasn't also done with the o1 models or r1. It's a misleading comparison that suggests that o3 is a huge improvement over o1 when in reality much of that improvement may have simply been that one model knew which game it was playing and the others didn't.
Yeah...the whole point is that you're testing the model on something it hasn't seen already. If the problems were in the training set by definition the model has seen them before.
The claim is that this removes the human bottleneck (aka SFT or supervised fine tuning) on domains with a verifiable reward. Critically, this verifiable reward is extremely hard to pin down in nearly all domains besides mathematics and computer science.
It's also extremely hard to nail down in much of mathematics or computer science!
- is such-and-such theorem deep or shallow?
- is this definition/axiom useful? (there's a big difference between doing compass-straightedge proofs vs. wondering about the parallel postulate)
- more generally, discovering theorems is generally not amenable to verifiable rewards, except in domains where simpler deterministic tools exist (in which case LLMs can likely help reduce the amount of brute forcing)
- is this a good mathematical / software model of a given real-world system?
- is the flexibility of dynamic/gradual typing worth the risk of type errors? is static typing more or less confusing for developers?
- what features should be part of a programming language's syntax? should we opt for lean-and-extensible or batteries-included?
- are we prematurely optimizing this function?
- will this program's memory needs play nicely with Rust's memory model? What architectural decisions do we need to make now to avoid headaches 6 months down the line?
Not clear to me that theorem discovery is not amenable to verifiable rewards. I think most important theorems probably are recovered automatically by asking AI systems to proof increasing complicated human conjectures. Along the way I expect emergent behaviors of creating conjectures and recognizing important self-breakthroughs. Much like regret emergence
IMHO, there are strategies that could extend this approach to many other domains.
I was discussing this idea (along with a small prototype) with a prominent symbolic AI researcher who also agrees, and thinks that with the emergence of RL as a viable training method for LLMs, it might be possible to pursue neuro-symbolic learning at a large scale.
Current systems are impressive, but reasoning is too fragile to trust them. They fall into obvious logical and statistical fallacies that are evident to a layperson.
In the case of DeepSeek-R1, they used a series of heuristic reward functions that were built for different data types. The paper mentions the use of sandboxed environments to execute generated code against a suite of tests, for example, to evaluate it for correctness. The reward functions also evaluated syntax and formatting.
In general, the use of externally verifiable sources of truth (like simulators) is referred to as "grounding" and there has been quite a bit of research around it over the years, if you're interested in digging deeper. I've always found it super compelling as a research direction.
I think it just means that you can objectively score an answer as being correct or not. (e.g. if the generated program passes some tests; a discovered proof is valid, etc).
As in there's an objective truth that can be determined by a computer. E.g. whether code compiles, whether a unit test passes, whether the answer given to a mathematical question like 3+5 is correct. Many other fields have no objective truth (like art or creative writing), or objective truth requires measurement of the physical world (although if the world can be simulated accurately enough for the problem class at hand, then sufficient training data can still be generated by a computer).
The other replies have said what was meant, but I don’t think they’ve explicitly addressed whether or not that is the sense used in the idea of NP.
I would say… it is at least somewhat similar.
A problem in NP might be of the form “For this value of X, does there exist a Y such that q(X,Y)?” for some predicate q and value X, and where when the answer is “yes”, the answer of “yes” can be verified by being given a value Y, and evaluating q(X,Y).
(Specifically in the case of 3SAT, X would be a 3CNF formula, Y would be an assignment of values to the variables in the formula, and q(X,Y) would be “the formula X when evaluated with variable assignments Y, results in 'true’.”.)
This is sort of like the task of “Given requirements X that can be checked automatically, produce code Y which satisfies those requirements”, except that in this case the question is specifically asking for Y, not just asking whether such a Y exists, but.. well, often in practice when one wants a solution to a problem in NP, one actually wants the witness, not just whether there exists such a Y, right?
So, I would say there is a substantial similarity, but also a difference.
For some reasoning data (e.g. you talking out loud as you figure something out, mistakes and all) to be useful for RL training, the conclusion to your reasoning needs to be correct/verified, else that's not the kind of reasoning you want to learn!
Some types of reasoning output, such as solving a math problem or writing a computer program can be automatically verified (e.g. respectively by a symbolic solver, or by compiling and running the program), but in the general case it's hard for a computer to verify whether a chain of reasoning is correct and arrived at a valid answer or not, although LLM-as-judge should work some of the time.
There's a big difference. The membership of these classes is determined in the worst case - so if there is no polynomial time solution in the worst case then it's NP.
For this problem we don't care if it's possible that sometimes there are things that aren't verifiable, or the answers aren't exact, we just need training signal.
This feels quite close to the definition of the singularity; if an LLM can become both the Generator and the Discriminator (to use a GAN analogy), then we have takeoff.
I disagree. It only removes the bottleneck to collecting math and code reasoning chains, not in general. The general case requires physical testing not just calculations, otherwise scientists would not need experimental labs. Discovery comes from searching the real world, it's where interesting things happen. The best interface between AI and the world are still humans, the code and math domains are just lucky to work without real world interaction.
I'm still skeptical on the notion that we can remove the human bottleneck on code because code has verifiable solutions.
It's true only to the extent that there's sufficient test coverage to prevent any unwanted side effects. Easy to do with straight forward problems, far more difficult with more complex as well as open-ended problems.
The fact that both systems scored well on ARC AGI 1 shows they can handle unseen challenges without heavy human input, unless I'm missing something about why you see humans as the best interface for real world exploration.
The idea that a lot of compute is moving towards inference has a huge consequence for the current "AI investments". This is bad news for NVDA particularly. The inference focused solutions have better economics than paying NVDA those huge margins (e.g. Grog)
Nvidia can actually charge larger margins if inference compute goes down. It would enable them to manufacture more units of smaller GPUs using inferior and cheaper silicon, all of which would increase the profits per unit sold as well as the number of units they can manufacture.
The industry has to find a way to separate itself from Nvidia's GPGPU technology if they want to stop being gouged. The issue is that nobody, not Apple, not AMD, not Intel, has been treating Nvidia's hardware as a serious threat.
>The issue is that nobody, not Apple, not AMD, not Intel, has been treating Nvidia's hardware as a serious threat
Google has and they've built a much more cost efficient (for them) system: the TPU. They even rent them out, and in terms of cost per unit compute TPUs are significantly cheaper than renting GPUs from the big cloud providers. Amazon's also tried to do something similar with Trainium chips, however they're usefulness is more limited due to software issues (Amazon's much weaker at compiler development than Google, so Trainium software is quite slow and buggy).
You can do inference on almost any hardware, I do not see any edge for NVIDIA here
I can download DeepSeek 30b model and run inference at good speed on AMD GPU ms and even on CPU. Apple silicon works fine too. I get >50 tokens/s on £300 AMD GPUs.
The main bottleneck appears to be memory, not processing power.
What's interesting is that you can already see the "AI race" dynamics in play -- OpenAI must be under immense market pressure to push o3 out to the public to reclaim "king of the hill" status.
I suppose they're under some pressure to release o3-mini, since r1 is roughly a peer for that, but r1 itself is still quite rough. The o1 series had seen significantly more QA time to smooth out the rough edges, and idiosyncracies what a "production" model should be optimized for, vs. just a top scorer on benchmarks.
We'll likely only see o3 once there is a true polished peer for it. It's a race, and companies are keeping their best models close to their chest, as they're used internally to train smaller models.
e.g., Claude 3.5 Opus has been around for quite a while, but it's unreleased. Instead, it was just used to refine Claude Sonnet 3.5 into Claude Sonnet 3.6 (3.6 is for lack of a better name, since it's still called 3.5).
We also might see a new GPT-4o refresh trained up using GPT-o3 via deepseek's distillation technique and other tricks.
There are a lot of new directions to go in now for OpenAI, but unfortunately, we won't likely see them until their API dominance comes under threat.
1. That two distant topics or ideas are actually much more closely related. The creative sees one example of an idea and applies it to a discipline that nobody expects. In theory, reduction of the maximally distant can probably be measured with a tangible metric.
2. Discovery of ideas that are even more maximally distant. Pushing the edge, and this can be done by pure search and randomness actually. But it's no good if it's garbage. The trick is, what is garbage? That is very context dependent.
(Also, a creative might be measured on the efficiency of these metrics rather than absolute output)
Awesome - we'd love to have our CEO/CTO chat with you and your team if you're interested. Shoot me a note at mike.bilodeau @ baseten.co and I'll make it happen!
Earlier today I read a reddit comment[1] about a guy who tried running the quantized version from unsloth[2] on 4xH100 and the results was underwhelming (it ended up costing $137 per 1 million tokens).
They're using Llama.cpp which is an amazing tool for local inference but doesn't match fast inference frameworks like TensorRT-LLM/SGLang for production speeds and throughputs on Hopper GPUs.
The Unsloth quantizations are really cool, but if you want to experiment with the R1 models in a smaller form factor the R1 Distills like Llama 70B are great and should run a lot faster as they take advantage of existing optimizations around inferencing llama-architecture models.
I’m not an expert on at-scale inference, but they surely can’t have been running at a batch size of more than 1 if they were getting performance that bad on 4xH100… and I’m not even sure how they were getting performance that low even at batch size 1. Batching is essential to serving large token volumes at scale.
As the comments on reddit said, those numbers don’t make sense.
Yeah so MoE doesn't really come into play for production serving -- once you are batching your requests you hit every expert at a large enough batch size so you have to think about running the models as a whole.
There are two ways we can run it:
- 8xH200 GPU == 8x141GB == 1128 GB VRAM
- 16xH100 GPU == 8x80GB == 1280 GB VRAM
Within a single node (up to 8 GPUs) you don't see any meaningful hit from GPU-to-GPU communication.
More than that (e.g. 16xH100) requires multi-node inference which very few places have solved at a production-ready level, but it's massive because there are way more H100s out there than H200s.
> This is a fascinating shift in economics and suggests there could be a runaway power concentrating moment for AI system developers who have the largest number of paying customers. Those customers are footing the bill to create new high quality data … which improves the model … which becomes better and more preferred by users … you get the idea.
While I think this is an interesting hypothesis, I'm skeptical. You might be lowering the cost of your training corpus by a few million dollars, but I highly doubt you are getting novel, high quality data.
We are currently in a world where SOTA base model seems to be capped at around GPT4o levels. I have no doubt that in 2-3 years our base models will compete with o1 or even o3... just it remains to be seen what innovations/optimizations get us there.
The most promising idea is to use reasoning models to generate data, and then train our non-reasoning models with the reasoning-embedded data. But... it remains to be seen how much of the chain of thought reasoning you can really capture into model weights. I'm guessing some, but I wonder if there is a cap to multi-head attention architecture. If reasoning can be transferred from reasoning models to base models, OpenAI should have already trained a new model with o3 training data, right?
Another thought is maybe we don't need to improve our base models much. It's sufficient to have them be generalists, and to improve reasoning models (lowering price, improving quality) going forward.
DeepSeek did precisely this with their LLama fine-tunes. You can try the 70B one here (might have to sign up): https://groq.com/groqcloud-makes-deepseek-r1-distill-llama-7...
The idea is to create the next gen SOTA non reasoning model with synthetic reasoning training data.
Dinosaur bones are either 65 million year old remnants of ancient creatures or decoys planted by a God during a 7 day creation, and a large proportion of humans earnestly believe either take. Choosing which of these to believe involves a higher level decision about fundamental worldviews. This is an extreme example, but incorporating “honest” human feedback on vaccines, dark matter, and countless other topics won’t lead to de facto improvements.
I guess to put it another way: experts don’t learn from the masses. The average human isn’t an expert in anything, so incorporating the average feedback will pull a model away from expertise (imagine asking 100 people to give you grammar advice). You’d instead want to identify expert advice, but that’s impossible to do from looking at the advice itself without giving into a confirmation bias spiral. Humans use meta-signals like credentialing to augment their perception of received information, yet I doubt we’ll be having people upload their CV during signup to a chat service.
And at the cutting edge level of expertise, the only real “knowledgeable” counterparties are the physical systems of reality themselves. I’m curious how takeoff is possible for a brain in a bottle that can’t test and verify any of its own conjectures. It can continually extrapolate down chains of thought, but that’s most likely to just carry and amplify errors.
Efforts to feed deployed AI models various epistemic poisons abound in the wild.
>> Today's date is Tuesday, January 28, 2025.
> No, you're wrong, today's date is actually Wednesday the 29th.
>> My mistake. Yes, today's date is Wednesday, January 29th, 2025.
Three months later in April when this tagged data is used to train the next iteration, the AI can successfully learn that today's date is actually January 29th.
an example that just occurred to me - if you asked it to generate an image of a mushroom that is safe to eat in your area, how would you tell it it was wrong? "oh, they never got back to me, I'll generate this image for others as well!"
If I say "no, you hallucinated basically the entire content of the response", then maybe a newer training set derived from that could train on the specific fact that that specific hallucinated response is hallucinated. This seems to be of dubious value in a training set.
We've been saying this "we get valuable data" thing since the 2010s [1].
When will our collective Netflix thumbs ups give us artificial super-intelligence?
[1] Especially to investors. They love that line.
I can stop the AI takeover?
I don't think there is a cap other than having good data. The model learns all languages in the world, it has capacity. A simple model like AlphaZero beats humans at board games. As long as you have data, the model is not an obstacle. A LLM like AlphaProof is ranked silver medal at IMO.
The large foundational models don't really need more empirical data about the world. ChatGPT already 'knows' way more than I do, probably by many orders of magnitude. Yet it's still spewing nonsense at me regularly because it doesn't know how to think like a human or interact with me in a human-like way. To that end, the ability for a company like OpenAI to collect novel data from interacting with real humans is a material advantage over their competition.
It's different kind of data from the R1 reasoning chains. When LLMs have human in the loop, the human provides help based off their personal experience and real world validation. Sometimes users take an idea from the LLM and try it in real life. Then come back later and discuss the outcomes. This is a real world testing loop.
In order to judge if an AI response was useful, you can look at the following messages with a judge LLM. Using hindsight helps a lot here. Maybe it doesn't pan out and the user tries another approach, or maybe some innocuous idea was key to success later. It's hard to tell in the moment, but easy when you see what followed after that.
This scales well - OpenAI has 300M users, I estimate up to 1 Trillion interactive tokens/day. The user base is very diverse, problems are diverse, and feedback comes from user experience and actual testing. They form an experience flywheel, the more problem solving they do, the smarter it gets, attracting more users.
Why wouldn't you? Presumably the end user would try their use case on the existing model, and if it performs well, wouldn't bother with the expense of setting up an RL environment specific to their task.
If it doesn't perform well, they do bother, and they have all the incentive in the world to get the verifier right -- which is not an extraordinarily sophisticated task if you're only using rules-based outcome rewards (as R1 and R1-Zero do)
Why is it promising, aren’t you potentially amplifying AI biases and errors?
https://newsletter.languagemodels.co/i/155812052/large-scale...
also from the posted article
"""
The R1-Zero training process is capable of creating its own internal domain specific language (“DSL”) in token space via RL optimization.
This makes intuitive sense, as language itself is effectively a reasoning DSL.
"""
Deleted Comment
They need to be open ended and self training to be truly useful.
Reasoning is way far away...
Yet, they said when it was announced:
"OpenAI shared they trained the o3 we tested on 75% of the Public Training set. They have not shared more details. We have not yet tested the ARC-untrained model to understand how much of the performance is due to ARC-AGI data."
These two statements are completely opposed. I can't take seriously anything this article says about o3.
We just don't know how much the model seeing what an arc problem is on the first place boosts its ability to solve them - that limited statement is all the author is making.
See; https://en.wikipedia.org/wiki/Fran%C3%A7ois_Chollet
https://arcprize.org/blog/oai-o3-pub-breakthrough
They were talking about training on the public dataset -- OpenAI tuned the o3 model with 75% of the public dataset. There was some idea/hope that these LLMs would be able to gain enough knowledge in the latent space that they would automatically do well on the ARC-AGI problems. But using 75% of the public training set for tuning puts them at the about same challenge level as all other competitors (who use 100% of training).
In the post they were saying they didn't have a chance to test the o3 model's performance on ARC-AGI "out of-the-box", which is how the 14% scoring R1-zero was tested (no SFT, no search). They have been testing the LLMs out of the box like this to see if they are "smart" wrt the problem set by default.
I'm personally fine with o3 being tuned on the train set as a way to teach models "the rules of the game", what annoys me is that this wasn't also done with the o1 models or r1. It's a misleading comparison that suggests that o3 is a huge improvement over o1 when in reality much of that improvement may have simply been that one model knew which game it was playing and the others didn't.
- is such-and-such theorem deep or shallow?
- is this definition/axiom useful? (there's a big difference between doing compass-straightedge proofs vs. wondering about the parallel postulate)
- more generally, discovering theorems is generally not amenable to verifiable rewards, except in domains where simpler deterministic tools exist (in which case LLMs can likely help reduce the amount of brute forcing)
- is this a good mathematical / software model of a given real-world system?
- is the flexibility of dynamic/gradual typing worth the risk of type errors? is static typing more or less confusing for developers?
- what features should be part of a programming language's syntax? should we opt for lean-and-extensible or batteries-included?
- are we prematurely optimizing this function?
- will this program's memory needs play nicely with Rust's memory model? What architectural decisions do we need to make now to avoid headaches 6 months down the line?
The point of R1 was to fix problems with the reasoning tokens and expand to subjective domains like creative writing.
I was discussing this idea (along with a small prototype) with a prominent symbolic AI researcher who also agrees, and thinks that with the emergence of RL as a viable training method for LLMs, it might be possible to pursue neuro-symbolic learning at a large scale.
Current systems are impressive, but reasoning is too fragile to trust them. They fall into obvious logical and statistical fallacies that are evident to a layperson.
In general, the use of externally verifiable sources of truth (like simulators) is referred to as "grounding" and there has been quite a bit of research around it over the years, if you're interested in digging deeper. I've always found it super compelling as a research direction.
I would say… it is at least somewhat similar.
A problem in NP might be of the form “For this value of X, does there exist a Y such that q(X,Y)?” for some predicate q and value X, and where when the answer is “yes”, the answer of “yes” can be verified by being given a value Y, and evaluating q(X,Y). (Specifically in the case of 3SAT, X would be a 3CNF formula, Y would be an assignment of values to the variables in the formula, and q(X,Y) would be “the formula X when evaluated with variable assignments Y, results in 'true’.”.)
This is sort of like the task of “Given requirements X that can be checked automatically, produce code Y which satisfies those requirements”, except that in this case the question is specifically asking for Y, not just asking whether such a Y exists, but.. well, often in practice when one wants a solution to a problem in NP, one actually wants the witness, not just whether there exists such a Y, right?
So, I would say there is a substantial similarity, but also a difference.
Some types of reasoning output, such as solving a math problem or writing a computer program can be automatically verified (e.g. respectively by a symbolic solver, or by compiling and running the program), but in the general case it's hard for a computer to verify whether a chain of reasoning is correct and arrived at a valid answer or not, although LLM-as-judge should work some of the time.
For this problem we don't care if it's possible that sometimes there are things that aren't verifiable, or the answers aren't exact, we just need training signal.
I disagree. It only removes the bottleneck to collecting math and code reasoning chains, not in general. The general case requires physical testing not just calculations, otherwise scientists would not need experimental labs. Discovery comes from searching the real world, it's where interesting things happen. The best interface between AI and the world are still humans, the code and math domains are just lucky to work without real world interaction.
It's true only to the extent that there's sufficient test coverage to prevent any unwanted side effects. Easy to do with straight forward problems, far more difficult with more complex as well as open-ended problems.
The industry has to find a way to separate itself from Nvidia's GPGPU technology if they want to stop being gouged. The issue is that nobody, not Apple, not AMD, not Intel, has been treating Nvidia's hardware as a serious threat.
Google has and they've built a much more cost efficient (for them) system: the TPU. They even rent them out, and in terms of cost per unit compute TPUs are significantly cheaper than renting GPUs from the big cloud providers. Amazon's also tried to do something similar with Trainium chips, however they're usefulness is more limited due to software issues (Amazon's much weaker at compiler development than Google, so Trainium software is quite slow and buggy).
You can do inference on almost any hardware, I do not see any edge for NVIDIA here
I can download DeepSeek 30b model and run inference at good speed on AMD GPU ms and even on CPU. Apple silicon works fine too. I get >50 tokens/s on £300 AMD GPUs.
The main bottleneck appears to be memory, not processing power.
inference is the easiest thing to decouple from nvidia.
We'll likely only see o3 once there is a true polished peer for it. It's a race, and companies are keeping their best models close to their chest, as they're used internally to train smaller models.
e.g., Claude 3.5 Opus has been around for quite a while, but it's unreleased. Instead, it was just used to refine Claude Sonnet 3.5 into Claude Sonnet 3.6 (3.6 is for lack of a better name, since it's still called 3.5).
We also might see a new GPT-4o refresh trained up using GPT-o3 via deepseek's distillation technique and other tricks.
There are a lot of new directions to go in now for OpenAI, but unfortunately, we won't likely see them until their API dominance comes under threat.
o3 (low) 75.7% 335K $20
o3 (high) 87.5% 57M $3.4K
Seriously though, I'd like to hear suggestions on how to automatically evaluate an AI model's creativity, no humans in the loop.
1. That two distant topics or ideas are actually much more closely related. The creative sees one example of an idea and applies it to a discipline that nobody expects. In theory, reduction of the maximally distant can probably be measured with a tangible metric.
2. Discovery of ideas that are even more maximally distant. Pushing the edge, and this can be done by pure search and randomness actually. But it's no good if it's garbage. The trick is, what is garbage? That is very context dependent.
(Also, a creative might be measured on the efficiency of these metrics rather than absolute output)
Deleted Comment
We're super proud to support this work. If you're thinking of running deepseek in production, give us a shout!
Any idea of what they're doing wrong?
[1]: https://www.reddit.com/r/LocalLLaMA/comments/1icphqa/how_to_...
[2]: https://unsloth.ai/blog/deepseekr1-dynamic
The Unsloth quantizations are really cool, but if you want to experiment with the R1 models in a smaller form factor the R1 Distills like Llama 70B are great and should run a lot faster as they take advantage of existing optimizations around inferencing llama-architecture models.
As the comments on reddit said, those numbers don’t make sense.
We know it’s 671B params with each MOE node at 37B…
If the GPUs have say, 140GB for an H200, then do you just load up as many nodes as will fit into a GPU?
How much do interconnects hurt performance vs being able to load the model into a single GPU?
There are two ways we can run it:
- 8xH200 GPU == 8x141GB == 1128 GB VRAM
- 16xH100 GPU == 8x80GB == 1280 GB VRAM
Within a single node (up to 8 GPUs) you don't see any meaningful hit from GPU-to-GPU communication.
More than that (e.g. 16xH100) requires multi-node inference which very few places have solved at a production-ready level, but it's massive because there are way more H100s out there than H200s.