This is why AI coding assistance will leap ahead in the coming years. Chat AI has no clear reward function (basically impossible to judge the quality of responses to open-ended questions like historical causes for a war). Coding AI can write tests, write code, compile, examine failed test cases, search for different coding solutions that satisfy more test cases or rewrite the tests, all in an unsupervised loop. And then whole process can turn into training data for future AI coding models.
I expect language models to also get crazy good at mathematical theorem proving. The search space is huge but theorem verification software will provide 100% accurate feedback that makes real reinforcement learning possible. It's the combination of vibes (how to approach the proof) and formal verification that works.
Formal verification of program correctness never got traction because it's so tedious and most of the time approximately correct is good enough. But with LLMs in the mix the equation changes. Having LLMs generate annotations that an engine can use to prove correctness might be the missing puzzle piece.
Does programming have a clear reward function? A vague description from a business person is not it. By the time someone (a programmer?) has written a reward function that is clear enough, how would that function look compared to a program?
Exactly, and people have been saying this for a while now. If an "AI software engineer" needs a perfect spec with zero ambiguity, all edge cases defined, full test coverage with desired outcomes etc., then the person writing the spec is the actual software engineer, and the AI is just a compiler.
Programming has a clear reward function when the problem being solving is well-specified, e.g., "we need a program that grabs data from these three endpoints, combines their data in this manner, and returns it in this JSON format."
The same is true for math. There is a clear reward function when the goal is well-specified, e.g., "we need a sequence of mathematical statements that prove this other important mathematical statement is true."
Very good point. For some types of problems maybe the answer is yes. For example porting. The reward function is testing it behaves the same in the new language as the old one. Tricky for apps with a gui but doesn't seem impossible.
The interesting kind of programming is the kind where I'm figuring out what I'm building as part of the process.
Maybe AI will soon be superhuman in all the situations where we know exactly what we want (win the game), but not in the areas we don't. I find that kind of cool.
Full circle but instead of determinism you introduce some randomness. Not good.
Also the reasoning is something business is dissonant about. The majority of planning and execution teams stick to processes. I see way more potential automating these than all parts in app production.
Business is going to have a hard time, when they believe, they alone can orchestrate some AI consoles.
“A precise enough specification is already code”, which means we'll not run out of developers in the short term. But the day to day job is going to be very different, maybe as different as what we're doing now compared to writing machine code on punchcards.
Certainly "compiled" is one reward (although a blank file fits that...)
Another is test cases, input and output. This doesn't work on a software-wide scale but function-wide it can work.
In the future I think we'll see more of this test-driven development. Where developers formally define the requirements and expectations of a system and then an LLM (combined with other tools) generates the implementation. So instead of making the implementation, you just declaratively say what the implementation should do (and shouldn't).
I think you could set up a good reward function for a programming assistance AI by checking that the resulting code is actually used. Flag or just 'git blame' the code produced by the AI with the prompts used to produce it, and when you push a release, it can check which outputs were retained in production code from which prompts. Hard to say whether code that needed edits was because the prompt was bad or because the code was bad, but at least you can get positive feedback when a good prompt resulted in good code.
> Does programming have a clear reward function? A vague description from a business person isn't it. By the time someone (a programmer?) has written a reward function that is clear enough, how would that function look compared to a program?
Well, to give an example: the complexity class NP is all about problems that have quick and simple verification, but finding solutions for many problems is still famously hard.
So there are at least some domains where this model would be a step forward.
If we will struggle to create reward functions for AI, then how different is that from the struggles we already face when divvying up product goals into small tasks to fit our development cycles?
In other words, to what extent does Agile's ubiquity prove our competence in turning product goals into de facto reward functions?
There's no reward function in the sense that optimizing the reward function means the solution is ideal.
There are objective criteria like 'compiles correctly' and 'passes self-designed tests' and 'is interpreted as correct by another LLM instance' which go a lot further than criteria that could be defined for most kinds of verbal questions.
Much business logic is really just a state machine where all the states and all the transitions need to be handled. When a state or transition is under-specified an LLM can pass the ball back and just ask what should happen when A and B but not C. Or follow more vague guidance on what should happen in edge cases. A typical business person is perfectly capable of describing how invoicing should work and when refunds should be issued, but very few business people can write a few thousand lines of code that covers all the cases.
It's easy to imagine why something could never work.
It's more interesting to imagine what just might work. One thing that has plagued programmers for the past decades is the difficulty of writing correct multi-threaded software. You need fine-grained locking otherwise your threads will waste time waiting for mutexes. But color-coding your program to constrain which parts of your code can touch which data and when is tedious and error-prone. If LLMs can annotate code sufficiently for a SAT solver to prove thread safety that's a huge win. And that's just one example.
Code coverage exists. Shouldn't be hard at all to tune the parameters to get what you want. We have really good tools to reason about code programmatically - linters, analyzers, coverage, etc.
Models aren't going to get really good at theorem proving until we build models that are transitive and handle isomorphisms more elegantly. Right now models can't recall factual relationships well in reverse order in many cases, and often fail to answer questions that they can answer easily in English when prompted to respond with the fact in another language.
This reads as a proper marketing ploy. If the current incarnation of AI + coding is anything to go by - it'll take leaps just to make it barely usable (or correct)
My take is the opposite: considering how good AI is at coding right now I'm eager to see what comes next. I don't know what kind of tasks you've tried using it for but I'm surprised to hear someone think that it's not even "barely usable". Personally, I can't imagine going back to programming without a coding assistant.
> Coding AI can write tests, write code, compile, examine failed test cases, search for different coding solutions that satisfy more test cases or rewrite the tests, all in an unsupervised loop. And then whole process can turn into training data for future AI coding models.
This is interesting, but doesn't it still need supervision? Why wouldn't it generate tests for properties you don't want? It seems to me that it might be able to "fill in gaps" by generalizing from "typical software", like, if you wrote a container class, it might guess that "empty" and "size" and "insert" are supposed to be related in a certain way, based on the fact that other peoples' container classes satisfy those properties. And if you look at the tests it makes up and go, "yeah, I want that property" or not, then you can steer what it's doing, or it can at least force you to think about more cases. But there would still be supervision.
Ah -- here's an unsupervised thing: Performance. Maybe it can guide a sequence of program transformations in a profile-guided feedback loop. Then you could really train the thing to make fast code. You'd pass "-O99" to gcc, and it'd spin up a GPU cluster on AWS.
Writing tests won't help you here, this problem is the same as other generation tasks. If the test passes, everything seems okay, right? Consider this: you now have a 50-line function just to display 'hello world'. It outputs 'hello world', so it scores well, but it's hardly efficient. Then, there's a function that runs in exponential time instead of the standard polynomial time that any sensible programmer would use in specific cases. It passes the tests, so it gets a high score. You also have assembly code embedded in C code, executed with 'asm'. It works for that particular case and passes the test, but the average C programmer won't understand what's happening in this code, whether it's secure, etc. Lastly, tests written by AI might not cover all cases, they could even fail to test what you intended because they might hallucinate scenarios (I've experienced this many times). Programming faces similar issues to those seen in other generation tasks in the current generation of large language models, though to a slightly lesser extent.
> I expect language models to also get crazy good at mathematical theorem proving
Indeed, systems like AlphaProof / AlphaGeometry are already able to win a silver medal at the IMO, and the former relies on Lean for theorem verification [1]. On the open source side, I really like the ideas in LeanDojo [2], which use a form of RAG to assist the LLM with premise selection.
I'm pretty interested in the theorem proving/scientific research aspect of this.
Do you think it's possible that some version of LLM technology could discover new physical theories (that are experimentally verifiable), like for example a new theory of quantum gravity, by exploring the mathematical space?
Edit: this is just incredibly exciting to think about. I'm not an "accelerationist" but the "singularity" has never felt closer...
My hunch is that LLMs are nowhere near intelligent enough to make brilliant conceptual leaps. At least not anytime soon.
Where I think AI models might prove useful is in those cases where the problem is well defined, where formal methods can be used to validate the correctness of (partial) solutions, and where the search space is so large that work towards a proof is based on "vibes" or intuition. Vibes can be trained through reinforcement learning.
Some computer assisted proofs are already hundreds of pages or gigabytes long. I think it's a pretty safe bet that really long and convoluted proofs that can only be verified by computers will become more common.
I think the answer is almost certainly no, and is mostly unrelated to how smart LLMs can get. The issue is that any theory of quantum gravity would only be testable with equipment that is much, much more complex than what we have today. So even if the AI came up with some beautifully simple theory, testing that its predictions are correct is still not going to be feasible for a very long time.
Now, it is possible that it could come up with some theory that is radically different from current theories, where quantum gravity arises very naturally, and that fits all of the other predictions of of the current theories that we can measure - so we would have good reasons to believe the new theory and consider quantum gravity probably solved. But it's literally impossible to predict whether such a theory even exists, that is not mathematically equivalent to QM/QFT but still matches all confirmed predictions.
Additionally, nothing in AI tech so far predicts that current approaches should be any good at this type of task. The only tasks where AI has truly excelled at are extremely well defined problems where there is a huge but finite search space; and where partial solutions are easy to grade. Image recognition, game playing, text translation are the great successes of AI. And performance drops sharply with the uncertainty in the space, and with the difficulty of judging a partial solution.
Finding physical theories is nothing like any of these problems. The search space is literally infinite, partial solutions are almost impossible to judge, and even judging whether a complete solution is good or not is extremely difficult. Sure, you can check if it's mathematically coherent, but that tells you nothing about whether it describes the physical world correctly. And there are plenty of good physical theories that aren't fully formally proven, or weren't at the time they were invented - so mathematical rigour isn't even a very strong signal (e.g. Newton's infinitesimal calculus wasn't considerered sound until the 1900s or something, by which time his theories had long since been rewritten in other terms; the Dirac delta wasn't given a precise mathematical definition until much later than it's uses; and I think QFT still uses some iffy math even today).
IIRC, there have been people doing similar things using something close to brute-force. Nothing of real significance has been found. A problem is that there are infinitely many physically and mathematically correct theorems that would add no practical value.
This is exactly what we are doing with Mutahunter, AI is exceedingly good at writing edge case tests to test code and will only continue to get better at this. Check out Mutahunter here https://github.com/codeintegrity-ai/mutahunter
Yes, same for maths. As long as a true reward 'surface' can be optimized. Approximate rewards are similar to approximate and non admissible heuristics,search eventually misses true optimal states and favors wrong ones, with side effects in very large state spaces.
>Coding AI can write tests, write code, compile, examine failed test cases, search for different coding solutions that satisfy more test cases or rewrite the tests, all in an unsupervised loop.
Will this be able to be done without spending absurd amounts of energy?
Computer energy efficiency is not as constrained as minimum feature size, it's still doubling every 2.6 years or so.
Even if they were, a human-quality AI that runs at human-speed for x10 our body's calorie requirements in electricity, would still (at electricity prices of USD 0.1/kWh) undercut workers earning the UN abject poverty threshold.
Tests don’t prove correctness of the code. What you’d really want instead is to specify invariants the code has to fulfill, and for the AI to come up with a machine-checkable proof that the code indeed guarantees those invariants.
Once you have enough data points, from current usage, and these days every company is tracking EVERYTHING even eye movement if they could, it's just a matter of time. I do agree though that before we reach an AGI we have these agents who are really good in a defined mission (like code completion).
It's not even about LLMs IMHO. It's about letting a computer crunch many numbers and find a pattern in the results, in a quasi religious manner.
A cheap DIY way of achieving the same thing as RLHF is to fine tune the model to append a score to its output every time.
Remember: The reason we need RLHF at all is that we cannot write a loss function for what makes a good answer. There are just many ways a good answer could look like, which cannot be calculated on the basis of next-token-probability.
So you start by having your vanilla model generate n completions for your prompt. You the. manually score them. And then those prompt => (completion,score) pairs become your training set.
Once the model is trained, you may find that you can cheat:
Because if you include the desired score in your prompt, the model will now strive to produce an answer that is consistent with that score.
> if you include the desired score in your prompt, the model will now strive to produce an answer that is consistent with that score
But you need a model to generate score from answer, and then fine-tune another model to generate answer conditioned on score. The first time the score is at the end and the second time at the beginning. It's how DecisionTransformer works too, it constructs a sequence of (reward, state, action) where reward conditions on the next action.
By the same logic you could generate tags, including style, author, venue and date. Some will be extracted from the source document, the others produced with classifiers. Then you can flip the order and finetune a model that takes the tags before the answer. Then you got a LLM you can condition on author and style.
I had an idea similar to this for a model that allows you to parameterize a performance vs. accuracy ratio, essentially an imbalanced MoE-like approach where instead of the "quality score" in your example, you assign a score based on how much computation it used to achieve that answer, then you can dynamically request different code paths be taken at inference time.
The problem of various ML algorithms "gaming" the reward function, is rather similar to the problem of various financial and economic issues. If people are not trying to do something useful, and then expecting $$ in return for that, but rather are just trying to get $$ without knowing or caring what is productive, then you get a lot of non-productive stuff (spam, scams, pyramid schemes, high-frequency trading, etc.) that isn't actually producing anything, but does take over a larger and larger percentage of the economy.
To mitigate this, you have to have a system outside of that, which penalizes "gaming" the reward function. This system has to have some idea of what real value is, to be able to spot cases where the reward function is high but the value is low. We have a hard enough time of this in the money economy, where we've been learning for centuries. I do not think we are anywhere close in neural networks.
> This system has to have some idea of what real value is
This is probably the most cursed problem ever.
Assume you could develop such a system, why wouldn't you just incorporate its logic into the original fitness function and be done with it?
I think the answer is that such a system can probably never be developed. At some level humans must be involved in order to adapt the function over time in order to meet expectations as training progresses.
The information used to train on is beyond critical, but heuristics regarding what information matters more than other information in a given context might be even more important.
There is some relation to Goedel's theories here, about the inherent limitations of any system of logic to avoid both errors of omission and errors of commission. Either there are true things you cannot prove, or things you "prove" that are not true.
In any reward function, either there are valuable things that are not rewarded, or unvaluable things that are. But having multiple systems to evaluate this, does help.
There is a step like this in ML. I think it's pretty interesting that topics from things like economics pop up in ML - although perhaps it's not too surprising as we are doing ML for humans to use.
Karpathy is _much_ more knowledgeable about this than I am, but I feel like this post is missing something.
Go is a game that is fundamentally too complex for humans to solve. We've known this since way back before AlphaGo. Since humans were not the perfect Go players, we didn't use them to teach a model- we wanted the model to be able to beat humans.
I dont see language being comparable. the "perfect" LLM imitates humans perfectly, presumably to the point where you can't tell the difference between LLM generated text, and human generated text. Maybe it's just as flexible as the human mind is too, and can context switch quickly, and can quickly swap between formalities, tones, and slangs. But the concept of "beating" a human doesn't really make much sense.
AlphaGo and Stockfish can push forward our understandings of their respective games, but an LLM cant push forwards our boundary of language. this is because it's fundamentally a copy-cat model. This makes RLHF make much more sense in the LLM realm than the Go realm.
One of the problems lies in the way RLHF is often performed: presenting a human with several different responses and having them choose one. The goal here is to create the most human-like output, but the process is instead creating outputs humans like the most, which can seriously limit the model. For example, most recent diffusion-based image generators use the same process to improve their outputs, relying on volunteers to select which outputs are preferable. This has lead to models that are comically incapable of generating ugly or average people, because the volunteers systematically rate those outputs lower.
The distinction is that LLMs are not used for what they are trained for in this case. In the vast majority of cases someone using an LLM is not interested in what some mixture of openai employees ratings + average person would say about a topic, they are interested in the correct answer.
When I ask chatgpt for code I don't want them to imitate humans, I want them to be better than humans. My reward function should then be code that actually works, not code that is similar to humans.
I don’t think it is true that the perfect LLM emulates a human perfectly. LLMs are language models, whose purpose is to entertain and solve problems. Yes, they do that by imitating human text at first, but that’s merely a shortcut to enable them to perform well. Making money via maximizing their goal (entertain and solve problems) will eventually entail self-training on tasks to perform superhumanly on these tasks. This seems clearly possible for math and coding, and it remains an open question about what approaches will work for other domains.
In a sense GPT-4 is self-training already, in that it's bringing in money for OpenAI which is being spent on training further iterations. (this is a joke)
This is a great comment. Another important distinction, I think, is that in the AlphaGo case there's no equivalent to the generalized predict next token pretraining that happens for LLMs (at least I don't think so, this is what I'm not sure of). For LLMs, RLHF teaches the model to be conversational, but the model has already learned language and how to talk like a human from the predict next token pretraining.
Let's say, hypothetically, we do enough RLHF that a model can imitate humans at the highest level. Like, the level of professional researchers on average. Then we do more RLHF.
Maybe, by chance, the model produces an output that is a little better than its average; that is, better than professional researchers. This will be ranked favorably in RLHF.
Repeat this process and the model slowly but surely surpasses the best humans.
One thing I’ve wondered about is what the “gap” between current transformer-based LLMs and optimal sequence prediction looks like.
To clarify, current LLMs (without RLHF, etc.) have a very straightforward objective function during training, which is to minimize the cross-entropy of token prediction on the training data. If we assume that our training data is sampled from a population generated via a finite computable model, then Solomonoff induction achieves optimal sequence prediction.
Assuming we had an oracle that could perform SI (since it’s uncomputable), how different would conversations between GPT4 and SI be, given the same training data?
We know there would be at least a few notable differences. For example, we could give SI the first 100 digits of pi, and it would give us as many more digits as we wanted. Current transformer models cannot (directly) do this. We could also give SI a hash and ask for a string that hashes to that value. Clearly a lot of hard, formally-specified problems could be solved this way.
But how different would SI and GPT4 appear in response to everyday chit-chat? What if we ask the SI-based sequence predictor how to cure cancer? Is the “most probable” answer to that question, given its internet-scraped training data, an answer that humans find satisfying? Probably not, which is why AGI requires something beyond just optimal sequence prediction. It requires a really good objective function.
My first inclination for this human-oriented objective function is something like “maximize the probability of providing an answer that the user of the model finds satisfying”. But there is more than one user, so over which set of humans do we consider and with which aggregation (avg satisfaction, p99 satisfaction, etc.)?
So then I’m inclined to frame the problem in terms of well-being: “maximize aggregate human happiness over all time” or “minimize the maximum of human suffering over all time”. But each of these objective functions has notable flaws.
Karpathy seems to be hinting toward this in his post, but the selection of an overall optimal objective function for human purposes seems to be an incredibly difficult philosophical problem. There is no objective function I can think of for which I cannot also immediately think of flaws with it.
>But how different would SI and GPT4 appear in response to everyday chit-chat? What if we ask the SI-based sequence predictor how to cure cancer?
I suspect that a lot of LLM prompts that elicit useful capabilities out of imperfect sequence predictors like GPT-4 are in fact most likely to show up in the context of "prompting an LLM" rather than being likely to show up "in the wild".
As such, to predict the token after a prompt like that, an SI-based sequence predictor would want to predict the output of whatever language model was most likely to be prompted, conditional on the prompt/response pair making it into the training set.
If the answer to "what model was most likely to be prompted" was "the SI-based sequence predictor", then it needs to predict which of its own likely outputs are likely to make it into the training set, which requires it to have a probability distribution over its own output. I think the "did the model successfully predict the next token" reward function is underspecified in that case.
There are many cases like this where the behavior of the system in the limit of perfect performance at the objective is undesirable. Fortunately for us, we live in a finite universe and apply finite amounts of optimization power, and lots of things that are useless or malign in the limit are useful in the finite-but-potentially-quite-large regime.
> What if we ask the SI-based sequence predictor how to cure cancer? Is the “most probable” answer to that question, given its internet-scraped training data, an answer that humans find satisfying?
You defined your predictor as being able to minimize mathematical definitions following some unspecified algebra, why didn't you define it being able to run chemical and pharmacological simulations through some unspecified model too?
I don’t follow—what do you mean by unspecified algebra? Solomonoff induction is well-defined. I’m just asking how the responses of a chatbot using Solomonoff induction for sequence prediction would differ from those using a transformer model, given the same training data. I can specify mathematically if that makes it clearer…
Alternatively, you include information about the user of the model as part of the context to the inference query, so that the model can uniquely optimize its answer for that user.
Imagine if you could give a model "how you think" and your knowledge, experiences, and values as context, then it's "Explain Like I'm 5" on steroids. Both exciting and terrifying at the same time.
> Alternatively, you include information about the user of the model as part of the context to the inference query
That was sort of implicit in my first suggestion for an objective function, but do you really want the model to be optimal on a per-user basis? There’s a lot of bad people out there. That’s why I switched to an objective function that considers all of humanity’s needs together as a whole.
I think that the field of proofs, such as LEAN, which have states (the current
subgoal), actions (the applicable theorems, especially effective in LEAN due
to strong Typing of arguments), a progress measure (simplified subgoals),
a final goal state (the proof completes), and a hierarchy in the theorems
so there is a "path metric" from simple theorems to complex theorems.
If Karpathy were to focus on automating LEAN proofs it could change mathematics forever.
Deepmind's recent model is trained with Lean. It scored a silver olympiad medal (and only one point away from gold).
> AlphaProof is a system that trains itself to prove mathematical statements in the formal language Lean. It couples a pre-trained language model with the AlphaZero reinforcement learning algorithm, which previously taught itself how to master the games of chess, shogi and Go
Alphago didn't have human feedback, but it did learn from humans before surpassing them. Specifically, it had a network to 'suggest good moves' that was trained on predicting moves from pro level human games.
The entire point of alpha zero was to eliminate this human influence, and go with pure reinforcement learning (i.e. zero human influence).
A game like Go has a clearly defined objective (win the game or not). A network like you described can therefore be trained to give a score to each move. Point here is that assessing whether a given sentence sounds good to humans or not does not have a clearly defined objective, the only way we came up with so far is to ask real humans.
AlphaGo is an optimization over a closed problem. Theoretically, computers could have always beat human in such problems. It's just that, without proper optimization, humans will die before the computer finishes its computation. Here, AlphaGo cuts down the computation time by smartly choosing the branches with the highest likelihood.
Unlike the above, open problems can't be solve by computing (in combinatorial senses). Even humans can only try, and LLMs do spew out something that would most likely work, not something inherently correct.
The final conclusion though stands without any justification - that LLM + RL will somehow out-perform people at open-domain problem solving seems quite a jump to me.
To be fair, it says "has a real shot at" and AlphaGo level. AlphaGo clearly beat humans on Go, so thinking that if you could replicate that, it would have a shot doesn't seem crazy to me
That only makes sense if you think Go is as expressive as written language.
And here I mean that it the act of making a single (plausible) move that must match the expressiveness of language, because otherwise you're not in the domain of Go but the far less interesting "I have a 19x19 pixel grid and two colours".
AlphaGo has got nothing to do with LLMs though. It's a combination of RL + MCTS. I'm not sure where you are seeing any relevance! DeepMind also used RL for playing video games - so what?!
I expect language models to also get crazy good at mathematical theorem proving. The search space is huge but theorem verification software will provide 100% accurate feedback that makes real reinforcement learning possible. It's the combination of vibes (how to approach the proof) and formal verification that works.
Formal verification of program correctness never got traction because it's so tedious and most of the time approximately correct is good enough. But with LLMs in the mix the equation changes. Having LLMs generate annotations that an engine can use to prove correctness might be the missing puzzle piece.
The same is true for math. There is a clear reward function when the goal is well-specified, e.g., "we need a sequence of mathematical statements that prove this other important mathematical statement is true."
The interesting kind of programming is the kind where I'm figuring out what I'm building as part of the process.
Maybe AI will soon be superhuman in all the situations where we know exactly what we want (win the game), but not in the areas we don't. I find that kind of cool.
Also the reasoning is something business is dissonant about. The majority of planning and execution teams stick to processes. I see way more potential automating these than all parts in app production.
Business is going to have a hard time, when they believe, they alone can orchestrate some AI consoles.
Certainly "compiled" is one reward (although a blank file fits that...) Another is test cases, input and output. This doesn't work on a software-wide scale but function-wide it can work.
In the future I think we'll see more of this test-driven development. Where developers formally define the requirements and expectations of a system and then an LLM (combined with other tools) generates the implementation. So instead of making the implementation, you just declaratively say what the implementation should do (and shouldn't).
Well, to give an example: the complexity class NP is all about problems that have quick and simple verification, but finding solutions for many problems is still famously hard.
So there are at least some domains where this model would be a step forward.
In other words, to what extent does Agile's ubiquity prove our competence in turning product goals into de facto reward functions?
There are objective criteria like 'compiles correctly' and 'passes self-designed tests' and 'is interpreted as correct by another LLM instance' which go a lot further than criteria that could be defined for most kinds of verbal questions.
It's more interesting to imagine what just might work. One thing that has plagued programmers for the past decades is the difficulty of writing correct multi-threaded software. You need fine-grained locking otherwise your threads will waste time waiting for mutexes. But color-coding your program to constrain which parts of your code can touch which data and when is tedious and error-prone. If LLMs can annotate code sufficiently for a SAT solver to prove thread safety that's a huge win. And that's just one example.
This is interesting, but doesn't it still need supervision? Why wouldn't it generate tests for properties you don't want? It seems to me that it might be able to "fill in gaps" by generalizing from "typical software", like, if you wrote a container class, it might guess that "empty" and "size" and "insert" are supposed to be related in a certain way, based on the fact that other peoples' container classes satisfy those properties. And if you look at the tests it makes up and go, "yeah, I want that property" or not, then you can steer what it's doing, or it can at least force you to think about more cases. But there would still be supervision.
Ah -- here's an unsupervised thing: Performance. Maybe it can guide a sequence of program transformations in a profile-guided feedback loop. Then you could really train the thing to make fast code. You'd pass "-O99" to gcc, and it'd spin up a GPU cluster on AWS.
Indeed, systems like AlphaProof / AlphaGeometry are already able to win a silver medal at the IMO, and the former relies on Lean for theorem verification [1]. On the open source side, I really like the ideas in LeanDojo [2], which use a form of RAG to assist the LLM with premise selection.
[1] https://deepmind.google/discover/blog/ai-solves-imo-problems...
[2] https://leandojo.org/
Do you think it's possible that some version of LLM technology could discover new physical theories (that are experimentally verifiable), like for example a new theory of quantum gravity, by exploring the mathematical space?
Edit: this is just incredibly exciting to think about. I'm not an "accelerationist" but the "singularity" has never felt closer...
Where I think AI models might prove useful is in those cases where the problem is well defined, where formal methods can be used to validate the correctness of (partial) solutions, and where the search space is so large that work towards a proof is based on "vibes" or intuition. Vibes can be trained through reinforcement learning.
Some computer assisted proofs are already hundreds of pages or gigabytes long. I think it's a pretty safe bet that really long and convoluted proofs that can only be verified by computers will become more common.
https://en.wikipedia.org/wiki/Computer-assisted_proof
Now, it is possible that it could come up with some theory that is radically different from current theories, where quantum gravity arises very naturally, and that fits all of the other predictions of of the current theories that we can measure - so we would have good reasons to believe the new theory and consider quantum gravity probably solved. But it's literally impossible to predict whether such a theory even exists, that is not mathematically equivalent to QM/QFT but still matches all confirmed predictions.
Additionally, nothing in AI tech so far predicts that current approaches should be any good at this type of task. The only tasks where AI has truly excelled at are extremely well defined problems where there is a huge but finite search space; and where partial solutions are easy to grade. Image recognition, game playing, text translation are the great successes of AI. And performance drops sharply with the uncertainty in the space, and with the difficulty of judging a partial solution.
Finding physical theories is nothing like any of these problems. The search space is literally infinite, partial solutions are almost impossible to judge, and even judging whether a complete solution is good or not is extremely difficult. Sure, you can check if it's mathematically coherent, but that tells you nothing about whether it describes the physical world correctly. And there are plenty of good physical theories that aren't fully formally proven, or weren't at the time they were invented - so mathematical rigour isn't even a very strong signal (e.g. Newton's infinitesimal calculus wasn't considerered sound until the 1900s or something, by which time his theories had long since been rewritten in other terms; the Dirac delta wasn't given a precise mathematical definition until much later than it's uses; and I think QFT still uses some iffy math even today).
Will this be able to be done without spending absurd amounts of energy?
Even if they were, a human-quality AI that runs at human-speed for x10 our body's calorie requirements in electricity, would still (at electricity prices of USD 0.1/kWh) undercut workers earning the UN abject poverty threshold.
Then the LLM generates and iterates on the code until it passes all of the tests. New requirements? Add more tests and repeat.
This would be legitimately paradigm shifting, vs. the super charged auto complete driven by LLMs we have today.
It's not even about LLMs IMHO. It's about letting a computer crunch many numbers and find a pattern in the results, in a quasi religious manner.
Remember: The reason we need RLHF at all is that we cannot write a loss function for what makes a good answer. There are just many ways a good answer could look like, which cannot be calculated on the basis of next-token-probability.
So you start by having your vanilla model generate n completions for your prompt. You the. manually score them. And then those prompt => (completion,score) pairs become your training set.
Once the model is trained, you may find that you can cheat:
Because if you include the desired score in your prompt, the model will now strive to produce an answer that is consistent with that score.
But you need a model to generate score from answer, and then fine-tune another model to generate answer conditioned on score. The first time the score is at the end and the second time at the beginning. It's how DecisionTransformer works too, it constructs a sequence of (reward, state, action) where reward conditions on the next action.
https://arxiv.org/pdf/2106.01345
By the same logic you could generate tags, including style, author, venue and date. Some will be extracted from the source document, the others produced with classifiers. Then you can flip the order and finetune a model that takes the tags before the answer. Then you got a LLM you can condition on author and style.
Dead Comment
To mitigate this, you have to have a system outside of that, which penalizes "gaming" the reward function. This system has to have some idea of what real value is, to be able to spot cases where the reward function is high but the value is low. We have a hard enough time of this in the money economy, where we've been learning for centuries. I do not think we are anywhere close in neural networks.
This is probably the most cursed problem ever.
Assume you could develop such a system, why wouldn't you just incorporate its logic into the original fitness function and be done with it?
I think the answer is that such a system can probably never be developed. At some level humans must be involved in order to adapt the function over time in order to meet expectations as training progresses.
The information used to train on is beyond critical, but heuristics regarding what information matters more than other information in a given context might be even more important.
In any reward function, either there are valuable things that are not rewarded, or unvaluable things that are. But having multiple systems to evaluate this, does help.
There is a step like this in ML. I think it's pretty interesting that topics from things like economics pop up in ML - although perhaps it's not too surprising as we are doing ML for humans to use.
You can “favorite” comments on HN to bookmark them.
Go is a game that is fundamentally too complex for humans to solve. We've known this since way back before AlphaGo. Since humans were not the perfect Go players, we didn't use them to teach a model- we wanted the model to be able to beat humans.
I dont see language being comparable. the "perfect" LLM imitates humans perfectly, presumably to the point where you can't tell the difference between LLM generated text, and human generated text. Maybe it's just as flexible as the human mind is too, and can context switch quickly, and can quickly swap between formalities, tones, and slangs. But the concept of "beating" a human doesn't really make much sense.
AlphaGo and Stockfish can push forward our understandings of their respective games, but an LLM cant push forwards our boundary of language. this is because it's fundamentally a copy-cat model. This makes RLHF make much more sense in the LLM realm than the Go realm.
When I ask chatgpt for code I don't want them to imitate humans, I want them to be better than humans. My reward function should then be code that actually works, not code that is similar to humans.
Maybe, by chance, the model produces an output that is a little better than its average; that is, better than professional researchers. This will be ranked favorably in RLHF.
Repeat this process and the model slowly but surely surpasses the best humans.
Is such a scenario possible in practice?
To clarify, current LLMs (without RLHF, etc.) have a very straightforward objective function during training, which is to minimize the cross-entropy of token prediction on the training data. If we assume that our training data is sampled from a population generated via a finite computable model, then Solomonoff induction achieves optimal sequence prediction.
Assuming we had an oracle that could perform SI (since it’s uncomputable), how different would conversations between GPT4 and SI be, given the same training data?
We know there would be at least a few notable differences. For example, we could give SI the first 100 digits of pi, and it would give us as many more digits as we wanted. Current transformer models cannot (directly) do this. We could also give SI a hash and ask for a string that hashes to that value. Clearly a lot of hard, formally-specified problems could be solved this way.
But how different would SI and GPT4 appear in response to everyday chit-chat? What if we ask the SI-based sequence predictor how to cure cancer? Is the “most probable” answer to that question, given its internet-scraped training data, an answer that humans find satisfying? Probably not, which is why AGI requires something beyond just optimal sequence prediction. It requires a really good objective function.
My first inclination for this human-oriented objective function is something like “maximize the probability of providing an answer that the user of the model finds satisfying”. But there is more than one user, so over which set of humans do we consider and with which aggregation (avg satisfaction, p99 satisfaction, etc.)?
So then I’m inclined to frame the problem in terms of well-being: “maximize aggregate human happiness over all time” or “minimize the maximum of human suffering over all time”. But each of these objective functions has notable flaws.
Karpathy seems to be hinting toward this in his post, but the selection of an overall optimal objective function for human purposes seems to be an incredibly difficult philosophical problem. There is no objective function I can think of for which I cannot also immediately think of flaws with it.
I suspect that a lot of LLM prompts that elicit useful capabilities out of imperfect sequence predictors like GPT-4 are in fact most likely to show up in the context of "prompting an LLM" rather than being likely to show up "in the wild".
As such, to predict the token after a prompt like that, an SI-based sequence predictor would want to predict the output of whatever language model was most likely to be prompted, conditional on the prompt/response pair making it into the training set.
If the answer to "what model was most likely to be prompted" was "the SI-based sequence predictor", then it needs to predict which of its own likely outputs are likely to make it into the training set, which requires it to have a probability distribution over its own output. I think the "did the model successfully predict the next token" reward function is underspecified in that case.
There are many cases like this where the behavior of the system in the limit of perfect performance at the objective is undesirable. Fortunately for us, we live in a finite universe and apply finite amounts of optimization power, and lots of things that are useless or malign in the limit are useful in the finite-but-potentially-quite-large regime.
You defined your predictor as being able to minimize mathematical definitions following some unspecified algebra, why didn't you define it being able to run chemical and pharmacological simulations through some unspecified model too?
Imagine if you could give a model "how you think" and your knowledge, experiences, and values as context, then it's "Explain Like I'm 5" on steroids. Both exciting and terrifying at the same time.
That was sort of implicit in my first suggestion for an objective function, but do you really want the model to be optimal on a per-user basis? There’s a lot of bad people out there. That’s why I switched to an objective function that considers all of humanity’s needs together as a whole.
If Karpathy were to focus on automating LEAN proofs it could change mathematics forever.
> AlphaProof is a system that trains itself to prove mathematical statements in the formal language Lean. It couples a pre-trained language model with the AlphaZero reinforcement learning algorithm, which previously taught itself how to master the games of chess, shogi and Go
https://deepmind.google/discover/blog/ai-solves-imo-problems...
The entire point of alpha zero was to eliminate this human influence, and go with pure reinforcement learning (i.e. zero human influence).
Unlike the above, open problems can't be solve by computing (in combinatorial senses). Even humans can only try, and LLMs do spew out something that would most likely work, not something inherently correct.
And here I mean that it the act of making a single (plausible) move that must match the expressiveness of language, because otherwise you're not in the domain of Go but the far less interesting "I have a 19x19 pixel grid and two colours".