Note that for the purposes of this paper a “problem” just means a formally decidable problem or a formal language, and the proof is that by creatively arranging transformers you can make individual transformer runs behave like individual Boolean circuits. However, this is a long way from any practical application of transformers: for one thing, most problems we care about are not stated as formal languages, and we already have an exceptionally more efficient way to implement Boolean circuits.
If a "problem we care about" is not stated as a formal language, does it mean it does not exist in the hierarchy of formal languages? Or is it just as yet unclassified?
It means that there are two problems: one, to formalize the problem as stated while capturing all relevant details, and two, solving the resulting formal problem. Until you solve problem one, you can't use formal methods to say anything about the problem (it's not even clear a priori that a problem is even solvable).
Unfortunately, the task of a formalizing an informal problem is itself an informal problem that we don't know how to formalize, so we can't say much about it. So overall, we can't say much about how hard the general problem "given a problem statement from a human, solve that problem" is, whether any particular system (including a human!) can solve it and how long that might take with what resources.
My 2 cents: Since LLMs (Large Language Models) operate as at least a subset of Turing machines (which recognize recursively enumerable languages), the chain of thought (CoT) approach could be equivalent to or even more expressive than that subset. In fact, CoT could perfectly be a Turing machine.
If we leave CoT aside for a moment, it's worth exploring the work discussed in the paper "Neural Networks and the Chomsky Hierarchy"[1], which analyzes how neural networks (including LLMs) map onto different levels of the Chomsky hierarchy, with a particular focus on their ability to recognize formal languages across varying complexity.
How would that be remarkable, when it is exactly what he Universal Approximation Theorem already states? Since transformers also use fully connected layers, none of this should really come as a surprise. But from glancing at the paper, they don't even mention it.
It's 'remarkable' because (a) academic careers are as much about hype as science, (b) arxiv doesn't have peer review process to quash this, (c) people take arxiv seriously.
>How would that be remarkable, when it is exactly what he Universal Approximation Theorem already states
Only with infinite precision, which is highly unrealistic. Under realistic assumptions, fixed depth transformer without chain-of-thought are very limited in what they can express: https://arxiv.org/abs/2207.00729 . Chain of thought increases the class of problems which fixed depth transformers can solve: https://arxiv.org/abs/2310.07923
I'm waiting for peoples of AI to discover syllogism and inference in its original PROLOG sense, which this CoT abomination basically tries to achieve. Interestingly, if all logical content is translated to rules, and then only rules are fed into the LLM training set, what would the result be, and can the probabilistic magic be made into actually following reason without all the dice.
Right we’ve now gotten to the stage of this AI cycle where we start using the new tool to solve problems old tools could solve. Saying a transformer can solve any Formally decidable problem if given enough tape isn’t saying much. It’s a cool proof, don’t mean to deny that, but it doesn’t mean much practically as we already have more efficient tools that can do the same.
What I don't get is... didn't people prove that in the 90s for any multi-layer neural network? Didn't people prove transformers are equivalent on the transformers paper?
I was thinking about the graphrag paper and prolog. I’d like to extract predicates. The source material will be inconsistent and contradictory and incomplete.
Using the clustering (community) model, an llm can summarize the opinions as a set of predicates which don’t have to agree and some general weight of how much people agree or disagree with them.
The predicates won’t be suitable for symbolic logic because the language will be loose. However an embedding model may be able to connect different symbols together.
Then you could attempt multiple runs through the database of predicates because there will be different opinions.
Then one could attempt to reason using these loosely stitched predicates. I don’t know how good the outcome would be.
I imagine this would be better in an interactive decision making tool where a human is evaluating the suggestions for the next step.
This could be better for planning than problem solving.
Hm... a RAG over DB of logical rules actually may be interesting. But loosely stitched predicates you can easily put to work with some random dice when you decide inference.
So how about we start thinking of AI as combination of the graphical probabilistic whatever which compresses the infromation from the training set in a very lossy manner; which is then hooked, internally or externally, with a discreet logical core, whenever CoT is needed. So this construct now can benefit from both worlds.
I’m surprised that understanding how to be thought unfolds is being considered not relevant to the answer. I have done a lot of problem solving in groups and alone. How thoughts develop seems fundamental to understand the solutions.
The story regarding the banning of terms that can be used with a reasoning system is a big red flag to me.
This sort of knee jerk reaction displays immature management and an immature technology product.
a little late to reply, but perhaps you see this. does it not make impression to you that lots of these articles on AI that get published are very childish. not in the math sense, but in the rasoning sense. besides, most of them are anything but interdisciplinary. I've almost never encountred prompt engineers who actually tried to delve into what GPTs do, and then these CoT guys, they don't know a thing about predicat logic, yet try to invent it anew.
On your comment reg. banning tokens/terms we are on the same page. We can agree all of this is very immature, and many of the peoples also, including this lot of chinese kids who seem to put out one paper per our. You see, the original seq2seq paper is 8 pages, topic included. Can you imagine? But Sutskever was ot a child back then, he was already deep into this all. We can easily state/assume the LLM business is in its infancy. It may easily stay there for a century until everyone levels up.
But didn't we already know that NN can solve any computable problem? The interesting thing is if they can be trained to solve any (computable) problem.
Does that mean when we reduce the precision of a NN, for example using bfloat16 instead of float32, we reduce the set of computational problems that can be solved.
How would that compare with a biological neural network with presumably near-infinite precision ?
First day of introductions to NN we were asked to create all the logic gates using artificial neurons, and then told "If you have all gates, you can do all computations".
I got to admit, I'm sorta sticking to that at face value, because I don't know enough computer science to a) discern if that is true and b) know what "f: X -> Y only for closed domains" means.
Only NNs of infinite size or precision. Under more realistic assumptions, transformers without chain of thought are actually limited in what they can solve: https://arxiv.org/abs/2207.00729
"What is the performance limit when scaling LLM inference? Sky's the limit.
We have mathematically proven that transformers can solve any problem, provided they are allowed to generate as many intermediate reasoning tokens as needed. Remarkably, constant depth is sufficient.
A key difference is that the way LMMs (Large Multimodal Models) generate output is far from random. These models can imitate/blend existing information or imitate/probably blend known reasoning methods in the training data. The latter is a key distinguishing feature of the new OpenAI o1 models.
Thus, the signal-to-noise ratio of their output is generally way better than infinite monkeys.
Arguably, humans rely on similar modes of "thinking" most of the time as well.
Yeah. Monkeys. Monkeys that write useful C and Python code that needs a bit less revision every time there's a model update.
Can we just give the "stochastic parrot" and "monkeys with typewriters" schtick a rest? It made for novel commentary three or four years ago, but at this point, these posts themselves read like the work of parrots. They are no longer interesting, insightful, or (for that matter) true.
One question, if anyone knows the details: does this prove that there exists a single LLM that can approximate any function to arbitrary precision given enough CoT, or does it prove that for every function, there exists a Transformer that fits those criteria?
That is, does this prove that a single LLM can solve any problem, or that for any problem, we can find an LLM that solves it?
If it's possible to find an LLM for any given problem, then find an LLM for the problem "find an LLM for the problem and then evaluate it" and then evaluate it, and then you have an LLM that can solve any problem.
It's the "Universal Turing Machine" for LLMs.
I wonder what's the LLM equivalent of the halting problem?
Theoretical results exist that try to quantify the number of CoT tokens needed to reach different levels of computational expressibility:
https://arxiv.org/pdf/2310.07923
TL;DR: Getting to Turing completeness can require polynomial CoT tokens, wrt the input problem size.
For a field that constantly harps on parallelism and compute efficiency, this requirement seems prohibitive.
We really need to get away from constant depth architectures.
> Getting to Turing completeness can require polynomial CoT tokens, wrt the input problem size.
So, as stated, this is impossible since it violates the Time Hierarchy Theorem.
The actual result of the paper is that any poly-time computable function can be computed with poly-many tokens. Which is... not a particularly impressive bound? Any non-trivial fixed neural network can, for instance, compute the NAND of two inputs. And any polynomial computable function can be computed with a polynomial number of NAND gates.
To be clear I think the tweet is a bit exaggerated (and the word ‘performance’ there doesn’t take into account efficiency, for example) but I don’t have the time to read the full paper (just skimmed the abstract and conclusion). I quoted the tweet by an author for people to discuss since it’s still a fairly remarkable result.
This is an accepted ICLR paper by authors from Stanford, Toyota and Google. That's not a guarantee for anything, of course, but they likely know basic algorithms and the second law.
You can certainly argue against their claims, but you need to put in the legwork.
> We have mathematically proven that transformers can solve any problem, provided they are allowed to generate as many intermediate reasoning tokens as needed.
That seems like a bit of a leap here to make this seem more impressive than it is (IMO). You can say the same thing about humans, provided they are allowed to think across as many years/generations as needed.
Wake me up when a LLM figures out stable fusion or room temperature superconductors.
I think you're misrepresenting the study. It builds upon previous work that examines the computation power of the transformer architecture from a circuit-complexity perspective. Previous work showed that the class of problems that a "naive" Transformer architecture could compute was within TC0 [1, 2] and as a consequence it was fundamentally impossible for transformers to solve certain classes of mathematical problems. This study actually provides a more realistic bound of AC0 (by analyzing the finite-precision case) which rules out even more problems, including such 'simple' ones as modular parity.
We also had previous work that hinted that part of the reason why chain-of-thought works from a theoretical perspective is that it literally allows the model to perform types of computations it could not under the more limited setting (in the same way jumping from FSMs to pushdown automata allows you to solve new types of problems) [3].
> You can say the same thing about humans, provided they are allowed to think across as many years/generations as needed.
Isn’t this a good thing since compute can be scaled so that the LLM can do generations of human thinking in a much shorter amount of time?
Say humans can solve quantum gravity in 100 years of thinking by 10,000 really smart people. If one AGI is equal to 1 really smart person. Scale enough compute for 1 million AGI and we can solve quantum gravity in a year.
The major assumption here is that transformers can indeed solve every problem humans can.
Sort of like quantum superposition state? So here is an idea, using quantum to produce all possible inferences and use some not yet invented algorithms to collapse to the final result
Has it been publicly benchmarked yet, if this approach:
Hello LLM, please solve this task: <task>
Can be improved by performing this afterwards?
for iteration in range(10):
Hello LLM, please solve this task: <task>
Here is a possible solution: <last_reply>
Please look at it and see if you can improve it.
Then tell me your improved solution.
I think o1 is more like "pretend you're doing a job interview, think step and show your working".
I tried something similar to the suggested iterative loop on a blog post I'd authored but wanted help copy editing; first few were good enough, but then it got very confused and decided the blog post wasn't actually a blog post to be edited and instead that what I really wanted to know was the implications of Florida something something Republican Party.
Benchmark would be neat, because all I have is an anecdote.
Unfortunately, the task of a formalizing an informal problem is itself an informal problem that we don't know how to formalize, so we can't say much about it. So overall, we can't say much about how hard the general problem "given a problem statement from a human, solve that problem" is, whether any particular system (including a human!) can solve it and how long that might take with what resources.
If we leave CoT aside for a moment, it's worth exploring the work discussed in the paper "Neural Networks and the Chomsky Hierarchy"[1], which analyzes how neural networks (including LLMs) map onto different levels of the Chomsky hierarchy, with a particular focus on their ability to recognize formal languages across varying complexity.
[1] https://ar5iv.labs.arxiv.org/html/2207.02098v1
then a way would be to translate them to formal language
> next paper: transformers can solve any problem but on some of them they may compute indefinitely and never provide an answer
> (and you cannot tell in advance which is which!!)
https://twitter.com/yoavgo/status/1835802380589203802
How would that be remarkable, when it is exactly what he Universal Approximation Theorem already states? Since transformers also use fully connected layers, none of this should really come as a surprise. But from glancing at the paper, they don't even mention it.
Only with infinite precision, which is highly unrealistic. Under realistic assumptions, fixed depth transformer without chain-of-thought are very limited in what they can express: https://arxiv.org/abs/2207.00729 . Chain of thought increases the class of problems which fixed depth transformers can solve: https://arxiv.org/abs/2310.07923
Using the clustering (community) model, an llm can summarize the opinions as a set of predicates which don’t have to agree and some general weight of how much people agree or disagree with them.
The predicates won’t be suitable for symbolic logic because the language will be loose. However an embedding model may be able to connect different symbols together.
Then you could attempt multiple runs through the database of predicates because there will be different opinions.
Then one could attempt to reason using these loosely stitched predicates. I don’t know how good the outcome would be.
I imagine this would be better in an interactive decision making tool where a human is evaluating the suggestions for the next step.
This could be better for planning than problem solving.
Chris Coyne of OKCupid and KeyBase (https://chriscoyne.com/) produced ContextFree (https://www.contextfreeart.org/) before all that. It is a grammar-based inference with probabilistic chance for the inference of the next rule. Very very very inspiring, not only because of the aesthetic side of the result. Digging further you find ProbLog which allows probabilities for rules (https://dtai.cs.kuleuven.be/problog/tutorial/basic/08_rule_p...)
So how about we start thinking of AI as combination of the graphical probabilistic whatever which compresses the infromation from the training set in a very lossy manner; which is then hooked, internally or externally, with a discreet logical core, whenever CoT is needed. So this construct now can benefit from both worlds.
The story regarding the banning of terms that can be used with a reasoning system is a big red flag to me.
This sort of knee jerk reaction displays immature management and an immature technology product.
On your comment reg. banning tokens/terms we are on the same page. We can agree all of this is very immature, and many of the peoples also, including this lot of chinese kids who seem to put out one paper per our. You see, the original seq2seq paper is 8 pages, topic included. Can you imagine? But Sutskever was ot a child back then, he was already deep into this all. We can easily state/assume the LLM business is in its infancy. It may easily stay there for a century until everyone levels up.
But recurrent neural networks can do solve any computational problem given enough precision.
How would that compare with a biological neural network with presumably near-infinite precision ?
I got to admit, I'm sorta sticking to that at face value, because I don't know enough computer science to a) discern if that is true and b) know what "f: X -> Y only for closed domains" means.
"What is the performance limit when scaling LLM inference? Sky's the limit.
We have mathematically proven that transformers can solve any problem, provided they are allowed to generate as many intermediate reasoning tokens as needed. Remarkably, constant depth is sufficient.
http://arxiv.org/abs/2402.12875 (ICLR 2024)"
https://x.com/denny_zhou/status/1835761801453306089
Thus, the signal-to-noise ratio of their output is generally way better than infinite monkeys.
Arguably, humans rely on similar modes of "thinking" most of the time as well.
Can we just give the "stochastic parrot" and "monkeys with typewriters" schtick a rest? It made for novel commentary three or four years ago, but at this point, these posts themselves read like the work of parrots. They are no longer interesting, insightful, or (for that matter) true.
That is, does this prove that a single LLM can solve any problem, or that for any problem, we can find an LLM that solves it?
If it's possible to find an LLM for any given problem, then find an LLM for the problem "find an LLM for the problem and then evaluate it" and then evaluate it, and then you have an LLM that can solve any problem.
It's the "Universal Turing Machine" for LLMs.
I wonder what's the LLM equivalent of the halting problem?
TL;DR: Getting to Turing completeness can require polynomial CoT tokens, wrt the input problem size. For a field that constantly harps on parallelism and compute efficiency, this requirement seems prohibitive.
We really need to get away from constant depth architectures.
So, as stated, this is impossible since it violates the Time Hierarchy Theorem.
The actual result of the paper is that any poly-time computable function can be computed with poly-many tokens. Which is... not a particularly impressive bound? Any non-trivial fixed neural network can, for instance, compute the NAND of two inputs. And any polynomial computable function can be computed with a polynomial number of NAND gates.
Deleted Comment
We should require that you've passed an algorithms and a thermodynamics class before you can post.
I think article also says log(n) embedding size (width?) is required, where n is size of input.
That seems like a bit of a leap here to make this seem more impressive than it is (IMO). You can say the same thing about humans, provided they are allowed to think across as many years/generations as needed.
Wake me up when a LLM figures out stable fusion or room temperature superconductors.
We also had previous work that hinted that part of the reason why chain-of-thought works from a theoretical perspective is that it literally allows the model to perform types of computations it could not under the more limited setting (in the same way jumping from FSMs to pushdown automata allows you to solve new types of problems) [3].
[1] https://news.ycombinator.com/item?id=35609652 [2] https://blog.computationalcomplexity.org/2023/02/why-cant-li... [3] https://arxiv.org/abs/2305.15408
One could argue that writing enabled chain of thought across generations.
Man, the goalposts these days.
seems like many commenting don't know about computability
1. Holy shit.
2. You can't apply Moore's law to humans.
Isn’t this a good thing since compute can be scaled so that the LLM can do generations of human thinking in a much shorter amount of time?
Say humans can solve quantum gravity in 100 years of thinking by 10,000 really smart people. If one AGI is equal to 1 really smart person. Scale enough compute for 1 million AGI and we can solve quantum gravity in a year.
The major assumption here is that transformers can indeed solve every problem humans can.
Deleted Comment
Deleted Comment
https://aider.chat/docs/leaderboards/
the question is how would you define "improve" and "solve". RLHF in a way delegates this to humans.
I tried something similar to the suggested iterative loop on a blog post I'd authored but wanted help copy editing; first few were good enough, but then it got very confused and decided the blog post wasn't actually a blog post to be edited and instead that what I really wanted to know was the implications of Florida something something Republican Party.
Benchmark would be neat, because all I have is an anecdote.