Chain of Thought empowers transformers to solve inherently serial problems

I'm waiting for peoples of AI to discover syllogism and inference in its original PROLOG sense, which this CoT abomination basically tries to achieve. Interestingly, if all logical content is translated to rules, and then only rules are fed into the LLM training set, what would the result be, and can the probabilistic magic be made into actually following reason without all the dice.

trescenzi · a year ago

Right we’ve now gotten to the stage of this AI cycle where we start using the new tool to solve problems old tools could solve. Saying a transformer can solve any Formally decidable problem if given enough tape isn’t saying much. It’s a cool proof, don’t mean to deny that, but it doesn’t mean much practically as we already have more efficient tools that can do the same.

marcosdumay · a year ago

What I don't get is... didn't people prove that in the 90s for any multi-layer neural network? Didn't people prove transformers are equivalent on the transformers paper?

sunir · a year ago

I was thinking about the graphrag paper and prolog. I’d like to extract predicates. The source material will be inconsistent and contradictory and incomplete.

Using the clustering (community) model, an llm can summarize the opinions as a set of predicates which don’t have to agree and some general weight of how much people agree or disagree with them.

The predicates won’t be suitable for symbolic logic because the language will be loose. However an embedding model may be able to connect different symbols together.

Then you could attempt multiple runs through the database of predicates because there will be different opinions.

Then one could attempt to reason using these loosely stitched predicates. I don’t know how good the outcome would be.

I imagine this would be better in an interactive decision making tool where a human is evaluating the suggestions for the next step.

This could be better for planning than problem solving.

larodi · a year ago

Hm... a RAG over DB of logical rules actually may be interesting. But loosely stitched predicates you can easily put to work with some random dice when you decide inference.

Chris Coyne of OKCupid and KeyBase (https://chriscoyne.com/) produced ContextFree (https://www.contextfreeart.org/) before all that. It is a grammar-based inference with probabilistic chance for the inference of the next rule. Very very very inspiring, not only because of the aesthetic side of the result. Digging further you find ProbLog which allows probabilities for rules (https://dtai.cs.kuleuven.be/problog/tutorial/basic/08_rule_p...)

So how about we start thinking of AI as combination of the graphical probabilistic whatever which compresses the infromation from the training set in a very lossy manner; which is then hooked, internally or externally, with a discreet logical core, whenever CoT is needed. So this construct now can benefit from both worlds.

pkoird · a year ago

I've said this before and I'll say it again: Any sufficiently advanced LLM is indistinguishable from Prolog.

detourdog · a year ago

I’m surprised that understanding how to be thought unfolds is being considered not relevant to the answer. I have done a lot of problem solving in groups and alone. How thoughts develop seems fundamental to understand the solutions.

The story regarding the banning of terms that can be used with a reasoning system is a big red flag to me.

This sort of knee jerk reaction displays immature management and an immature technology product.

larodi · a year ago

a little late to reply, but perhaps you see this. does it not make impression to you that lots of these articles on AI that get published are very childish. not in the math sense, but in the rasoning sense. besides, most of them are anything but interdisciplinary. I've almost never encountred prompt engineers who actually tried to delve into what GPTs do, and then these CoT guys, they don't know a thing about predicat logic, yet try to invent it anew.

On your comment reg. banning tokens/terms we are on the same page. We can agree all of this is very immature, and many of the peoples also, including this lot of chinese kids who seem to put out one paper per our. You see, the original seq2seq paper is 8 pages, topic included. Can you imagine? But Sutskever was ot a child back then, he was already deep into this all. We can easily state/assume the LLM business is in its infancy. It may easily stay there for a century until everyone levels up.

In the words of an author:

"What is the performance limit when scaling LLM inference? Sky's the limit.

We have mathematically proven that transformers can solve any problem, provided they are allowed to generate as many intermediate reasoning tokens as needed. Remarkably, constant depth is sufficient.

http://arxiv.org/abs/2402.12875 (ICLR 2024)"

https://x.com/denny_zhou/status/1835761801453306089

ec109685 · a year ago

Is this the infinite monkey Shakespeare trope?

throwup238 · a year ago

More like the universal approximation theorem extended to computation rather than network complexity: https://en.wikipedia.org/wiki/Universal_approximation_theore...

nopinsight · a year ago

A key difference is that the way LMMs (Large Multimodal Models) generate output is far from random. These models can imitate/blend existing information or imitate/probably blend known reasoning methods in the training data. The latter is a key distinguishing feature of the new OpenAI o1 models.

Thus, the signal-to-noise ratio of their output is generally way better than infinite monkeys.

Arguably, humans rely on similar modes of "thinking" most of the time as well.

CamperBob2 · a year ago

Yeah. Monkeys. Monkeys that write useful C and Python code that needs a bit less revision every time there's a model update.

Can we just give the "stochastic parrot" and "monkeys with typewriters" schtick a rest? It made for novel commentary three or four years ago, but at this point, these posts themselves read like the work of parrots. They are no longer interesting, insightful, or (for that matter) true.

tsimionescu · a year ago

One question, if anyone knows the details: does this prove that there exists a single LLM that can approximate any function to arbitrary precision given enough CoT, or does it prove that for every function, there exists a Transformer that fits those criteria?

That is, does this prove that a single LLM can solve any problem, or that for any problem, we can find an LLM that solves it?

jstanley · a year ago

Doesn't the latter imply the former?

If it's possible to find an LLM for any given problem, then find an LLM for the problem "find an LLM for the problem and then evaluate it" and then evaluate it, and then you have an LLM that can solve any problem.

It's the "Universal Turing Machine" for LLMs.

I wonder what's the LLM equivalent of the halting problem?

shawntan · a year ago

Theoretical results exist that try to quantify the number of CoT tokens needed to reach different levels of computational expressibility: https://arxiv.org/pdf/2310.07923

TL;DR: Getting to Turing completeness can require polynomial CoT tokens, wrt the input problem size. For a field that constantly harps on parallelism and compute efficiency, this requirement seems prohibitive.

We really need to get away from constant depth architectures.

benkuykendall · a year ago

> Getting to Turing completeness can require polynomial CoT tokens, wrt the input problem size.

So, as stated, this is impossible since it violates the Time Hierarchy Theorem.

The actual result of the paper is that any poly-time computable function can be computed with poly-many tokens. Which is... not a particularly impressive bound? Any non-trivial fixed neural network can, for instance, compute the NAND of two inputs. And any polynomial computable function can be computed with a polynomial number of NAND gates.

Deleted Comment

ljsprague · a year ago

Skynet's the limit.

__loam · a year ago

> We have mathematically proven that transformers can solve any problem

We should require that you've passed an algorithms and a thermodynamics class before you can post.

nopinsight · a year ago

To be clear I think the tweet is a bit exaggerated (and the word ‘performance’ there doesn’t take into account efficiency, for example) but I don’t have the time to read the full paper (just skimmed the abstract and conclusion). I quoted the tweet by an author for people to discuss since it’s still a fairly remarkable result.

bonoboTP · a year ago

This is an accepted ICLR paper by authors from Stanford, Toyota and Google. That's not a guarantee for anything, of course, but they likely know basic algorithms and the second law. You can certainly argue against their claims, but you need to put in the legwork.

riku_iki · a year ago

> Remarkably, constant depth is sufficient.

I think article also says log(n) embedding size (width?) is required, where n is size of input.

candiddevmike · a year ago

> We have mathematically proven that transformers can solve any problem, provided they are allowed to generate as many intermediate reasoning tokens as needed.

That seems like a bit of a leap here to make this seem more impressive than it is (IMO). You can say the same thing about humans, provided they are allowed to think across as many years/generations as needed.

Wake me up when a LLM figures out stable fusion or room temperature superconductors.

krackers · a year ago

I think you're misrepresenting the study. It builds upon previous work that examines the computation power of the transformer architecture from a circuit-complexity perspective. Previous work showed that the class of problems that a "naive" Transformer architecture could compute was within TC0 [1, 2] and as a consequence it was fundamentally impossible for transformers to solve certain classes of mathematical problems. This study actually provides a more realistic bound of AC0 (by analyzing the finite-precision case) which rules out even more problems, including such 'simple' ones as modular parity.

We also had previous work that hinted that part of the reason why chain-of-thought works from a theoretical perspective is that it literally allows the model to perform types of computations it could not under the more limited setting (in the same way jumping from FSMs to pushdown automata allows you to solve new types of problems) [3].

[1] https://news.ycombinator.com/item?id=35609652 [2] https://blog.computationalcomplexity.org/2023/02/why-cant-li... [3] https://arxiv.org/abs/2305.15408

Horffupolde · a year ago

It is actually impressive.

One could argue that writing enabled chain of thought across generations.

Veedrac · a year ago

> Wake me up when a LLM figures out stable fusion or room temperature superconductors.

Man, the goalposts these days.

whimsicalism · a year ago

it's a TCS result.

seems like many commenting don't know about computability

WalterSear · a year ago

> You can say the same thing about humans

1. Holy shit.

2. You can't apply Moore's law to humans.

aurareturn · a year ago

> You can say the same thing about humans, provided they are allowed to think across as many years/generations as needed.

Isn’t this a good thing since compute can be scaled so that the LLM can do generations of human thinking in a much shorter amount of time?

Say humans can solve quantum gravity in 100 years of thinking by 10,000 really smart people. If one AGI is equal to 1 really smart person. Scale enough compute for 1 million AGI and we can solve quantum gravity in a year.

The major assumption here is that transformers can indeed solve every problem humans can.

DarkNova6 · a year ago

The more interesting is whether the ability of reason and solve problems scales linearly or logarithmically.

Deleted Comment

m3kw9 · a year ago

Sort of like quantum superposition state? So here is an idea, using quantum to produce all possible inferences and use some not yet invented algorithms to collapse to the final result

Deleted Comment

tooltower · a year ago

Constant depth circuits can solve everything? I feel like I missed some important part of circuit complexity. Or this is BS.

shawntan · a year ago

Using CoT implicitly increases the depth of the circuit. But yes, poorly worded.

whimsicalism · a year ago

CoT means you're adding loops

for iteration in range(10): Hello LLM, please solve this task: <task> Here is a possible solution: <last_reply> Please look at it and see if you can improve it. Then tell me your improved solution.