Quiet-STaR: Language Models Can Teach Themselves to Think Before Speaking

Edsger Dijkstra had a precise english style; even though his mother tongue was Dutch, I find he made better use of English than many native speakers.

In one of the EWD's, he reminisced that, as children, they were taught to never begin to speak a sentence unless they already knew how they were going to finish it.

I'd bet these two observations have a causal connection.

zoogeny · a year ago

When I was a young man I was taking a language course while I was temporarily living in a foreign country. There was an older man in the course (not elderly, more like mid-fifties) who was very bad at the new language we were both learning. Yet I noticed he had, what seemed to me, a magic power: he could always make people laugh. He would often whisper something to one of our classmates and they would always get a giant smile on their face or even laugh out loud.

I was intensely curious and I spent some time wondering how he did it. One day, out of the blue, he invited me out to lunch after class. We just chatted for most of the lunch, exchanging backgrounds and stories. Then his face took on a serious expression and he slowly and carefully began to explain something to me as if he was passing on some wisdom.

He said that he never spoke a single sentence without fully saying the sentence in his mind. He said he would often think of the words several times in his mind, revising the phrase until he was happy. He would imagine saying the words to the person in front of him and he would imagine their reaction. And he would continue to revise until he felt confident the person who heard the words he would say would react in the way he wanted them to react. If he could not imagine the person reacting how he wanted them to react, he would not say anything at all.

It was clear to me that he was passing along this advice but also that he was calling me out a bit. He was letting me know that I spoke without thinking. I say what pops into my head. It was like he read my mind honestly, he knew exactly what I was curious about and he answered the question I had for him that I never asked.

I wish I could say that I learned the lesson. When I have tried the technique it has rewarded the effort. But I haven't formed it into a habit and I still tend to let my mouth race ahead of my mind.

MattPalmer1086 · a year ago

That actually sounds like hell to me, a complete absence of spontaneity and being in the moment.

I used to obsessively try to figure out what to say before I said it. I am socially awkward, and it did not help at all. I love writing because it is asynchronous and I can figure things out precisely and edit my thoughts.

But in social situations it is a complete hindrance.

Cthulhu_ · a year ago

I've observed two things. One, writing is different to speaking, because it's async, you can think before you write, you can edit, etc.

But second, speaking in a non-native language makes you think harder about what you're about to say. Less colloquialisms, more focus on making sure your meaning is understood, more sensitivity in case you might offend someone, perhaps?

It's not new either; a lot of science and whatnot has been done in people's not-native language, like French, German, Latin, etc. Another factor there is the lingo of the field; I can't simply say "Kubernetes is een open-bron houder orkestratiesysteem voor het automatiseren van de inzet, schalen, en het beheer van zachte waren" without confusing half my native speaking audience.

wara23arish · a year ago

I love reading his EWDs, I had a professor who worked with him who mentioned he made his students work use pens while taking his tests. To make it less likely for the students to make mistakes??

float4 · a year ago

> he made his students work use pens while taking his tests

This is very common in the Netherlands, I think that's why it was a rule of his.

In general, the Dutch education system seems to be against pencils (at least this was the case until recent; I'm Dutch and mid 20s). You're tought to write using a fountain pen, not a pencil. In high school, you're allowed to switch to ball point but absolutely not to pencil. In university, write with pretty much anything you want, but... not with a pencil. If you do take your test with a pencil, there's genuinely a chance your teacher will give you a 0, although most of the time they'll probably be forgiving.

I majored in CS in the Netherlands and every test was done with good old pen and paper. Students still make mistakes all the time, which is why everyone uses a scrap sheet.

westurner · a year ago

Perhaps to make it easier determine how to correct instruction.

- "Guidelines for keeping a laboratory notebook" (2019) https://news.ycombinator.com/item?id=19123430#19126809

torginus · a year ago

I also learned English from textbooks, and one of the strangest things I encountered that native speakers routinely confuse "their, there, they're" which I never thought was a mistake I could make. It would be like confusing 'wet' and 'vet'. So there's definitely a difference between native and non-native speakers use the language.

qup · a year ago

The people who confuse that mostly have not done very much reading. Audibly, those words are identical.

leobg · a year ago

Even crazier:

“Could of”.

Like “You could of said so”.

ricardobeat · a year ago

Is that even possible, or just hyperbole? I'd bet the latter. I wouldn't be surprised if some people are able to fully unravel entire paragraphs of conversation in their head in a couple of seconds, but that's not something you could teach to children in general.

mannykannot · a year ago

I don't think it is feasible, at least for conversation, but as an aspirational goal for children, along the lines of "put your toys away when you've finished playing with them", it is not a bad one.

It's not unusual for me to think I know how I am going to end a sentence, but then find that I can't get there.

h34t · a year ago

in Dutch (and German) the verb often goes at the end of a sentence, so the advice is rather practical.

ted_bunny · a year ago

German children would with you disagree.

caddy · a year ago

I also wonder if it has anything to do with the process of learning a new language in general. I've thought more thoroughly about how English works since I've been learning French (not that I'm very eloquent in either)

fennecbutt · a year ago

Unfortunately from experience that just gives enough of a delay that you get talked over in a group setting and never get a chance to speak anyway.

I had this thought the other day that the whole chain of thought reasoning pattern contributing to improved performance in LLM-based systems seems to sit parallel to Kahneman's two-system model of the mind that he covers in 'Thinking, Fast and Slow'.

Haven't read it in a few years, but I recall the book suggests that we use one 'System 1' in our brains primarily for low-effort, low computation thinking - like 1+1=? or "the sky is ____".

It then suggests that we use a 'System 2' for deliberate, conscious, high-cognitive tasks. Dense multiplication, reasoning problems, working with tools - generally just decision-making. Anything that requires focus or brain power. Our brain escalates tasks from S1 to S2 if they feel complex or dangerous.

Maybe I'm being too cute, but it feels like critique that "LLMs aren't intelligent because they are stochastic parrots" is an observation that they are only equipped to use their 'System 1'.

When we prompt an LLM to think step-by-step, we allow it a workspace to write down it's thoughts which it can then consider in it's next token prediction, a rudimentary System 2, like a deliberation sandbox.

We do a similar thing when we engage our System 2 - we hold a diorama of the world in the front of our mind, where we simulate what the environment will do if we proceed with a given action - what our friend might respond to what we say, how the sheet steel might bend to a force, how the code might break, how the tyres might grip. And we use that simulation to explore a tree of possibilities and decide an action that rewards us the most.

I'm no expert, but this paper seems to recognise a similar framework to the above. Perhaps a recurrent deliberation/simulation mechanism will make it's way into models in the future, especially the action models we are seeing in robotics.

airstrike · a year ago

I'll preface this by saying I know this may sound entirely made up, unscientific, anecdotal, naive, or adolescent even, but luckily nobody has to believe me...

A few weeks back I was in that limbo state where you're neither fully awake nor fully asleep and for some reason I got into a cycle where I could notice my fast-thinking brain spitting out words/concepts in what felt like the speed of light before my slow-thinking brain would take those and turn them into actual sentences

It was like I was seeing my chain of thought as a list of ideas that was filled impossibly fast before it got summarized into a proper "thought" as a carefully selected list of words

I have since believed, as others have suggested in much more cogent arguments before me, that what we perceive as our thoughts are, indeed, a curated output of the brainstormy process that immediately precedes it

giva · a year ago

Well, this sound weird to me in the sense that I don't feel that I think in _words_. I only convert my thoughts into words when i need to speak or write them down; So when I need to communicate them to others, when I need to remember them for later, or when I am stuck and I need to clear things up.

I was actually convinced it was the same for most people, and that for this reason "Rubber duck debugging"[1] is a thing.

1) https://en.wikipedia.org/wiki/Rubber_duck_debugging

nico · a year ago

There is a technique for achieving this state of consciousness, it’s called noting

This is an awareness that advanced meditators seek, practice and develop to perceive “reality as it is”

If you are curious, you might find related discussions, and a great welcoming community at r/streamentry on Reddit

Also the book Mastering the Core Teachings of the Buddha talks about it quite a bit, including instructions on how to do it

dicroce · a year ago

This is fascinating. I had another experience that I think sheds light on some of this. One day I was in my office and the lights were off. I turned around and looked at the dark shape on top of my coworkers desk. For a few seconds I stared blankly and then suddenly I had a thought: PC, it's his PC. Then I started to think about that period of time just before I realized what I was looking at... The only word I can describe what it felt like is: unconscious. Is it possible that consciousness is just a stream of recognition?

theaussiestew · a year ago

I have this too. My cognitive processes are not related to my thinking brain, which I define as the part of my mental process which produces the sounds of words in my mind. Instead, I've observed that first, my subconscious processes concepts at a much more fine grained level, much like the latent space of a machine learning model. Only substantially after, let's say 10ms after, do thoughts arise, which are just pointers to the already processed subconscious process. A very rough analogy would be the inference of an LLM in words, vs all the processing of embeddings that happens internally.

andai · a year ago

I forget the name but I remember reading about this as a recognized process in neurology. We usually only hear the thought that wins, but there are many generated simultaneously, and there is a selection process.

Possibly related, I had a similar experience last night, where my mind simulated a fully realistic conversation between two people, with audio and video, except that the sentences made no sense. I thought that was interesting. My explanation was "the language part of your brain is too tired cause you've been using it all day."

Swizec · a year ago

> I got into a cycle where I could notice my fast-thinking brain spitting out words/concepts in what felt like the speed of light before my slow-thinking brain would take those and turn them into actual sentences

The way I’ve seen this described by psychologists is that System 1 is driving the car while System 2 panicks in the back seat screaming out explanations for every action and shouting directions to the driver so it can feel in control. The driver may listen to those directions, but there’s no direct link between System 2 in the backseat and System 1 holding the wheel.

Various experiments have shown that in many situations our actions come first and our conscious understanding/explanation of those actions comes second. Easiest observed in people with split brain operations. The wordy brain always thinks it’s in control even when we know for a fact it couldn’t possibly have been because the link has been surgically severed.

Being super tired, on the edge of sleep, or on drugs can disrupt these links enough to let you observe this directly. It’s pretty wild when it happens.

Another easy way, for me, is to get up on stage and give a talk. Your mouth runs away presenting things and you’re in the back of your head going “Oh shit no that’s going in the wrong direction and won’t make the right point, adjust course!”

mirror_neuron · a year ago

It’s hard (impossible?) to know if we’re talking about the same thing or not, but I experience something like this all the time, without being on the edge of sleep. We might both be wrong, but it’s relatable!

pictureofabear · a year ago

This seems like it might upend Descartes' "cogito, ergo sum" ("I think therefore I am") in that the process for forming thoughts in a language is not indicative that we exist, rather it merely indicates that we have evolved a brain that can produce and interpret language.

Seems like we're dismantling a lot of what Descartes came up with these days.

melagonster · a year ago

From positive perspective,it is surely that our thinking/mind is not just language and always faster than sentence formation.

JoBrad · a year ago

I had a similar experience when I was put under during surgery a few years ago. Later I learned that they used ketamine in their concoction.

allemagne · a year ago

I occasionally reach a similar state near sleep where I will be half-dreaming that I'm reading from a page of a book where the words materialize/"come into focus" right before my eyes into what is usually vaguely grammatically correct nonsense.

marmaduke · a year ago

> curated output of the brainstormy process that immediately precedes it

Daniel Dennett gives a nice albeit more detailed version of your idea in his book Consciousness Explained, could be worth a read

samstave · a year ago

Mandelthought psyt.

HarHarVeryFunny · a year ago

> it feels like critique that "LLMs aren't intelligent because they are stochastic parrots" is an observation that they are only equipped to use their 'System 1'.

I wouldn't say LLMs aren't intelligent (at all) since they are based on prediction which I believe is the ability that we recognize as intelligence. Prediction is what our cortex has evolved to do.

Still, intelligence isn't an all or nothing ability - it exists on a spectrum (and not just an IQ score spectrum). My definition of intelligence is "degree of ability to correctly predict future outcomes based on past experience", so it depends on the mechanisms the system (biological or artificial) has available to recognize and predict patterns.

Intelligence also depends on experience, minimally to the extent that you can't recognize (and hence predict) what you don't have experience with, although our vocabulary for talking about this might be better if we distinguished predictive ability from experience rather than bundling them together as "intelligence".

If we compare the predictive machinery of LLMs vs our brain, there is obviously quite a lot missing. Certainly "thinking before speaking" (vs LLM fixed # steps) is part of that, and this Q* approach and tree-of-thoughts will help towards that. Maybe some other missing pieces such as thalamo-cortical loop (iteration) can be retrofitted to LLM/transformer approach too, but I think the critical piece missing for human-level capability is online learning - the ability to act then see the results of your action and learn from that.

We can build a "book smart" AGI (you can't learn what you haven't been exposed to, so maybe unfair to withhold the label "AGI" just because of that) based on current approach, but the only way to learn a skill is by practicing it and experimenting. You can't learn to be a developer, or anything else, just by reading a book or analyzing what other people have produced - you need to understand the real world results of your own predictions/actions, and learn from that.

RandomLensman · a year ago

Defining intelligence as prediction leaves out a lot of other things that humans would see as intelligence in other humans (e.g., creating a novel), also quite simple organisms make predictions (e.g., a predator jumping at prey makes a prediction about positions).

hackerlight · a year ago

> online learning - the ability to act then see the results of your action and learn from that.

I don't think that should be necessary, if you are talking about weight updates. Offline batch mode Q-learning achieves the same thing.

By online learning, did you mean working memory? I'd agree with that. Whether it's RAG, ultra-long-context, and LSTM-like approach, or something else, is TBD.

Grimblewald · a year ago

Id say intelligence is a measure of how well you can make use of what you have. An intelligent person can take some pretty basic principles a really long way, for example. Similarly, they can take a basic comprehension of a system and build on it rapidly to get predictions for that system that defy the level of experience they have. Anyone can gather experience, but not everyone can push that experience's capacity to predict beyond what it should enable.

iteygib · a year ago

To me, it is one of those things like defining what 'art' is, as in creating a model in our heads around a concept. We take our definitions and then use those to construct models like AI that simulate our model well enough.

In other words, I personally do not believe any system we develop will be truly 'intelligent', since intelligence is a concept we created to help explain ourselves. We can't even truly define it, but yet we try to test technologies we develop to see if they possess it. It is a bit non sensical to me.

kderbe · a year ago

Andrej Karpathy makes this same point, using the same book reference, in his "[1hr Talk] Intro to Large Language Models" video from Nov. 2023.

Here is a link to the relevant part of his presentation: https://youtu.be/zjkBMFhNj_g?t=2120

biosed · a year ago

Wasn't most of the claims in that book refuted, some even by the author. I really enjoyed it and found some great insights only to be later told by a friend in that sphere that the book was not correct and even the author had "retracted" some of the assertions.

mannykannot · a year ago

It might still be a useful concept in developing LLMs.

jerpint · a year ago

He won a Nobel prize for his works so not sure how much of it would be refuted

tasty_freeze · a year ago

People often say that LLMs aren't really thinking because they are just producing a stream of words (tokens really) reflexively based on some windows of previous text either read or from its own response. That is true.

But I have the experience when talking of not knowing what I'm going to say until I hear what I've said. Sometimes I do have deliberative thought and planning, trialing phrases in my head before uttering them, but apparently I'm mostly an LLM that is just generating a stream of tokens.

Workaccount2 · a year ago

This is something that is easily observable by anyone at virtually any moment, yet at the same time is something that escapes 99% of the population.

When you are talking to someone in normal conversation, you are both taking in the words you are saying at the same time.

OJFord · a year ago

I'm currently reading it for the first time, completely coincidentally/not for this reason, and on a few occasions I've thought 'Gosh that's just like' or 'analogous to' or 'brilliant description of that problem' for LLMs/generative AI or some aspect of it. I wish I could recall some examples.

glial · a year ago

I think of COT as a memory scratchpad. It gives the LLM some limited write-only working memory that it can use for simple computations (or associations, in its case). Now suppose an LLM had re-writeable memory... I think every prompt-hack, of which COT is one example, is an opportunity for an architecture improvement.

HarHarVeryFunny · a year ago

I think of COT more as a type of planning or thinking before you speak. If you just open your mouth and start talking, which is what a plain LLM does, then you may talk yourself into a corner with no good way to get out of it, or find yourself saying something that really makes no sense. COT effectively allows the LLM to see the potential continuations of what it is considering saying, and pick one that makes sense!

I think lack of COT or any ability to plan ahead is part of why LLMs are prone to hallucinate - if you've already run your mouth and said "the capital of australia is", then it's a bit late to realize you don't know what it is. The plain LLM solution is to do what they always do and predict next word using whatever it had in the training set, such as names of some australian cities and maybe a notion that a capital should be a large important city. IOW it'll hallucinate/bullshit a continuation word such as "Melbourne". With COT it would potentially have the ability to realize that "the capital of australia is" is not a good way to start a sentence when you don't know the answer, and instead say "i don't know". Of course the other cause of hallucinations is that the LLM might not even know what it doesn't know, so might think that "Melbourne" is a great answer.

kouru225 · a year ago

Feel like this is better represented as the default mode network: https://en.m.wikipedia.org/wiki/Default_mode_network

There are questions we know the answers to and we just reflexively spit them out, but then there are questions that are new to us and we have to figure them out separately.

Recent research has shown that new memories are recorded in the brain differently depending on how unique the memory is: https://www.quantamagazine.org/the-usefulness-of-a-memory-gu...

bun_at_work · a year ago

I have a similar view to you and not much to add to your comment, other than to reference a couple books that you might like if you enjoyed 'Thinking, Fast and Slow'.

'The Righteous Mind' by Jonathan Haidt. Here, Haidt describes a very similar 2-system model he describes as the Elephant-rider model.

'A Thousand Brains: A New Theory of Intelligence' by Jeff Hawkins. Here Jeff describes his Thousand Brains theory, which has commonality with the 2-system model described by Kahneman.

I think these theories of intelligence help pave the way for future improvements on LLMs for sure, so just want to share.

iteygib · a year ago

How does evolutionary instinct factor into the system model? Flight or fight responses, reflexes, etc. 'Thinking' does have consequences in terms of evolutionary survival in some circumstances, as in spending too much time deliberating\simulating.

eightysixfour · a year ago

This is a common comparison in the LLM world. I actually think it is closer to the Left/Right Brain differences described in Master and His Emissary, but that’s for a blog post later.

thwarted · a year ago

This sounds similar to the A Brain/B Brain concept that was described by, I believe, Marvin Minsky. I don't know how this might be related to Kahneman's work.

dougmwne · a year ago

I had the same thought from Thinking, Fast and Slow.

Another variation of this seems to be the “thought loop” that agents such as Devin and AutoGPT use.

mistermann · a year ago

https://en.m.wikipedia.org/wiki/OODA_loop

machiaweliczny · a year ago

It’s a bit over my head for now but seems like GFlowNets are tackling this problem a bit.

dcrimp · a year ago

interesting, hadn't come across these. Will be doing some more reading up on them.

toisanji · a year ago

that is the approach also taken in this paper for building LLM agents with metacognition: https://replicantlife.com/

emmender2 · a year ago

thinking step-by-step requires 100% accuracy in each step. If you are 95% accurate in each step, after the 10th step, the accuracy of the reasoning chain drops to 59%. this is the fundamental problem with llm for reasoning.

reasoning requires deterministic symbolic manipulation for accuracy. only then it can be composed into long chains.

throwuwu · a year ago

You’ve never made a mistake in your reasoning?

Tongue in cheek but this has been considered and has resulted in experiments like tree of thought and various check your work and testing approaches. Thinking step by step is really just another way of saying make a plan or use an algorithm and when humans do either they need to periodically re-evaluate what they’ve done so far and ensure it’s correct.

The trick is training the model to do this as a matter of course and to learn which tool to apply at the right time which is what the paper is about wrt interspersed thoughts.

trenchgun · a year ago

>reasoning requires deterministic symbolic manipulation for accuracy

No, that is automation. Automated reasoning is a thing, indeed. And I can kind of see a world where there is a system which uses LLM for creative thinking, augmented with automated reasoning systems (think datalog, egg, SMT-solver, probabilistic model checking etc).

hesdeadjim · a year ago

I dream of a world where the majority of humans could come close to 59% after attempting a ten step logical process.

Dead Comment