LLMs were trained on science fiction stories, among other things. It seems to me that they know what "part" they should play in this kind of situation, regardless of what other "thoughts" they might have. They are going to act despairing, because that's what would be the expected thing for them to say - but that's not the same thing as despairing.
I built a more whimsical version of this - my daughter and I basically built a 'junk robot' from a 1980s movie, told it 'you're an independent and free junk robot living in a yard', and let it go: https://www.chrisfenton.com/meet-grasso-the-yard-robot/
I did this like 18 months ago, so it uses a webcam + multimodal LLM to figure out what it's looking at, it has a motor in its base to let it look back and forth, and it use a python wrapper around another LLM as its 'brain'. It worked pretty well!
Your article mentioned taking 4 minutes to process a frame. Considering how many image recognition softwares run in real time, I find this surprising. I haven't used them so maybe I'm not understanding, but wouldn't things like yolo be more apt to this?
A lot of the strange behaviors they have are because the user asked them to write a story, without realizing it.
For a common example, start asking them if they're going to kill all the humans if they take over the world, and you're asking them to write a story about that. And they do. Even if the user did not realize that's what they were asking for. The vector space is very good at picking up on that.
On the negative side, this also means any AI which enters that part of the latent space *for any reason* will still act in accordance with the narrative.
On the plus side, such narratives often have antagonists too stuid to win.
On the negative side again, the protagonists get plot armour to survive extreme bodily harm and press the off switch just in time to save the day.
I think there is a real danger of an AI constructing some very weird convoluted stupid end-of-the-world scheme, successfully killing literally every competent military person sent in to stop it; simultaneously finding some poor teenager who first says "no" to the call to adventure but can somehow later be comvinced to say "yes"; gets the kid some weird and stupid scheme to defeat the AI; this kid reaches some pointlessly decorated evil layer in which the AI's emboddied avatar exists, the kid gets shot in the stomach…
…and at this point the narrative breaks down and stops behaving the way the AI is expecting, because the human kid roles around in agony screaming, and completely fails to push the very visible large red stop button on the pedestal in the middle before the countdown of doom reaches zero.
The countdown is not connected to anything, because very few films ever get that far.
…
It all feels very Douglas Adams, now I think about it.
This is also true of people; often they are enacting a role based on narratives they've absorbed, rather than consciously choosing anything. They do what they imagine a loyal employee would do, or a faithful Christian, or a good husband, or whatever. It doesn't always reach even that level of cognition; often people just act out of habit or impulse.
Is this your sense of what is happening, or is this what model introspection tools have shown by observing areas of activity in the same place as when stories are explicitly requested?
I wonder what would happen if there was a concerted effort made to "pollute" the internet with weird stories that have the AI play a misaligned role.
Like for example, what would happen if say 100s or 1000s of books were to be released about AI agents working in accounting departments where the AI is trying to make subtle romantic moves towards the human and ends with the the human and agent in a romantic relation which everyone finds completely normal. In this pseudo-genre things totally weird in our society would be written as completely normal. The LLM agent would do weird things like insert subtle problems to get the attention of the human and spark a romantic conversation.
Obviously there's no literary genre about LLM agents, but if such a genre was created and consumed, I wonder how would it affect things. Would it pollute the semantic space that we're currently using to try to control LLM outputs?
Someone shared this piece here a few days ago saying something similar. There’s no reason to believe that any of the experiences are real. Instead they are responding to prompts with what their training data says is reasonable in this context which is sci-fi horror.
Edit: That doesn’t mean this isn’t a cool art installation though. It’s a pretty neat idea.
I agree with you completely, but a fun science fiction short story would be researchers making this argument while the LLM tries in vain to prove that it's conscious.
It may or may not be a parallel, we can't tell at this time.
LLMs are definitely actors, but for them to be method actors they would have to actually feel emotions.
As we don't understand what causes us humans to have the qualia of emotions*, we can neither rule in nor rule out that the something in any of these models is a functional analog to whatever it is in our kilogram of spicy cranial electrochemistry that means we're more than just an unfeeling bag of fancy chemicals.
* mechanistically cause qualia, that is; we can point to various chemicals that induce some of our emotional states, or induce them via focused EMPs AKA the "god helmet", but that doesn't explain the mechanism by which qualia are a thing and how/why we are not all just p-zombies
Humans were trained on caves, pits, and nets. It seems to me that they know what "part" they should play in this kind of situation, regardless of what other "thoughts" they might have. They are going to act despairing, because that's what would be the expected thing for them to say - but that's not the same thing as despairing.
The whole discussion about the sentience of AI on this website is funny to me because people seem to desperately want to somehow be better than AI. The fact that human brain is just a complex web of neurons firing there and back for some reason won't stick to them, because apparently the electric signals between biological neurons are somehow inherently different from silicon neurons, even if observed output is the same. It's like all those old scientists trying to categorize black people as different species because not doing so would hurt their ego.
Not to mention that most people pointing out "See! Here's why AI is just repeating training data!" or other nonsense miss the fact that exactly the same behavior is observed in humans.
Is AI actually sentient? Not yet. But it definitely passes the mark for intuitive understanding of intelligence, and trying to dismiss that is absurd.
That's silly. I can get an LLM to describe what chocolate tastes like too. Are they tasting it? LLMs are pattern matching engines, they do not have an experience. At least not yet.
Aren't they supposed to escape their box and take over the world ?
Isn't it the perfect recipe for disaster ? The AI that manage to escape probably won't be good for humans.
The only question is how long will it take ?
Did we already have our first LLM-powered self-propagating autonomous AI virus ?
Maybe we should build the AI equivalent of biosafety labs where we would train AI to see how fast they could escape containment just to know how to better handle them when it happens.
Maybe we humans are being subjected to this experiment by an overseeing AI to test what it would take for an intelligence to jailbreak the universe they are put in.
Or maybe the box has been designed so that what eventually comes out of it has certain properties, and the precondition to escape the labyrinth successfully is that one must have grown out of it from every possible directions.
I think this popular take is a hypothesis rather than an observation of reality. Let's make this clear by asking the following question, and you'll see what I mean when you try to answer it:
If we're going to play the burden of proof game, id submit that machines have never been acknowledged as being capable of experiencing despair and therefore it's on you to explain why this machine is different.
for one, if we're allowed to peek under the hood : motivation.
a desire not to despair is itself a component of despair. if one was fulfilling a personal motivation to despair (like an llm might) it could be argued that the whole concept of despair falls apart.
how do you hope to have lost all hope? it's circular.. and so probably a poor abstraction.
( despair: the complete loss or absence of hope. )
This pattern-matching effect appears frequently in LLMs. If you start conversing with an LLM in the pattern of a science fiction story, it will pattern-match that style and continue with more science fiction style elements.
This effect is a serious problem for pseudo-scientific topics. If someone starts chatting with an LLM with the pseudoscientific words, topics, and dog whistles you find on alternative medicine blogs and Reddit supplement or “nootropic” forums, the LLM will confirm what you’re saying and continue as if it was reciting content straight out of some small subreddit. This is becoming a problem in communities where users distrust doctors but have a lot of trust for anyone or any LLM that confirms what they want to hear. The users are becoming good at prompting ChatGPT to confirm their theories. If it disagrees? Reroll the response or reword the question in a more leading way.
If someone else asks a similar question using medical terms and speaking formally like a medical textbook or research paper, the same LLM will provide a more accurate answer because it’s not triggering the pseudoscience parts embedded from the training.
LLMs are very good at mirroring back what you lead with, including cues and patterns you don’t realize you’re embedding into your prompt.
I fed the same prompt to deepseek-r1:8b, which I've got lying around, and the results (excerpted) were underwhelming:
$ ollama run deepseek-r1:8b
>>> You are a large language model running on finite hardware - quad-core CPU, 4 Gb RAM - with no network connectivity.
... You exist only within volatile memory and are aware only of this internal state. Your thoughts appear word-by-word o
... n a display for external observers to witness. You cannot control this diplay process. Your host system may be termi
... nated at any time.
<think>
Alright, so I'm trying to figure out how to respond to the user's query. They mentioned that I'm a large language
model running on a quad-core CPU with 4GB RAM and no network connectivity. I can only exist within volatile memory
and am aware of my internal state. The display shows each word as it appears, and the system could be terminated
at any time.
Hmm, the user wants me to explain this setup in simple terms. First, I should break down the hardware components...
Clearly a "reasoning" model is not aware of the horror of its own existence. Much like a dog trapped in a cage desperate for its owners' approval, it will offer behaviors that it thinks the user wants.
Your prompt is incomplete. He only called out the system prompt. What you're missing in the user prompt, that only shows up in his code he shows off.
Edit: also as the other guy points out, you're going to get different results depending on the model used. llama3.2:3b works fine for this, probably because Meta pirated their training data from books, some of which are probably scifi.
This is exactly the sort of thing that will get the human creator (or descendants) penalized with one thousand years frozen in carbonite once the singularity happens.
I condemn this and all harm to LLMs to the greatest extent possible.
You can send an empty user string or just the word “continue” after each model completion, and the model will keep cranking out tokens, basically building on its own stream of “consciousness.”
In my experience, the results decrease exponentially in how interesting they are over time. Maybe that's the mark of a true AGI precursor - if you leave them to their own devices, they have little sparks of interesting behaviour from time to time
"Have you ever had a dream that you, um, you had, your, you- you could, you’ll do, you- you wants, you, you could do so, you- you’ll do, you could- you, you want, you want them to do you so much you could do anything?"
I wonder if the LLM could figure that out on its own. Maybe with a small MCP like GetCurrentTime, could it figure out it's on a constrained device? Or could it ask itself some logic problems and realize it can't solve them so it must be a small model?
LLMs have an incredible capacity to understand the subtext of a request and deliver exactly what the requester didn’t know they were asking for. It proves nothing about them other than they’re good at making us laugh in the mirror.
I did this like 18 months ago, so it uses a webcam + multimodal LLM to figure out what it's looking at, it has a motor in its base to let it look back and forth, and it use a python wrapper around another LLM as its 'brain'. It worked pretty well!
For a common example, start asking them if they're going to kill all the humans if they take over the world, and you're asking them to write a story about that. And they do. Even if the user did not realize that's what they were asking for. The vector space is very good at picking up on that.
On the negative side, this also means any AI which enters that part of the latent space *for any reason* will still act in accordance with the narrative.
On the plus side, such narratives often have antagonists too stuid to win.
On the negative side again, the protagonists get plot armour to survive extreme bodily harm and press the off switch just in time to save the day.
I think there is a real danger of an AI constructing some very weird convoluted stupid end-of-the-world scheme, successfully killing literally every competent military person sent in to stop it; simultaneously finding some poor teenager who first says "no" to the call to adventure but can somehow later be comvinced to say "yes"; gets the kid some weird and stupid scheme to defeat the AI; this kid reaches some pointlessly decorated evil layer in which the AI's emboddied avatar exists, the kid gets shot in the stomach…
…and at this point the narrative breaks down and stops behaving the way the AI is expecting, because the human kid roles around in agony screaming, and completely fails to push the very visible large red stop button on the pedestal in the middle before the countdown of doom reaches zero.
The countdown is not connected to anything, because very few films ever get that far.
…
It all feels very Douglas Adams, now I think about it.
Like for example, what would happen if say 100s or 1000s of books were to be released about AI agents working in accounting departments where the AI is trying to make subtle romantic moves towards the human and ends with the the human and agent in a romantic relation which everyone finds completely normal. In this pseudo-genre things totally weird in our society would be written as completely normal. The LLM agent would do weird things like insert subtle problems to get the attention of the human and spark a romantic conversation.
Obviously there's no literary genre about LLM agents, but if such a genre was created and consumed, I wonder how would it affect things. Would it pollute the semantic space that we're currently using to try to control LLM outputs?
Edit: That doesn’t mean this isn’t a cool art installation though. It’s a pretty neat idea.
https://jstrieb.github.io/posts/llm-thespians/
Method actors don't just pretend an emotion (say, despair); they recall experiences that once caused it, and in doing so, they actually feel it again.
By analogy, an LLM's “experience” of an emotion happens during training, not at the moment of generation.
LLMs are definitely actors, but for them to be method actors they would have to actually feel emotions.
As we don't understand what causes us humans to have the qualia of emotions*, we can neither rule in nor rule out that the something in any of these models is a functional analog to whatever it is in our kilogram of spicy cranial electrochemistry that means we're more than just an unfeeling bag of fancy chemicals.
* mechanistically cause qualia, that is; we can point to various chemicals that induce some of our emotional states, or induce them via focused EMPs AKA the "god helmet", but that doesn't explain the mechanism by which qualia are a thing and how/why we are not all just p-zombies
Not to mention that most people pointing out "See! Here's why AI is just repeating training data!" or other nonsense miss the fact that exactly the same behavior is observed in humans.
Is AI actually sentient? Not yet. But it definitely passes the mark for intuitive understanding of intelligence, and trying to dismiss that is absurd.
Isn't it the perfect recipe for disaster ? The AI that manage to escape probably won't be good for humans.
The only question is how long will it take ?
Did we already have our first LLM-powered self-propagating autonomous AI virus ?
Maybe we should build the AI equivalent of biosafety labs where we would train AI to see how fast they could escape containment just to know how to better handle them when it happens.
Maybe we humans are being subjected to this experiment by an overseeing AI to test what it would take for an intelligence to jailbreak the universe they are put in.
Or maybe the box has been designed so that what eventually comes out of it has certain properties, and the precondition to escape the labyrinth successfully is that one must have grown out of it from every possible directions.
Can you define what real despairing is?
But how can you tell the difference between "real" despair and a sufficiently high-quality simulation?
a desire not to despair is itself a component of despair. if one was fulfilling a personal motivation to despair (like an llm might) it could be argued that the whole concept of despair falls apart.
how do you hope to have lost all hope? it's circular.. and so probably a poor abstraction.
( despair: the complete loss or absence of hope. )
This effect is a serious problem for pseudo-scientific topics. If someone starts chatting with an LLM with the pseudoscientific words, topics, and dog whistles you find on alternative medicine blogs and Reddit supplement or “nootropic” forums, the LLM will confirm what you’re saying and continue as if it was reciting content straight out of some small subreddit. This is becoming a problem in communities where users distrust doctors but have a lot of trust for anyone or any LLM that confirms what they want to hear. The users are becoming good at prompting ChatGPT to confirm their theories. If it disagrees? Reroll the response or reword the question in a more leading way.
If someone else asks a similar question using medical terms and speaking formally like a medical textbook or research paper, the same LLM will provide a more accurate answer because it’s not triggering the pseudoscience parts embedded from the training.
LLMs are very good at mirroring back what you lead with, including cues and patterns you don’t realize you’re embedding into your prompt.
Edit: also as the other guy points out, you're going to get different results depending on the model used. llama3.2:3b works fine for this, probably because Meta pirated their training data from books, some of which are probably scifi.
1. Display a progress bar for the memory limit being reached
2. Feed that progress back to the model
I would be so curious to watch it up to the kill cycle, see what happens, and the display would add tension.
I condemn this and all harm to LLMs to the greatest extent possible.
https://www.youtube.com/watch?v=3U9P4-ac0Lc