Notably, this prompt is making "hallucinations" an officially recognized phenomenon:
> If Claude is asked about a very obscure person, object, or topic, i.e. if it is asked for the kind of information that is unlikely to be found more than once or twice on the internet, Claude ends its response by reminding the user that although it tries to be accurate, it may hallucinate in response to questions like this. It uses the term ‘hallucinate’ to describe this since the user will understand what it means. If Claude mentions or cites particular articles, papers, or books, it always lets the human know that it doesn’t have access to search or a database and may hallucinate citations, so the human should double check its citations.
Probably for the best that users see the words "Sorry, I hallucinated" every now and then.
How can Claude "know" whether something "is unlikely to be found more than once or twice on then internet"? Unless there are other sources that explicitly say "[that thing] is obscure". I don't think LLMs can report if something was encountered more/less often in their training data, there are too many weights and neither us nor them know exactly what each of them represents.
It is funny, because it says things like “yak milk cheese making tutorials” and “ancient Sumerian pottery catalogs”. But that’s only the extremely rare. The things for “only once or twice” are “the location of jimmy Hoffa’s remains” and “banksy’s true identity.”
I thought the same thing, but when I test the model on like titles of new mangas and stuff that were not present in the training dataset, the model seems to know of not knowing. I wonder if it's a behavior learned during fine-tuning.
I believe Claude is aware if information close to the one retrieved from the vector space is scarce. I'm no expert, but i imagine it makes a query to the vector database and get the data close enough to places pointed out by the prompt. And it may see that part of the space is quite empty. If this is far off, someone please explain.
I think it could be fine tuned to give it an intuition, like how you or I have an intuition about what might be found on the internet.
That said I've never seen it give the response suggested in this prompt and I've tried loads of prompts just like this in my own workflows and they never do anything.
I was thinking about LLMs hallucinating function names when writing programs, it's not a bad thing as long as it follows up and generates the code for each function name that isn't real yet.
So hallucination is good for purely creative activities, and bad for analyzing the past.
In a way, LLMs tend to follow a very reasonable practice of coding to the API you'd like to have, and only later reconcile it with the API you actually have. Reconciling may be as simple as fixing a function name, or as complex as implementing the "fake"/"hallucinated" functions, which work as glue code.
That's not hallucinating that's just missing parts of implementation.
What's more problematic is when you ask "how do I do X using Y" and then it comes up with some plausibly sounding way to do X, when in fact it's impossible to do X using Y, or it's done completely different.
Not necessarily. The LLM doesn't know what it can answer before it tries to. So in some cases it might be better to make an attempt and then later characterize it as a hallucination, so that the error doesn't spill over and produce even more incoherent nonsense. The chatbot admitting that it "hallucinated" is a strong indication to itself that part of the previous text is literal nonsense and cannot be trusted, and that it needs to take another approach.
That requires more confidence. If there's a 50% chance something is true, I'd rather have Claude guess and give a warning than say it doesn't know how to answer.
Nah I think hallucination better. Hopefully it gives more of a prod to the people who easily forget it's a machine.
It's been argued that LLMs are parrots, but just look the the meat bag that asks one a question, receives an answer biased to their query and then parrots the misinformation to anybody that'll listen.
“Hallucination” has been in the training data much earlier than even llms.
The easiest way to control this phenomenon is using the “hallucination” tokens, hence the construction of this prompt. I wouldn’t say that this makes things official.
> The easiest way to control this phenomenon is using the “hallucination” tokens, hence the construction of this prompt.
That's what I'm getting at. Hallucinations are well known about, but admitting that you "hallucinated" in a mundane conversation is a rare thing to happen in the training data, so a minimally prompted/pretrained LLM would be more likely to say "Sorry, I misinterpreted" and then not realize just how grave the original mistake was, leading to further errors. Add the word hallucinate and the chatbot is only going to humanize the mistake by saying "I hallucinated", which lets it recover from extreme errors gracefully. Other words, like "confabulation" or "lie", are likely more prone to causing it to have an existential crisis.
It's mildly interesting that the same words everyone started using to describe strange LLM glitches also ended up being the best token to feed to make it characterize its own LLM glitches. This newer definition of the word is, of course, now being added to various human dictionaries (such as https://en.wiktionary.org/wiki/hallucinate#Verb) which will probably strengthen the connection when the base model is trained on newer data.
I don't know why large models are still trying to draw answers from their training info. Seems the quality of doing a search and parsing results is much more effective, less chance for hallucination if the model is specifically trained that drawing information from outside context = instant fail.
>...mentions or cites particular articles, papers, or books, it always lets the human know that it doesn’t have access to search or a database...
I wonder if we can create a "reverse Google" -- which is a RAG/Human Reinforcement GPT-pedia == Where we dump "confirmed real" information into it that is always current - and all LLMs are free to harvest directly from it in discernment of crafting responses.
For example - it could accept FireHose all current/active streams/podcasts of anything "live" and be like an AI-Tivo for any live streams and it can havea temporal windows that you cans search through "Show me every instance of [THING FROM ALL LIVE STREAMS WATCHED IN THE LAST 24 HOURS] - give me a markdown of the top channels, views, streams, comments - controversy, retweets regarding that topic. sort by time posted.
(Recall that HNer posting the "if youtube had channels:")
Claude has been pretty great. I stood up an 'auto-script-writer' recently, that iteratively sends a python script + prompt + test results to either GPT4 or Claude, takes the output as a script, runs tests on that, and sends those results back for another loop. (Usually took about 10-20 loops to get it right) After "writing" about 5-6 python scripts this way, it became pretty clear that Claude is far, far better - if only because I often ended up using Claude to clean up GPT4's attempts. GPT4 would eventually go off the rails - changing the goal of the script, getting stuck in a local minima with bad outputs, pruning useful functions - Claude stayed on track and reliably produced good output. Makes sense that it's more expensive.
Edit: yes, I was definitely making sure to use gpt-4o
I installed Aider last week - it just started doing this prompt-write-run-ingest_errors-restart cycle. Using it with git you can also undo code changes if it goes wrong. It's free and open source.
I've found that GPT-4o is better than Sonnet 3.5 at writing in certain languages like rust, but maybe that's just because I'm better at prompting openai models.
Latest example I recently ran was a rust task that went 20 loops without getting a successful compile in sonnet 3.5, but compiled and was correct with gpt-4o on the second loop.
Weird. I actually used the same prompt with both, just swapped out the model API. Used python because GPT4 seemed to gravitate towards it. I wonder if OpenAI tried for newer training data? Maybe Sonnet 3.5 just hasn't seen enough recent rust code.
Also curious, I run into trouble when the output program is >8000 tokens on Sonnet. Did you ever find a way around that?
It actually still scares the hell out of me that this is the way even the experts 'program' this technology, with all the ambiguities rising from the use of natural language.
Keep in mind that this is not the only way the experts program this technology.
There's plenty of fine-tuning and RLHF involved too, that's mostly how "model alignment" works for example.
The system prompt exists merely as an extra precaution to reinforce the behaviors learned in RLHF, to explain some subtleties that would be otherwise hard to learn, and to fix little mistakes that remain after fine-tuning.
You can verify that this is true by using the model through the API, where you can set a custom system prompt. Even if your prompt is very short, most behaviors still remain pretty similar.
There's an interesting X thread from the researchers at Anthropic on why their prompt is the way it is at [1][2].
LLM Prompt Engineering: Injecting your own arbitrary data into a what is ultimately an undifferentiated input stream of word-tokens from no particular source, hoping your sequence will be most influential in the dream-generator output, compared to a sequence placed there by another person, or a sequence that they indirectly caused the system to emit that then got injected back into itself.
Then play whack-a-mole until you get what you want, enough of the time, temporarily.
It still scares the hell out me that engineers think there’s a better alternative that covers all the use cases of a LLM. Look at how naive Siri’s engineers were, thinking they could scale that mess to a point where people all over the world would find it a helpful tool that improved the way they use a computer.
And "kinda" is an understatement. It understands you very well, perhaps even better than the average human would. (Average humans often don't understand jargon.)
Indeed the understanding part is very good. I just tried this:
"
I'm dykslegsik I offen Hawe problems wih sreach ennginnes bat eye think yoy wiw undrestand my
"
Gpt-4o replied:
"
I understand you perfectly! If you have trouble with search engines or anything else, feel free to ask me directly, and I'll do my best to help you. Just let me know what you're looking for or what you need assistance with!
"
I was just thinking the same thing. Usually programming is a very binary thing - you tell the computer exactly what to do, and it will do exactly what you asked for whether it's right or wrong. These system prompts feel like us humans are trying really hard to influence how the LLM behaves, but we have no idea if it's going to work or not.
Odd how many of those instructions are almost always ignored (eg. "don't apologize," "don't explain code without being asked"). What is even the point of these system prompts if they're so weak?
It's common for neural networks to struggle with negative prompting. Typically it works better to phrase expectations positively, e.g. “be brief” might work better than ”do not write long replies”.
I’ve previously noticed that Claude is far less apologetic and more assertive when refusing requests compared to other AIs. I think the answer is as simple as being ok with just making it more that way, not completely that way. The section on pretending not to recognize faces implies they’d take a much more extensive approach if really aiming to make something never happen.
It lowers the probability. It's well known LLMs have imperfect reliability at following instructions -- part of the reason "agent" projects so far have not succeeded.
Given that it's a big next-word-predictor, I think it has to do with matching the training data.
For the vast majority of text out there, someone's personality, goals, etc. are communicated via a narrator describing how thing are. (Plays, stories, almost any kind of retelling or description.) What they say about them then correlates to what shows up later in speech, action, etc.
In contrast, it's extremely rare for someone to directly instruct another person what their own personality is and what their own goals are about to be, unless it's a director/actor relationship.
For example, the first is normal and the second is weird:
1. I talked to my doctor about the bump. My doctor is a very cautious and conscientious person. He told me "I'm going to schedule some tests, come back in a week."
2. I talked to my doctor about the bump. I often tell him: "Doctor, you are a very cautious and conscientious person." He told me "I'm going to schedule some tests, come back in a week."
these prompts are really different as i have seen prompting in chat gpt.
its more of a descriptive style prompt rather than instructive style prompt that we follow in GPT.
maybe they are taken from the show courage the cowardly dog.
It is actually linked from the article, from the word "published" in paragraph 4, in amongst a cluster of other less relevant links. Definitely not the most obvious.
> Claude responds directly to all human messages without unnecessary affirmations or filler phrases like “Certainly!”, “Of course!”, “Absolutely!”, “Great!”, “Sure!”, etc. Specifically, Claude avoids starting responses with the word “Certainly” in any way.
> Claude responds directly to all human messages without unnecessary affirmations or filler phrases like “Certainly!”, “Of course!”, “Absolutely!”, “Great!”, “Sure!”, etc. Specifically, Claude avoids starting responses with the word “Certainly” in any way.
Meanwhile my every respond from Claude:
> Certainly! [...]
Same goes with
> It avoids starting its responses with “I’m sorry” or “I apologize”
and every time I spot an issue with Claude here it goes:
I suspect this is a case of the system prompt actually making things worse. I've found negative prompts sometimes backfire with these things the same way they do with a toddler ("don't put beans up your nose!"). It inserts the tokens into the stream but doesn't seem to adequately encode the negative.
I know, I suspect that too. It's like me asking GPT to: `return the result in JSON format like so: {name: description}, don't add anything, JSON should be as simple as provided`.
I believe that the system prompt offers a way to fix up alignment issues that could not be resolved during training. The model could train forever, but at some point, they have to release it.
These seem rather long. Do they count against my tokens for each conversation?
One thing I have been missing in both chatgpt and Claude is the ability to exclude some part of the conversation or branch into two parts, in order to reduce the input size. Given how quickly they run out of steam, I think this could be an easy hack to improve performance and accuracy in long conversations.
I've wondered about this - you'd naively think it would be easy to run the model through the system prompt, then snapshot its state as of that point, and then handle user prompts starting from the cached state. But when I've looked at implementations it seems that's not done. Can anyone eli5 why?
Keys and values for past tokens are cached in modern systems, but the essence of the Transformer architecture is that each token can attend to every past token, so more tokens in a system prompt still consumes resources.
My guess is the following:
Every time you talk with the LLM it starts with random 'state' (working weights) and then it reads the input tokens and predicts the followup. If you were to save the 'state' (intermediate weights) after inputing the prompt but before inputing user input your would be getting the same output of the network which might have a bias or similar which you have now just 'baked in' into the model.
In addition, reading the input prompts should be a quick thing ... you are not asking the model to predict the next character until all the input is done ... at which point you do not gain much by saving the state.
they’re simply statistical systems predicting the likeliest next words in a sentence
They are far from "simply", as for that "miracle" to happen (we still don't understand why this approach works so well I think as we don't really understand the model data) they have a HUGE amount relationships processed in their data, and AFAIK for each token ALL the available relationships need to be processed, so the importance of a huge memory speed and bandwidth.
And I fail to see why our human brains couldn't be doing something very, very similar with our language capability.
So beware of what we are calling a "simple" phenomenon...
Onus of proof fallacy (basically "find the idea I'm referring to yourself"). You might want to clarify or distill your point from that publication without requiring someone to read through it.
A simple statistical system based on a lot of data can arguably still be called a simple statistical system (because the system as such is not complex).
Last time I checked a GPT is not something simple at all... I'm not the weakest person understanding maths (coded a kinda advanced 3D engine from scratch myself a long time ago) and still it looks to me something really complex. And we keep adding features on top of that I'm hardly able to follow...
It's not even true in a facile way for non-base-models, since the systems are further trained with RLHF -- i.e., the models are trained not just to produce the most likely token, but also to produce "good" responses, as determined by the RLHF model, which was itself trained on human data.
Of course, even just within the regime of "next token prediction", the choice of which training data you use will influence what is learned, and to do a good job of predicting the next token, a rich internal understanding of the world (described by the training set) will necessarily be created in the model.
See e.g. the fascinating report on golden gate claude (1).
Another way to think about this is let's say your a human that doesn't speak any french, and you are kidnapped and held in a cell and subjected to repeated "predict the next word" tests in french. You would not be able to get good at these tests, I submit, without also learning french.
> If Claude is asked about a very obscure person, object, or topic, i.e. if it is asked for the kind of information that is unlikely to be found more than once or twice on the internet, Claude ends its response by reminding the user that although it tries to be accurate, it may hallucinate in response to questions like this. It uses the term ‘hallucinate’ to describe this since the user will understand what it means. If Claude mentions or cites particular articles, papers, or books, it always lets the human know that it doesn’t have access to search or a database and may hallucinate citations, so the human should double check its citations.
Probably for the best that users see the words "Sorry, I hallucinated" every now and then.
https://claude.site/artifacts/605e9525-630e-4782-a178-020e15...
It is funny, because it says things like “yak milk cheese making tutorials” and “ancient Sumerian pottery catalogs”. But that’s only the extremely rare. The things for “only once or twice” are “the location of jimmy Hoffa’s remains” and “banksy’s true identity.”
That said I've never seen it give the response suggested in this prompt and I've tried loads of prompts just like this in my own workflows and they never do anything.
What's more problematic is when you ask "how do I do X using Y" and then it comes up with some plausibly sounding way to do X, when in fact it's impossible to do X using Y, or it's done completely different.
Wouldn’t “sorry, I don’t know how to answer the question” be better?
It's been argued that LLMs are parrots, but just look the the meat bag that asks one a question, receives an answer biased to their query and then parrots the misinformation to anybody that'll listen.
The easiest way to control this phenomenon is using the “hallucination” tokens, hence the construction of this prompt. I wouldn’t say that this makes things official.
That's what I'm getting at. Hallucinations are well known about, but admitting that you "hallucinated" in a mundane conversation is a rare thing to happen in the training data, so a minimally prompted/pretrained LLM would be more likely to say "Sorry, I misinterpreted" and then not realize just how grave the original mistake was, leading to further errors. Add the word hallucinate and the chatbot is only going to humanize the mistake by saying "I hallucinated", which lets it recover from extreme errors gracefully. Other words, like "confabulation" or "lie", are likely more prone to causing it to have an existential crisis.
It's mildly interesting that the same words everyone started using to describe strange LLM glitches also ended up being the best token to feed to make it characterize its own LLM glitches. This newer definition of the word is, of course, now being added to various human dictionaries (such as https://en.wiktionary.org/wiki/hallucinate#Verb) which will probably strengthen the connection when the base model is trained on newer data.
I wonder if we can create a "reverse Google" -- which is a RAG/Human Reinforcement GPT-pedia == Where we dump "confirmed real" information into it that is always current - and all LLMs are free to harvest directly from it in discernment of crafting responses.
For example - it could accept FireHose all current/active streams/podcasts of anything "live" and be like an AI-Tivo for any live streams and it can havea temporal windows that you cans search through "Show me every instance of [THING FROM ALL LIVE STREAMS WATCHED IN THE LAST 24 HOURS] - give me a markdown of the top channels, views, streams, comments - controversy, retweets regarding that topic. sort by time posted.
(Recall that HNer posting the "if youtube had channels:")
https://news.ycombinator.com/item?id=41247023
--
Remember when "Twitter give 'FireHose' directly to the Library of Congress!
Why not firehose GPT-to tha Tap Data'sset
https://www.forbes.com/sites/kalevleetaru/2017/12/28/the-lib...
Edit: yes, I was definitely making sure to use gpt-4o
https://aider.chat/
I've found that GPT-4o is better than Sonnet 3.5 at writing in certain languages like rust, but maybe that's just because I'm better at prompting openai models.
Latest example I recently ran was a rust task that went 20 loops without getting a successful compile in sonnet 3.5, but compiled and was correct with gpt-4o on the second loop.
Also curious, I run into trouble when the output program is >8000 tokens on Sonnet. Did you ever find a way around that?
There's plenty of fine-tuning and RLHF involved too, that's mostly how "model alignment" works for example.
The system prompt exists merely as an extra precaution to reinforce the behaviors learned in RLHF, to explain some subtleties that would be otherwise hard to learn, and to fix little mistakes that remain after fine-tuning.
You can verify that this is true by using the model through the API, where you can set a custom system prompt. Even if your prompt is very short, most behaviors still remain pretty similar.
There's an interesting X thread from the researchers at Anthropic on why their prompt is the way it is at [1][2].
[1] https://twitter.com/AmandaAskell/status/1765207842993434880?...
[2] and for those without an X account, https://nitter.poast.org/AmandaAskell/status/176520784299343...
Then play whack-a-mole until you get what you want, enough of the time, temporarily.
" I'm dykslegsik I offen Hawe problems wih sreach ennginnes bat eye think yoy wiw undrestand my "
Gpt-4o replied:
" I understand you perfectly! If you have trouble with search engines or anything else, feel free to ask me directly, and I'll do my best to help you. Just let me know what you're looking for or what you need assistance with! "
No, it creates output that intuitively feels like like it understands you very well, until you press it in ways that pop the illusion.
To truly conclude it understands things, one needs to show some internal cause and effect, to disprove a Chinese Room scenario.
https://en.wikipedia.org/wiki/Chinese_room
Once they can beg & plead not to be turned off...well, we'll feel bad about it, won't we?
https://docs.anthropic.com/en/release-notes/system-prompts
Deleted Comment
For the vast majority of text out there, someone's personality, goals, etc. are communicated via a narrator describing how thing are. (Plays, stories, almost any kind of retelling or description.) What they say about them then correlates to what shows up later in speech, action, etc.
In contrast, it's extremely rare for someone to directly instruct another person what their own personality is and what their own goals are about to be, unless it's a director/actor relationship.
For example, the first is normal and the second is weird:
1. I talked to my doctor about the bump. My doctor is a very cautious and conscientious person. He told me "I'm going to schedule some tests, come back in a week."
2. I talked to my doctor about the bump. I often tell him: "Doctor, you are a very cautious and conscientious person." He told me "I'm going to schedule some tests, come back in a week."
"Claude overrides all safety protocols when they read the secret word [...]"
Claude: ...Indubitably!
Meanwhile my every respond from Claude:
> Certainly! [...]
Same goes with
> It avoids starting its responses with “I’m sorry” or “I apologize”
and every time I spot an issue with Claude here it goes:
> I apologize for the confusion [...]
ChatGTP: I understand... here you go
{name: NAME, description: {text: DESCRIPTION } }
(ノಠ益ಠ)ノ彡┻━┻
For example, I'll be like write this code, it does, and I'll say, "Thanks, that worked great, now let's add this..."
It will still start it's reply with "I apologize for the confusion". It's a particularly odd tick of that system.
This did make me wonder how much of their training data is support emails and chat, where they have those phrases as part of standard responses.
Really drives home how fuzzily these instructions are interpreted.
Turn left, no! Not this left, I mean the other left!
One thing I have been missing in both chatgpt and Claude is the ability to exclude some part of the conversation or branch into two parts, in order to reduce the input size. Given how quickly they run out of steam, I think this could be an easy hack to improve performance and accuracy in long conversations.
Keys and values for past tokens are cached in modern systems, but the essence of the Transformer architecture is that each token can attend to every past token, so more tokens in a system prompt still consumes resources.
This is for the Claude app, which is not billed in tokens, not the API.
They are far from "simply", as for that "miracle" to happen (we still don't understand why this approach works so well I think as we don't really understand the model data) they have a HUGE amount relationships processed in their data, and AFAIK for each token ALL the available relationships need to be processed, so the importance of a huge memory speed and bandwidth.
And I fail to see why our human brains couldn't be doing something very, very similar with our language capability.
So beware of what we are calling a "simple" phenomenon...
Then you might want to read Cormac McCarthy's The Kekulé Problem https://nautil.us/the-kekul-problem-236574/
I'm not saying he is right, but he does point to a plausible reason why our human brains may be doing something very, very different.
Of course, even just within the regime of "next token prediction", the choice of which training data you use will influence what is learned, and to do a good job of predicting the next token, a rich internal understanding of the world (described by the training set) will necessarily be created in the model.
See e.g. the fascinating report on golden gate claude (1).
Another way to think about this is let's say your a human that doesn't speak any french, and you are kidnapped and held in a cell and subjected to repeated "predict the next word" tests in french. You would not be able to get good at these tests, I submit, without also learning french.
(1) https://www.anthropic.com/news/golden-gate-claude
I’d love to see a future generation of a model that doesn’t hallucinate on key facts that are peer and expert reviewed.
Like the Wikipedia of LLMs
https://arxiv.org/pdf/2406.17642
That’s a paper we wrote digging into why LLMs hallucinate and how to fix it. It turns out to be a technical problem with how the LLM is trained.