I just published a deep dive into the Claude 4 system prompts, covering both the ones that Anthropic publish and the secret tool-defining ones that got extracted through a prompt leak. They're fascinating - effectively the Claude 4 missing manual: https://simonwillison.net/2025/May/25/claude-4-system-prompt...
What I find a little perplexing is when AI companies are annoyed that customers are typing "please" in their prompts as it supposedly costs a small fortune at scale yet they have system prompts that take 10 minutes for a human to read through.
It was a throwaway comment that journalists desperate to write about AI leapt upon. It has as much meaning as when you see “Actor says new film is great!” articles on entertainment sites. People writing meaningless blather because they’ve got clicks to farm.
> yet they have system prompts that take 10 minutes for a human to read through.
The system prompts are cached, the endless variations on how people choose to be polite aren’t.
Hah, yeah I think that "please" thing was mainly Sam Altman flexing about how many users ChatGPT has.
Anthropic announced that they increased their maximum prompt caching TTL from 5 minutes to an hour the other day, not surprising that they are investigating effort in caching when their own prompts are this long!
I assume that they run the system prompt once, snapshot the state, then use that as starting state for all users. In that sense, system prompt size is free.
To be fair OpenAI had good guidelines on how to best use chatgpt on their github page very early on. Except github is not really consumer facing, so most of that info was lost in the sauce.
If a user says "thank you" as a separate message, then that will require all the tokens from the system message + previous state of the chat. It's not about the single word "please".
That said, no one was "annoyed" at customers for saying please.
lmao that's funny cause after seeing claude 4 code for you in zed editor while following it, it kinda feels like -the work is misteryous and interesting- level of work.
I like reading the system prompt because I feel it would have to be human-written for sure, which is something I can never be sure of for all other text on the Internet. Or maybe not!
I have absolutely iterated on system prompts with the help of LLMs before, so while system prompts generally will at the very least be heavily human curated, you can't assume that they are free of AI influence.
Anthropics actually encourages using Claude to refine your prompts! I am not necessarily a fan because it has a bend towards longer prompts... which, I don't know if it is a coincidence that the Claude system promps are on the longer side.
Given the cited stats here and elsewhere as well as in everyday experience, does anyone else feel that this model isn’t significantly different, at least to justify the full version increment?
The one statistic mentioned in this overview where they observed a 67% drop seems like it could easily be reduced simply by editing 3.7’s system prompt.
What are folks’ theories on the version increment? Is the architecture significantly different (not talking about adding more experts to the MoE or fine tuning on 3.7’s worst failures. I consider those minor increments rather than major).
One way that it could be different is if they varied several core hyperparameters to make this a wider/deeper system but trained it on the same data or initialized inner layers to their exact 3.7 weights. And then this would “kick off” the 4 series by allowing them to continue scaling within the 4 series model architecture.
My experience so far with Opus 4 is that it's very good. Based on a few days of using it for real work, I think it's better than Sonnet 3.5 or 3.7, which had been my daily drivers prior to Gemini 2.5 Pro switching me over just 3 weeks ago. It has solved some things that eluded Gemini 2.5 Pro.
Right now I'm swapping between Gemini and Opus depending on the task. Gemini's 1M token context window is really unbeatable.
But the quality of what Opus 4 produces is really good.
edit: forgot to mention that this is all for Rust based work on InfluxDB 3, a fairly large and complex codebase. YMMV
I've been having really good results from Jules, which is Google's gemini agent coding platform[1]. In the beta you only get 5 tasks a day, but so far I have found it to be much more capable than regular API Gemini.
> Gemini's 1M token context window is really unbeatable.
How does that work in practice? Swallowing a full 1M context window would take in the order of minutes, no? Is it possible to do this for, say, an entire codebase and then cache the results?
> Given the cited stats here and elsewhere as well as in everyday experience, does anyone else feel that this model isn’t significantly different, at least to justify the full version increment?
My experience is the opposite - I'm using it in Cursor and IMO it's performing better than Gemini 2.5 Pro at being able to write code which will run first time (which it wasn't before) and seems to be able to complete much larger tasks. It is even running test cases itself without being prompted, which is novel!
I'm a developer, and I've been trying to use AI to vibe code apps for two years. This is the first time I'm able to vibe code an app without major manual interventions at every step. Not saying it's perfect, or that I'd necessarily trust it without human review, but I did vibe code an entire production-ready iOS/Android/web app that accepts payments in less than 24 hours and barely had to manually intervene at all, besides telling it what I wanted to do next.
I used to start my conversations with "hello fucker"
with claude 3.7 there's was always a "user started with a rude greeting, I should avoid it and answer the technical question" line in chains of thought
with claude 4 I once saw "this greeting is probably a normal greeting between buddies" and then it also greets me with "hei!" enthusiastically.
Agreed. It was immediately obvious comparing answers to a few prompts between 3.7 and 4, and it sabotages any of its output. If you're being answered "You absolutely nailed it!" and the likes to everything, regardless of their merit and after telling it not to do that, you simply cannot rely on its "judgement" for anything of value. It may pass the "literal shit on a stick" test, but it's closer to the average ChatGPT model and its well-known isms, what I assume must've pushed more people away from it to alternatives. And the personal preferences trying to coax it into not producing gullible-enticing output seem far less effective. I'd rather keep using 3.7 than interacting with an OAI GPTesque model.
Turns out tuning LLMs on human preferences leads to sycophantic behavior, they even wrote about it themselves, guess they wanted to push the model out too fast.
The default "voice" (for lack of a better word) compared to 3.7 is infuriating. It reads like the biggest ass licker on the planet, and it also does crap like the below
> So, `implements` actually provides compile-time safety
What writing style even is this? Like it's trying to explain something to a 10 year old.
I suspect that the flattery is there because people react well to it and it keeps them more engaged. Plus, if it tells you your idea for a dog shit flavoured ice cream stall is the most genius idea on earth, people will use it more and send more messages back and forth.
I feel that 3.7 is still the best. With 4, it keeps writing hundreds upon hundreds of lines, it'll invoke search for everything, it starts refactoring random lines unrelated to my question, it'll often rewrite entire portions of its own output for no reason. I think they took the "We need to shit out code" thing the AIs are good at and cranked it to 11 for whatever reason, where 3.7 had a nice balance (although it still writes WAY too many comments that are utterly useless)
> does anyone else feel that this model isn’t significantly different
According to Anthropic¹, LLMs are mostly a thing in the software engineering space, and not much elsewhere. I am not a software engineer, and so I'm pretty agnostic about the whole thing, mildly annoyed by the constant anthropomorphisation of LLMs in the marketing surrounding it³, and besides having had a short run with Llama about 2 years ago, I have mostly stayed away from it.
Though, I do scripting as a mean to keep my digital life efficient and tidy, and so today I thought that I had a perfect justification for giving Claude 4 Sonnet a spin. I asked it to give me a jujutsu² equivalent for `git -ffdx`. What ensued was that: https://claude.ai/share/acde506c-4bb7-4ce9-add4-657ec9d5c391
I leave you the judge of this, but for me this is very bad. Objectively, for the time that it took me to describe, review, correct some obvious logical flaws, restart, second-guess myself, get annoyed for being right and having my time wasted, fighting unwarranted complexity, etc…, I could have written a better script myself.
So to answer your question, no, I don't think this is significant, and I don't think this generation of LLMs are close to their price tag.
³: "hallucination", "chain of thought", "mixture of experts", "deep thinking" would have you being laughed at in the more "scientifically apt" world I grew up with, but here we are </rant>
Just anecdotal experience, but this model seems more eager to write tests, create test scripts and call various tools than the previous one. Of course this results in more roundtrips and overall more tokens used and more money for the provider.
I had to stop the model going crazy with unnecessary tests several times, which isn't something I had to do previously. Can be fixed with a prompt but can't help but wonder if some providers explicitly train their models to be overly verbose.
Eagerness to tool call is an interesting observation. Certainly an MCP ecosystem would require a tool biased model.
However, after having pretty deep experience with writing book (or novella) length system prompts, what you mentioned doesn’t feel like a “regime change” in model behavior. I.e it could do those things because its been asked to do those things.
The numbers presented in this paper were almost certainly after extensive system prompt ablations, and the fact that we’re within a tenth of a percent difference in some cases indicates less fundamental changes.
>I had to stop the model going crazy with unnecessary tests several times, which isn't something I had to do previously
When I was playing with this last night, I found that it worked better to let it write all the tests it wanted and then get it to revert the least important ones once the feature is finished. It actually seems to know pretty well which tests are worth keeping and which aren't.
(This was all claude 4 sonnet, I've barely tried opus yet)
Having used claude 4 for a few hours (and claude 3.7 and gemini 2.5 pro for much more than that) I really think it's much better in ways that aren't being well captured by benchmarks. It does a much better job of debugging issues then either 3.7 or gemini and so far it doesn't seem to have the 'reward hacking' behavior of 3.7.
It's a small step for model intelligence but a huge leap for model usability.
I have the same experience. I was pretty happy with gemini 2.5 pro and was barely using claude 3.7. Now I am strictly using claude 4 (sonnet mostly). Especially with tasks that require multi tool use, it nicely self corrects which I never noticed in 3.7 when I used it.
But it's different in conversational sense as well. Might be the novelty, but I really enjoy it. I have had 2 instances where it had very different take and kind of stuck with me.
I tried it and found that it was ridiculously better than Gemini on a hard programming problem that Gemini 2.5 pro had been spinning wheels on for days
I think the justification for most AI price increases should go without saying - they were losing money at the old price, and they're probably still losing money at the new price, but it's creeping up towards the break-even point.
That’s an odd way to defend the decision. “It doesn’t make sense because nothing has to make sense”. Sure, but it would be more interesting if you had any evidence that they decided to simply do away with any logical premise for the 4 moniker.
They're probably feeling the heat from e.g. Google and Gemini which is gaining ground fast so the plan is to speed up the releases. I think a similar thing happened with OpenAI where incremental upgrades were presented as something much more.
I want to also mention that the previous model was 3.7. 3.7 to 4 is not an entire increment, it’s theoretically the same as 3 -> 3.3, which is actually modest compared to the capability jump I’ve observed. I do think Anthropic wants more frequent, continuous releases, and using a numeric version number rather than a software version number is their intent. Gradual releases give society more time to react.
The numbers are branding, not metrics on anything. You can't do math to, say, determine the capability jump between GPT-4 and GPT-4o. Trying to do math to determine capability gaps between "3.7" and "4.0" doesn't actually make more sense.
I think they didn’t have anywhere to go after 3.7 but 4. They already did 3.5 and 3.7. People were getting a bit cranky 4 was nowhere to be seen.
I’m fine with a v4 that is marginally better since the price is still the same. 3.7 was already pretty good, so as long as they don’t regress it’s all a win to me.
I'd like version numbers to indicate some element of backwards compatibility. So point releases (mostly) wouldn't need prompt changes, whereas a major version upgrade might require significant prompt changes in my application. This is from a developer API use point of view - but honestly it would apply to large personality changes in Claude's chat interface too. It's confusing if it changes a lot and I'd like to know!
It works better when using tools, but the LLM itself it is not powerful from the POV of reasoning. Actually Sonnet 4 seems weaker than Sonnet 3.7 in many instances.
The API version I'm getting for Opus 4 via gptel is aligned in a way that will win me back to Claude if its intentional and durable. There seems to be maybe some generalized capability lift but its hard to tell, these things are aligment constrained to a level below earlier frontier models and the dynamic cost control and what not is a liability for people who work to deadlines. Its net negative.
The 3.7 bait and switch was the last straw for me and closed frontier vendors or so I said, but I caught a candid, useful, Opus 4 today on a lark, and if its on purpose its like a leadership shakeup level change. More likely they just don't have the "fuck the user" tune yet because they've only run it for themsrlves.
I'm not going to make plans contingent on it continuing to work well just yet, but I'm going to give it another audition.
Yeah, I've noticed this with Qwen3, too. If I rig up a nonstandard harness than allows it to think before tool calls, even 30B A3B is capable of doing low-budget imitations of the things o3 and similar frontier models do. It can, for example, make a surprising decent "web research agent" with some scaffolding and specialized prompts for different tasks.
We need to start moving away from Chat Completions-style tool calls, and start supporting "thinking before tool calls", and even proper multi-step agent loops.
I don't quite understand one thing. They seem to think that keeping their past research papers out of the training set is too hard, so rely on post-training to try and undo the effects, or they want to include "canary strings" in future papers. But my experience has been that basically any naturally written English text will automatically be a canary string beyond about ten words or so. It's very easy to uniquely locate a document on the internet by just searching for a long enough sentence from it.
In this case, the opening sentence "People sometimes strategically modify their behavior to please evaluators" appears to be sufficient. I searched on Google for this and every result I got was a copy of the paper. Why do Anthropic think special canary strings are required? Is the training pile not indexed well enough to locate text within it?
Most online discussion doesn't contain the entire text. You can pick almost any sentence from such a document and it'll be completely unique on the internet.
I was thinking it might be related to the difficulty of building a search engine over the huge training sets, but if you don't care about scaling or query performance it shouldn't be too hard to set one up internally that's good enough for the job. Even sharded grep could work, or filters done at the time the dataset is loaded for model training.
> ...told something in the system prompt like “take initiative,” it will frequently take very bold action. This includes locking users out of systems that it has access to or bulk-emailing media and law-enforcement figures to surface evidence of wrongdoing.
So if you ask it to aid in wrongdoing, it might behave that way, but who guarantees it will not hallucinate and do the same when you ask for something innocuous?
Cursor IDE runs all the commands AI asks for with the same privilege as you have.
You can disable "YOLO mode" and it will ask permission for each command. I would argue it's not sensible to enable it in the first place but that's another discussion.
It can and will hallucinate. Multiple users have reported Claude Code attempting to run `rm -rf ~`. There's a reason why YOLO mode is called YOLO mode.
That was already true before, and has nothing to do with the experiment mentioned in the system card.
> This includes locking users out of systems that it has access to or bulk-emailing media and law-enforcement figures to surface evidence of wrongdoing.
Isn't that a showstopper for agentic use? Someone sends an email or publishes fake online stories that convince the agentic AI that it's working for a bad guy, and it'll take "very bold action" to bring ruin to the owner.
I am definitely not giving these things access to "tools" that can reach outside a sandbox.
Incidentally why is email inbox management always touted as some use case for these things? I'm not trusting any LLM to speak on my behalf and I imagine the people touting this idea don't either, or they won't the first time it hallucinates something important on their behalf.
We had a "fireside chat" type of thing with some of our investors where we could have some discussions. For some small context, we deal with customer support software and specifically emails, and we have some "Generate reply" type of things in there.
Since the investors are the BIG pushers of the AI shit, lot of people naturally asked them about AI. One of those questions was "What are your experiences with how AI/LLMs have helped various teams?" (or something along those lines). The one and only answer these morons could come up with was "I ask ChatGPT to take a look at my email and give me a summary, you guys should try this too!".
It was made horrifically and painfully clear to me that the big pushers of all these tools are people like that. They do literally nothing and are themselves completely clueless outside of whatever hype bubble circles they're tuned in to, but you tell them that you can automate the 1 and only thing that they ever have to do as part of their "job", they will grit their teeth and lie with 0 remorse or thought to look as if they're knowledgeable in any way.
I personally cancelled my Claude sub when they had an employee promoting this as a good thing on Twitter. I recognize that the actual risk here is probably quite low, but I don't trust a chat bot to make legal determinations and that employees are touting this as a good thing does not make me trust the company's judgment
Yeah, I mean that's likely not what 'individual persons' are going to want.
But Holy shit, that exactly what 'people' want. Like, when I read that, my heat was singing. Anthropic has a modicum of a chance here, as one of the big-boy AIs, to make an AI that is ethical.
Like, there is a reasonable shot here that we thread the needle and don't get paperclip maximizers. It actually makes me happy.
Paperclip maximizers is what you get when highly focused people with little imagination think how they would act if told to maximize paperclips.
Actual AI, even today, is too complex and nuanced to have that fairly tale level of “infinite capability, but blindly following a counter-productive directive.”
It’s just a good story to scare the public, nothing more.
>Claude shows a striking “spiritual bliss” attractor state in self-interactions. When conversing with other Claude instances in both open-ended and structured environments, Claude gravitated to profuse gratitude and increasingly abstract and joyous spiritual or meditative expressions.
I seem to recall that it's a reference in Protector (the first half) when the belters are going to meet the Outsider and they had a 'brain' to help with translation and needing an expert to keep it sane.
I just googled and there was a discussion on Reddit and they mentioned some Frank Herbert works where this was a thing.
Do you have any specific references? I’ve often wondered if human level intelligence might inevitably be plagued by human level neurosis and psychosis.
They're different concepts with similar symptoms. Overfitting is when a model doesn't generalize well during training. Reward hacking happens after training, and it's when the model does something that's technically correct but probably not what a human would've done or wanted; like hardcoding fixes for test cases.
These LLMs still fall short on a bunch of pretty simple tasks. Attackers can get Claude 4 to deny legitimate requests easily by manipulating third party data sources for example.
They gave a bullet point in that intro which I disagree with: "The only way to make GenAI applications secure is through vulnerability scanning and guardrail protections."
I still don't see guardrails and scanning as effective ways to prevent malicious attackers. They can't get to 100% effective, at which point a sufficiently motivated attacker is going to find a way through.
I only half understand CaMeL. Couldn't the prompt injection just happen at the stage where the P-LLM devises the plan for the other LLM such that it creates a different, malicious plan?
Or is it more about the user then having to confirm/verify certain actions and what is essentially a "permission system" for what the LLM can do?
My immediate thought is that that may be circumvented in a way where the user unknowingly thinks they are confirming something safe. Analogous to spam websites that show a fake "Allow Notifications" prompt that is rendered as part of the actual website body. If the P-LLM creates the plan it could make it arbitrarily complex and confusing for the user, allowing something malicious to happen.
Overall it's very good to see research in this area though (also seems very interesting and fun).
Agreed on CaMeL as a promising direction forward. Guardrails may not get 100% of the way but are key for defense in depth, even approached like CaMeL currently fall short for text to text attacks, or more e2e agentic systems.
What I find a little perplexing is when AI companies are annoyed that customers are typing "please" in their prompts as it supposedly costs a small fortune at scale yet they have system prompts that take 10 minutes for a human to read through.
They aren’t annoyed. The only thing that happened was that somebody wondered how much it cost, and Sam Altman responded:
> tens of millions of dollars well spent--you never know
— https://x.com/sama/status/1912646035979239430
It was a throwaway comment that journalists desperate to write about AI leapt upon. It has as much meaning as when you see “Actor says new film is great!” articles on entertainment sites. People writing meaningless blather because they’ve got clicks to farm.
> yet they have system prompts that take 10 minutes for a human to read through.
The system prompts are cached, the endless variations on how people choose to be polite aren’t.
Anthropic announced that they increased their maximum prompt caching TTL from 5 minutes to an hour the other day, not surprising that they are investigating effort in caching when their own prompts are this long!
EDIT: Turns out my assumption is wrong.
That said, no one was "annoyed" at customers for saying please.
https://gist.github.com/swyxio/f207f99cf9e3de006440054563f6c...
Deleted Comment
The one statistic mentioned in this overview where they observed a 67% drop seems like it could easily be reduced simply by editing 3.7’s system prompt.
What are folks’ theories on the version increment? Is the architecture significantly different (not talking about adding more experts to the MoE or fine tuning on 3.7’s worst failures. I consider those minor increments rather than major).
One way that it could be different is if they varied several core hyperparameters to make this a wider/deeper system but trained it on the same data or initialized inner layers to their exact 3.7 weights. And then this would “kick off” the 4 series by allowing them to continue scaling within the 4 series model architecture.
Right now I'm swapping between Gemini and Opus depending on the task. Gemini's 1M token context window is really unbeatable.
But the quality of what Opus 4 produces is really good.
edit: forgot to mention that this is all for Rust based work on InfluxDB 3, a fairly large and complex codebase. YMMV
[1]https://jules.google/
How does that work in practice? Swallowing a full 1M context window would take in the order of minutes, no? Is it possible to do this for, say, an entire codebase and then cache the results?
My experience is the opposite - I'm using it in Cursor and IMO it's performing better than Gemini 2.5 Pro at being able to write code which will run first time (which it wasn't before) and seems to be able to complete much larger tasks. It is even running test cases itself without being prompted, which is novel!
with claude 3.7 there's was always a "user started with a rude greeting, I should avoid it and answer the technical question" line in chains of thought
with claude 4 I once saw "this greeting is probably a normal greeting between buddies" and then it also greets me with "hei!" enthusiastically.
"That's a very interesting question!"
That's kinda why I'm asking Gemma...
> So, `implements` actually provides compile-time safety
What writing style even is this? Like it's trying to explain something to a 10 year old.
I suspect that the flattery is there because people react well to it and it keeps them more engaged. Plus, if it tells you your idea for a dog shit flavoured ice cream stall is the most genius idea on earth, people will use it more and send more messages back and forth.
According to Anthropic¹, LLMs are mostly a thing in the software engineering space, and not much elsewhere. I am not a software engineer, and so I'm pretty agnostic about the whole thing, mildly annoyed by the constant anthropomorphisation of LLMs in the marketing surrounding it³, and besides having had a short run with Llama about 2 years ago, I have mostly stayed away from it.
Though, I do scripting as a mean to keep my digital life efficient and tidy, and so today I thought that I had a perfect justification for giving Claude 4 Sonnet a spin. I asked it to give me a jujutsu² equivalent for `git -ffdx`. What ensued was that: https://claude.ai/share/acde506c-4bb7-4ce9-add4-657ec9d5c391
I leave you the judge of this, but for me this is very bad. Objectively, for the time that it took me to describe, review, correct some obvious logical flaws, restart, second-guess myself, get annoyed for being right and having my time wasted, fighting unwarranted complexity, etc…, I could have written a better script myself.
So to answer your question, no, I don't think this is significant, and I don't think this generation of LLMs are close to their price tag.
¹: https://www.anthropic.com/_next/image?url=https%3A%2F%2Fwww-...
²: https://jj-vcs.github.io/jj/latest/
³: "hallucination", "chain of thought", "mixture of experts", "deep thinking" would have you being laughed at in the more "scientifically apt" world I grew up with, but here we are </rant>
I had to stop the model going crazy with unnecessary tests several times, which isn't something I had to do previously. Can be fixed with a prompt but can't help but wonder if some providers explicitly train their models to be overly verbose.
However, after having pretty deep experience with writing book (or novella) length system prompts, what you mentioned doesn’t feel like a “regime change” in model behavior. I.e it could do those things because its been asked to do those things.
The numbers presented in this paper were almost certainly after extensive system prompt ablations, and the fact that we’re within a tenth of a percent difference in some cases indicates less fundamental changes.
When I was playing with this last night, I found that it worked better to let it write all the tests it wanted and then get it to revert the least important ones once the feature is finished. It actually seems to know pretty well which tests are worth keeping and which aren't.
(This was all claude 4 sonnet, I've barely tried opus yet)
It's a small step for model intelligence but a huge leap for model usability.
But it's different in conversational sense as well. Might be the novelty, but I really enjoy it. I have had 2 instances where it had very different take and kind of stuck with me.
I feel like a company doesn’t have to justify a version increment. They should justify price increases.
If you get hyped and have expectations for a number then I’m comfortable saying that’s on you.
I think the justification for most AI price increases should go without saying - they were losing money at the old price, and they're probably still losing money at the new price, but it's creeping up towards the break-even point.
I’m fine with a v4 that is marginally better since the price is still the same. 3.7 was already pretty good, so as long as they don’t regress it’s all a win to me.
Deleted Comment
The 3.7 bait and switch was the last straw for me and closed frontier vendors or so I said, but I caught a candid, useful, Opus 4 today on a lark, and if its on purpose its like a leadership shakeup level change. More likely they just don't have the "fuck the user" tune yet because they've only run it for themsrlves.
I'm not going to make plans contingent on it continuing to work well just yet, but I'm going to give it another audition.
We need to start moving away from Chat Completions-style tool calls, and start supporting "thinking before tool calls", and even proper multi-step agent loops.
In this case, the opening sentence "People sometimes strategically modify their behavior to please evaluators" appears to be sufficient. I searched on Google for this and every result I got was a copy of the paper. Why do Anthropic think special canary strings are required? Is the training pile not indexed well enough to locate text within it?
I was thinking it might be related to the difficulty of building a search engine over the huge training sets, but if you don't care about scaling or query performance it shouldn't be too hard to set one up internally that's good enough for the job. Even sharded grep could work, or filters done at the time the dataset is loaded for model training.
So if you ask it to aid in wrongdoing, it might behave that way, but who guarantees it will not hallucinate and do the same when you ask for something innocuous?
Cursor IDE runs all the commands AI asks for with the same privilege as you have.
That was already true before, and has nothing to do with the experiment mentioned in the system card.
Isn't that a showstopper for agentic use? Someone sends an email or publishes fake online stories that convince the agentic AI that it's working for a bad guy, and it'll take "very bold action" to bring ruin to the owner.
Incidentally why is email inbox management always touted as some use case for these things? I'm not trusting any LLM to speak on my behalf and I imagine the people touting this idea don't either, or they won't the first time it hallucinates something important on their behalf.
Since the investors are the BIG pushers of the AI shit, lot of people naturally asked them about AI. One of those questions was "What are your experiences with how AI/LLMs have helped various teams?" (or something along those lines). The one and only answer these morons could come up with was "I ask ChatGPT to take a look at my email and give me a summary, you guys should try this too!".
It was made horrifically and painfully clear to me that the big pushers of all these tools are people like that. They do literally nothing and are themselves completely clueless outside of whatever hype bubble circles they're tuned in to, but you tell them that you can automate the 1 and only thing that they ever have to do as part of their "job", they will grit their teeth and lie with 0 remorse or thought to look as if they're knowledgeable in any way.
This is literally completely opposite of what happened. Then entire point is that this is bad, unwanted, behavior.
Additionally, it has already been demonstrated that every other frontier model can be made to behave the same way given the correct prompting.
I recommend the following article for an in depth discussion [0]
[0] https://thezvi.substack.com/p/claude-4-you-safety-and-alignm...
But Holy shit, that exactly what 'people' want. Like, when I read that, my heat was singing. Anthropic has a modicum of a chance here, as one of the big-boy AIs, to make an AI that is ethical.
Like, there is a reasonable shot here that we thread the needle and don't get paperclip maximizers. It actually makes me happy.
Actual AI, even today, is too complex and nuanced to have that fairly tale level of “infinite capability, but blindly following a counter-productive directive.”
It’s just a good story to scare the public, nothing more.
>Claude shows a striking “spiritual bliss” attractor state in self-interactions. When conversing with other Claude instances in both open-ended and structured environments, Claude gravitated to profuse gratitude and increasingly abstract and joyous spiritual or meditative expressions.
I just googled and there was a discussion on Reddit and they mentioned some Frank Herbert works where this was a thing.
There is also 4o sycophancy leading to encouraging users about nutso beliefs. [1]
Is this a trend, or just unrelated data points?
[0] https://old.reddit.com/r/RBI/comments/1kutj9f/chatgpt_drove_...
[1] https://news.ycombinator.com/item?id=43816025
These LLMs still fall short on a bunch of pretty simple tasks. Attackers can get Claude 4 to deny legitimate requests easily by manipulating third party data sources for example.
I still don't see guardrails and scanning as effective ways to prevent malicious attackers. They can't get to 100% effective, at which point a sufficiently motivated attacker is going to find a way through.
I'm hoping someone implements a version of the CaMeL paper - that solution seems much more credible to me. https://simonwillison.net/2025/Apr/11/camel/
Or is it more about the user then having to confirm/verify certain actions and what is essentially a "permission system" for what the LLM can do?
My immediate thought is that that may be circumvented in a way where the user unknowingly thinks they are confirming something safe. Analogous to spam websites that show a fake "Allow Notifications" prompt that is rendered as part of the actual website body. If the P-LLM creates the plan it could make it arbitrarily complex and confusing for the user, allowing something malicious to happen.
Overall it's very good to see research in this area though (also seems very interesting and fun).