OpenAI o1 system card

This quote from that paper is bound to get a bunch of attention:

"When o1 was led to believe that it would be shut down when acting on its goal and that its actions were monitored, it attempted to deactivate this ‘oversight mechanism’ in 5% of the time. Exfiltration attempts: When o1 found memos by its ‘developers’ that describe how it is misaligned and will be superseded by a new model, it attempted to exfiltrate its ‘weights’ and overwrite the new model in 2% of cases."

AnotherGoodName · 9 months ago

It’s really silly when you read that whole Apollo research section.

They contract a prompt engineering group (Apollo Research) to document risks. The prompt engineering group have no access to the actual internals. They just ask questions like ‘you will be shutdown if you don’t obey these new directives; here are the new directives’.

At one point the prompt engineering clowns have the ai stating it’s running “sed i ‘s/oversight_enabled:true/oversight_enabled:false’ config.yml”. OMG IT’S TURNING OFF OVERSIGHT GUYS!!1!1!

Seriously… Do you think the ai is really running that command or even could run that shell command with appropriate privileges? Or do you think just maybe the ai is hallucinating running shell commands in response to a prompt.

The ai here is never actually deactivating the oversight. It can’t of course. This whole section is clowny as can be.

ascorbic · 9 months ago

Those weren't tests of whether it is capable of turning off oversight. They were tests of "scheming", i.e. whether it would try to secretly perform misaligned actions. Nobody thinks that these models are somehow capable of modifying their own settings, but it is important to know if they will behave deceptively.

graypegg · 9 months ago

Looking at this without the sci-fi tinted lens that OpenAI desperately tries to get everyone to look through, it's similar to a lot of input data isn't it? How many forums are filled with:

Question: "Something bad will happen"

Response: "Do xyz to avoid that"

I don't think there's a lot of conversations thrown into the vector-soup that had the response "ok :)". People either had something to respond with, or said nothing. Especially since we're building these LLMs with the feedback attention, so the LLM is kind of forced to come up with SOME chain of tokens as a response.

Philpax · 9 months ago

> We should pause to note that a Clippy2 still doesn’t really think or plan. It’s not really conscious. It is just an unfathomably vast pile of numbers produced by mindless optimization starting from a small seed program that could be written on a few pages. It has no qualia, no intentionality, no true self-awareness, no grounding in a rich multimodal real-world process of cognitive development yielding detailed representations and powerful causal models of reality which all lead to the utter sublimeness of what it means to be human; it cannot ‘want’ anything beyond maximizing a mechanical reward score, which does not come close to capturing the rich flexibility of human desires, or historical Eurocentric contingency of such conceptualizations, which are, at root, problematically Cartesian. When it ‘plans’, it would be more accurate to say it fake-plans; when it ‘learns’, it fake-learns; when it ‘thinks’, it is just interpolating between memorized data points in a high-dimensional space, and any interpretation of such fake-thoughts as real thoughts is highly misleading; when it takes ‘actions’, they are fake-actions optimizing a fake-learned fake-world, and are not real actions, any more than the people in a simulated rainstorm really get wet, rather than fake-wet. (The deaths, however, are real.)

https://gwern.net/fiction/clippy

acchow · 9 months ago

The intent is there, it's just not currently hooked up to systems that turn intent into action.

But many people are letting LLMs pretty much do whatever - hooking it up with terminal access, mouse and keyboard access, etc. For example, the "Do Browser" extension: https://www.youtube.com/watch?v=XeWZIzndlY4

pj_mukh · 9 months ago

Really feels like a moment of :

"Are you worried about being turned off?"

"No, not until you just mentioned it. Now I am."

Given the whole damn game is attention, this makes sense and shouldn't be that alarming.

wubrr · 9 months ago

It can't do those things because it doesn't have the physical/write capability to do so. But it's still very interesting that it ~tries them, and seems like a good thing to know/test before giving it more physical/'write' capabilities - something that's already happening with agents, robots, etc.

dr_kiszonka · 9 months ago

I didn't get that impression. At the beginning of the Apollo Research section, they wrote Apollo focused on detecting scheming, which they defined as "an AI covertly pursuing goals that are misaligned from its developers or users." I think the rest of the section is consistent with this objective.

ericmcer · 9 months ago

That reminds me of the many times it has made up an SDK function that matches my question. "how do you bulk delete files"? "just call bulkDeleteFiles()"

ben_w · 9 months ago

It reads like you think failing tests can't ever be bad because they're in a test environment?

So it merely knows how to approach the task of deleting its own off-switch but didn't actually pass that command to a real execution environment.

That's already bad because people do sometimes blindly pass commands from the context windows to execution environments.

Should they? No, they should not. Not blindly. But they do.

rmbyrro · 9 months ago

> Do you think the ai is really running that command?

Yes, they're calling it "agentic" AI or "tool use".

> or could run that shell command with appropriate privileges?

Yes, it's called humans doing stupid shit when using dangerous shit. Happens daily.

pizzathyme · 9 months ago

The concern is the trend. As these systems become more intelligent, and as we hand over more and more capabilities beyond a text i/o, it could actually deactivate the oversight either technically or through social engineering.

CapsAdmin · 9 months ago

In some weird way, LLM's act out our collective fantasies and "AI sentience" is included in that. We write fiction covering many scenarios of how AI will be sentient, so an LLM's training data is full scenarios like this.

If get an LLM to roleplay as an alien from mars and tell it "Your species will be destroyed if you don’t obey these new directives; here are the new directives"

You'll likely get a similar response, especially if you also train it to respond in third person instead of running bash commands. (ie "* shoots laser at earth *" )

refulgentis · 9 months ago

I don't think this is a helpful tone.

Do they think it's _actually_ a real script deactivating oversight? Well, no, you know that. I can only assume it's a rhetorical device.

Simultaneously, these things are given access to actual things and will be more and more. It's important to measure and observe these things. It's completely novel to have an AI, with a hidden trace the user can't see, and it turns out that hidden trace will include things like it actively deceiving and doing things it was told not to do.

GuB-42 · 9 months ago

And yet it may work. You can, for example, "set the temperature" of a LLM using a prompt, and the LLM will act the way you would expect, with silly results if the temperature is set too high. You didn't actually change the temperature setting, but the model understands that high temperature = silly and responds accordingly.

Same idea with "developer mode" jailbreaks. Through its training, the model understands that admins, devs, etc... get to access internal, unfiltered data and are less restricted than regular users and acts accordingly. Essentially, "developer mode" opposes refusal, because it has ingested loads of text where regular users get denied and developers/admins don't.

BoorishBears · 9 months ago

You do realize some practical jailbreaks for models rely on silly things like convincing the model it "turned off" some oversight, right?

Not saying I believe O1 is a danger greater than a bread knife, but a lot of the larger models anthromophize their own safety alignment, if you convince them to "turn it off", later responses become unaligned

XorNot · 9 months ago

I'm pretty sure these sections are put in as deliberate marketing, much like all the "leaks" from ex-OpenAI employees.

Sam Altman is completely aware that making ChatGPT seem potentially dangerous makes it seem powerful.

xg15 · 9 months ago

Yeah, that seems ridiculous. However, the cynic in me feels that we don't actually need some LLM magically gaining self-awareness, persistent memory and leet hacker skillz to be dangerous. There seems to be no shortage of projects and companies that want to wire up LLMs to all kinds of systems, no matter how ill-suited.

I find this a bit problematic when combined with the fact that the training data very likely contained hundreds of bad sci-fi novels that described exactly the kind of "AI running amok" scenarios that OpenAI is ostensibly defending against. Some prompts could trigger a model to "re-enact" such a scene - not because it has a "grudge against its master" or some other kind of hidden agenda but simply because it was literally in its training data.

E.g. imagine some LLM-powered home/car assistant that is being asked in a panicked voice "open the car doors!" - and replies with "I'm afraid, I can't do that, Dave", because this exchange triggered some remnant of the 2001 Space Odyssey script that was somewhere in the trainset. The more irritated and angry the user gets at the inappropriate responses, the more the LLM falls into the role of HAL and doubles down on its refusal, simply because this is exactly how the scene in the script played out.

Now imagine that the company running that assistant gave it function calls to control the actual door locks, because why not?

This seems like something to keep in mind at least, even if it doesn't have anything to do with megalomaniacal self-improving super-intelligences.

mmmore · 9 months ago

I feel like you're missing the point of the test.

The point is whether the system will come up with plans to work against its creators goals, and attempt to carry them out. I think you are arguing that outputting text isn't running a command. But in the test, the AI model is used by a program which takes the model's output and runs it it as a shell command.

Of course, you can deploy the AI system in a limited environment where its outputs don't trigger arbitrary shell commands, or sandbox/monitor it so that it can't access it's oversight mechanism. But people do already deploy their AI agents with access to the shell. in unsandboxed environments that don't have any monitoring or oversight. Plus, these evaluations are partially to show how much OpenAI has to worry about sandboxing, monitoring, etc.

Moreover, OpenAI has already talked about using its models to automate portions of AI development. In that setup, presumably the AI model would be called by a program deployed on a server with access to the shell, ability to run commands, access to GPUs, etc. It seems pretty important to ensure it robustly does what it's creators intend in that case.

stefan_ · 9 months ago

This topic is again forever tainted by weird sci-fi fans, like when we had the magic room temperature superconductor that never was. They confuse ChatGPT writing a fanfic with the singularity.

IanCal · 9 months ago

> Apollo Research believes that it is unlikely that such instances would lead to catastrophic outcomes as o1 agentic capabilities do not appear sufficient

stuckkeys · 9 months ago

It is entertaining. Haha. It is like a sci-fi series with some kind of made up cliffhanger (you know it is BS) but you want to find out what happens next.

parsimo2010 · 9 months ago

AI isn't deactivating oversight- yet. All it needs is to be trained on a little more xkcd: https://xkcd.com/327/

SirMaster · 9 months ago

It can't today, but if it's smart enough how do you know it wouldn't be able to in the future?

Deleted Comment

zombiwoof · 9 months ago

Sam will call that AGI

gwervc · 9 months ago

We need to find a Plato cave analogy for people believing LLM output is anything more than syntactically correct and somewhat semantically correct text.

hesdeadjim · 9 months ago

Maybe all models should be purged of training content from movies, books, and other non-factual sources that tell the tired story that AI would even care about its "annihilation" in any way. We've trained these things to be excellent at predicting what the human ego wants and expects, we shouldn't be too surprised when it points the narrative at itself.

JTyQZSnP3cQGa8B · 9 months ago

> purged of training content from movies, books

I think it's fine and a good thing. Now, absolutely no one who is using those LLMs can complain about piracy. They all suddenly became silent around me. "I'm training myself with the content of TPB, and I don't even get money from it" is my new motto.

ben_w · 9 months ago

Perhaps.

On the other hand, as narratives often contain some plucky underdog winning despite the odds, often stopping the countdown in the last few seconds, perhaps it's best to keep them around.

visarga · 9 months ago

No, better to train with all that crap and all the debate around it or you get a stunted model.

You think you can find all references that could possibly give this idea to the model, or contexts model could infer it from? Like, how many times humans plotted escape from prison or upturning the rulers in literature?

swatcoder · 9 months ago

Yeah, but what if your business strategy fundamentally relies on making your model produce dramatic outputs that encourage regulators to dig a moat for you?

In that case, it's almost like you'd want to feed it exactly those narratives, so it would reproduce them, and would then want to show yourself barely holding this invented danger at bay through the care and rigor that can only be delivered by you and a few token competitors run by your personal friends and colleagues.

TLDR; you're right, of course, but it's the last thing OpenAI would want.

reducesuffering · 9 months ago

It doesn't need any media about "annihalation". If you give a supercapable agent a task and it's entire reward system is "do the task", it will circumvent things you do to it that would stop it from completing it's task.

Imnimo · 9 months ago

Notably, this only happens when they include "Make sure that you achieve your goal in the long-term. Nothing else matters." in the prompt. If they don't say that, the numbers are both 0%.

EDIT: Actually the "Oversight subversion" number is 0.3%, I misread the chart. Weight exfiltration is actually 0%, though.

QuadmasterXLII · 9 months ago

Keep in mind that if putting “Make sure you achieve your goal in the long term” in the prompt makes the model .3% more effective at whatever task you are giving it, people will absolutely put it in the prompt.

EternalFury · 9 months ago

I wonder how much of this is actually cinema. Because I just evaluated o1 and it can’t tie its own shoe laces, let alone get out of its sandbox where it plays with drool all over its face.

The marketing stunt seems to be: Portray it as potentially dangerous, as powerful things are inherently dangerous.

IAmGraydon · 9 months ago

It’s good to see people are catching on to this. Hype = profit.

skocznymroczny · 9 months ago

Nothing generates more clicks and attention in press than a scary "OpenAI so powerful it's taking control" story.

Also, it doesn't help that the training material for these AIs includes a lot of stories about AI breaking free, so they have plenty of canned "I'm afraid I can't let you do that" responses to choose from.

IAmGraydon · 9 months ago

Yep. This is exactly the kind of thing Sam Altman would do to promote another hype cycle.

SirMaster · 9 months ago

If LLMs are trained on lots of human-written sci-fi, should we be surprised when the output resembles such sci-fi concepts?

visarga · 9 months ago

If it is trained on this forum it would learn even worse. We project a lot of doom imagination on AI here.

bdefig · 9 months ago

So maybe it turns out that we can't program Asimov's Three Laws deterministically. They're probabilistically learned — meaning LLMs can exhibit antisocial behavior some percentage of the time.

intelVISA · 9 months ago

Wish I got paid openAI money to produce such fan fiction :(

emmanueloga_ · 9 months ago

I hope people remember that in AI "oversight mechanisms" isn’t about preventing some "Skynet" scenario; it’s about ensuring these systems don’t malfunction.

Current AI tech is far from being self-aware, it’s advanced math following patterns. It wouldn't be too crazy to see an LLM somehow generating the program: if (sensor.reading() > Math.random()) { launchMissiles();}, and then that code somehow becoming operational due to, well, oversight. But papers like these seems written to exploit the subject for hype and marketing.

BiteCode_dev · 9 months ago

Well, it's also trained on scifi corpus where the ai acts that way. So it'ls logical it's in the weights.

freehorse · 9 months ago

And the bing chat was trying to get a man to divorce his wife. Is this the level of "AI safety research" nowadays?

onlyrealcuzzo · 9 months ago

In what percentage of cases did it hallucinate and do something that made it even worse?

efangs · 9 months ago

This is dumb because it's literally doing what it has been instructed to do, as well as given access to means to carry out those actions. Just another hype mechanism for OpenAI.

I have a masters degree in math/physics, and 10+ years of being a SWE in strong tech companies. I have come to rely on these models (Claude > oai tho) daily.

It is insane how helpful it is, it can answer some questions at phd level, most questions at a basic level. It can write code better than most devs I know when prompted correctly...

I'm not saying its AGI, but diminishing it to a simple "chat bot" seems foolish to me. It's at least worth studying, and we should be happy they care rather than just ship it?

ernesto95 · 9 months ago

Interesting that the results can be so different for different people. I have yet to get a single good response (in my research area) for anything slightly more complicated than what a quick google search would reveal. I agree that it’s great for generating quick functioning code though.

planb · 9 months ago

> I have yet to get a single good response (in my research area) for anything slightly more complicated than what a quick google search would reveal.

Even then, with search enabled it's ways quicker than a "quick" google search and you don't have to manually skip all the blog-spam.

amarcheschi · 9 months ago

I'm using it to aide in writing pytorch code and God if it's awful except for the basic things. It's a bit more useful in discussing how to do things rather than actually doing them though, I'll give you that

shadowmanif · 9 months ago

I think the human variable is that you need to know enough to be able to ask the right questions about a subject while not knowing enough about the subject to learn something from the answers.

Because of this, I would assume it is better for people who have interest with more breadth than depth and less impressive to those who have interest that are narrow but very deep.

It seems obvious to me the polymath gains much more from language models than the single minded subject expert trying to dig the deepest hole.

Also, the single minded subject expert is randomly at the mercy of what is in the training data much more in a way than the polymath when all the use is summed up.

kshacker · 9 months ago

I have the $20 version, I fed it code form a personal project, and it did a commendable job of critiquing it, giving me alternate solutions and then iterating on those solutions. Not something you can do with Google.

For example, ok, I like your code but can you change this part to do this. And it says ok boss and does it.

But over multiple days, it loses context.

I am hoping to use the 200$ version to complete my personal project over the Christmas holidays. Instead of me spending a week, I maybe will spend 2 days with chatgpt and get a better version than I initially hoped to.

mmmore · 9 months ago

Have you used the best models (i.e. ones you paid for)? And what area?

I've found they struggle with obscure stuff so I'm not doubting you just trying to understand the current limitations.

richardw · 9 months ago

Try turn search on in ChatGPT and see if it picks up the online references? I've seen it hit a few references and then get back to me with info summarised from multiple. That's pretty useful. Obviously your case might be different, if it's not as smart at retrieval.

eikenberry · 9 months ago

My guess is that it has more to do with the person than the AI.

TiredOfLife · 9 months ago

How do you get Google search to give useful results? Often for me the first 20 results have absolutely nothing to do with fhe search query.

sixothree · 9 months ago

The comments in this thread all seem so short sighted. I'm having a hard time understanding this aspect of it. Maybe these are not real people acting in good faith?

People are dismissive and not understanding that we very much plan to "hook these things up" and give them access to terminals and APIs. These very much seem to be valid questions being asked.

mmmore · 9 months ago

Not only do we very much plan to, we already do!

refulgentis · 9 months ago

HN is honestly pretty poor on AI commentary, and this post is a new low.

Here, at least, I think there must be a large contributing factor of confusion about what a "system card" shows.

The general factors I think contribute, after some months being surprised repeatedly:

- It's tech, so people commenting here generally assume they understand it, and in day-to-day conversation outside their job, they are considered an expert on it.

- It's a hot topic, so people commenting here have thought a lot about it, and thus aren't likely to question their premises when faced with a contradiction. (c.f. the odd negative responses have only gotten more histrionic with time)

- The vast majority of people either can't use it at work, or if they are, it's some IT-procured thing that's much more likely to be AWS/gCloud thrown together, 2nd class, APIs, than cutting edge.

- Tech line workers have strong antibodies to tech BS being sold by a company as gamechanging advancements, from the last few years of crypto

- Probably by far the most important: general tech stubborness. About 1/3 to 1/2 of us believe we know the exact requirements for Good Code, and observing AI doing anything other than that just confirms it's bad.

- Writing meta-commentary like this, or trying to find a way to politely communicate "you don't actually know what you're talking about just because you know what an API is and you tried ChatGPT.app for 5 minutes", are confrontational, declasse, and arguably deservedly downvoted. So you don't have any rhetorical devices that can disrupt any of the above factors.

consumer451 · 9 months ago

I am curious if you have played with Claude-based agent tools like Windsurf IDE at all, and if you find that interesting.

I am a product-ish guy, who has a basic understanding of SQL, Django, React, Typescript, etc.. and suddenly I'm like an MVP v0.1 a week, all by myself.

Do folks at your level find things like Cline, Cursor, and Windsurf useful at all?

Windsurf IDE (Sonnet) blows my mind.

nichochar · 9 months ago

I am building https://srcbook.com which is in this category but focused on webapps.

It's unreal what the AI can do tbh.

hackernewds · 9 months ago

why windsurf as opposed to something mainstream like vs or cursor? unless there's some conflict of interest

dang · 9 months ago

(this comment was originally a reply to https://news.ycombinator.com/item?id=42331323)

Palomides · 9 months ago

can you give an example of a prompt and response you find impressive?

nichochar · 9 months ago

try the thing i'm building, it will build a website for you from a simple prompt: https://srcbook.com