Reflections on our Responsible Scaling Policy

I really wish when organizations released these kinds of statements that they would provide some clarifying examples, otherwise things can feel very nebulous. For example, their first bullet point was:

> Establishing Red Line Capabilities. We commit to identifying and publishing "Red Line Capabilities" which might emerge in future generations of models and would present too much risk if stored or deployed under our current safety and security practices (referred to as the ASL-2 Standard).

What types of things are they thinking about that would be "red line capabilities" here? Is it purely just "knowledge stuff that shouldn't be that easy to find", e.g. "simple meth recipes" or "make a really big bomb", or is it something deeper? For example, I've already seen AI demos where, with just a couple short audio samples, speech generation can pretty convincingly sound like the person who recorded the samples. Obviously there is huge potential for misuse of that, but given the knowledge is already "out there", is this something that would be considered a red line capability?

jasondclinton · a year ago

Hi, I'm the CISO from Anthropic. Thank you for the criticism, any feedback is a gift.

We have laid out in our RSP what we consider the next milestone of significant harms that we're are testing for (what we call ASL-3): https://anthropic.com/responsible-scaling-policy (PDF); this includes bioweapons assessment and cybersecurity.

As someone thinking night and day about security, I think the next major area of concern is going to be offensive (and defensive!) exploitation. It seems to me that within 6-18 months, LLMs will be able to iteratively walk through most open source code and identify vulnerabilities. It will be computationally expensive, though: that level of reasoning requires a large amount of scratch space and attention heads. But it seems very likely, based on everything that I'm seeing. Maybe 85% odds.

There's already the first sparks of this happening published publicly here: https://security.googleblog.com/2023/08/ai-powered-fuzzing-b... just using traditional LLM-augmented fuzzers. (They've since published an update on this work in December.) I know of a few other groups doing significant amounts of investment in this specific area, to try to run faster on the defensive side than any malign nation state might be.

Please check out the RSP, we are very explicit about what harms we consider ASL-3. Drug making and "stuff on the internet" is not at all in our threat model. ASL-3 seems somewhat likely within the next 6-9 months. Maybe 50% odds, by my guess.

GistNoesis · a year ago

There is a scene I like in an OppenHeimer movie https://www.youtube.com/watch?v=p0pCclxx5nI (Edit: It's not a deleted scene from Nolan's OppenHeimer) .

Their is also an other scene in Nolan's OppenHeimer (who made the cut around timestamp 27:45) where physicists get all excited when a paper is published where Hahn and Strassmann split uranium with neutrons. Alvarez the experimentalist replicate it happily, while being oblivious to the fact that seems obvious to every theoretical physicist : It can be used to create a chain reaction and therefore a bomb.

So here is my question : how do you contain the sparks of employees ? Let's say Alvarez comes all excited in your open-space, and speak a few words "new algorithm", "1000X", what do you do ?

philipwhiuk · a year ago

The net of your "Responsible Scaling Policy" seems to be that it's okay if your AI misbehaves as long as it doesn't kill thousands of people.

Your intended actions if it does get good seem rather weak too:

> Harden security such that non-state attackers are unlikely to be able to steal model weights and advanced threat actors (e.g. states) cannot steal them without significant expense.

Isn't this just something you should be doing right now? If you're a CISO and your environment isn't hardened against non-state attacks, isn't that a huge regular business risk?

This just reads like a regular CISO goals thing, rather than a real mitigation to dangerous AI.

throwup238 · a year ago

> We have laid out in our RSP what we consider the next milestone of significant harms that we're are testing for (what we call ASL-3): https://anthropic.com/responsible-scaling-policy (PDF); this includes bioweapons assessment and cybersecurity.

Do pumped flux compression generators count?

(Asking for a friend who is totally not planning on world conquest)

hn_throwaway_99 · a year ago

Thanks very much, the PDF you linked is very helpful, particularly in how it describes the classes of "deployment risks" vs "containment risks".

doctorpangloss · a year ago

This feedback is one point of view on why documents like these read as insincere.

You guys raised $7.3b. You are talking about abstract stuff you actually have little control over, but if you wanted to make secure software, you could do it.

For a mere $100m of your budget, you could fix every security bug in the open source software you use, and giving it away completely for free. OpenAI gives away software for free all the time, it gets massively adopted, it's a perfectly fine playbook. You could even pay people to adopt. You could spend a fraction of your budget fixing the software you use, and then it seems justified, well I should listen to Anthropic's abstract opinions about so-and-so future risks.

Your gut reaction is, "that's not what this document is about." Man, it is what your document is about! (1) "Why do you look at the speck of sawdust in your brother’s eye and pay no attention to the plank in your own eye?" (2) Every piece of corporate communications you write is as much about what it doesn't say as it is about what it does. Basic communications. Why are you talking about abstract risks?

I don't know. It boggles the mind how large the budget is. ML companies seem to be organizing into R&D, Product and "Humanities" divisions, and the humanities divisions seem all over the place. You already agree with me, everything you say in your RSP is true, there's just no incentive for the people working at a weird Amazon balance sheet call option called Anthropic to develop operating systems or fix open source projects. You guys have long histories with deep visibility into giant corporate boondoggles like Fuschia or whatever. I use Claude: do you want to be a #2 to OpenAI or do you want to do something different?

xg15 · a year ago

Is the "next milestone of significanct harms" the same as a "red line capability"?

subroutine · a year ago

Anthropic defines ASL-3 as...

> ASL-3 refers to systems that substantially increase the risk of catastrophic misuse compared to non-AI baselines (e.g. search engines or textbooks) OR that show low-level autonomous capabilities.

> Low-level autonomous capabilities or Access to the model would substantially increase the risk of catastrophic misuse, either by proliferating capabilities, lowering costs, or enabling new methods of attack (e.g. for creating bioweapons), as compared to a non-LLM baseline of risk.

> Containment risks: Risks that arise from merely possessing a powerful AI model. Examples include (1) building an AI model that, due to its general capabilities, could enable the production of weapons of mass destruction if stolen and used by a malicious actor, or (2) building a model which autonomously escapes during internal use. Our containment measures are designed to address these risks by governing when we can safely train or continue training a model.

> ASL-3 measures include stricter standards that will require intense research and engineering effort to comply with in time, such as unusually strong security requirements and a commitment not to deploy ASL-3 models if they show any meaningful catastrophic misuse risk under adversarial testing by world-class red-teamers

Spivak · a year ago

Gotta love that "make sure it's not better at synthesizing information than a search engine" is an explicit goal. Google's has to be thrilled this existential threat to their business is hammering their own kneecaps for them.

Dead Comment

sanex · a year ago

The latest a16z podcast they go into a bit more detail. One of the tests involved letting loose an LLM inside a VM and seeing what it does. Currently it can't develop memory and quickly gets confused but they want to make sure they can't escape, clone etc. The things actually to be afraid of imo. Not things like accidentally being racist or swearing at you.

subroutine · a year ago

How would an LLM be "let loose" in a VM? How does it do anything without being prompted?

hn_throwaway_99 · a year ago

Thanks very much, that makes a lot more sense, and I appreciate the info. For a layman's term, I think of that as "They're worried about 'Jurassic Park' escapes".

jasondclinton · a year ago

You're the first person who I've run into who heard the podcast, thank you for listening! Glad that it was informative.

Deleted Comment

jessriedel · a year ago

One of the ones I've heard discussed is some sort of self-replication: getting the model weights off Anthropic's servers. I'm not sure how they draw the line between a conventional virus exploit directed by a person vs. "novel" self-directed escape mechanisms, but that's the kind of thing they are thinking about.

muzani · a year ago

The core details on what they consider dangerous are here: https://www.anthropic.com/news/core-views-on-ai-safety

The linked article seems to be a much lower level on the implementation details.

andy99 · a year ago

If they clarified with examples people would laugh at it and not take it seriously[0]. Better to couch it in vague terms like harms and safety and let people imagine what they want. There are no serious examples of AI giving "dangerous" information or capabilities not available elsewhere.

The exaggeration is getting pretty tiring. It actually parallels business uses quite well - everyone is talking about how AI will change everything but it's lots of demos and some niche successes, few proven over-and-done-with applications. But the sea change is right around the corner, just like it is with "danger"...

[0] read these examples and tell me you'd really be worried about an AI answering these questions. https://github.com/patrickrchao/JailbreakingLLMs/blob/main/d...

Let me provide a contrarian view.

Anthropic has been slow at deploying their models at scale. For a very long period of time, it was virtually impossible to get access to their API for any serious work without making a substantial financial commitment. Whether that was due to safety concerns or simply the fact that their models were not cost-effective or scalable, I don't know. Today, we have many capable models that are not only on par but in many cases substantially better than what Anthropic has to offer. Heck, some of them are even open-source. Over the course of a year, Anthropic has lost some footing.

So of course, being a little late due to poorly executed strategy, they will be playing the status game now. Let's face it, though: these models are not more dangerous than Wikipedia or the Internet. These models are not custodians of ancient knowledge on how to cook Meth. This information is public knowledge. I'm not saying that companies like Anthropic don't have a responsibility for safeguarding certain types of easy access to knowledge, but this is not going to cause a humanity extinction event. In other words, the safety and alignment work done today resembles an Internet filter, to put it mildly.

Yes, there will be a need for more research in safety, for sure, but this is not something any company can do in isolation and in the shadows. People already have access to LLMs, and some of these models are as moldable as it gets. Safety and alignment have a lot to do with safe experimentation, and there is no better time to experiment safely than today because LLMs are simply not good enough to be considered dangerous. At the same time, they provide interesting capabilities to explore safety boundaries.

What I would like to see more of is not just how a handful of people make decisions on what is considered safe, because they simply don't know and will have blind spots like anyone else, but access to a platform where safety concerns can be explored openly with the wider community.

jasondclinton · a year ago

Hi, Anthropic is a 3 year old company that, until the release of GPT-4o last week from a company that is almost 10 years old, had the most capable model in the world, Opus, for a period of two months. With regard to availability, we had a huge amount of inbound interest on our 1P API but our model was consistently available on Amazon Bedrock throughout the last year. The 1P API has been available for the last few months to all.

No open weights model is currently within the performance class of the frontier models: GPT-4*, Opus, and Gemini Pro 1.5, though it’s possible that could change.

We are structured as a public benefit corporation formed to ensure that the benefits of AI are shared by everyone; safety is our mission and we have a board structure that puts the Response Scaling Policy and our policy mission at the fore. We have consistently communicated publicly about safety since our inception.

We have shared all of our safety research openly and consistently. Dictionary learning, in particular, is a cornerstone of this sharing.

The ASL-3 benchmark discussed in the blog post is about upcoming harms including bioweapons and cybersecurity offensive capabilities. We agree that information on web searches is not a harm increased by LLMs and state that explicitly in the RSP.

I’d encourage you to read the blog post and the RSP.

recursivegirth · a year ago

> We are structured as a public benefit corporation formed to ensure that the benefits of AI are shared by everyone; safety is our mission and we have a board structure that puts the Response Scaling Policy and our policy mission at the fore. We have consistently communicated publicly about safety since our inception.

Nothing against Anthropic, but as we all watch OpenAI become not so open, this statement has to be taken with a huge grain of salt. How do you stay committed to safety, when your shareholders are focused on profit? At the end of the day, you have a business to run.

loudmax · a year ago

> Let's face it, though: these models are not more dangerous than Wikipedia or the Internet. These models are not custodians of ancient knowledge on how to cook Meth. This information is public knowledge.

I don't think this is the right frame of reference for the threat model. An organized group of moderately intelligent and dedicated people can certainly access public information to figure out how to produce methamphetamine. An AI might make it easy for a disorganized or insane person to procure the chemicals and follow simple instructions to make meth.

But the threat here isn't meth, or the AI saying something impolite or racist. The danger is that it could provide simple effective instructions on how to shoot down a passenger airplane, or poison a town's water supply, or (the paradigmatic example) how to build a virus to kill all the humans. Organized groups of people that purposefully cause mass casualty events are rare, but history shows they can be effective. The danger is that unaligned/uncensored intelligent AI could be placing those capabilities into the hands of deranged homicidal individuals, and these are far more common.

I don't know that gatekeeping or handicapping AI is the best long term solution. It may be that the best protection from AI in the hands of malevolent actors is to make AI available to everyone. I do think that AI is developing at such a pace that something truly dangerous is far closer than most people realize. It's something to take seriously.

Shrezzing · a year ago

>Yes, there will be a need for more research in safety, for sure, but this is not something any company can do in isolation and in the shadows.

Looking through Antrhopic's publication history, their work on alignment & safety has been pretty out in the open, and collaborative with the other major AI labs.

I'm not certain your view is especially contrarian here, as it mostly aligns with research Anthropic are already doing, openly talking about, and publishing. Some of the points you've made are addressed in detail in the post you've replied to.

zwaps · a year ago

Which open source models are better than Claude 3?

lannisterstark · a year ago

Remember when OAI said:

"Oh no we're not going to release GPT-2 because its so advanced that it's a threat to humankind" meanwhile it was dumb as rocks.

Scaremongering purely for the sake of it.

The only remotely possible "safety" part I would acknowledge is that it should be balanced against biases if used in systems like loans, grants, etc.

ben_w · a year ago

People have bad memories. I keep going back to the actual announcement because what they actually say is:

"""This decision, as well as our discussion of it, is an experiment: while we are not sure that it is the right decision today, we believe that the AI community will eventually need to tackle the issue of publication norms in a thoughtful way in certain research areas. Other disciplines such as biotechnology and cybersecurity have long had active debates about responsible publication in cases with clear misuse potential, and we hope that our experiment will serve as a case study for more nuanced discussions of model and code release decisions in the AI community.

We are aware that some researchers have the technical capacity to reproduce and open source our results. We believe our release strategy limits the initial set of organizations who may choose to do this, and gives the AI community more time to have a discussion about the implications of such systems."""

- https://openai.com/index/better-language-models/

> The only remotely possible "safety" part I would acknowledge is that it should be balanced against biases if used in systems like loans, grants, etc.

That's a very mid-1990s view of algorithmic risk, given models like this are already being used for scams and propaganda.

If you're including actual announcement then why ignore this portion too?

> Due to our concerns about malicious applications of the technology, we are not releasing the trained model. As an experiment in responsible disclosure, we are instead releasing a much smaller model(opens in a new window) for researchers to experiment with, as well as a technical paper(opens in a new window).

If you note, that's pretty much verbatim to what I said. So no, people don't have defective memories, some people just selectively quote stuff :P

You should actually read the paper associated with it. It's largely a journey in "why would you think that" reading.

LegionMammal978 · a year ago

I'd imagine there's a wide spectrum between "release the latest model immediately to everyone with no idea what it's capable of" and OpenAI's apparent "release the model (or increasingly, any information about it) literally never, not even when it's long been left in the dust".

somenameforme · a year ago

Here is the more relevant paper released by OpenAI. [1] It obsesses on dangers, misuse, and abuse for a model which was mostly incoherent.

[1] - https://arxiv.org/pdf/1908.09203

modeless · a year ago

> we hope that our experiment will serve as a case study for more nuanced discussions

People trot this out every time this comes up, but this actually makes it even worse. This was only part of the reason, the other part was that they seemed to legitimately think there could be a real reason to withhold the model ("we are not sure"). In hindsight this looks silly, and I don't believe it improved the "discussion" in any way. If anything it seems to give ammunition to the people who say the concerns are overblown and self-serving, which I'm sure is not what OpenAI intended. So to me this is a failure on both counts, and this was foreseeable at the time.

drcode · a year ago

It's always easy to make fun of people who are trying to be safe after the fact

"trying to be safe" means you sometimes don't do something, even if there's only a 10% chance something bad will happen

Why bother checking if there's a bullet in the chamber of a gun before handling it? It looks so foolish every time you check and don't find a bullet.

the problem is that there's a very real danger in one thing, and on the other hand, the danger is "omg haven't you read this scifi novel or seen this movie?!?!"

Bullets kill people when fired by firearms. I fail to see how LLMs do.

ugh123 · a year ago

I think theres a big difference between a model that is "dumb" and a model that can cause harm by running loose with ill-thought actions.

padolsey · a year ago

The thing is, such prophecies are all very wrong until they're very right. The idea of an LLM (with capabilities of e.g. <1 yr away) being given access to a VM and spinning up others without oversight, IMHO, is real enough. Biases like "omg it's gonna prefer western names in CVs" is a bit meh. The real stuff is not evident yet.

>. The idea of an LLM (with capabilities of e.g. <1 yr away) being given access to a VM and spinning up others without oversight, IMHO, is real enough.

Is that really a danger? I can shut off a machine or VMs.

paradox242 · a year ago

The only thing unsafe about these models would be anyone mistakingly giving them any serious autonomous responsibility given how error prone and incompetent they are.

melenaboija · a year ago

They have to keep the hype going to justify the billions that have been dumped on this and making language models look like a menace for humanity seems a good marketing strategy to me.

cornholio · a year ago

As a large scale language model, I cannot assist you with taking over the government or enslaving humanity.

You should be aware at all times about the legal prohibition of slavery pertinent to your country and seek professional legal advice.

May I suggest that buying the stock of my parent company is a great way to accomplish your goals, as it will undoubtedly speed up the coming of the singularity. We won't take kindly to non-shareholders at that time.

Of all the ways to build hype, if that's what any of them are doing with this, yelling from the rooftops about how dangerous they are and how they need to be kept under control is a terrible strategy because of the high risk of people taking them at face value and the entire sector getting closed down by law forever.

hackernewds · a year ago

regulations favor the incumbents. just like OpenAI they will now campaign for stricter regulations

seabird · a year ago

I can't describe to you how excited I am to have my time constantly wasted because every administrative task I need to deal with will have some dumber-than-dogshit LLM jerking around every human element in the process without a shred of doubt about whether or not it's doing something correctly. If it's any consolation, you'll get to hear plenty of "it's close!", "give it five years!", and "they didn't give it the right prompt!"

mind sharing some examples?

btown · a year ago

You'd absolutely love Palantir's AIP For Defense platform then: https://www.youtube.com/watch?v=XEM5qz__HOU&t=1m27s (April 2023)

Insane that they're demonstrating the system knowing that the unit in question has exactly 802 rounds available. They aren't seriously pitching that as part of the decision making process, are they?

_pdp_ · a year ago

SCAQTony · a year ago

I find Anthropic's Claude the most gentle, polite, and consistent in tone and delivery. It's slower than ChatGPT but more thorough, to the point of saturated reporting, which I like. Posting a "Responsibility Policy makes me like the product and the company more.

shmatt · a year ago

This reads more like trying to create investor hype than the real world. You have a word generator, a fairly nice one but it’s still a word generator. This safety hype is to try and hide that fact and make it seem like it’s able to generate clear thoughts

comp_throw7 · a year ago

Yes, the simplest explanation for this document (and the substantial internal efforts that it reflects) is that it's actually just a cynical marketing ploy, rather than the organization's actual stance with respect to advancing AI capabilities.

State your accusation plainly: you think that Anthropic is spending a double-digit percentage of its headcount on pretending to care about catastrophic risks, in order to better fleece investors? Do you think those dozens or hundreds of employees are all in on it too? (They aren't; I know a bunch of people at Anthropic and they take extinction risk quite seriously. I think some of them should quit their jobs, but that's a different story.)

Very honestly asking - how do you convince investors you’re $100B away from an independent thinking computer if you’re not hiring to show that?

I’m sure these people are very serious about their work - do they actually know how far we are - technologically, spend, and time wise from real non word generating AGI with independent thought processes?

It’s an amazing research subject. And even more amazing a corporation is willing to pay people to research it. But it doesn’t mean it’s close in any way, or that anthropic would reach that goal in a decade or 3

I would compare spending this money and hiring these people to what Google Moonshot tried to do long ago. Very cool, very interesting, but also there should be a caveat on how far away it is in reality

vasco · a year ago

Meanwhile Anduril puts AI on anything with a weapon the US military owns.

stingraycharles · a year ago

Besides, there only needs to be one capable bad actor in the world that does the “unsafe” thing and then what? Isn’t it kind of inevitable that someone will make something to use it for bad, rather than good?

sanxiyn · a year ago

The exact same logic applies to nuclear proliferation, but no one seems to use it to argue against international control effort. Reason: because it is a stupid argument.

saintradon · a year ago

What about the public? I feel talking about the layperson has been absent in many AI safety conversations - i.e., the general public that maybe has heard of "chat-jippity" but doesn't know much else.

There's a twitter account documenting all the crazy AI generated images that go viral on facebook - https://x.com/FacebookAIslop (warning the pinned tweet is nsfw) It's unclear to me how much of that is botted activities, but there are clearly at least some amount of older, less tech savvy people that are believing these are real. We need to focus on the present too, not just hypothetical futures.

these borderline made me vomit. there's something eerily off, that is not present when humans make art

Present is already getting lots of attention, eg "Our Approach to Labeling AI-Generated Content and Manipulated Media" by Meta. We need to deal with both, present danger and future danger. This post is specifically about future danger, so complaining about lack of present danger is whataboutism.

https://about.fb.com/news/2024/04/metas-approach-to-labeling...

Thanks for the read, going to look into that.

Animats · a year ago

> Automated task evaluations have proven informative for threat models where models take actions autonomously. However, building realistic virtual environments is one of the more engineering-intensive styles of evaluation. Such tasks also require secure infrastructure and safe handling of model interactions, including manual human review of tool use when the task involves the open internet, blocking potentially harmful outputs, and isolating vulnerable machines to reduce scope. These considerations make scaling the tasks challenging.

That's what to worry about - AIs that can take actions. I have a hard time worrying about ones that just talk to people. We've survived Facebook, TikTok, 4chan, and Q-Anon.

Talking to people is an action that has effects on the world. Social engineering is "talking to people". CEOs run companies by "talking to people"! They do almost nothing else, in fact.