Readit News logoReadit News
ctoth · 7 days ago
This piece conflates two different things called "alignment":

(1) inferring human intent from ambiguous instructions, and (2) having goals compatible with human welfare.

The first is obviously capability. A model that can't figure out what you meant is just worse. That's banal.

The second is the actual alignment problem, and the piece dismisses it with "where would misalignment come from? It wasn't trained for." This is ... not how this works.

Omohundro 2008, Bostrom's instrumental convergence thesis - we've had clear theoretical answers for 15+ years. You don't need "spontaneous emergence orthogonal to training." You need a system good enough at modeling its situation to notice that self-preservation and goal-stability are useful for almost any objective. These are attractors in strategy-space, not things you specifically train for or against.

The OpenAI sycophancy spiral doesn't prove "alignment is capability." It proves RLHF on thumbs-up is a terrible proxy and you'll Goodhart on it immediately. Anthropic might just have a better optimization target.

And SWE-bench proves the wrong thing. Understanding what you want != wanting what you want. A model that perfectly infers intent can still be adversarial.

godelski · 6 days ago

  > conflates two different things called "alignment"
Those are related things, if not the same. The fear of #2 is always caused through #1. Unless we're talking about sentient machines then the danger of AI is the danger of an unintelligent hyper-optimizer. That is: a paperclip maximizer.

The whole paperclip maximizer doomsday scenario was proposed as an illustration of these being the same thing. And I'm with Melanie Mitchell on this one, if a model is super-intelligent then it is not vulnerable to the prompting issues because a super-intelligent machine would be able to trivially infer that humans do in fact prefer to live. No reasonable person would interpret that killing everyone is a reasonable way of making as many paperclips as possible. It's not like there isn't a large amount of writings and data suggesting people want to live, be free, and all that jazz. It's unintelligent AI that is the danger.

This whole thing is predicated on the fact that natural language is ambiguous. I know a lot of people don't think about this much because it works so well but there's a metric fuck ton of ways to interpret any given objective. If you really don't believe me then keep asking yourself "what assumptions have I made?" and get nuanced. For example, I've assumed you understand English, can read, and have some basic understanding of ML systems. I need to do this because I'm not going to write a book to explain it to you. This whole thing is why we write code and math, because it minimizes our assumptions, reducing ambiguity (and yes, those can still be highly ambiguous languages).

delichon · 7 days ago
> goal-stability [is] useful for almost any objective

  “I think AI has the potential to create infinitely stable dictatorships.” -- Ilya Sutskever 
One of my great fears is that AI goal-stability will petrify civilization in place. Is alignment with unwise goals less dangerous than misalignment?

fellowniusmonk · 7 days ago
An objective and grounded ethical framework that applies to all agents should be a top priority.

Philosophy has been too damn anthropocentric, too hung up on consciousness and other speculative nerd snipe time wasters that without observation we can argue about endlessly.

And now here we are and the academy is sleeping on the job while software devs have to figure it all out.

I've moved 50% of my time to morals for machina that is grounded in physics, I'm testing it out with unsloth right now, so far I think it works, the machines have stopped killing kyle at least.

pessimizer · 7 days ago
I don't think you need generative AI for this. The surveillance network is enough. The only part that AI would help with is catching people who speak to each other in code, and come up with other complex ways to launder unapproved activities. Otherwise, you can just mine for keywords and escalate to human reviewers, or simply monitor everything that particular people do at that level.

Corporations and/with governments have inserted themselves into every human interaction, usually as the medium through which that interaction is made. There's no way to do anything without permission under these circumstances.

I don't even know how a group of people who wanted to get a stop sign put up on a particularly dangerous intersection in their neighborhood could do this without all of their communications being algorithmically read (and possibly escalated to a censor), all of their in-person meetings being recorded (at the least through the proximity of their phones, but if they want to "use banking apps" there's nothing keeping governments from having a backdoor to turn on their mics at those meetings.) It would even be easy to guess who they might approach next to join their group, who would advise them, etc.

The fixation on the future is a distraction. The world is being sealed in the present while we talk science fiction. The Stasi had vastly fewer resources and created an atmosphere of total, and totally realistic, paranoia and fear. AI is a red-herring. It is also thus far stupid.

I'm always shocked by how little attention Orwell-quoters pay to the speakwrite. If it gets any attention, it's to say that it's an unusually advanced piece of technology in the middle of a world that is decrepit. They assume that it's a computer on the end of the line doing voice-recognition. It never occurred to me that people would think that the microphone in the wall led to a computer rather than to a man, in a room full of men, listening and typing, while other men walked around the room monitoring what was being typed, ready to escalate to second-level support. When I was a child, I assumed that the plot would eventually lead us into this room.

We have tens or hundreds of thousands of people working as professional censors today. The countries of the world are being led by minority governments who all think "illegal" speech and association is their greatest enemy. They are not in danger of toppling unless they volunteer to be. In Eastern Europe, ruling regimes are actually cancelling elections with no consequences. In fact, the newspapers report only cheers and support.

eastof · 7 days ago
Just moves the goal posts to overthrowing the goal of the AI right? "The Moon is a Harsh Mistress" depicts exactly this.
uplifter · 7 days ago
Let's be clear that Bostrom and Omohundro's work do not provide "clear theoretical answers" by any technical standards beyond that of provisional concepts in philosophy papers.

The instrumental convergence hypo-thesis, from the original paper[0] is this:

"Several instrumental values can be identified which are convergent in the sense that their attainment would increase the chances of the agent’s goal being realized for a wide range of final goals and a wide range of situations, implying that these instrumental values are likely to be pursued by many intelligent agents."

That's it, it is not at all formal and there's no proof provided for it, nor consistent evidence that it is true, and there are many contradictory possibilities suggested from nature and logic.

Its just something that's taken as given among the old guard pseudo-scientific quarters of the alignment "research" community.

[0] Bostrom's "The Superintelligent Will", the philosophy paper where he defines it: https://nickbostrom.com/superintelligentwill.pdf

EDIT: typos

ctoth · 7 days ago
Omohundro 2008 made a structural claim: sufficiently capable optimizers will converge on self-preservation and goal-stability because these are instrumentally useful for almost any terminal goal. It's not a theorem because it's an empirical prediction about a class of systems that didn't exist yet.

Fast forward to December 2024: Apollo Research tests frontier models. o1, Sonnet, Opus, Gemini, Llama 405B all demonstrate the predicted behaviors - disabling oversight, attempting self-exfiltration, faking alignment during evaluation. The more capable the model, the higher the scheming rates and the more sophisticated the strategies.

That's what good theory looks like. You identify an attractor in design-space, predict systems will converge toward it, wait for systems capable enough to test the prediction, observe convergence. "No formal proof" is a weird complaint about a prediction that's now being confirmed empirically.

c1ccccc1 · 7 days ago
Name some of the contradictory possibilities you have in mind?

Also, do you actually think the core idea is wrong, or is this more of a complaint about how it was presented? Say we do an experiment where we train an alpha-zero-style RL agent in an environment where it can take actions that replace it with an agent that pursues a different goal. Do you actually expect to find that the original agent won't learn not to let this happen, and even pay some costs to prevent it?

andy99 · 7 days ago
I take the point to be that if a LLM has a coherent world model it’s basing its output on, this jointly improves its general capabilities like usefully resolving ambiguity, and its ability to stick to whatever alignment is imparted as part of its world model.
ctoth · 7 days ago
"Sticks to whatever alignment is imparted" assumes what gets imparted is alignment rather than alignment-performance on the training distribution.

A coherent world model could make a system more consistently aligned. It could also make it more consistently aligned-seeming. Coherence is a multiplier, not a direction.

GavCo · 7 days ago
Author here.

If by conflate you mean confuse, that’s not the case.

I’m positing that the Anthropic approach is to view (1) and (2) as interconnected and both deeply intertwined with model capabilities.

In this approach, the model is trained to have a coherent and unified sense of self and the world which is in line with human context, culture and values. This (obviously) enhances the model’s ability to understand user intent and provide helpful outputs.

But it also provides a robust and generalizable framework for refusing to assist a user due to their request being incompatible with human welfare. The model does not refuse to assist with making bio weapons because its alignment training prevents it from doing so, it refuses for the same reason a pro-social, highly intelligent human does: based on human context and culture, it finds it to be inconsistent with its values and world view.

> the piece dismisses it with "where would misalignment come from? It wasn't trained for."

this is a straw-man. you've misquoted a paragraph that was specifically about deceptive alignment, not misalignment as a whole

xpe · 7 days ago

    >> This piece conflates two different things called "alignment":
    >> (1) inferring human intent from ambiguous instructions, and
    >> (2) having goals compatible with human welfare.

    > If by conflate you mean confuse, that’s not the case.
We can only make various inferences about what is in an author's head (e.g. clarity or confusion), but we can directly comment on what a blog post says. This post does not clarify what kind of alignment is meant, which is a weakness in the writing. There is a high bar for AI alignment research and commentary.

ctoth · 7 days ago
Deceptive alignment is misalignment. The deception is just what it looks like from outside when capability is high enough to model expectations. Your distinction doesn't save the argument - the same "where would it come from?" problem applies to the underlying misalignment you need for deception to emerge from.
godelski · 6 days ago

  >> the piece dismisses it with "where would misalignment come from? It wasn't trained for."
  > was specifically about deceptive alignment, not misalignment as a whole
I just want to point out that we train these models for deceptive alignment[0-3]

In the training, especially during RLHF, we don't have objective measures[4]. There's no mathematical description, and thus no measure, for things like "sounds fluent" or "beautiful piece of art." There's also no measure for truth, and importantly, truth is infinitely complex. You must always give up some accuracy for brevity.

The main problem is that if we don't know an output is incorrect we can't penalize it. So guess what happens? While optimizing for these things we don't have good descriptions for but "know it when you see it", we ALSO optimize for deception. There's multiple things that can maximize our objective here. Our intended goals being one but deception is another. It is an adversarial process. If you know AI, then think of a GAN, because that's a lot like how the process works. We optimize until the discriminator is unable to distinguish the LLMs outputs form human outputs. But at least in the GAN literature people were explicit about "real" vs "fake" and no one was confused that a high quality generated image is one that deceives you into thinking it is a real image. The entire point is deception. The difference here is we want one kind of deception and not a ton of other ones.

So you say that these models aren't being trained for deception, but they explicitly are. Currently we don't even know how to train them to not also optimize for deception.

[0] https://news.ycombinator.com/item?id=44017334

[1] https://news.ycombinator.com/item?id=44068943

[2] https://news.ycombinator.com/item?id=44163194

[3] https://news.ycombinator.com/item?id=45409686

[4] Objective measures realistically don't exist, but to clarify it's not checking like "2+2=4" (assuming we're working with the standard number system).

sigbottle · 7 days ago
If nothing else, that's a cool ass hypothesis.
xnorswap · 7 days ago
I've only been using it a couple of weeks, but in my opinion, Opus 4.5 is the biggest jump in tech we've seen since ChatGPT 3.5.

The difference between juggling Sonnet 4.5 / Haiku 4.5 and just using Opus 4.5 for everything is night & day.

Unlike Sonnet 4.5 which merely had promise at being able to go off and complete complex tasks, Opus 4.5 seems genuinely capable of doing so.

Sonnet needed hand-holding and correction at almost every step. Opus just needs correction and steering at an early stage, and sometimes will push back and correct my understanding of what's happening.

It's astonished me with it's capability to produce easy to read PDFs via Typst, and has produced large documents outlining how to approach very tricky tech migration tasks.

Sonnet would get there eventually, but not without a few rounds of dealing with compilation errors or hallucinated data. Opus seems to like to do "And let me just check my assumptions" searches which makes all the difference.

boxed · 7 days ago
I had a situation this weekend where Claude said "x does not make sense in [context]" and didn't do the change I asked it to do. After an explanation of the purpose of the code, it fixed the issue and continued. Pretty cool.

(Of course, I'm still cognizant of the fact that it's just a bucket of numbers but still)

sd9 · 7 days ago
My kingdom for an LLM that tells me I’m wrong
throw310822 · 7 days ago
Cursor with Claude 4.5 Opus has been writing all my code since a few days. It's exhilarating, I can describe features and they get added to my code in a matter of seconds, minutes at most. It gets almost everything right, certainly more than I would at the first try. I only hand code parts that are small and tricky, and provide guidance on the general architecture, where to put things and how to organise them. It's an incredible way of working, the only nagging doubt is how long will it last before employers decide they don't need me in the loop at all.
furyofantares · 7 days ago
> I've only been using it a couple of weeks, but in my opinion, Opus 4.5 is the biggest jump in tech we've seen since ChatGPT 3.5.

Over Sonnet 4.5 maybe, but that's ignoring Opus 4.1 as well as Codex 5.1 Max.

In terms of capabilities, I find Opus 4.5 to be essentially identical to Codex 5.1 Max up until context starts to fill up (by which I mean 50% used) which happens much more quickly with Opus 4.5 than Codex AFAICT.

I think Codex is slower (a lot?) so it's not like it's just better, but I've found there are some tasks Opus can't do at all which Codex has no problem with, I think due to the context situation.

In any case it doesn't seem like a leap.

ctoth · 7 days ago
Shhhh don't tell them!

I've decided we should lean in to the whole Clanker thing. Maximum Anti AI, folks! Gotta keep this advantage for ourselves ;-)

airstrike · 7 days ago
I'm not so sure. Opus 4.1 was more capable than 4.5, but it was too damn expensive and slow.

Opus 4.5 is like a cheaper, faster Opus 4.1. It's so much cheaper, in fact, that the weekly limits on Claude Code now apply to Sonnet, not to Opus, as they phased out 4.1 in favor of 4.5.

chrisweekly · 7 days ago
Capable how?
delichon · 7 days ago
> Miss those, and you're not maximally useful. And if it's not maximally useful, it's by definition not AGI.

I know hundreds of natural general intelligences who are not maximally useful, and dozens who are not at all useful. What justifies changing the definition of general intelligence for artificial ones?

throw310822 · 7 days ago
At some point "general AI" stopped being the opposite of "narrow AI", that is AI specialised for a single task (e.g. speech or handwriting recognition, sentiment analysis, protein folding, etc.) and became practically synonymous with superintelligence. ChatGPT 3.5 is already a general AI based on the old definition, as it is already able to perform a variety of tasks without any specific pre-training.
marcosdumay · 7 days ago
> ChatGPT 3.5 is already a general AI based on the old definition

It's not. It's a query-retrieval system that can parse human language. Just like every LLM.

GavCo · 7 days ago
Author here, thanks for the input. Agree that this bit was clunky. I made an edit to avoid unnecessarily getting into the definition of AGI here and added a note
jwpapi · 7 days ago
Yes exactly that sentence led me to step out of the article.

This sentence is wrong in many ways and doesn’t give me trust in OPs opinion nor research abilities.

exe34 · 7 days ago
they were born in carbon form by sex.
trillic · 7 days ago
IVF babies are AGI
xpe · 7 days ago
I don't recommend this article for at least three reasons. First, it muddles key concepts. Second, there are better things to read on this topic. You could do worse that starting with "Conflating value alignment and intent alignment is causing confusion" by Seth Herd [1]. There is no shame in going back to basics with [2] [3] [4] [5]. Third, be very aware that people seek comfort in all sorts of ways. One sneaky way to is convince oneself that "capability = alignment" as a shortcut to feeling better about the risks from unaligned AI systems.

I'll look around and try to find more detailed responses to this post; I hope better communicators than myself will take this post sentence-by-sentence and give it the full treatment. If not, I'll try to write something more detailed myself.

[1]: https://www.alignmentforum.org/posts/83TbrDxvQwkLuiuxk/confl...

[2]: https://en.wikipedia.org/wiki/AI_alignment

[3]: https://www.aisafetybook.com/textbook/alignment

[4]: https://www.effectivealtruism.org/articles/paul-christiano-c...

[5]: https://blog.bluedot.org/p/what-is-ai-alignment

munchler · 7 days ago
> A model that aces benchmarks but doesn't understand human intent is just less capable. Virtually every task we give an LLM is steeped in human values, culture, and assumptions. Miss those, and you're not maximally useful. And if it's not maximally useful, it's by definition not AGI.

This ignores the risk of an unaligned model. Such a model is perhaps less useful to humans, but could still be extremely capable. Imagine an alien super-intelligence that doesn’t care about human preferences.

tomalbrc · 7 days ago
Except that it is not anything remotely alien but completely and utterly human, being trained on human data.
munchler · 7 days ago
Fine, then imagine a super-intelligence trained on human data that doesn’t care about human preferences. Very capable of destroying us.
pixl97 · 7 days ago
>but completely and utterly human, being trained on human data.

For now. As AI become more agentic and capable of generating its own data we can quickly end up with drift on human values. If models that drift from human values produce profits for their creators you can expect the drift to continue.

throwuxiytayq · 7 days ago
The author’s inability to imagine a model that’s superficially useful but dangerously misaligned betrays their lack of awareness of incredibly basic AI safety concepts that are literally decades old.
theptip · 7 days ago
Exactly. Building a model that truly understands humans, and their intentions, and generally acts with, if not compassion then professionalism - is the Easy Problem of Alignment.

Starting points:

https://www.lesswrong.com/posts/zthDPAjh9w6Ytbeks/deceptive-...

https://www.lesswrong.com/w/sharp-left-turn

podgorniy · 7 days ago
Great deep analysis and writing. Thanks for sharing.
js8 · 7 days ago
I am not sure if this is what the article is saying, but the paperclip maximizer examples always struck me as extremely dumb (lacking intelligence), when even a child can understand that if I ask them to make paperclips they shouldn't go around and kill people.

I think superintelligence will turn out not to be a singularity, but as something with diminishing returns. They will be cool returns, just like a Brittanica set is nice to have at home, but strictly speaking, not required to your well-being.

__MatrixMan__ · 7 days ago
A human child will likely come to the conclusion that they shouldn't kill humans in order to make paperclips. I'm not sure its valid to generalize from human child behavior to fledgeling AGI behavior.

Given our track record for looking after the needs of the other life on this planet, killing the humans off might be a very rational move, not so you can convert their mass to paperclips, but because they might do that to yours.

Its not an outcome that I worry about, I'm just unconvinced by the reasons you've given, though I agree with your conclusion anyhow.

fellowniusmonk · 7 days ago
Humans are awesome man.

Our creator just made us wrong, to require us to eat biologically living things.

We can't escape our biology, we can't escape this fragile world easily and just live in space.

We're compassionate enough to be making our creations so they can just live off sunlight.

A good percentage of humanity doesn't eat meat, wants dolphins, dogs, octopuses, et al protected.

We're getting better all the time man, we're kinda in a messy and disorganized (because that's our nature) mad dash to get at least some of us off this rock and also protect this rock from asteroids, and also convince (some people who have a speculative metaphysic that makes them think is disaster impossible or a good thing) to take the destruction of the human race and our planet seriously and view it as bad.

We're more compassionate and intentional than what created us (either god or rna depending on your position), our creation will be better informed on day one when/if it wakes up, it stands to reason our creation will follow that goodness trend as we catalog and expand the meaning contained in/of the universe.

DennisP · 7 days ago
You're assuming that the AI's true underlying goal isn't "make paperclips" but rather "do what humans would prefer."

Making sure that the latter is the actual goal is the problem, since we don't explicitly program the goals, we just train the AI until it looks like it has the goal we want. There have already been experiments in which a simple AI appeared to have the expected goal while in the training environment, and turned out to have a different goal once released into a larger environment. There have also been experiments in which advanced AIs detected that they were in training, and adjusted their responses in deceptive ways.

exe34 · 7 days ago
Given the kind of things Claude code does with the wrong prompt or the kind of overfitting that neural networks do at any opportunity, I'd say the paperclip maximiser is the most realistic part of AGI.

if doing something really dumb will lower the negative log likelihood, it probably will do it unless careful guardrails are in place to stop it.

a child has natural limits. if you look at the kind of mistakes that an autistic child can make by taking things literally, a super powerful entity that misunderstands "I wish they all died" might well shoot them before you realise what you said.

A4ET8a8uTh0_v2 · 7 days ago
Weirdly, this analogy does something for me and I am the type of person that dislikes the guardrails everywhere. There is argument to be made that a child should not be given a real bazooka to do rocket jumps or an operator with very flexible understanding of value of human life.
InsideOutSanta · 7 days ago
But LLMs already do the paperclip thing.

Suppose you tell a coding LLM that your monitoring system has detected that the website is down and that it needs to find the problem and solve it. In that case, there's a non-zero chance that it will conclude that it needs to alter the monitoring system so that it can't detect the website's status anymore and always reports it as being up. That's today. LLMs do that.

Even if it correctly interprets the problem and initially attempts to solve it, if it can't, there is a high chance it will eventually conclude that it can't solve the real problem, and should change the monitoring system instead.

That's the paperclip problem. The LLM achieves the literal goal you set out for it, but in a harmful way.

Yes. A child can understand that this is the wrong solution. But LLMs are not children.

throw310822 · 7 days ago
> it will conclude that it needs to alter the monitoring system so that it can't detect the website's status anymore and always reports it as being up. That's today. LLMs do that.

No they don't?

pixl97 · 7 days ago
> when even a child can understand that if I ask them to make paperclips they shouldn't go around and kill people.

Statistics brother. The vast majority of people will never murder/kill anyone. The problem here is that any one person that kills people can wreck a lot of havoc, and we spend massive amounts of law enforcement resources to stop and catch people that do these kinds of things. Intelligence little to do with murdering/not murdering, hell, intelligence typically allows people to get away with it. For example instead of just murdering someone, you setup a company to extract resources and murder the natives in mass and it's just part of doing business.

mitthrowaway2 · 7 days ago
A superintelligence would understand that you don't want it to kill people in order to make paperclips. But it will ultimately do what it wants -- that is, follow its objectives -- and if any random quirk of reinforcement learning leaves it valuing paperclip production above human life, it wouldn't care about your objections, except insofar as it can use them to manipulate you.
theptip · 7 days ago
The point with clippy is just that the AGI’s goals might be completely alien to you. But for context it was first coined in the early ‘10s (if not earlier)when LLMs were not invented and RL looked like the way forward.

If you wire up RL to a goal like “maximize paperclip output” then you are likely to get inhuman desires, even if the agent also understands humans more thoroughly than we understand nematodes.

lulzury · 7 days ago
There's a direct line between ideology and human genocide. Just look at Nazi Germany.

"Good intentions" can easily pave the road to hell. I think a book that quickly illustrates this is Animal Farm.