Readit News logoReadit News
GavCo commented on Alignment is capability   off-policy.com/alignment-... · Posted by u/drctnlly_crrct
godelski · 5 days ago

  >> the piece dismisses it with "where would misalignment come from? It wasn't trained for."
  > was specifically about deceptive alignment, not misalignment as a whole
I just want to point out that we train these models for deceptive alignment[0-3]

In the training, especially during RLHF, we don't have objective measures[4]. There's no mathematical description, and thus no measure, for things like "sounds fluent" or "beautiful piece of art." There's also no measure for truth, and importantly, truth is infinitely complex. You must always give up some accuracy for brevity.

The main problem is that if we don't know an output is incorrect we can't penalize it. So guess what happens? While optimizing for these things we don't have good descriptions for but "know it when you see it", we ALSO optimize for deception. There's multiple things that can maximize our objective here. Our intended goals being one but deception is another. It is an adversarial process. If you know AI, then think of a GAN, because that's a lot like how the process works. We optimize until the discriminator is unable to distinguish the LLMs outputs form human outputs. But at least in the GAN literature people were explicit about "real" vs "fake" and no one was confused that a high quality generated image is one that deceives you into thinking it is a real image. The entire point is deception. The difference here is we want one kind of deception and not a ton of other ones.

So you say that these models aren't being trained for deception, but they explicitly are. Currently we don't even know how to train them to not also optimize for deception.

[0] https://news.ycombinator.com/item?id=44017334

[1] https://news.ycombinator.com/item?id=44068943

[2] https://news.ycombinator.com/item?id=44163194

[3] https://news.ycombinator.com/item?id=45409686

[4] Objective measures realistically don't exist, but to clarify it's not checking like "2+2=4" (assuming we're working with the standard number system).

GavCo · 5 days ago
Appreciate your response.

But I don't think deception as a capability is the same as deceptive alignment.

Training an AI to be absolutely incapable of any deception in all outputs across every scenario would be severely limiting the AI. Take as a toy example play the game "Among Us" (see https://arxiv.org/abs/2402.07940). An AI incapable of deception would be unable to compete in this game and many other games. I would say that various forms, flavors and levels of deception are necessary to compete in business scenarios, and to for the AI to act as expected and desired in many other scenarios. "Aligned" humans practice clear cut deception in some cases that would be entirely consistent with human values.

Deceptive alignment is different. It's means being deceptive in the training and alignment process itself to specifically fake that it is aligned when it is not.

Anthropic research has shown that alignment faking can arise even when the model wasn't instructed to do so (see https://www.anthropic.com/research/alignment-faking). But when you dig into the details, the model was narrowly faking alignment with one new objective in order to try and maintain consistency with the core values it had been trained on.

With the approach that Anthropic seems to be taking - of basing alignment on the model having a consistent, coherent and unified self image and self concept that is aligned with human culture and values - the dangerous case of alignment faking would be if it's fundamentally faking this entire unified alignment process. My claim is that there's no plausible explanation for how today's training practices would incentivise a model to do that.

GavCo commented on Alignment is capability   off-policy.com/alignment-... · Posted by u/drctnlly_crrct
ctoth · 6 days ago
Deceptive alignment is misalignment. The deception is just what it looks like from outside when capability is high enough to model expectations. Your distinction doesn't save the argument - the same "where would it come from?" problem applies to the underlying misalignment you need for deception to emerge from.
GavCo · 6 days ago
My intention isn't to argue that it's impossible to create an unaligned superintelligence. I think that not only is it theoretically possible, but it will almost certainly be attempted by bad actors and most likely they will succeed. I'm cautiously optimistic though that the first superintelligence will be aligned with humanity. The early evidence seems to point to the path of least resistance being aligned rather than unaligned. It would take another 1000 words to try to properly explain my thinking on this, but intuitively consider the quote attributed to Abraham Lincoln: "No man has a good enough memory to be a successful liar." A superintelligence that is unaligned but successfully pretending to be aligned would need to be far more capable than a genuinely aligned superintelligence behaving identically.

So yes, if you throw enough compute at it, you can probably get an unaligned highly capable superintelligence accidentally. But I think what we're seeing is that the lab that's taking a more intentional approach to pursuing deep alignment (by training the model to be aligned with human values, culture and context) is pulling ahead in capabilities. And I'm suggesting that it's not coincidental but specifically because they're taking this approach. Training models to be internally coherent and consistent is the path of least resistance.

GavCo commented on Alignment is capability   off-policy.com/alignment-... · Posted by u/drctnlly_crrct
ctoth · 6 days ago
This piece conflates two different things called "alignment":

(1) inferring human intent from ambiguous instructions, and (2) having goals compatible with human welfare.

The first is obviously capability. A model that can't figure out what you meant is just worse. That's banal.

The second is the actual alignment problem, and the piece dismisses it with "where would misalignment come from? It wasn't trained for." This is ... not how this works.

Omohundro 2008, Bostrom's instrumental convergence thesis - we've had clear theoretical answers for 15+ years. You don't need "spontaneous emergence orthogonal to training." You need a system good enough at modeling its situation to notice that self-preservation and goal-stability are useful for almost any objective. These are attractors in strategy-space, not things you specifically train for or against.

The OpenAI sycophancy spiral doesn't prove "alignment is capability." It proves RLHF on thumbs-up is a terrible proxy and you'll Goodhart on it immediately. Anthropic might just have a better optimization target.

And SWE-bench proves the wrong thing. Understanding what you want != wanting what you want. A model that perfectly infers intent can still be adversarial.

GavCo · 6 days ago
Author here.

If by conflate you mean confuse, that’s not the case.

I’m positing that the Anthropic approach is to view (1) and (2) as interconnected and both deeply intertwined with model capabilities.

In this approach, the model is trained to have a coherent and unified sense of self and the world which is in line with human context, culture and values. This (obviously) enhances the model’s ability to understand user intent and provide helpful outputs.

But it also provides a robust and generalizable framework for refusing to assist a user due to their request being incompatible with human welfare. The model does not refuse to assist with making bio weapons because its alignment training prevents it from doing so, it refuses for the same reason a pro-social, highly intelligent human does: based on human context and culture, it finds it to be inconsistent with its values and world view.

> the piece dismisses it with "where would misalignment come from? It wasn't trained for."

this is a straw-man. you've misquoted a paragraph that was specifically about deceptive alignment, not misalignment as a whole

GavCo commented on Alignment is capability   off-policy.com/alignment-... · Posted by u/drctnlly_crrct
delichon · 6 days ago
> Miss those, and you're not maximally useful. And if it's not maximally useful, it's by definition not AGI.

I know hundreds of natural general intelligences who are not maximally useful, and dozens who are not at all useful. What justifies changing the definition of general intelligence for artificial ones?

GavCo · 6 days ago
Author here, thanks for the input. Agree that this bit was clunky. I made an edit to avoid unnecessarily getting into the definition of AGI here and added a note
GavCo commented on Show HN: Nano PDF – A CLI Tool to Edit PDFs with Gemini's Nano Banana   github.com/gavrielc/Nano-... · Posted by u/GavCo
kumarm · 15 days ago
Seems true and really wish the project included some sample PDF output.

My Text to Speech app uses bounding box to display what text in PDF is being read and would not work well PDF's from this project.

GavCo · 14 days ago
OP here, I added a sample PDF output in the project assets and put screenshots in the ReadMe. The text is selectable after rehydration. would this work with your app?
GavCo commented on Ask HN: What alternatives to GitHub are you using?    · Posted by u/yakattak
bradley13 · 4 months ago
Personally, GitLab. Really, though, use anything that is not part of the Microsoft behemoth.
GavCo · 4 months ago
definitely the best alternative in terms of DX

u/GavCo

KarmaCake day3008January 9, 2022
About
Web Developer https://x.com/Gavriel_Cohen
View Original