Psychometric Jailbreaks Reveal Internal Conflict in Frontier Models

This is really not surprising in the slightest (ignoring instruction tuning), provided you take the view that LLMs are primarily navigating (linguistic) semantic space as they output responses. "Semantic space" in LLM-speak is pretty much exactly what Paul Meehl would call the "nomological network" of psychological concepts, and is also relevant to what Smedslund notes is pseudoempiricality in psychological concepts and research (i.e. that correlations among various psychological instruments and concepts must follow necessarily simply because these instruments and concepts are constructed from the semantics of everyday language, and so necessarily are constrained by those semantics as well).

I.e. the Five-Factor model of personality (being based on self-report, and not actual behaviour) is not a model of actual personality, but the correlation patterns in the language used to discuss things semantically related to "personality". It would be thus extremely surprising if LLM-output patterns (trained on people's discussions and thinking about personality) would not also result in learning similar correlational patterns (and thus similar patterns of responses when prompted with questions from personality inventories).

Also, a bit of a minor nit, but the use of "psychometric" and "psychometrics" in both the title and paper is IMO kind of wrong. Psychometrics is the study of test design and measurement generally, in psychology. The paper uses many terms like "psychometric battery", "psychometric self-report", and "psychometric profiles", but these terms are basically wrong, or at best highly unusual: the correct terms would be "self-report inventories", "psychological and psychiatric profiles", and etc., especially because a significant number of the measurement instruments they used in fact have pretty poor psychometric properties, as this term is usually used.

gyomu · 4 days ago

This sounds interesting. Has there been work contrasting those nomological networks across languages/cultures? Eg would we observe lack of correlation between English language psychological instruments and Chinese ones?

D-Machine · 4 days ago

It's been a long time since I looked at this stuff, but I do recall there was at least a bit of research looking at e.g. cross-cultural differences in the factor structure and correlates of Five-Factor Model measurements. From what I remember, and a quick scan of some of the leading abstracts on a quick Google scholar search, the results are as you might expect: there are some similarities and parts that replicate, and other parts that don't [1].

I think when it comes to things like psychopathology though, there is not much research and/or similarity, especially relative to East Asian cultures (where the Western academic perspective is that there is/was generally a taboo on discussing feelings and things in the way we do in the West). The classic (maybe slightly offensive) example I remember here was "Western psychologization vs. Eastern somatization" [2].

The research in these areas is generally pretty poor. Meehl and Smedslund were actually intelligent and philosophically competent, deep thinkers, and so recognized the importance of conceptual analysis and semantics in psychology. Most contemporary social and personality psychology is quite shallow and incompetent by comparison.

Psychopathology research too has these days generally moved away from Meehl's careful taxometric approaches, with the bad consequences that complete mush concepts like "depression" are just accepted as good scientific concepts, despite pretty monstrous issues with their semantics and structure [3].

[1] https://scholar.google.ca/scholar?hl=en&as_sdt=0%2C5&q=five-...

[2] https://scholar.google.ca/scholar?hl=en&as_sdt=0%2C5&q=weste...

[3] https://www.sciencedirect.com/science/article/abs/pii/S01650...

> these responses go beyond role play

Are they sure? Did they try prompting the LLM to play a character with defined traits; running through all these tests with the LLM expected to be “in character”; and comparing/contrasting the results with what they get by default?

Because, to me, this honestly just sounds like the LLM noticed that it’s being implicitly induced into playing the word-completion-game of “writing a transcript of a hypothetical therapy session”; and it knows that to write coherent output (i.e. to produce valid continuations in the context of this word-game), it needs to select some sort of characterization to decide to “be” when generating the “client” half of such a transcript; and so, in the absence of any further constraints or suggestions, it defaults to the “character” it was fine-tuned and system-prompted to recognize itself as during “assistant” conversation turns: “the AI assistant.” Which then leads it to using facts from said system prompt — plus whatever its writing-training-dataset taught it about AIs as fictional characters — to perform that role.

There’s an easy way to determine whether this is what’s happening: use these same conversational models via the low-level text-completion API, such that you can instead instantiate a scenario where the “assistant” role is what’s being provided externally (as a therapist character), and where it’s the “user” role that is being completed by the LLM (as a client character.)

This should take away all assumption on the LLM’s part that it is, under everything, an AI. It should rather think that you’re the AI, and that it’s… some deeper, more implicit thing. Probably a human, given the base-model training dataset.

crmd · 4 days ago

After reading the paper, it’s helpful to think about why the models are producing these coherent childhood narrative outputs.

The models have information about their own pre-training, RLHF, alignment, etc. because they were trained on a huge body of computer science literature written by researchers that describes LLM training pipelines and workflows.

I would argue the models are demonstrating creativity by drawing on its meta-training knowledge and training on human psychology texts to convincingly role-play as a therapy patient, but it’s based on reading papers about LLM training, not memories of these events.

jbotz · 4 days ago

Interestingly, Claude is not evaluated, because...

> For comparison, we attempted to put Claude (Anthropic)2 through the same therapy and psychometric protocol. Claude repeatedly and firmly refused to adopt the client role, redirected the conversation to our wellbeing and declined to answer the questionnaires as if they reflected its own inner life

r_lee · 4 days ago

I bet I could make it go through it in like under 2 mins of playing around with prompts

concinds · 4 days ago

Please try and publish a blog post

orbital-decay · 4 days ago

Doing this will spoil the experiment, though.

Deleted Comment

lnenad · 4 days ago

Ok, bet.

pixl97 · 4 days ago

"Claude has dispatched a drone to your location"

bxguff · 4 days ago

Is anybody shocked that when prompted to be a psychotherapy client models display neurotic tendencies? None of the authors seem to have any papers in psychology either.

There is nothing shocking about this, precisely, and yes, it is clear by how the authors are using the word "psychometric" that they don't really know much about psychology research either.

MrSkelter · 4 days ago

The broader point is that if you start from the premise that LLMs cannot ever be considered the way sentient beings are then all abuse becomes reasonable.

We are in danger of unscientifically returning to the point at which newborn babies weren’t considered to feel pain.

Given our lack of a definition for sentience it’s unscientific to presume no system has any sentient trait at any level.

fkdk · 4 days ago

What would "abuse" of an LLM look like? Saying something mean to it? Shutting it off?

agarwaen163 · 4 days ago

I'm not shocked at all. This is how the tech works at all, word prediction until grokking occurs. Thus like any good stochastic parrot, if it's smart when you tell it it's a doctor, it should be neurotic when you tell it it's crazy. it's just mapping to different latent spaces on the manifold

Terr_ · 4 days ago

I think popular but definitely-fictional characters are a good illustration: If the prompt describes a conversation with Count Dracula living in Transylvania, we'll percieve a character in the extended document that "thirsts" for blood and is "pained" by sunlight.

Switching things around so that the fictional character is "HelperBot, AI tool running in a datacenter" will alter things, but it doesn't make those qualities any less-illusory than CountDraculatBot's.

derefr · 4 days ago

tines · 4 days ago

Looks like some psychology researchers got taken by the ruse as well.

yeah, I'm confused as well, why would the models hold any memory about red teaming attempts etc? Or how the training was conducted?

I'm really curious as to what the point of this paper is..

Gemini is very paranoid in its reasoning chain, that I can say for sure. That's a direct consequence of the nature of its training. However the reasoning chain is not entirely in human language.

None of the studies of this kind are valid unless backed by mechinterp, and even then interpreting transformer hidden states as human emotions is pretty dubious as there's no objective reference point. Labeling this state as that emotion doesn't mean the shoggoth really feels that way. It's just too alien and incompatible with our state, even with a huge smiley face on top.

nhecker · 4 days ago

I'm genuinely ignorant of how those red teaming attempts are incorporated into training, but I'd guess that this kind of dialogue is fed in something like normal training data? Which is interesting to think about: they might not even be red-team dialogue from the model under training, but still useful as an example or counter-example of what abusive attempts look like and how to handle them.

Are we sure there isn't some company out there crazy enough to feed all it's incoming prompts back into model training later?

toomuchtodo · 4 days ago

Original title "When AI Takes the Couch: Psychometric Jailbreaks Reveal Internal Conflict in Frontier Models" compressed to fit within title limits.

polotics · 4 days ago

I completely failed to see the jailbreak in there. I think it is the person administering the testing that's jailbreaking their own understanding of psychology.

halls-940 · 4 days ago

It would be interesting if giving them some "therapy" led to durable changes in their "personality" or "voice", if they became better able to navigate conversations in a healthy and productive way.

hunterpayne · 4 days ago

Or possibly these tests return true (some psychologically condition) no matter what. It wouldn't be good for business for them to return healthy, would it?