Claude 3 Opus suspects it is being tested from benchmark question

ggm · 2 years ago

Again with the sloppy language. Claude 3 responses are different depending on context of questioning and can be seen to reflect a distinction between use for purpose and test use? Well thats a damn fine LL. Well done. Is it inferring or suspecting? No.

It's not "suspecting" anything except in the ridiculous, sloppy use of language, analogy ridden way AI proponents like to state things.

"suspect" had a very good english language meaning tied to intelligence and thinking. Polluting the meaning by applying it to a complex software system strongly implies belief the software IS thinking. It's not scientifically observationally neutral language.

If I was in peer review in this field, I'd fail any paper which wrote this kind of thing.

Claude 3 Opus can be used by people to distinguish use between test/benchmarking, and intentional use. Naieve use of Claude 3 for benchmarking may be led astray because the LLM appears to behave differently in each case. This is interesting. It's not a sign of emergent intelligence, its refinement of meaning in the NLP of the questions and their context.

falcor84 · 2 years ago

I must say I don't understand your stance against extending the use of language to machines. Are you also arguing that "flying" should be reserved for birds and cannot be meaningfully applied to airplanes?

ggm · 2 years ago

No, i am not. Flying doesn't impart meaning of intentional behaviour. It's distance carried against height lost or gained pure and simple. Claim, believes, states, hallucinated, states, says argues all go to intelligence not just the input output situation. Maybe most unintentional flight is gliding, or controlled descent to ground. That said, can a dumb system keep a Kite aloft? I believe yes. No Ai needed.

To me there is a clear difference. Ai language is not as direct as flying.

nottorp · 2 years ago

Interesting enough, all flying machines fly - that means both airplanes and lighter than air stuff.

But ships and submarines don't swim. Although submarines do dive.

Coming back from the finer points of the english language, the LLM in question is likely to be trained on text discussing LLM evaluation. So it ended up generating something about it.

You LLM "activists" are doing yourselves a disservice speaking about them in religious tones. You lose some credibility.

fancyfredbot · 2 years ago

The response returned from the Chinese Room showed signs of what would have been interpreted as "suspicion" had the response come from a Human.

Did the Chinese Room suspect anything?

EForEndeavour · 2 years ago

Genuine question: do you have the same objection against the terms machine learning, goal-seeking, or AI agents?

ggm · 2 years ago

I am unhappy with ML as words but ML as an acronym imparts less concern. Goal seeking has been in the vocab for a long time, and usually does impart a sense of intentionality but it's coded in, algorithmically. the mechanistic approaches to goal seeking are givens, the code is an implementation of them.

In LLM we're being taken to "the intentionality emerges, it wasn't coded in" Which sould be disputed, but it's how I see people inferring "behaviour" in this space.

Agents yes, I do have problems with, having just written of them in a different but related thread (different HN story) I begin to suspect the seeds of this linguistic problem lie deep. They were there in "clippy" and the personification of the system.

ukuina · 2 years ago

Emergent awareness is cool, but it seems Claude Opus has imperfect recall at the 25k token mark, and has lossy recall well before the half-payload mark (100k)?

For reference, GPT4-Turbo has perfect recall to 64k and drops pretty badly past that until 128k: https://twitter.com/GregKamradt/status/1722386725635580292

pliny · 2 years ago

I saw a comment (not in the linked thread) that notes that the system prompt probably includes something along the lines of "You are a helpful AI assistant...", which does feel like a bit of a giveaway in the sense that there are many documents on the internet that discuss testing AI assistants and therefore a document who's topic is an AI assistant is likely to contain discussion of testing the assistant (and if the document is written from a first person perspective, of the assistant, then it's likely to contain text from the perspective of the assistant discussing it's own testing).

m3kw9 · 2 years ago

Good job on hyping it but it really lose respect for Claude for insulting our intelligence by saying this stuff. The issue here is that he does not say if the “randomness” setting is zero, if not you are gonna get all sorts of responses and depending on the pre prompt on how it is supposed to answer you