Clio: A system for privacy-preserving insights into real-world AI use

orbital-decay · a year ago

>For example, our systems sometimes failed to flag violating content when the user asked Claude to translate from one language to another. Clio, however, spotted these conversations.

Why do they even consider translation of existing content "harmful", policy-wise? The content already exists. No machine translator I know would refuse translating something based on the content. That makes their language models unpredictable in one of their major use cases.

noworriesnate · a year ago

I'm adjacent to the world of sign language translators in the US. They are legally obligated to translate EVERYTHING, regardless of whether it's legal or not, and they also have to maintain client secrecy. I personally know some who have facilitated drug deals and another who has facilitated an illegal discussion about Trump.

We decided as a society that we're not going to use translation services to catch citizens in crime. This AI situation is so much milder--we're talking about censoring stuff that is "harmful", not illegal. The content is not being published by Anthropic--it's up to the users to publish it or not.

We seriously need regulations around AI "safety" because of the enormous influence they bear on all human discourse.

nozzlegear · a year ago

Presumably human interpreters aren't prone to hallucinating things when providing their services, right? That's probably one of the key differentiators.

Imnimo · a year ago

I don't think I would describe a system in which a human ends up looking at your conversation if the algorithm thinks you're suspicious as "privacy-preserving". What is the non-privacy-preserving version of this system? A human browsing through every conversation?

IanCal · a year ago

That's a different thing. This system doesn't do that, but that's one use case they have for it.

wseqyrku · a year ago

Yeah, this is basically a kind of surveillance system for governments seeking "insights" into communications of any modality.

refulgentis · a year ago

I find this sort of thing cloying because all it does is show me they keep copies of my chats and access them at will.

I hate playing that card. I worked at Google, and for the first couple years, I was very earnest. Someone smart here pointed out to me, sure, maybe everything is behind 3 locks and keys and encrypted and audit logged, but what about the next guys?

Sort of stuck with me. I can't find a reason I'd ever build anything that did this, if only to make the world marginally easier to live in.

epoch_100 · a year ago

Anthropic’s privacy policy is extremely strict — for example, conversations are retained for only 30 days and there’s no training on user data by default. https://privacy.anthropic.com/en/articles/10023548-how-long-...

refulgentis · a year ago

I thought this was true, honestly, up until I read it just now. User data is explicitly one of the 3 training sources[^1], with forced opt-ins like "feedback"[^2] lets them store & train on it for 10 years[^3], or tripping the safety classifier"[^2], lets them store & train on it for 7 years.[^3]

[^1] https://www.anthropic.com/legal/privacy:

"Specifically, we train our models using data from three sources:...[3.] Data that our users or crowd workers provide"..."

[^2] For all products, we retain inputs and outputs for up to 2 years and trust and safety classification scores for up to 7 years if you submit a prompt that is flagged by our trust and safety classifiers as violating our UP.

Where you have opted in or provided some affirmative consent (e.g., submitting feedback or bug reports), we retain data associated with that submission for 10 years.

[^3] "We will not use your Inputs or Outputs to train our models, unless: (1) your conversations are flagged for Trust & Safety review (in which case we may use or analyze them to improve our ability to detect and enforce our Usage Policy, including training models for use by our Trust and Safety team, consistent with Anthropic’s safety mission), or (2) you’ve explicitly reported the materials to us (for example via our feedback mechanisms), or (3) by otherwise explicitly opting in to training."

binarymax · a year ago

This is a non starter for every company I work with as a B2B SaaS dealing with sensitive documents. This policy doesn’t make any sense. OpenAI is guilty of the same. Just freaking turn this off for business customers. They’re leaving money on the table by effectively removing themselves from a huge chunk of the market that can’t agree to this single clause.

voltaireodactyl · a year ago

Given the apparent technical difficulties involved in getting insight into a model’s underlying data, how would anyone ever hold them to account if they violated this policy? Real question, not a gotcha, it just seems like if corporate-backed IP holders are unable to prosecute claims against AI, it seems even more unlikely that individual paying customers would have greater success.

saagarjha · a year ago

That's the point, though. What's there that would stop it from changing later?

anon373839 · a year ago

Even if this were true (and not hollowed out by various exceptions in Anthropic’s T&C), I would not call it “extremely strict”. How about zero retention?

lazycog512 · a year ago

who guards the guards? [they plan] ahead and begin with them.

eddyzh · a year ago

They say something about retention after analysis by Clio but it's not very specific.

pixelsort · a year ago

They have to, the major AI companies are ads companies. Their profits demand that we accept their attempts to normalize the Spyware that networked AI represents.

botanical76 · a year ago

Yep. More generally, I have a lot of distaste that big tech are the ones driving the privacy conversation. Why would you put the guys with such blatant ulterior motives behind the wheel? But, this seems to be the US way. Customer choice via market share above everything, always, even if that choice gradually erodes the customer's autonomy.

Not that anywhere else is brave enough to try otherwise, for fear of falling too far behind US markets.

Disclaimer: I could be much more informed on the relevant policies which enable this, but I can see the direction we're heading in... and I don't like it.

simonw · a year ago

I wrote up some notes (including interesting snippets of the video transcript) here: https://simonwillison.net/2024/Dec/12/clio/

tonygiorgio · a year ago

There’s absolutely nothing privacy preserving about their system and adding additional ways to extract and process user data doesn’t call for any additional privacy, it weakens it further.

Until they start using nvidia confidential compute and doing end to end encryption from the client to the GPU like we are, it’s just a larp. Sorry, a few words in a privacy policy don’t cut it.

musha68k · a year ago

They are in bed with NSA & co the same as OpenAI.

Palantir announced this even officially; partnership with Anthropic and AWS:

https://www.businesswire.com/news/home/20241107699415/en/Ant...

wseqyrku · a year ago

Of course this doesn't need to be used on "AI use" as they frame it. So far, your activity was a line in the logs somewhere, now someone is actually looking at you with three eyes, at all times.

pieix · a year ago

A lot of negativity in these comments. I find this analysis of claude.ai use cases helpful — many people, myself included, are trying to figure out what real people find LLMs useful for, and now we know a little more about that.

Coding use cases making up 23.8% of usage indicates that we're still quite early on the adoption curve. I wonder if ChatGPT's numbers also skew this heavily towards devs, who make up only ~2.5% of the [American] workforce.

eddyzh · a year ago

While the highest catergoies are vague (web development vs cloud development) the specific clusters shown in the language specific examples expose a nation specific collectiev activity. While anonimized its stil exposing a lot of this collection of privat chats.

Good that the tell, but they did it before telling. I really hope they delete the detailed chats afterwards. They should and probably wont delete the first layer of aggregation.