Can I remove my personal data from GenAI training datasets?

The article weasels out at the end by claiming that companies “may be unable to comply” with requirements to delete personal data. It’s easy to comply - if there’s no other choice then you delete the model and all backups and derivative data that was trained in flagrant violation of the law.

ivalm · 2 years ago

For most companies “deleting the model” is equivalent to dissolving the company so that is equivalent to not being able to comply. More realistically what they would need to do is exit the market of the country that has such stupid laws.

dylan604 · 2 years ago

>For most companies “deleting the model” is equivalent to dissolving the company

I'm okay with that. It's sad that you're not. If you're willing to start a business on such shady foundations, there's a really good chance your business will continue to make shady decisions in the future. It's better to find and remove the cancer early

Deleted Comment

soulofmischief · 2 years ago

Which law is that?

john-radio · 2 years ago

The post that you're responding to doesn't make any claim about the law; it expresses that the defense that an AI company might be "unable to comply" with a command to remove user data is not an honest one.

TrueDuality · 2 years ago

There is currently no law in the US or EU that I'm aware of that consider models trained on PII or copyright data to be considered as including that data. This is the core issue everyone is discussing about copyright around models and whether publicly scraped data is fair use for these purposes.

Until this is decided, likely by a long series of trials in high courts, or passage of law through the appropriate legislative bodies (and then likely after many challenges to it) will GDPR and CCPA be able to apply to these language models.

So no this is not currently a violation of any law.

batch12 · 2 years ago

I was daydreaming about other possible ways to comply, but I don't know if they'd really work in practice. Ideas like:

- Supply copywritten keywords and approved responses to try and retrain the model.

- Train the copywritten content again with inverse weighting. I imagine this one wouldn't really work well and still keep the model performance up...

I don't get it, you put that information on the internet, you have no expectation to privacy. But now maybe you learned a lesson, and you won't publicly share things you don't want people to see?

jjav · 2 years ago

> I don't get it, you put that information on the internet, you have no expectation to privacy.

While that is somewhat true, there are multiple angles to it. The big one is expectations. It used to be that if you shared something on an obscure forum that had like 100 readers, most of those people expected that only 100 would see their posting. And it was true, so it reinforced the expectation for nontechnical people, which is basically everyone within a rounding error.

Then everything gets indexes and now fed to AI and suddenly billions of people have easy access to what they felt was very limited in distribution.

You could file this under "People don't threat model correctly", but also (more importantly) under the fact that technology should consider human nature before breaking expectations.

visarga · 2 years ago

You should consider the size of the training set as well. A blog post in a 30T dataset like RedPajama has less impact than one in a fine-tuning dataset of 1000 examples. The gradients from all tokens are added up, they stack on top of each other and their influence is diluted in larger datasets.

fsckboy · 2 years ago

do you ever save things you see on the internet that have some "scandal" merit, say for example you don't like Donald Trump and you come across a variety of material that's embarrassing to him. Or your school bully, or your gf's ex, or Macron in France, Trudeau in Canada, etc. Would you feel that other people should get to tell you to delete it because you're not allowed to use any aids to help you remember or prove why they are bad people (in your estimation)? Freedom to remember is part and parcel of free speech.

muspimerol · 2 years ago

Did you read the article? There are cases where the data was explicitly not publicly shared. Why would this journalist have no expectation to privacy? https://arstechnica.com/information-technology/2022/09/artis...

Even for publicly shared data, there is legislation in many parts of the world (notably the entire EU) that enforces an individual's right to have their data deleted.

valianteffort · 2 years ago

Did you read what I said? Anything posted on the internet is public. Just because you think it's private does not make it so.

To that end, you should consider any file on your computer to be public as well if it's connected to the internet, as there are countless ways that data could be exfiltrated.

thefz · 2 years ago

This is similar to saying that if girls don't want to be molested, they should not wear a mini skirt in public.

One may not have been aware of the future existence of AI when writing on a public platform.

Or one simply may want his blog/thoughts to be intended for human consumption and not to train the latest commercial product which will reap from his content and not share any profit.

scrollaway · 2 years ago

How is it similar to that strawman?

There is no expectation of privacy in public. That is pretty much the definition. It’s not victim blaming.

fsckboy · 2 years ago

> saying that if girls don't want to be molested, they should not wear a mini skirt in public

such a ridiculous example if you are trying to convince me that you care about these girls' safety as opposed to some abstract egg-headed right.

stingraycharles · 2 years ago

The “right to be forgotten” is an explicit right in the EU, which I think applies here.

qweqwe14 · 2 years ago

I'm not saying that this right is a bad thing, but the reality is that it's rarely enforceable. As soon as you put something on the internet, you have every "right" to expect that it's there forever.

No amount of legislation can outweigh the technical reality.

valianteffort · 2 years ago

Americans are not bound by EU laws. We appreciate forcing iPhone to USB-C, but you can keep the rest of your authoritarian laws.

constantly · 2 years ago

EU rights do not apply in the US.

exabrial · 2 years ago

> you put that information on the internet, you have no expectation to privacy

That was the expectation until Zuckerberg came along. People were putting tons of stuff on BBSs, intranets, and other stuff 20+ years beforehand because people had the idea of "consent" back then existed: using someones information for a purpose they didn't intend was (and still is) wrong.

FeepingCreature · 2 years ago

Using someone's information for a purpose they didn't intend is wrong? What? This just seems obviously mistaken. I don't agree with that at all. I'm not even sure how to argue about that.

Lots of art, science and technology can be considered as "using someone's information for a purpose they didn't intend." It is extremely normal for people to find new uses for things; information is not exempt from this.

caturopath · 2 years ago

> That was the expectation until Zuckerberg came along.

This isn't my recollection of a pre-facebook internet

icelancer · 2 years ago

> That was the expectation until Zuckerberg came along.

The early days of the Internet were full of "information should be free" hackers, many of whom roam this very forum in their 40s, 50s, and 60s.

Well, information is free now. And this is what it looks like - consumed by LLMs. We got what we asked for.

vasco · 2 years ago

This seems like a fantasy. Anything could end up in 4chan with a photoshoped penis on it since a long time before Zuck came along.

cratermoon · 2 years ago

People didn't put their private medical records on the open internet. They may have uploaded them to a service they thought was respecting privacy but was either selling their data without disclosure or was just straight up incompetent and uploaded private-user-data.zip to a public Dropbox share.

ClumsyPilot · 2 years ago

> you put that information on the internet, you have no expectation to privacy.

But if Disney puts information on the internet, and you misuse it, lawyers will chase you to the ends of the earth.

We both have same right and copyright, but somehow it's one rule for them and another rule for me.

Sparkyte · 2 years ago

GPDR and many data rights exist today, it is still applicable to AI.

thayne · 2 years ago

> you put that information on the internet

That isn't necessarily true. Someone else might have put the information on the Internet. That could be someone else uploading a photo of you, or records of your home purchase on a city's public records, or an obituary or wedding announcement that listed you as family, etc.etc.

godelski · 2 years ago

This is wildly disingenuous. I don't even know where to begin.

Services are advertised as private. So why wouldn't people be acting like they are private. Oh, you're not tech literate and so don't know something "obvious?" Guess it's your fault for listening to your financial advisor and losing all your money too. We live in a specialized world. You don't want to go down this route and it is unreasonable because we can't have infinite knowledge and infinite time to gain that. There's absolutely no way you know in high detail: tech, medicine, economics, law, science, politics, or any other expertise. You don't even know one of these, but rather a small section of one.

Yes, they may be "public" but you can have an expectation of privacy in public. Also your lack of expectation in public is not infinite. The lack of privacy in public is an astonishingly new phenomena. Realistically in the last 10-15 years as we've seen cellphones become prolific. Things were way different just in the early 2000s, where there weren't nearly as many cameras either by CCT or just in peoples' pockets. Let alone microphones or high resolution images.

Most importantly, the environment has changed on the internet too. 10 years ago the only people scraping up the entire internet was big tech and govs. Now it's every startup and group of people that can. These are very different situations. 2015 image generation was still a joke. Goodfellow's GAN paper was posted in late 2014. That's not even 10 years ago that we're struggling to generate black and white faces at 32x32 pixels.

Things are VERY different and if you think you could have predicted this I'm going to call bullshit. Only Captain Hindsight could do that. This kind of thinking is simply just victim blaming and lording your particular expertise over others while ignoring your own ignorance because you let post hoc thinking rewrite your past.

greysphere · 2 years ago

More often than not it's other people putting the information on the internet. Celebrities and politicians already have had to deal with this but at least had some compensation/resources to deal with it. We need some protections or else existing in public spaces will become only viable for the rich. For everyone else you'll be 'caught' picking your nose in the sidewalk and then be fired from your restaurant job or whatever. Not to mention the ramifications for public protest.

masterfooo · 2 years ago

That is an assumption. No one has ever agreed on such concept on paper.

Dead Comment

devjab · 2 years ago

There are a lot of aspects to this though. We write code for solar power plants, some of it is open source or at least publicly available. What if we wrote something that turned out to be bad and it was used by someone else through the AI and it broke their plant… would we be responsible?

Now, you’re probably thinking about this from a reasoning or technical perspective in which case it’ll appear to be a ridiculous concern… because it is a ridiculous concern. That’s not how our legal department sees it though. They see it as risk mitigation, and they actually take it rather serious.

esafak · 2 years ago

I do have an expectation to privacy. It is possible to secure information if so desired.

Do you expect your online bank to secure your data? Your email provider? Your healthcare provider? So you do expect some privacy. I just expect some more. I believe that companies should not own my data just because they provide me with services off it. From that, consequences follow.

I have a better proposition: let's slap companies that don't respect privacy with fines until they too learn a lesson.