Readit News logoReadit News
ilamont · 2 years ago
Not just personal information. Blogs, commercial content, news articles and more. Check out the Allen Institute for Artificial Intelligence's C4 dataset to see if anything you wrote was ingested:

https://c4-search.apps.allenai.org

Love the disclaimer: "The dataset is released under the terms of ODC-BY. By using this, you are also bound by the Common Crawl Terms of Use in respect of the content contained in the dataset."

jprete · 2 years ago
The article weasels out at the end by claiming that companies “may be unable to comply” with requirements to delete personal data. It’s easy to comply - if there’s no other choice then you delete the model and all backups and derivative data that was trained in flagrant violation of the law.
ivalm · 2 years ago
For most companies “deleting the model” is equivalent to dissolving the company so that is equivalent to not being able to comply. More realistically what they would need to do is exit the market of the country that has such stupid laws.
dylan604 · 2 years ago
>For most companies “deleting the model” is equivalent to dissolving the company

I'm okay with that. It's sad that you're not. If you're willing to start a business on such shady foundations, there's a really good chance your business will continue to make shady decisions in the future. It's better to find and remove the cancer early

Deleted Comment

soulofmischief · 2 years ago
Which law is that?
john-radio · 2 years ago
The post that you're responding to doesn't make any claim about the law; it expresses that the defense that an AI company might be "unable to comply" with a command to remove user data is not an honest one.
TrueDuality · 2 years ago
There is currently no law in the US or EU that I'm aware of that consider models trained on PII or copyright data to be considered as including that data. This is the core issue everyone is discussing about copyright around models and whether publicly scraped data is fair use for these purposes.

Until this is decided, likely by a long series of trials in high courts, or passage of law through the appropriate legislative bodies (and then likely after many challenges to it) will GDPR and CCPA be able to apply to these language models.

So no this is not currently a violation of any law.

batch12 · 2 years ago
I was daydreaming about other possible ways to comply, but I don't know if they'd really work in practice. Ideas like:

- Supply copywritten keywords and approved responses to try and retrain the model.

- Train the copywritten content again with inverse weighting. I imagine this one wouldn't really work well and still keep the model performance up...

valianteffort · 2 years ago
I don't get it, you put that information on the internet, you have no expectation to privacy. But now maybe you learned a lesson, and you won't publicly share things you don't want people to see?
jjav · 2 years ago
> I don't get it, you put that information on the internet, you have no expectation to privacy.

While that is somewhat true, there are multiple angles to it. The big one is expectations. It used to be that if you shared something on an obscure forum that had like 100 readers, most of those people expected that only 100 would see their posting. And it was true, so it reinforced the expectation for nontechnical people, which is basically everyone within a rounding error.

Then everything gets indexes and now fed to AI and suddenly billions of people have easy access to what they felt was very limited in distribution.

You could file this under "People don't threat model correctly", but also (more importantly) under the fact that technology should consider human nature before breaking expectations.

visarga · 2 years ago
You should consider the size of the training set as well. A blog post in a 30T dataset like RedPajama has less impact than one in a fine-tuning dataset of 1000 examples. The gradients from all tokens are added up, they stack on top of each other and their influence is diluted in larger datasets.
fsckboy · 2 years ago
do you ever save things you see on the internet that have some "scandal" merit, say for example you don't like Donald Trump and you come across a variety of material that's embarrassing to him. Or your school bully, or your gf's ex, or Macron in France, Trudeau in Canada, etc. Would you feel that other people should get to tell you to delete it because you're not allowed to use any aids to help you remember or prove why they are bad people (in your estimation)? Freedom to remember is part and parcel of free speech.
muspimerol · 2 years ago
Did you read the article? There are cases where the data was explicitly not publicly shared. Why would this journalist have no expectation to privacy? https://arstechnica.com/information-technology/2022/09/artis...

Even for publicly shared data, there is legislation in many parts of the world (notably the entire EU) that enforces an individual's right to have their data deleted.

valianteffort · 2 years ago
Did you read what I said? Anything posted on the internet is public. Just because you think it's private does not make it so.

To that end, you should consider any file on your computer to be public as well if it's connected to the internet, as there are countless ways that data could be exfiltrated.

thefz · 2 years ago
This is similar to saying that if girls don't want to be molested, they should not wear a mini skirt in public.

One may not have been aware of the future existence of AI when writing on a public platform.

Or one simply may want his blog/thoughts to be intended for human consumption and not to train the latest commercial product which will reap from his content and not share any profit.

scrollaway · 2 years ago
How is it similar to that strawman?

There is no expectation of privacy in public. That is pretty much the definition. It’s not victim blaming.

fsckboy · 2 years ago
> saying that if girls don't want to be molested, they should not wear a mini skirt in public

such a ridiculous example if you are trying to convince me that you care about these girls' safety as opposed to some abstract egg-headed right.

stingraycharles · 2 years ago
The “right to be forgotten” is an explicit right in the EU, which I think applies here.
qweqwe14 · 2 years ago
I'm not saying that this right is a bad thing, but the reality is that it's rarely enforceable. As soon as you put something on the internet, you have every "right" to expect that it's there forever.

No amount of legislation can outweigh the technical reality.

valianteffort · 2 years ago
Americans are not bound by EU laws. We appreciate forcing iPhone to USB-C, but you can keep the rest of your authoritarian laws.
constantly · 2 years ago
EU rights do not apply in the US.
exabrial · 2 years ago
> you put that information on the internet, you have no expectation to privacy

That was the expectation until Zuckerberg came along. People were putting tons of stuff on BBSs, intranets, and other stuff 20+ years beforehand because people had the idea of "consent" back then existed: using someones information for a purpose they didn't intend was (and still is) wrong.

FeepingCreature · 2 years ago
Using someone's information for a purpose they didn't intend is wrong? What? This just seems obviously mistaken. I don't agree with that at all. I'm not even sure how to argue about that.

Lots of art, science and technology can be considered as "using someone's information for a purpose they didn't intend." It is extremely normal for people to find new uses for things; information is not exempt from this.

caturopath · 2 years ago
> That was the expectation until Zuckerberg came along.

This isn't my recollection of a pre-facebook internet

icelancer · 2 years ago
> That was the expectation until Zuckerberg came along.

The early days of the Internet were full of "information should be free" hackers, many of whom roam this very forum in their 40s, 50s, and 60s.

Well, information is free now. And this is what it looks like - consumed by LLMs. We got what we asked for.

vasco · 2 years ago
This seems like a fantasy. Anything could end up in 4chan with a photoshoped penis on it since a long time before Zuck came along.
cratermoon · 2 years ago
People didn't put their private medical records on the open internet. They may have uploaded them to a service they thought was respecting privacy but was either selling their data without disclosure or was just straight up incompetent and uploaded private-user-data.zip to a public Dropbox share.
ClumsyPilot · 2 years ago
> you put that information on the internet, you have no expectation to privacy.

But if Disney puts information on the internet, and you misuse it, lawyers will chase you to the ends of the earth.

We both have same right and copyright, but somehow it's one rule for them and another rule for me.

Sparkyte · 2 years ago
GPDR and many data rights exist today, it is still applicable to AI.
thayne · 2 years ago
> you put that information on the internet

That isn't necessarily true. Someone else might have put the information on the Internet. That could be someone else uploading a photo of you, or records of your home purchase on a city's public records, or an obituary or wedding announcement that listed you as family, etc.etc.

godelski · 2 years ago
This is wildly disingenuous. I don't even know where to begin.

Services are advertised as private. So why wouldn't people be acting like they are private. Oh, you're not tech literate and so don't know something "obvious?" Guess it's your fault for listening to your financial advisor and losing all your money too. We live in a specialized world. You don't want to go down this route and it is unreasonable because we can't have infinite knowledge and infinite time to gain that. There's absolutely no way you know in high detail: tech, medicine, economics, law, science, politics, or any other expertise. You don't even know one of these, but rather a small section of one.

Yes, they may be "public" but you can have an expectation of privacy in public. Also your lack of expectation in public is not infinite. The lack of privacy in public is an astonishingly new phenomena. Realistically in the last 10-15 years as we've seen cellphones become prolific. Things were way different just in the early 2000s, where there weren't nearly as many cameras either by CCT or just in peoples' pockets. Let alone microphones or high resolution images.

Most importantly, the environment has changed on the internet too. 10 years ago the only people scraping up the entire internet was big tech and govs. Now it's every startup and group of people that can. These are very different situations. 2015 image generation was still a joke. Goodfellow's GAN paper was posted in late 2014. That's not even 10 years ago that we're struggling to generate black and white faces at 32x32 pixels.

Things are VERY different and if you think you could have predicted this I'm going to call bullshit. Only Captain Hindsight could do that. This kind of thinking is simply just victim blaming and lording your particular expertise over others while ignoring your own ignorance because you let post hoc thinking rewrite your past.

greysphere · 2 years ago
More often than not it's other people putting the information on the internet. Celebrities and politicians already have had to deal with this but at least had some compensation/resources to deal with it. We need some protections or else existing in public spaces will become only viable for the rich. For everyone else you'll be 'caught' picking your nose in the sidewalk and then be fired from your restaurant job or whatever. Not to mention the ramifications for public protest.
masterfooo · 2 years ago
That is an assumption. No one has ever agreed on such concept on paper.

Dead Comment

Dead Comment

devjab · 2 years ago
There are a lot of aspects to this though. We write code for solar power plants, some of it is open source or at least publicly available. What if we wrote something that turned out to be bad and it was used by someone else through the AI and it broke their plant… would we be responsible?

Now, you’re probably thinking about this from a reasoning or technical perspective in which case it’ll appear to be a ridiculous concern… because it is a ridiculous concern. That’s not how our legal department sees it though. They see it as risk mitigation, and they actually take it rather serious.

esafak · 2 years ago
I do have an expectation to privacy. It is possible to secure information if so desired.

Do you expect your online bank to secure your data? Your email provider? Your healthcare provider? So you do expect some privacy. I just expect some more. I believe that companies should not own my data just because they provide me with services off it. From that, consequences follow.

I have a better proposition: let's slap companies that don't respect privacy with fines until they too learn a lesson.

bmitc · 2 years ago
In general, we need a data bill of rights. Corporations are enslaving us digitally, and I feel it's reaching a boiling point.
graphe · 2 years ago
There are zero incentives for them to comply and zero ways a person can make them accountable. If your SSN can be leaked and nothing will happen why would they care about scrappable pictures?

The only times they care are when it can cost them money. SD didn't care about visual artists but could not do the same for generative music since the rights are managed by deep pockets.

happytiger · 2 years ago
Frankly, and not directed personally, but that’s not really true.

Government can make the incentive — and has. Legislation can put the teeth in societal goods like this even if financial incentives don’t.

California has done exactly this with their right to be deleted law.

https://www.foley.com/en/insights/publications/2023/10/calif....

thelittleone · 2 years ago
AI might lead to more paywalled sites which would suck.
andybak · 2 years ago
There's plenty of AI being trained by hobbyists, artists and enthusiasts too.

Dead Comment

happytiger · 2 years ago
I just finished getting downvoted in another thread for writing these very words. Couldn’t agree more.
Sparkyte · 2 years ago
Legally they can't use your data without your consent and that data must be able to be deleted by your demand. So for GenAI to use personal information is bad and they should change its design. I advise they obfuscate the data to not contain sensitive information if they are intending to create AI around such data. Identifiable information is not a good thing.

GDPR and many other laws are still applicable to what GenAI is.

moose4400 · 2 years ago
Now's a good time to scrub your online identity before things get worse.