Readit News logoReadit News
djoldman · a month ago
Just to be clear, as with LAION, the data set doesn't contain personal data.

It contains links to personal data.

The title is like saying that sending a magnet link to a copyrighted torrent file is distributing copyright material. Folks can argue if that's true but the discussion should at least be transparent.

yorwba · a month ago
I think the data set is generally considered to consist of the images, not the list of links for downloading the images.

That the data set aggregator doesn't directly host the images themselves matters when you want to issue a takedown (targeting the original image host might be more effective) but for the question "Does that mean a model was trained on my images?" it's immaterial.

ryanjshaw · a month ago
It does matter? When implemented as a reference, the image can be taken down and will no longer be included in training sets*. As a copy, the image is eternal. What’s the alternative?

* Assuming the users regularly check the images are still being hosted (probably something that should be regulated)

djoldman · a month ago
The data set is a list of ("descriptive text", URL) tuples.

As with almost any URL, it is not in and of itself an image.

As an aside, this presents a problem for researchers because the links can resolve to different resources, or no resource at all, depending on when they are accessed.

Therefore this is not a static dataset on which a machine learning model can be trained in a guaranteed reproducible fashion.

duskwuff · a month ago
That's a distinction without a difference. Just as with LAION, anyone using this data set is going to be downloading the images and training on them, and the potential harms to the affected users are the same.
djoldman · a month ago
LAION was alleged to link to CSAM. If LAION didn't link and instead hosted/contained/distributed the actual files, I think there would be a much higher chance that someone distributing LAION could serve prison time, at least in the USA.

That seems like a pretty big difference to me.

kazinator · a month ago
When the model is trained, are the links not resolved to fetch whatever the point to, and that goes into the model?

Secondly, privacy and copyright are different. Privacy is more of a concern with how information is used than getting credit and monetization for being the author.

anonymoushn · a month ago
no, normally your training pipeline wouldn't involve running bittorrent
bearl · a month ago
Links to pii are by far the worst sort of pii, yes.

“It’s not his actual money, it’s just his bank account and routing number.”

djoldman · a month ago
A more accurate analogy is "it's not his actual money, it's a link to a webpage or image that has his bank account and routing number."
atemerev · a month ago
Well yes, by knowing my bank account and routing number, you don't have access to my money.

Deleted Comment

Frieren · a month ago
> The title is like saying that sending a magnet link to a copyrighted torrent file is distributing copyright material.

I interpret that the article is about AI being trained on personal data. That is a big break of many countries legislation.

And AI is 100% being trained in copyrighted data too. Breaking another different set of laws.

That shows how much big-tech is just breaking the law and using money and influence to get away with it.

os2warpman · a month ago
"Ladies and gentlemen of the jury, my client did not rob that bank. He only made a Google Maps link to directions to the bank, a link to an Imgur image containing the vault's combination, and a link to a Pastebin with instructions on how to disable the security system available. He merely packaged that information together and made it publicly available in a single source in a format only really useful to robbers for the purpose of robbery training. It's twoo hward to actually look at the information one is compiling and releasing to the public and to expect even a microscopically minuscule cursory amount of minimal effort to that end is unreasonable. He is clearly innocent."
lazide · a month ago
What do you think they’d be charged with in this situation?

It wouldn’t be bank robbery.

cheschire · a month ago
I hope future functionality of haveibeenpwned includes a tool to search LLM models and training data for PII based on the collected and hashed results of this sort of research.
croes · a month ago
Hard to search in the model itself
cheschire · a month ago
Yep, that's why at the end of my sentence I referred to the results of research efforts like this that do the hard work of extracting the information in the first place.
satvikpendem · a month ago
This is all public data. People should not be putting personal data on public image hosts and sites like LinkedIn if they did not want them to be scraped. There is nothing private about the internet and I wish people understood that.
andsoitis · a month ago
> There is nothing private about the internet and I wish people understood that.

I don’t know that that is useful advice for the average person. For instance, you can access your bank account via the internet, yet there are very strong privacy guarantees.

Concur that it is a safe default assumption what you say, but then you need a way for people to not now mistrust all internet services because everything is considered public.

satvikpendem · a month ago
I'm more so talking about posting things consciously on some platform, not accessing.
pera · a month ago
> This is all public data

It's important to know that generally this distinction is not relevant when it comes to data subject rights like GDPR's right to erasure: If your company is processing any kind of personal data, including publicly available data, it must comply with data protection regulations.

satvikpendem · a month ago
That's all fine. But until someone requests their information to be deleted, it is still public.
booder1 · a month ago
Legal has in no way been able to keep up with AI. Just look at copyright. Internet data is public and the government is incapable of changing this.
boie0025 · a month ago
While I agree with your sentiment, there's a pretty good chance that at least some of this is, for example, data that inadvertently leaked while someone accidentally exposed an automatic index with Apache, or perhaps an asset manifest exposed a bunch of uploaded images in a folder or bucket that wasn't marked private for whatever reason. I can think of a lot of reasons this data could be "public" that would be well beyond the control of the person exposed. I also don't think that there's a universal enough understanding that uploading something to your WordPress or whatever personal/business site to share with a specific person, with an obscure unpublished URL is actually public. I think these lines are pretty blurry.

Edit: to clarify, in the first two examples I'm referring to web applications that the exposed person uses but does not control.

malfist · a month ago
What's important is that we blame the victims instead of the corporations that are abusing people's trust. The victims should have known better than to trust corporations
nerdjon · a month ago
Right, both things can be wrong here.

We need to better educate people on the risks of posting private information online.

But that does not absolve these corporations of criticism of how they are handling data and "protecting" people's privacy.

Especially not when those companies are using dark patterns to convince people to share more and more information with them.

thinkingtoilet · a month ago
If this was 2010 I would agree. This is the world we live in. If you post a picture of yourself on a lamp post on a street in a busy city, you can't be surprised if someone takes it. It's the same on the internet and everyone knows it by now.
Workaccount2 · a month ago
I have negative sympathy for people who still aren't aware that if they aren't paying for something, they are the something to be sold. This has been the case for almost 30 years now with the majority of services on the internet, including this very website right here.
keybored · a month ago
Modern companies: We aim to create or use human-like AI.

Those same modern companies: Look, if our users inadvertently upload sensitive or private information then we can't really help them. The heuristics for detecting those kinds of things are just too difficult to implement.

squigz · a month ago
> The victims should have known better than to trust corporations

Literally yes? Is this sarcasm? Are we in 2025 supposed to implicitly trust multi-billion dollar multi-national corporations that have decades' worth of abuses to look back on? As if we couldn't have seen this coming?

It's been part of every social media platform's ToS for many years that they get a license to do whatever they want with what you upload. People have warned others about this for years and nothing happened. Those platforms' have already used that data prior to this for image classification, identification and the like. But nothing happened. What's different now?

blitzar · a month ago
> blame the victims

If you post something publicly you cant be complaining that it is public.

johnnyanmac · a month ago
>People should not be putting personal data on public image hosts and sites like LinkedIn if they did not want them to be scraped.

So my choice in society is to not have a job or get interviews and accept that I have no privacy in the modern world, being mined for profit to companies that lay off their workers anyway.

By the way, I was also recommended to make and show off a website portfolio to get interviews... sigh.

ako · a month ago
But that is information you intend to be public, you want it in google, and in ai models as they are replacing traditional search engines. The only reason you put it on LinkedIn is for other people to find you, so be happy the llm helps.
satvikpendem · a month ago
You don't have to use LinkedIn or similar, many people don't.
chrisg23 · a month ago
What if someone else posts your personal data on the public internet and it gets collected into a dataset like this?
satvikpendem · a month ago
How is that not a different story?
Anonbrit · a month ago
A hidden camera can make your bedroom public. Don't do it if you don't want it to be on pay-per-view?
satvikpendem · a month ago
That is indeed what Justin.tv did, to much success. But that was because Justin had consented to doing so, just as anything anyone posts online is also consented to being seen by anyone.
dlivingston · a month ago
Your analogy doesn't hold. A 'hidden camera' would be either malware that does data exfiltration, or the company selling/training on your data outside of the bounds of its terms of service.

A more apt analogy would be someone recording you in public, or an outside camera pointed at your wide-open bedroom window.

dpoloncsak · a month ago
Does this analogy really apply? Maybe I'm misunderstanding, but it seems like all of this data was publicly available already, and scraped from the web.

In that case, its not a 'hidden camera'...users uploaded this data and made it public, right? I'm sure some were due to misconfiguration or whatever (like we see with Tea), but it seems like most of this was uploaded by the user to the clear web. I'm all for "Dont blame the victims", but if you upload your CC to Imgur I think you deserve to have to get a new card.

Per the article "CommonPool ... draws on the same data source: web scraping done by the nonprofit Common Crawl between 2014 and 2022."

Dead Comment

jeroenhd · a month ago
AI and scraping companies are why we can't have nice things.

Of course privacy law doesn't necessarily agree with the idea that you can just scrape private data, but good luck getting that enforced anywhere.

1vuio0pswjnm7 · a month ago
archive.is is (a) sometimes blocked, (b) serves CAPTCHAs in some instances and (c) includes a tracking pixel

One alternative to archive.is for this website is to disable Javascript and CSS

Another alternative is the website's RSS feed

Works anywhere without CSS or Javascript, without CAPTCHAs, without tracking pixel

For example,

   curl https://web.archive.org/web/20250721104402if_/https://www.technologyreview.com/feed/ 
   |(echo "<meta charset=utf-8>";grep -E "<pubDate>|<p>|<div") > 1.htm

   firefox ./1.htm
To retrieve only the entry about DataComp CommonPool,

   curl https://web.archive.org/web/20250721104402if_/https://www.technologyreview.com/feed/ 
   |sed -n '/./{/>1120522</post-id>/,/>1120466</post-id>/p;}' 
   |(echo "<meta charset=utf-8>";grep -E "<pubDate>|<p>|<div") > 1.htm

   firefox ./1.htm

1vuio0pswjnm7 · a month ago
If using a text-only browser that does not process CSS or run Javascript, 100% of the article is displayed
pera · a month ago
Yesterday I asked if there is any LLM provider that is GDPR compliant: at the moment I believe the answer is no.

https://news.ycombinator.com/item?id=44716006

thrance · a month ago
Mistral's products are supposed to be at least, since they are based in the EU.
pera · a month ago
I am not sure if Mistral is: if you go to their GDPR page (https://help.mistral.ai/en/articles/347639-how-can-i-exercis...) and then to the erasure request section they just link to a "How can I delete my account?" page.

Unfortunately they don't provide information regarding their training sets (https://help.mistral.ai/en/articles/347390-does-mistral-ai-c...) but I think it's safe to assume it includes DataComp CommonPool.

wilg · a month ago
GDPR has plenty of language related to reasonability, cost, feasibility, and technical state of the art that probably means LLM providers do not have to comply in the same way, say, a social platform might.
GardenLetter27 · a month ago
This just demonstrates how bad the GDPR is rather than the LLMs though.

China must be laughing.

tonyhart7 · a month ago
so your best bet is open weight LLM then???

but its that a breach of GDPR???

pera · a month ago
There is currently no effective method for unlearning information - specially not when you don't have access to the original training datasets (as is the case with open weight models), see:

Rethinking Machine Unlearning for Large Language Models

https://arxiv.org/html/2402.08787v6

atoav · a month ago
Only if it contains personal data you collected without explicit consent ("explicit" here means litrrally asking: "I want to use this data for that purpose, do you allow this? Y/N").

Also people who have given their consent before need to be able to revoke it at any point.

itsalotoffun · a month ago
I WISH this mattered. I wish data breaches actually carried consequences. I wish people cared about this. But people don't care. Right up until you're targeted for ID theft, fraud or whatever else. But by then the causality feels so diluted that it's "just one of those things" that happens randomly to good people, and there's "nothing you can do". Horseshit.
rypskar · a month ago
We should also stop calling it ID theft. The identity is not stolen, the owner do still have it. Calling it ID theft is moving the responsibility from the one that a fraud is against (often banks or other large entities) to an innocent 3rd party
herbturbo · a month ago
Yes tricking a bank into thinking you are one of their customers is not the same as assuming someone else’s identity.
JohnFen · a month ago
> Calling it ID theft is moving the responsibility from the one that a fraud is against (often banks or other large entities)

The victim of ID theft is the person whose ID was stolen. The damage to banks or other large entities pales in comparison to the damage to those people.

laughingcurve · a month ago
It’s not clear to me how this is a data breach at all. Did the researchers hack into some database and steal information? No?

Because afaik everything they collected was public web. So now researchers are being lambasted for having data in their sets that others released

That said, masking obvious numbers like SSN is low hanging fruit. Trying to obviate every piece of public information about a person that can identify them is insane.

atoav · a month ago
It doesn't now, but we could collectively decide to introduce consequences of the kind that deter anybody willing to try this again.
jelvibe25 · a month ago
What's the right consequence in your opinion?
passwordoops · a month ago
Criminal liability with a minimum 2 years served for executives and fines amounting to 110% of total global revenue to the company that allowed the breach would see cybersecurity taken a lot more seriously in a hurry

Deleted Comment

krageon · a month ago
A stolen identity destroys the life of the victim, and there's going to be more than one. They (every single involved CEO) should have all of their assets seized, to be put in a fund that is used to provide free legal support to the victims. Then they should go to a low-security prison and have mandatory community service for the rest of their lives.

They probably can't be redeemed and we should recognise that, but that doesn't mean they can't spend the rest of their life being forced to be useful to society in a constructive way. Any sort of future offense (violence, theft, assault, anything really) should mean we give up on them. Then they should be humanely put down.

Dead Comment