At the very least, an action which is effectively instantaneous should never be opt out. That is, if they can flip a switch and scan millions of blogs in a few hours, or days, or even weeks, you need to be able to opt out before that happens. It shouldn't be a race between you and the website to see if you can find the opt out toggle before you lose control of your property.
A lesser example of this is when you sign up for a website, and are immediately opted in to their newsletter and various other spam email lists. The opt-in happened in ~1ms, but even if you opt out immediately after your first login, you'll still get added to their list by default.
I've always been amused by the fact that they'll newsletter and product sales bomb me within a minute of signing up for their service, but removing me from their lists may take up to 5-7 business days.
It made more sense when I understood that saying it may take up to 10 business days to opt me out was a statement recognizing the legal requirements rather than the technical requirements. I'm sure some companies just wait exactly 9 days, 23 hours, 59 minutes and 59 seconds to instantaneously opt me out. Malicious compliance, or as we used to call it, passive-aggressiveness.
We need a legally enforcible mechanism to make anyone think twice
before using data without consent. This must have the effect with AI
training models, that if a persons data has been incorporated without
consent, or they legally revoke assumed consent, the data must be
removed. At present, AFAICS that would mean voiding an entire
model. That would be expensive and so be a stiff deterrent against
abuse. Copyright has failed. But there are sutely other tricks to
play.
> The data subject must also be informed about his or her right to withdraw consent anytime. The withdrawal must be as easy as giving consent.
> Last but not least, consent must be unambiguous, which means it requires either a statement or a clear affirmative act. Consent cannot be implied and must always be given through an opt-in, a declaration or an active motion, so that there is no misunderstanding that the data subject has consented to the particular processing.
> For especially severe violations, listed in Art. 83(5) GDPR, the fine framework can be up to 20 million euros, or in the case of an undertaking, up to 4 % of their total global turnover of the preceding fiscal year, whichever is higher. But even the catalogue of less severe violations in Art. 83(4) GDPR sets forth fines of up to 10 million euros, or, in the case of an undertaking, up to 2% of its entire global turnover of the preceding fiscal year, whichever is higher.
Why do you believe training an AI model shouldn’t be considered fair use, especially when they are predominantly trained on publicly available information? It’s a completely transformative work where the model contains no part of the original data.
If you shop in a chain store, they generally ask for your email and phone number usually in exchange for discounts. Giving them this information opts you into their email and sometimes text spam lists meaning several messages a day, some of which might include relevant information about a sale or discount you're interested in.
The end result is that people stop looking at their email because it's mostly marketing spam and they miss genuinely important emails.
I think the tech industry understands this perfectly well. Large parts also understand that actually asking for and respecting consent will reduce their income stream some. The primary directive of the tech industry is to profit regardless of the cost or harm to others.
Interesting choice of words. As I understand it, and I do not have the
judgement transcript, in the British Post Office case those were the
words on which Judge Fraser acquitted the postmasters and found
against the Post Office.
I think the key words may have been "above all else" with respect to
the "interests of the Post Office".
You don't get to say that. Ever.
In law, at least in Britain as I can see it, you don't get to put
reputation or profit or anything else "above all else" or "regardless
of the cost or harm to others". YMMV elsewhere.
Case in point would be when Apple implemented a change to iOS requiring apps to ask if you want to be tracked. I believe almost everybody says no which cost the industry a great deal of ad money.
In essence, this is equal to slavery - "someone pays me to flood you with SPAM - if you refuse that SPAM, I will starve to death; so please don't refuse the SPAM"
We know that bees fly to the flowers by their own wish.
Imagine how bees are sitting in their hive and out of nowhere thousands and thousands of flowers are coming to the hive and advertising "Here, be-be-be-beees, the best floral pollen for you! Take 3 for the price of 2!"
If you were a bee - would you consider this normal?
It is, in that it takes advantage of peoples' speculative nature.
"What if we had all this extra data on our customers?! Then we could do a lot more to make more money! So, it makes sense we ought to pay more for that data too!"
The data provider gets to profit based of this speculative "of course more information/more stuff is better!", while the data purchaser doesn't really have a chance to test whether their speculation is well-founded. (How precisely do you test what the causative effect of changes to your ad campaign are? If you make a change, and numbers go up, how do you separate numbers going up caused by any of the other parallel factors? It's hard to be scientific with advertising.)
The person selling the tool always wins, even if the tool is nothing more than a placebo effect. The person buying the tool on the other hand...well, good thing they have deep pockets, and don't really know how to spend their money well, and good thing too that there's always more more suckers ready to take their place (as long as we keep the birth rate up)!
As a consumer, I agree with this take. Looking at it from the other side though, many businesses simply wouldn’t exist online with opt-in. To some extent, you need to understand that companies need to make a profit and we’ve developed a market where that’s not going to be direct payments by users.
Imagining an opt-in policy highlights how unethical these AI data-harvesting schemes really are. It's blindingly obvious that almost nobody would actively choose to donate their work to enrich AI companies without getting anything in return.
Guess I'm 'almost nobody'. I see zero issue with AI farming Stack Overflow, Twitter, Reddit, and the like - publicly accessible forums. The value to me was in the discussion. It's happened; I've extracted my value from that engagement. If an AI company can also extract value from that collection of discussions, it costs me nothing and I expect no compensation.
Thought AI companies farming copyrighted work on the other hand, that's a different story.
The article isn't talking about public discussion forums at all, it's talking about WordPress and its owner, Automattic. The article is a blog post on a WordPress-hosted blog. It then goes on to talk about consent in software in general.
Personally, I'm not really okay with what you're calling public forum harvesting either. I've put a lot of work into Stack Exchange answers and I am not okay with a for-profit company recycling and possibly outright regurgitating that work without attribution. (The latter would be a flagrant violation of CC BY-SA, of course.)
I understand that the author is mad and wants things to be opt-in. But I also think the author is smart enough to know that the tech industry understands consent just fine. It just doesn't care.
twitter and reddit will soon be bots talking to bots (if they aren't already) so the AI can train on that.
> Thought AI companies farming copyrighted work on the other hand, that's a different story.
Copyright also happens to be opt-out. You have to explicitly say “this is not copyrighted” for copyright to stop applying.
See your comment and my reply? Both copyrighted. Right now. As soon as we hit publish we started to own copyright. There is an EULA somewhere on HN that probably says we give HN implicit permission to host this content in perpetuity and can even make it available in APIs, show it to bots, etc. But that’s not the same as no copyright. If somebody who is not HN wants to screenshot this comment and publish it in a book, they in theory have to find us and ask for permission.
Inevitably there will be copyrighted images, audio, and text mixed in with random social updates and discussions. It should be on the LLM builder to seek active consent, rather than everyone else to be vigilant and/or sue to get their work out of the model's data.
> I see zero issue with AI farming Stack Overflow,
> Thought AI companies farming copyrighted work on the other hand, that's a different story.
All post on Stack Overflow are still copyright of respective posters. They are offered publicly under Creative Commons license that require attribution.
In the US, everything you write anywhere online is copyrighted by you, unless you sign a copyright assignment agreement. It's automatic any time you put an expression into a fixed form, and there is no way to revoke that copyright.
Think of it this way. If Google makes 300 billions per year scrapping 1 trillion webpages, how much money should it pay for each webpage it scraped?
There's a point in bulk scraping that the logistics of giving people real money for data makes no sense. The payout out be to small to waste time thinking about. And the fees of costs of paying anyone in the world would be higher than the actual amount being paid!
I'm not saying it's morally right, I'm saying the only way to be commercially successful is to try to get away with it.
We're properly into the age of all-out "lawfare" now. I wonder what
happened to the likes of Lawrence Lessig and Pamela Jones, and all
those legal minds who used to weigh in on the side of ordinary decent
people. We could use some easily deployed retaliatory weapons and
countermeasures about now.
You can really only laugh when companies like OpenAI say they're working on this problem, and them working on this problem is a tedious opt-out form that you need to fill out for each and every piece of work they may or may not have ingested, and no they won't be retraining the model. It's obvious they are acting in bad faith to anyone with a brain.
Personally, I feel it would be a much smaller problem if Tumblr had an internal AI thing going on. What users REALLY don't like is that they have confided a post with one website, and that website just shared that post with a third-party because if opens up infinite possibilities.
If Tumblr can take your post and give it to OpenAI, they can take your post and give it to anyone, and that's the problem. Because for users, what they post is "between them and Tumblr" and not anybody else.
I'll even say more. Artists don't care if you scrape their art to make AI generated art with it. Because when this happens, it's "between artists and scrapers" and not anybody else, so it's fine. What they do care about is when people post that AI generated art on the internet, or publish it professionally, or do basically anything with it.
In other words, there's a sense of privacy when there is only two parties involved, no matter what is going on, but the instant a third party gets into the system, the first party will freak out, because then you lose that sense of privacy.
I think "consent" is not the right word for this, because it's never simply "consent" it's always boundaries and expectations. Consent implies there's always one yes/no answer for a process. In practice the process is always so complicated that it would require countless yes/no questions, to the point nobody sane would want to deal with it.
Just look at cookie banners and imagine if we had a consent popup for every single thing that needs to be download to show a page: do you consent to download the HTML? Do you consent to download the CSS? Do you consent to download the Javascript? Do you consent to download the images? The user just wants the see the page. You have to make assumptions about how far that consent goes, so it's absolutely transitive, the problem is at what point it crosses the line.
A lesser example of this is when you sign up for a website, and are immediately opted in to their newsletter and various other spam email lists. The opt-in happened in ~1ms, but even if you opt out immediately after your first login, you'll still get added to their list by default.
https://gdpr-info.eu/issues/consent/
> The data subject must also be informed about his or her right to withdraw consent anytime. The withdrawal must be as easy as giving consent.
> Last but not least, consent must be unambiguous, which means it requires either a statement or a clear affirmative act. Consent cannot be implied and must always be given through an opt-in, a declaration or an active motion, so that there is no misunderstanding that the data subject has consented to the particular processing.
https://gdpr-info.eu/issues/fines-penalties/
> For especially severe violations, listed in Art. 83(5) GDPR, the fine framework can be up to 20 million euros, or in the case of an undertaking, up to 4 % of their total global turnover of the preceding fiscal year, whichever is higher. But even the catalogue of less severe violations in Art. 83(4) GDPR sets forth fines of up to 10 million euros, or, in the case of an undertaking, up to 2% of its entire global turnover of the preceding fiscal year, whichever is higher.
The end result is that people stop looking at their email because it's mostly marketing spam and they miss genuinely important emails.
This is one of the big reasons why I do not create accounts or sign up for things online if I can possibly avoid it.
Interesting choice of words. As I understand it, and I do not have the judgement transcript, in the British Post Office case those were the words on which Judge Fraser acquitted the postmasters and found against the Post Office.
I think the key words may have been "above all else" with respect to the "interests of the Post Office".
You don't get to say that. Ever.
In law, at least in Britain as I can see it, you don't get to put reputation or profit or anything else "above all else" or "regardless of the cost or harm to others". YMMV elsewhere.
"What if we had all this extra data on our customers?! Then we could do a lot more to make more money! So, it makes sense we ought to pay more for that data too!"
The data provider gets to profit based of this speculative "of course more information/more stuff is better!", while the data purchaser doesn't really have a chance to test whether their speculation is well-founded. (How precisely do you test what the causative effect of changes to your ad campaign are? If you make a change, and numbers go up, how do you separate numbers going up caused by any of the other parallel factors? It's hard to be scientific with advertising.)
The person selling the tool always wins, even if the tool is nothing more than a placebo effect. The person buying the tool on the other hand...well, good thing they have deep pockets, and don't really know how to spend their money well, and good thing too that there's always more more suckers ready to take their place (as long as we keep the birth rate up)!
Is is just a kleptomaniac instinct to hoard data, despite the fact that it is really a liability [0] in most cases?
[0] https://www.schneier.com/essays/archives/2016/03/data_is_a_t...
If a business can't exist without abusing people, perhaps it shouldn't exist.
Thought AI companies farming copyrighted work on the other hand, that's a different story.
Personally, I'm not really okay with what you're calling public forum harvesting either. I've put a lot of work into Stack Exchange answers and I am not okay with a for-profit company recycling and possibly outright regurgitating that work without attribution. (The latter would be a flagrant violation of CC BY-SA, of course.)
twitter and reddit will soon be bots talking to bots (if they aren't already) so the AI can train on that.
Copyright also happens to be opt-out. You have to explicitly say “this is not copyrighted” for copyright to stop applying.
See your comment and my reply? Both copyrighted. Right now. As soon as we hit publish we started to own copyright. There is an EULA somewhere on HN that probably says we give HN implicit permission to host this content in perpetuity and can even make it available in APIs, show it to bots, etc. But that’s not the same as no copyright. If somebody who is not HN wants to screenshot this comment and publish it in a book, they in theory have to find us and ask for permission.
When you put it that way, yes, you obviously are. So is every other individual.
> Thought AI companies farming copyrighted work on the other hand, that's a different story.
All post on Stack Overflow are still copyright of respective posters. They are offered publicly under Creative Commons license that require attribution.
There's a point in bulk scraping that the logistics of giving people real money for data makes no sense. The payout out be to small to waste time thinking about. And the fees of costs of paying anyone in the world would be higher than the actual amount being paid!
I'm not saying it's morally right, I'm saying the only way to be commercially successful is to try to get away with it.
If Tumblr can take your post and give it to OpenAI, they can take your post and give it to anyone, and that's the problem. Because for users, what they post is "between them and Tumblr" and not anybody else.
I'll even say more. Artists don't care if you scrape their art to make AI generated art with it. Because when this happens, it's "between artists and scrapers" and not anybody else, so it's fine. What they do care about is when people post that AI generated art on the internet, or publish it professionally, or do basically anything with it.
In other words, there's a sense of privacy when there is only two parties involved, no matter what is going on, but the instant a third party gets into the system, the first party will freak out, because then you lose that sense of privacy.
That's why they always ask for (or more frequently just take) "a paid up, non-exclusive, irrevocable, worldwide license"
Just look at cookie banners and imagine if we had a consent popup for every single thing that needs to be download to show a page: do you consent to download the HTML? Do you consent to download the CSS? Do you consent to download the Javascript? Do you consent to download the images? The user just wants the see the page. You have to make assumptions about how far that consent goes, so it's absolutely transitive, the problem is at what point it crosses the line.