Readit News logoReadit News
Posted by u/wxce 3 years ago
Ask HN: Isn't ChatGPT unfair to the sources it scraped data from?
ChatGPT scraped data from various sources on the internet.

> The model was trained using text databases from the internet. This included a whopping 570GB of data obtained from books, webtexts, Wikipedia, articles and other pieces of writing on the internet. To be even more exact, 300 billion words were fed into the system.

I believe it's unfair to these sources that ChatGPT drives away their clicks, and in turn the ad income that would come with them.

Scraping data seems fine in contexts where clicks aren't driven away from the very site the data was scraped from. But in ChatGPT's case, it seems really unfair to these sources and the work that the authors put, as people would no longer even to attempt to go to these sources.

Can this start breaking the ad-based model of the internet, where a lot of sites rely upon the ad income to run servers?

ergonaught · 3 years ago
Is it unfair for you to create content/products/etc after you have read and learned from various sources on the internet, potentially depriving them of clicks/income?

People get internet hostile at me for this question, but it really is that simple. They've automated you, and it's definitely going to be a problem, but if it's acceptable for your brain to do the same thing, you're going to have to find a different angle to attack it than "fairness".

_qzu4 · 3 years ago
> Is it unfair for you to create content/products/etc after you have read and learned from various sources on the internet, potentially depriving them of clicks/income?

Because it's false equivalence? ChatGPT isn't a human being. It's a product that is built upon data from other sources.

The question is if this data is legal to scrape, which it is: Web scraping is legal, US appeals court reaffirms [https://news.ycombinator.com/item?id=31075396].

As long as the content is not copyrighted and it's not regurgitating the exact same content, then it should be okay.

Retr0id · 3 years ago
Being allowed to scrape something does not absolve you of all intellectual property, copyright, moral, etc. issues arising from subsequent use of the scraped data.
faktory · 3 years ago
ChatGPT isn't doing the scraping, humans are. And humans are using computers to both read the article and create content or to scrape it.

So not it's not a false equivalence.

anonymouskimmer · 3 years ago
Check me on this because I'm not a software person:

When a person "scrapes" a website by clicking through the link it registers as a hit on the website and, without filters being turned on, triggers the various ad impressions and other cookies. Also if the person needs that information again odds are they'll click on a bookmark or a search link and repeat the impression process all over again.

When an AI scrapes the web it does so once, and possibly in a manner designed to not trigger any ads or cookies (unless that's the purpose of the scrape). It's more equivalent to a person hitting up the website through an archive link.

jMyles · 3 years ago
> The question is if this data is legal to scrape

...it is? I didn't see that question raised in OP's text at all. What do legacy human legalities have to do with how AI will behave?

> Because it's false equivalence? ChatGPT isn't a human being.

Is this important? What is so special about human learning that it puts it in a morally distinct category from the learning that our successors will do?

It sounds like OP is concerned with the ad-driven model of income on the internet, and whether it requires breaking in order for AI to both thrive and be fair.

bilsbie · 3 years ago
> It's a product that is built upon data from other sources.

To be fair, so are you.

nCave · 3 years ago
Does anyone actually find these arguments persuasive?

There is really no reason to believe that what chatGPT or stable diffusion does is anything like what "your brain" does--except in the most superficial, inconsequential way.

Second, try applying this logic to literally anything else and you'll see why it's absurd:

"You can't ban cars from driving on sidewalks! If it's acceptable for people to walk on sidewalks, then it has to be acceptable for cars to drive on sidewalks, since it's just automated walking"

"You can't ban airplanes from landing in ponds. They fly 'just like' ducks fly! So if it's acceptable for them, it must be acceptable for airplanes too"

dawsoneliasen · 3 years ago
Yes, and: why shouldn’t it matter that in one case it is a person and in another it is a computer program?

Why would it be incoherent to say “I’m okay with a person reading, synthesizing, and then utilizing this synthesis—but I’m not okay with a company profiting off of a computer doing the same thing.” What’s wrong with that?

But again, like you and others have said, it’s really not the same thing at all! All ChatGPT (or any other deep learning model) is capable of doing is synthesizing “in the most superficial way.” What a person does is completely different, much more interesting.

BurningFrog · 3 years ago
I find the argument pretty persuasive.

I also agree it's not the only argument and ultimate proof.

I don't at, this point, have an answer. I'm sure this miraculous new technology will survive the luddite attacks, but there will probably be some tense moments, and some jurisdictions will choose to be left behind.

hannob · 3 years ago
You usually expect people to cite sources. Granted, that very often doesn't happen, and the amount of citing expected depends on the context. But ChatGPT just doesn't cite sources at all. I think there's a case to be made that they should.
ericd · 3 years ago
People don’t remember the sources that formed their opinions, it’s just baked into the structure of their brain after reading, same for the model.
rom-antics · 3 years ago
Humans have a pretty good sense of when you need to cite sources, and when you don't. For example, long ago I learned from some website how to write a for-loop in python, and now I write them all the time without giving credit. I'm okay with ChatGPT writing a for-loop without citing its source.

I would say most knowledge about words/grammar/laws of nature can be taken for granted without a citation, but there are some important exceptions where things must be cited. I don't know how you'd reliably teach the difference to a computer though.

lolinder · 3 years ago
ChatGPT doesn't have a concept of sources. It has weights that together define a function that allow it to guess the most likely next word from the context. As a neat side effect of this contextual next-word guessing, it often can share accurate information.

If ChatGPT were to be required to share its sources, they would need a completely different approach. I'm not commenting on whether or not that would be a bad thing, but it would render the current iteration completely useless. You can't strap a source-crediting mechanism on top of a transformers-based model after the fact.

christkv · 3 years ago
The bing leak seemed to mention sources.
pixl97 · 3 years ago
Hello [Oxford Dictionary: 1827]

Oh, wait, I'm not going to cite sources in a non-scientific work as this leads to madness. The following is a previous post of mine on HN

"Your mind exists in a state where it is constantly 'scraping' copyrighted work. Now, in general limitations of the human mind keep you from accurately reproducing that work, but if I were able to look at your output as an omniscient being it is likely I could slam you with violation after violation where you took stylization ideas off of copyrighted work.

RMS covers this rather well in 'The right to read'. Pretty much any model that puts hard ownership rules on ideas and styles leads to total ownership by a few large monied entities. It's much easier for Google to pay some artist for their data that goes into an AI model. Because the 'google ai' model is now more culturally complete than other models that cannot see this data Google entrenches a stronger monopoly in the market, hence generating more money in which to outright buy ideas to further monopolize the market."

klpr · 3 years ago
No, it isn't that simple. The scale and totality of the scraping is out of reach for a human.

If you previously interacted with people on this issue, you must know that.

It is fair for a single human to breathe, but not for a machine to use all oxygen on this planet at once, killing everyone else in the process.

rom-antics · 3 years ago
If I woke up tomorrow and breathed all the oxygen, nobody else could breathe. But If I woke up tomorrow and read all the websites on the internet, it wouldn't stop other people from reading them too.

Air is zero-sum. Knowledge is not.

ergonaught · 3 years ago
It is in fact that simple. There are dozens, hundreds, perhaps thousands of legitimate, genuine, serious, real reasons to be concerned and "want something to be done". This isn't it.

"Learning is unfair" is not an argument you want to win.

nickfromseattle · 3 years ago
>Is it unfair for you to create content/products/etc after you have read and learned from various sources on the internet, potentially depriving them of clicks/income?

The difference is in scale.

A human video game designer can consume other' people's art, then sell their labor to a video game developer. The amount of value captured by the video game designer rounds down to zero in terms of percentage of economic value created by 'video game art'.

OpenAI can consume all of the video game artists, ever, create an art design product and capture a significant percentage of the economic productivity of video game art.

2OEH8eoCRo0 · 3 years ago
At a human level it falls below the noise floor. It's a fact of life that humans will learn and build from experience.

The difference is scale. At scale it becomes a problem.

Edit: I don't know how to satisfy all parties. This shakes the foundation of copyright. Perhaps we are all finding out how valuable good information truly is and especially in aggregate. We have created proto-gods.

ericd · 3 years ago
At scale, it becomes a wonderful tool. Are the people in this thread so threatened or so invested in the current business models of the internet that you can’t see how amazing this sort of thing could be for our abilities as a species? Not just in its current iteration, but it will get better and better.

This could be an excellent brain augmentation, trying to hamper it because we want to force people to drag themselves through underlying sources so those sources can try to steal their attention with ads for revenue is asinine.

beepbooptheory · 3 years ago
Do you think just maybe there is a diffence here because humans need money to survive, and maybe we should have compassion for humans who could hypothetically starve or freeze or suicide or whatever because they have no money? Or is it just silly to care about people like that?
ergonaught · 3 years ago
That's got nothing to do with whether or not it is "fair" for a learning system to produce content after it has learned.

That is, instead, one of the larger and vastly more important sociocultural issues that actually warrants attention, but never receives it in sufficient degree to address the problem, because, for example, we're arguing whether automated learning is "fair".

belltaco · 3 years ago
Isn't that like saying automated looms should be banned because it meant humans would lose jobs to it? Or buggy whip drivers wanting to ban cars?

https://en.wikipedia.org/wiki/Luddite

Might as well ban computers since they automated and eliminated a lot of manual jobs.

The problem of humans with no money should be solved by a societ safety net and things like UBI.

hn_throwaway_99 · 3 years ago
This is a response to an argument the GP didn't make. One can still have grave concerns about generative AI's potential impact on human society while accepting there is nothing fundamentally unfair about how it scrapes publicly accessible data.
BeFlatXIII · 3 years ago
Teach the suicidal ones the noble art of suicide bombing and force societal change that way.
wiseowise · 3 years ago
> Is it unfair for you to create content/products/etc after you have read and learned from various sources on the internet, potentially depriving them of clicks/income?

I don’t do it on an industrial scale.

anileated · 3 years ago
It’s fair if you do this; neither fair nor legal [0] when a commercial for-profit entity, backed by a large corporation, does it at scale and capitalizes on that.

Imagine if you were a webmaster and Google unilaterally decided to stop sending users to content you have worked to research and write, and instead aggregated it and showed the answer to user’s query entirely on its own pages, without any attribution or payment to you. Unimaginable, yet that is very much the scenario unfolding now. [1]

Scraping at this kind of scale is out of your (or any given individual’s) reach. It is, however, within reach of the likes of Microsoft (on whose billions OpenAI basically exists) and Google (who, to be fair, have not abused it in such a blatant way so far).

[0] It is clearly using someone else’s works for commercial purposes, including to create derivative works. (Again, it’s different from you creating a work derivative from someone else’s work you read previously, because in this case a corporation does it at scale for profit.)

[1] And the cynic in me says the only reason we are not yet out with pitchforks is simply because OpenAI is new and shiny and has “open” in its name (never mind the shadow of Microsoft looming all over it), while Google is an entrenched behemoth that we all had some degree of dissatisfaction with in the past and thus are constantly watching out for.

CPLX · 3 years ago
Your argument contains the implicit assumption that we have the same rules for machines that we do for humans.

That is trivially disproved, as is the rest of your argument that follows from it as a premise.

xg15 · 3 years ago
If you treat ChatGPT like a human, how high is the salary you are going to pay it?
lukev · 3 years ago
I think this is a real concern, but imagine a couple other scenarios:

1. You have a widely read spouse named Joe who reads constantly. He's got a good memory, and typically if you have a question you just ask him instead of searching for it yourself. Are you depriving Joe's sources of your eyeballs?

2. Many books summarize and restate other books. If I read Cliff's Notes on a book, for example, I can learn a lot about the original book without buying it. Is this depriving the author?

3. I have a website that proxies requests to other websites and summarizes them while stripping out ads.

So which of these examples are a better metaphor for what a LLM does?

I don't know. The fact is, LLMs are a new thing in our tech and culture and they don't quite fit into any of our existing cultural intuitions or norms. Of course it's ambiguous! But it's also exciting.

anileated · 3 years ago
It is not breaking the ad-based model—it’s breaking open information sharing culture as we know it.

Yesterday: 1) You do research, you publish a book, you write some posts. 2) People discover your work and you personally, they visit your posts and subscribe to you. 3) You have an opportunity to upsell your book and make money on ads to sustain your future work; more importantly, you get to see traffic stats and see what is in demand, you get thank-you emails and feel valued.

Tomorrow: 1) you do research, write posts, publish a book, 2) it is all consumed by a for-profit operated LLM. 3) People ask LLM to get answers, and have no reason or even opportunity to buy your book or know you exist.

What exactly are the incentives to publish information openly in that world?

(Will they even believe you if you say you’re the one who did the niche research powering some specific ChatGPT answer, in a world everyone knows that you can just ask an LLM?)

pharke · 3 years ago
Why would someone only ask an LLM questions when they were in the market to buy a book? Most people I know don't buy books in order to look up the answer to a question, sure some people buy reference books and use them but that's not really what we think of when talking about authors and books. If I'm in the market for a book, I'm looking to read a book, not query something or someone for answers. I think your example should go like this:

Tomorrow: 1) you do research, write posts, publish a book, 2) it is all consumed by a for-profit operated LLM. 3) People ask LLM to get answers to some related question or interest 4) They ask the LLM for a list of recent books that go in depth on the topic or are in the genre etc. 5) Your name comes up in the list 6) Goto step 2 from Yesterday

anileated · 3 years ago
> 4) They ask the LLM for a list of recent books that go in depth on the topic or are in the genre etc. 5) Your name comes up in the list

My belief is that ChatGPT is actually not quite capable of that, after seeing examples of how it manufactures non-existing references. Besides, if it were capable of that, why would it not show your name as part of the answer already now?

The cynic in me thinks it’s not capable of that primarily because it is not a priority for OpenAI and training data strips attribution, with an explicit purpose: if the public knows that ChatGPT can trace back the source, OpenAI would be on the hook for paying all the countless non-consensual content providers on which work it makes money.

We should treat OpenAI as we treat Google and Microsoft. It has great talent and charismatic people working for it, but ultimately it’s a for-profit tech company and the name they chose ought to make us all the more suspicious (akin to Google’s “don’t be evil”).

> Why would someone only ask an LLM questions when they were in the market to buy a book?

Why would you be in a market for a book when you can learn the same and more by asking an LLM that already consumed said book? And therefore why would the author spend effort writing and publishing a book knowing it’d sell exactly one copy (to LLM operator)?

ForestCritter · 3 years ago
Exactly. As a professional artist, I am expected to have a public online portfolio and publicly available imagery of shows and exhibits. Saying that I'm forfeiting my stake in my art because I'm showing it publicly is a really great way to kill art and culture. AI is not learning to make, draw, use mediums in a skilled manner. AI is scraping my public images and plottlining them with the input of humans to label them, tag them and apply stylistic qualities to them.Just because there are massive amounts of data to dilute influence doesn't change that the computer is still simply doing what a human is telling it to do with imagery created by humans. If you took away the human input, labeling and tagging you will find that the computer has not learned anything. I can look at 'AI' art and pick out artists from the collated imagery. Unlike 'AI'I can't spit out the imagery by photocopying/plottlining/tracing it. I have to learn the skills of each artist involved to recreate what I see. Motor skills require practice and effort. 'AI' is not learning motor skills, which is the basis of the creation of art. It is mapping and applying statistical algorithms to amalgamate data from preexisting sources for those who want 'Art' without the effort of time or skill to produce it. At this very moment 'AI' art is being used to sell merchandise with zero credit or monies going to the people who used their human motor skills to create the backbone of this art. Sadly,this only agravates the ways copyright already restricts human art.Imagine if we lived in a world where people valued artists with respect for thier craft? I once had someone ask me how long it took me to draw a charcoal drawing. The short answer is half an hour. The long answer is that I was doing daily scketching practice and investing many hours a week doing charcoal excercises. I am currently out of practice with charcoal and as it is a medium with no erasing or margin of error, I doubt I could recreate my drawing myself without 'getting my hand back in'. It is obvious to me that this 'AI' tool is being used by humans, with the industry of humans, to exploit humans for the gratification of end user humans. I suppose humans could stop making art to feed the monster...
dmarcos · 3 years ago
That’s my main fear. Not the fairness / unfairness but that people might be less willing to share info and a lot becomes inaccessible / secret.
anileated · 3 years ago
I am also anxious about the web becoming fragmented and secretive. If one must gain access to the right circles to start learning, it hinders learning in general, and for myself and many people I know would basically mean we wouldn’t be doing what we’re doing if it were the case when we were younger.
daevout · 3 years ago
Yes it absolutely is, but imo less so than what GitHub Copilot and various image generation companies are doing. My theory is that if AI turns out to be as disruptive as the current hype suggests, the conflict between those who feed the AI vs. those who profit from it might be the next big social rift.

Artists are already in full rebellion against this, as they should be, being nearly eclipsed by AI, except when it comes to inventing new styles and hand-crafting samples for the models to train on. These, I assume, are either scraped off the web, or signed away in unfair ToS of various online publishing platforms.

Since the damage individually is small (they took some code from me without attribution, ok) but collectively enormous, in my opinion it the role of government to step in and soften the blow if necessary.

themodelplumber · 3 years ago
> Artists are already in full rebellion against this,

Huh? No. Some artists are maybe?

> as they should be, being nearly eclipsed by AI

Not even close. It's like looking at the newest brand of clip art.

Non-artists don't (maybe can't) know that particular feeling, at least not with regard to being told you're angry about "what's supposed to look like art".

(Heck, artists have been told that with regard to other humans' art for centuries, for one)

Going even further, a lot of artists already know how to build on this new tech without ripping people off.

I used to teach college art classes and would have loved to integrate this topic into the curriculum. It'd be a great ongoing discussion, no matter the legal outcomes.

Sakos · 3 years ago
Absolutely, yes. It's incredibly unfair. But techbros here and elsewhere don't care about you or me or people in general and they'll think up an infinite amount of ridiculous false equivalencies before admitting the risks and real harms.
jostiniane · 3 years ago
1. get to enjoy an open network of networks

2. people share, get creative and get some sort of credit for it

3. scrap it all and feed it a large deep neural network and be a worse version of all this content but easily accessible

4. creative people don't see a reason to keep sharing what they have (no new public books, no new open source projects, ...)

5. get stuck in an AI world of recycled content

People blindly following OpenAI products have a very shortsighted vision. What they did is neither innovative, nor extraordinary, they got the data, convinced some victims into a kickstart, made sure the hardware supports the bigger deep neural network that can do the job. Check out the OpenAI alternative solutions, it's not hard.

dorchadas · 3 years ago
> but techbros here and elsewhere don't care about you or me or people in general and they'll think up an infinite amount of ridiculous false equivalencies before admitting the risks and real harms.

I came to this realisation arguing with someone in a mutual discord server, about these very topics (the negative impacts of AI). They just couldn't see it, and refused to believe it. I was constantly met with things like "Sure, we'll have to adjust but it'll come" and "Things are no worse now than when the TV and when books were invented" (completely ignoring the many of billions companies are spending to make things more addictive ot our monkey minds, which don't change). Also lots of noble "everyone can use it and it'll benefit everyone"...when really, it only benefits those who can control it. No mention of biases in training data or anything else either. They were really completely blinded to the idea that it might not be good and we should serious admit there are huge issues looming.

I also found it telling that the multiple people like that also weren't fans of in-person interaction, outside their friend group. They saw Discord interactions as just as fine as going out and having serendipitous moments in person, with other real people, and just actually living. Something else I feel technology has stolen from us with everyone always glued to their screen. It's funny how I've become something of a Luddite, proudly, and think we need less internet and more real world, cause, well, life is real world, being human is through real world interactions. And not ones mediated by your phone.

blue_cookeh · 3 years ago
100% agree - the likes of ChatGPT are straight up generating revenue based on adding value to stolen work.
pixl97 · 3 years ago
Lets turn this around the other way.

I create a omniscient copyright detection bot and face it at everything you create 24 hours a day 7 days a week.

You go home and sing happy birthday to your kid. The bot gives you a non-monetary warning for using a copyrighted work without permission. No big deal, but it is on your permanent record.

It had been a stressful day so you take up your evening hobby of painting. You like nature scenes and trees and 30 minutes in you receive a violation, evidently Bob Ross has already done this and his surviving estate is now asking you to destroy the picture.

The next day you go to into your job at the corporate bureaucracy slinging lines of javascript. It's been a productive day so far and you have a few hundred new lines of code written and then the bots going off and HR and legal are ringing the phone within seconds. Turns out some comment you'd saw on Stack Overflow years ago was imprinted in your memory well enough you committed a copyright violation. Looks like you'll be losing your job.

paulcole · 3 years ago
The first two situations you mention almost certainly aren’t copyright violations. The third is at least a solid “maybe.”
dmak · 3 years ago
All of a sudden everyone has issues with free information when Google has been doing this for years.

Dead Comment

dmak · 3 years ago
All I have to say is, as technologists, anyone who is criticizing ChatGPT and has not been criticizing Google is a hypocrite. It's well known Google tries to keep you on Google by parsing more and more information from websites and summarizing it. Ex, Wikipedia summaries, IMDB Scores, Review Stars, etc...

If you have a problem with ChatGPT's "scraped data", then you have more fundamental issues with how the internet is as it is today.

flangola7 · 3 years ago
Google makes money when you click links and visit webpages. Instant info features are useful but do not directly bring Google money.
dmak · 3 years ago
That's my point?

If the product is scraping the data and presenting it on their website like ChatGPT and Google, then that's effectively the same as taking away the ad revenue from those websites because they aren't getting the impressions.

detaro · 3 years ago
Google makes money when you look at and click ads. Visiting websites has a risk that it takes you to a site that does not show you Google ads.
williamcotton · 3 years ago
Ah, our daily dose of a bunch of people with basically no understanding of copyright law or even the basic concepts of tort or common law jurisprudence make all sorts of silly anthropomorphic arguments about “how computers think”.

Please, people, learn how to focus your thoughts. Go read up on copyright law in the United States. If you go into learning about copyright law trying to justify your own preconceived notions you will gain nothing.

venv · 3 years ago
Absolutely this, well put. I guess enough people misunderstand AI models to the point of treating them like they are not software. I guess this validates Clarke's third law (Any sufficiently advanced technology is indistinguishable from magic).