Ask HN: Isn't ChatGPT unfair to the sources it scraped data from?

Is it unfair for you to create content/products/etc after you have read and learned from various sources on the internet, potentially depriving them of clicks/income?

People get internet hostile at me for this question, but it really is that simple. They've automated you, and it's definitely going to be a problem, but if it's acceptable for your brain to do the same thing, you're going to have to find a different angle to attack it than "fairness".

_qzu4 · 3 years ago

> Is it unfair for you to create content/products/etc after you have read and learned from various sources on the internet, potentially depriving them of clicks/income?

Because it's false equivalence? ChatGPT isn't a human being. It's a product that is built upon data from other sources.

The question is if this data is legal to scrape, which it is: Web scraping is legal, US appeals court reaffirms [https://news.ycombinator.com/item?id=31075396].

As long as the content is not copyrighted and it's not regurgitating the exact same content, then it should be okay.

Retr0id · 3 years ago

Being allowed to scrape something does not absolve you of all intellectual property, copyright, moral, etc. issues arising from subsequent use of the scraped data.

faktory · 3 years ago

ChatGPT isn't doing the scraping, humans are. And humans are using computers to both read the article and create content or to scrape it.

So not it's not a false equivalence.

anonymouskimmer · 3 years ago

Check me on this because I'm not a software person:

When a person "scrapes" a website by clicking through the link it registers as a hit on the website and, without filters being turned on, triggers the various ad impressions and other cookies. Also if the person needs that information again odds are they'll click on a bookmark or a search link and repeat the impression process all over again.

When an AI scrapes the web it does so once, and possibly in a manner designed to not trigger any ads or cookies (unless that's the purpose of the scrape). It's more equivalent to a person hitting up the website through an archive link.

jMyles · 3 years ago

> The question is if this data is legal to scrape

...it is? I didn't see that question raised in OP's text at all. What do legacy human legalities have to do with how AI will behave?

> Because it's false equivalence? ChatGPT isn't a human being.

Is this important? What is so special about human learning that it puts it in a morally distinct category from the learning that our successors will do?

It sounds like OP is concerned with the ad-driven model of income on the internet, and whether it requires breaking in order for AI to both thrive and be fair.

bilsbie · 3 years ago

> It's a product that is built upon data from other sources.

To be fair, so are you.

nCave · 3 years ago

Does anyone actually find these arguments persuasive?

There is really no reason to believe that what chatGPT or stable diffusion does is anything like what "your brain" does--except in the most superficial, inconsequential way.

Second, try applying this logic to literally anything else and you'll see why it's absurd:

"You can't ban cars from driving on sidewalks! If it's acceptable for people to walk on sidewalks, then it has to be acceptable for cars to drive on sidewalks, since it's just automated walking"

"You can't ban airplanes from landing in ponds. They fly 'just like' ducks fly! So if it's acceptable for them, it must be acceptable for airplanes too"

dawsoneliasen · 3 years ago

Yes, and: why shouldn’t it matter that in one case it is a person and in another it is a computer program?

Why would it be incoherent to say “I’m okay with a person reading, synthesizing, and then utilizing this synthesis—but I’m not okay with a company profiting off of a computer doing the same thing.” What’s wrong with that?

But again, like you and others have said, it’s really not the same thing at all! All ChatGPT (or any other deep learning model) is capable of doing is synthesizing “in the most superficial way.” What a person does is completely different, much more interesting.

BurningFrog · 3 years ago

I find the argument pretty persuasive.

I also agree it's not the only argument and ultimate proof.

I don't at, this point, have an answer. I'm sure this miraculous new technology will survive the luddite attacks, but there will probably be some tense moments, and some jurisdictions will choose to be left behind.

hannob · 3 years ago

You usually expect people to cite sources. Granted, that very often doesn't happen, and the amount of citing expected depends on the context. But ChatGPT just doesn't cite sources at all. I think there's a case to be made that they should.

ericd · 3 years ago

People don’t remember the sources that formed their opinions, it’s just baked into the structure of their brain after reading, same for the model.

rom-antics · 3 years ago

Humans have a pretty good sense of when you need to cite sources, and when you don't. For example, long ago I learned from some website how to write a for-loop in python, and now I write them all the time without giving credit. I'm okay with ChatGPT writing a for-loop without citing its source.

I would say most knowledge about words/grammar/laws of nature can be taken for granted without a citation, but there are some important exceptions where things must be cited. I don't know how you'd reliably teach the difference to a computer though.

lolinder · 3 years ago

ChatGPT doesn't have a concept of sources. It has weights that together define a function that allow it to guess the most likely next word from the context. As a neat side effect of this contextual next-word guessing, it often can share accurate information.

If ChatGPT were to be required to share its sources, they would need a completely different approach. I'm not commenting on whether or not that would be a bad thing, but it would render the current iteration completely useless. You can't strap a source-crediting mechanism on top of a transformers-based model after the fact.

christkv · 3 years ago

The bing leak seemed to mention sources.

pixl97 · 3 years ago

Hello [Oxford Dictionary: 1827]

Oh, wait, I'm not going to cite sources in a non-scientific work as this leads to madness. The following is a previous post of mine on HN

"Your mind exists in a state where it is constantly 'scraping' copyrighted work. Now, in general limitations of the human mind keep you from accurately reproducing that work, but if I were able to look at your output as an omniscient being it is likely I could slam you with violation after violation where you took stylization ideas off of copyrighted work.

RMS covers this rather well in 'The right to read'. Pretty much any model that puts hard ownership rules on ideas and styles leads to total ownership by a few large monied entities. It's much easier for Google to pay some artist for their data that goes into an AI model. Because the 'google ai' model is now more culturally complete than other models that cannot see this data Google entrenches a stronger monopoly in the market, hence generating more money in which to outright buy ideas to further monopolize the market."

klpr · 3 years ago

No, it isn't that simple. The scale and totality of the scraping is out of reach for a human.

If you previously interacted with people on this issue, you must know that.

It is fair for a single human to breathe, but not for a machine to use all oxygen on this planet at once, killing everyone else in the process.

rom-antics · 3 years ago

If I woke up tomorrow and breathed all the oxygen, nobody else could breathe. But If I woke up tomorrow and read all the websites on the internet, it wouldn't stop other people from reading them too.

Air is zero-sum. Knowledge is not.

ergonaught · 3 years ago

It is in fact that simple. There are dozens, hundreds, perhaps thousands of legitimate, genuine, serious, real reasons to be concerned and "want something to be done". This isn't it.

"Learning is unfair" is not an argument you want to win.

nickfromseattle · 3 years ago

>Is it unfair for you to create content/products/etc after you have read and learned from various sources on the internet, potentially depriving them of clicks/income?

The difference is in scale.

A human video game designer can consume other' people's art, then sell their labor to a video game developer. The amount of value captured by the video game designer rounds down to zero in terms of percentage of economic value created by 'video game art'.

OpenAI can consume all of the video game artists, ever, create an art design product and capture a significant percentage of the economic productivity of video game art.

2OEH8eoCRo0 · 3 years ago

At a human level it falls below the noise floor. It's a fact of life that humans will learn and build from experience.

The difference is scale. At scale it becomes a problem.

Edit: I don't know how to satisfy all parties. This shakes the foundation of copyright. Perhaps we are all finding out how valuable good information truly is and especially in aggregate. We have created proto-gods.

ericd · 3 years ago

At scale, it becomes a wonderful tool. Are the people in this thread so threatened or so invested in the current business models of the internet that you can’t see how amazing this sort of thing could be for our abilities as a species? Not just in its current iteration, but it will get better and better.

This could be an excellent brain augmentation, trying to hamper it because we want to force people to drag themselves through underlying sources so those sources can try to steal their attention with ads for revenue is asinine.

beepbooptheory · 3 years ago

Do you think just maybe there is a diffence here because humans need money to survive, and maybe we should have compassion for humans who could hypothetically starve or freeze or suicide or whatever because they have no money? Or is it just silly to care about people like that?

ergonaught · 3 years ago

That's got nothing to do with whether or not it is "fair" for a learning system to produce content after it has learned.

That is, instead, one of the larger and vastly more important sociocultural issues that actually warrants attention, but never receives it in sufficient degree to address the problem, because, for example, we're arguing whether automated learning is "fair".

belltaco · 3 years ago

Isn't that like saying automated looms should be banned because it meant humans would lose jobs to it? Or buggy whip drivers wanting to ban cars?

https://en.wikipedia.org/wiki/Luddite

Might as well ban computers since they automated and eliminated a lot of manual jobs.

The problem of humans with no money should be solved by a societ safety net and things like UBI.

hn_throwaway_99 · 3 years ago

This is a response to an argument the GP didn't make. One can still have grave concerns about generative AI's potential impact on human society while accepting there is nothing fundamentally unfair about how it scrapes publicly accessible data.

BeFlatXIII · 3 years ago

Teach the suicidal ones the noble art of suicide bombing and force societal change that way.

wiseowise · 3 years ago

> Is it unfair for you to create content/products/etc after you have read and learned from various sources on the internet, potentially depriving them of clicks/income?

I don’t do it on an industrial scale.

anileated · 3 years ago

It’s fair if you do this; neither fair nor legal [0] when a commercial for-profit entity, backed by a large corporation, does it at scale and capitalizes on that.

Imagine if you were a webmaster and Google unilaterally decided to stop sending users to content you have worked to research and write, and instead aggregated it and showed the answer to user’s query entirely on its own pages, without any attribution or payment to you. Unimaginable, yet that is very much the scenario unfolding now. [1]

Scraping at this kind of scale is out of your (or any given individual’s) reach. It is, however, within reach of the likes of Microsoft (on whose billions OpenAI basically exists) and Google (who, to be fair, have not abused it in such a blatant way so far).

[0] It is clearly using someone else’s works for commercial purposes, including to create derivative works. (Again, it’s different from you creating a work derivative from someone else’s work you read previously, because in this case a corporation does it at scale for profit.)

[1] And the cynic in me says the only reason we are not yet out with pitchforks is simply because OpenAI is new and shiny and has “open” in its name (never mind the shadow of Microsoft looming all over it), while Google is an entrenched behemoth that we all had some degree of dissatisfaction with in the past and thus are constantly watching out for.

CPLX · 3 years ago

Your argument contains the implicit assumption that we have the same rules for machines that we do for humans.

That is trivially disproved, as is the rest of your argument that follows from it as a premise.

xg15 · 3 years ago

If you treat ChatGPT like a human, how high is the salary you are going to pay it?

It is not breaking the ad-based model—it’s breaking open information sharing culture as we know it.

Yesterday: 1) You do research, you publish a book, you write some posts. 2) People discover your work and you personally, they visit your posts and subscribe to you. 3) You have an opportunity to upsell your book and make money on ads to sustain your future work; more importantly, you get to see traffic stats and see what is in demand, you get thank-you emails and feel valued.

Tomorrow: 1) you do research, write posts, publish a book, 2) it is all consumed by a for-profit operated LLM. 3) People ask LLM to get answers, and have no reason or even opportunity to buy your book or know you exist.

What exactly are the incentives to publish information openly in that world?

(Will they even believe you if you say you’re the one who did the niche research powering some specific ChatGPT answer, in a world everyone knows that you can just ask an LLM?)

pharke · 3 years ago

Why would someone only ask an LLM questions when they were in the market to buy a book? Most people I know don't buy books in order to look up the answer to a question, sure some people buy reference books and use them but that's not really what we think of when talking about authors and books. If I'm in the market for a book, I'm looking to read a book, not query something or someone for answers. I think your example should go like this:

Tomorrow: 1) you do research, write posts, publish a book, 2) it is all consumed by a for-profit operated LLM. 3) People ask LLM to get answers to some related question or interest 4) They ask the LLM for a list of recent books that go in depth on the topic or are in the genre etc. 5) Your name comes up in the list 6) Goto step 2 from Yesterday

anileated · 3 years ago

> 4) They ask the LLM for a list of recent books that go in depth on the topic or are in the genre etc. 5) Your name comes up in the list

My belief is that ChatGPT is actually not quite capable of that, after seeing examples of how it manufactures non-existing references. Besides, if it were capable of that, why would it not show your name as part of the answer already now?

The cynic in me thinks it’s not capable of that primarily because it is not a priority for OpenAI and training data strips attribution, with an explicit purpose: if the public knows that ChatGPT can trace back the source, OpenAI would be on the hook for paying all the countless non-consensual content providers on which work it makes money.

We should treat OpenAI as we treat Google and Microsoft. It has great talent and charismatic people working for it, but ultimately it’s a for-profit tech company and the name they chose ought to make us all the more suspicious (akin to Google’s “don’t be evil”).

> Why would someone only ask an LLM questions when they were in the market to buy a book?

Why would you be in a market for a book when you can learn the same and more by asking an LLM that already consumed said book? And therefore why would the author spend effort writing and publishing a book knowing it’d sell exactly one copy (to LLM operator)?

ForestCritter · 3 years ago

Exactly. As a professional artist, I am expected to have a public online portfolio and publicly available imagery of shows and exhibits. Saying that I'm forfeiting my stake in my art because I'm showing it publicly is a really great way to kill art and culture. AI is not learning to make, draw, use mediums in a skilled manner. AI is scraping my public images and plottlining them with the input of humans to label them, tag them and apply stylistic qualities to them.Just because there are massive amounts of data to dilute influence doesn't change that the computer is still simply doing what a human is telling it to do with imagery created by humans. If you took away the human input, labeling and tagging you will find that the computer has not learned anything. I can look at 'AI' art and pick out artists from the collated imagery. Unlike 'AI'I can't spit out the imagery by photocopying/plottlining/tracing it. I have to learn the skills of each artist involved to recreate what I see. Motor skills require practice and effort. 'AI' is not learning motor skills, which is the basis of the creation of art. It is mapping and applying statistical algorithms to amalgamate data from preexisting sources for those who want 'Art' without the effort of time or skill to produce it. At this very moment 'AI' art is being used to sell merchandise with zero credit or monies going to the people who used their human motor skills to create the backbone of this art. Sadly,this only agravates the ways copyright already restricts human art.Imagine if we lived in a world where people valued artists with respect for thier craft? I once had someone ask me how long it took me to draw a charcoal drawing. The short answer is half an hour. The long answer is that I was doing daily scketching practice and investing many hours a week doing charcoal excercises. I am currently out of practice with charcoal and as it is a medium with no erasing or margin of error, I doubt I could recreate my drawing myself without 'getting my hand back in'. It is obvious to me that this 'AI' tool is being used by humans, with the industry of humans, to exploit humans for the gratification of end user humans. I suppose humans could stop making art to feed the monster...

dmarcos · 3 years ago

That’s my main fear. Not the fairness / unfairness but that people might be less willing to share info and a lot becomes inaccessible / secret.

anileated · 3 years ago

I am also anxious about the web becoming fragmented and secretive. If one must gain access to the right circles to start learning, it hinders learning in general, and for myself and many people I know would basically mean we wouldn’t be doing what we’re doing if it were the case when we were younger.