> The model was trained using text databases from the internet. This included a whopping 570GB of data obtained from books, webtexts, Wikipedia, articles and other pieces of writing on the internet. To be even more exact, 300 billion words were fed into the system.
I believe it's unfair to these sources that ChatGPT drives away their clicks, and in turn the ad income that would come with them.
Scraping data seems fine in contexts where clicks aren't driven away from the very site the data was scraped from. But in ChatGPT's case, it seems really unfair to these sources and the work that the authors put, as people would no longer even to attempt to go to these sources.
Can this start breaking the ad-based model of the internet, where a lot of sites rely upon the ad income to run servers?
People get internet hostile at me for this question, but it really is that simple. They've automated you, and it's definitely going to be a problem, but if it's acceptable for your brain to do the same thing, you're going to have to find a different angle to attack it than "fairness".
Because it's false equivalence? ChatGPT isn't a human being. It's a product that is built upon data from other sources.
The question is if this data is legal to scrape, which it is: Web scraping is legal, US appeals court reaffirms [https://news.ycombinator.com/item?id=31075396].
As long as the content is not copyrighted and it's not regurgitating the exact same content, then it should be okay.
So not it's not a false equivalence.
When a person "scrapes" a website by clicking through the link it registers as a hit on the website and, without filters being turned on, triggers the various ad impressions and other cookies. Also if the person needs that information again odds are they'll click on a bookmark or a search link and repeat the impression process all over again.
When an AI scrapes the web it does so once, and possibly in a manner designed to not trigger any ads or cookies (unless that's the purpose of the scrape). It's more equivalent to a person hitting up the website through an archive link.
...it is? I didn't see that question raised in OP's text at all. What do legacy human legalities have to do with how AI will behave?
> Because it's false equivalence? ChatGPT isn't a human being.
Is this important? What is so special about human learning that it puts it in a morally distinct category from the learning that our successors will do?
It sounds like OP is concerned with the ad-driven model of income on the internet, and whether it requires breaking in order for AI to both thrive and be fair.
To be fair, so are you.
There is really no reason to believe that what chatGPT or stable diffusion does is anything like what "your brain" does--except in the most superficial, inconsequential way.
Second, try applying this logic to literally anything else and you'll see why it's absurd:
"You can't ban cars from driving on sidewalks! If it's acceptable for people to walk on sidewalks, then it has to be acceptable for cars to drive on sidewalks, since it's just automated walking"
"You can't ban airplanes from landing in ponds. They fly 'just like' ducks fly! So if it's acceptable for them, it must be acceptable for airplanes too"
Why would it be incoherent to say “I’m okay with a person reading, synthesizing, and then utilizing this synthesis—but I’m not okay with a company profiting off of a computer doing the same thing.” What’s wrong with that?
But again, like you and others have said, it’s really not the same thing at all! All ChatGPT (or any other deep learning model) is capable of doing is synthesizing “in the most superficial way.” What a person does is completely different, much more interesting.
I also agree it's not the only argument and ultimate proof.
I don't at, this point, have an answer. I'm sure this miraculous new technology will survive the luddite attacks, but there will probably be some tense moments, and some jurisdictions will choose to be left behind.
I would say most knowledge about words/grammar/laws of nature can be taken for granted without a citation, but there are some important exceptions where things must be cited. I don't know how you'd reliably teach the difference to a computer though.
If ChatGPT were to be required to share its sources, they would need a completely different approach. I'm not commenting on whether or not that would be a bad thing, but it would render the current iteration completely useless. You can't strap a source-crediting mechanism on top of a transformers-based model after the fact.
Oh, wait, I'm not going to cite sources in a non-scientific work as this leads to madness. The following is a previous post of mine on HN
"Your mind exists in a state where it is constantly 'scraping' copyrighted work. Now, in general limitations of the human mind keep you from accurately reproducing that work, but if I were able to look at your output as an omniscient being it is likely I could slam you with violation after violation where you took stylization ideas off of copyrighted work.
RMS covers this rather well in 'The right to read'. Pretty much any model that puts hard ownership rules on ideas and styles leads to total ownership by a few large monied entities. It's much easier for Google to pay some artist for their data that goes into an AI model. Because the 'google ai' model is now more culturally complete than other models that cannot see this data Google entrenches a stronger monopoly in the market, hence generating more money in which to outright buy ideas to further monopolize the market."
If you previously interacted with people on this issue, you must know that.
It is fair for a single human to breathe, but not for a machine to use all oxygen on this planet at once, killing everyone else in the process.
Air is zero-sum. Knowledge is not.
"Learning is unfair" is not an argument you want to win.
The difference is in scale.
A human video game designer can consume other' people's art, then sell their labor to a video game developer. The amount of value captured by the video game designer rounds down to zero in terms of percentage of economic value created by 'video game art'.
OpenAI can consume all of the video game artists, ever, create an art design product and capture a significant percentage of the economic productivity of video game art.
The difference is scale. At scale it becomes a problem.
Edit: I don't know how to satisfy all parties. This shakes the foundation of copyright. Perhaps we are all finding out how valuable good information truly is and especially in aggregate. We have created proto-gods.
This could be an excellent brain augmentation, trying to hamper it because we want to force people to drag themselves through underlying sources so those sources can try to steal their attention with ads for revenue is asinine.
That is, instead, one of the larger and vastly more important sociocultural issues that actually warrants attention, but never receives it in sufficient degree to address the problem, because, for example, we're arguing whether automated learning is "fair".
https://en.wikipedia.org/wiki/Luddite
Might as well ban computers since they automated and eliminated a lot of manual jobs.
The problem of humans with no money should be solved by a societ safety net and things like UBI.
I don’t do it on an industrial scale.
Imagine if you were a webmaster and Google unilaterally decided to stop sending users to content you have worked to research and write, and instead aggregated it and showed the answer to user’s query entirely on its own pages, without any attribution or payment to you. Unimaginable, yet that is very much the scenario unfolding now. [1]
Scraping at this kind of scale is out of your (or any given individual’s) reach. It is, however, within reach of the likes of Microsoft (on whose billions OpenAI basically exists) and Google (who, to be fair, have not abused it in such a blatant way so far).
[0] It is clearly using someone else’s works for commercial purposes, including to create derivative works. (Again, it’s different from you creating a work derivative from someone else’s work you read previously, because in this case a corporation does it at scale for profit.)
[1] And the cynic in me says the only reason we are not yet out with pitchforks is simply because OpenAI is new and shiny and has “open” in its name (never mind the shadow of Microsoft looming all over it), while Google is an entrenched behemoth that we all had some degree of dissatisfaction with in the past and thus are constantly watching out for.
That is trivially disproved, as is the rest of your argument that follows from it as a premise.
1. You have a widely read spouse named Joe who reads constantly. He's got a good memory, and typically if you have a question you just ask him instead of searching for it yourself. Are you depriving Joe's sources of your eyeballs?
2. Many books summarize and restate other books. If I read Cliff's Notes on a book, for example, I can learn a lot about the original book without buying it. Is this depriving the author?
3. I have a website that proxies requests to other websites and summarizes them while stripping out ads.
So which of these examples are a better metaphor for what a LLM does?
I don't know. The fact is, LLMs are a new thing in our tech and culture and they don't quite fit into any of our existing cultural intuitions or norms. Of course it's ambiguous! But it's also exciting.
Yesterday: 1) You do research, you publish a book, you write some posts. 2) People discover your work and you personally, they visit your posts and subscribe to you. 3) You have an opportunity to upsell your book and make money on ads to sustain your future work; more importantly, you get to see traffic stats and see what is in demand, you get thank-you emails and feel valued.
Tomorrow: 1) you do research, write posts, publish a book, 2) it is all consumed by a for-profit operated LLM. 3) People ask LLM to get answers, and have no reason or even opportunity to buy your book or know you exist.
What exactly are the incentives to publish information openly in that world?
(Will they even believe you if you say you’re the one who did the niche research powering some specific ChatGPT answer, in a world everyone knows that you can just ask an LLM?)
Tomorrow: 1) you do research, write posts, publish a book, 2) it is all consumed by a for-profit operated LLM. 3) People ask LLM to get answers to some related question or interest 4) They ask the LLM for a list of recent books that go in depth on the topic or are in the genre etc. 5) Your name comes up in the list 6) Goto step 2 from Yesterday
My belief is that ChatGPT is actually not quite capable of that, after seeing examples of how it manufactures non-existing references. Besides, if it were capable of that, why would it not show your name as part of the answer already now?
The cynic in me thinks it’s not capable of that primarily because it is not a priority for OpenAI and training data strips attribution, with an explicit purpose: if the public knows that ChatGPT can trace back the source, OpenAI would be on the hook for paying all the countless non-consensual content providers on which work it makes money.
We should treat OpenAI as we treat Google and Microsoft. It has great talent and charismatic people working for it, but ultimately it’s a for-profit tech company and the name they chose ought to make us all the more suspicious (akin to Google’s “don’t be evil”).
> Why would someone only ask an LLM questions when they were in the market to buy a book?
Why would you be in a market for a book when you can learn the same and more by asking an LLM that already consumed said book? And therefore why would the author spend effort writing and publishing a book knowing it’d sell exactly one copy (to LLM operator)?
Artists are already in full rebellion against this, as they should be, being nearly eclipsed by AI, except when it comes to inventing new styles and hand-crafting samples for the models to train on. These, I assume, are either scraped off the web, or signed away in unfair ToS of various online publishing platforms.
Since the damage individually is small (they took some code from me without attribution, ok) but collectively enormous, in my opinion it the role of government to step in and soften the blow if necessary.
Huh? No. Some artists are maybe?
> as they should be, being nearly eclipsed by AI
Not even close. It's like looking at the newest brand of clip art.
Non-artists don't (maybe can't) know that particular feeling, at least not with regard to being told you're angry about "what's supposed to look like art".
(Heck, artists have been told that with regard to other humans' art for centuries, for one)
Going even further, a lot of artists already know how to build on this new tech without ripping people off.
I used to teach college art classes and would have loved to integrate this topic into the curriculum. It'd be a great ongoing discussion, no matter the legal outcomes.
2. people share, get creative and get some sort of credit for it
3. scrap it all and feed it a large deep neural network and be a worse version of all this content but easily accessible
4. creative people don't see a reason to keep sharing what they have (no new public books, no new open source projects, ...)
5. get stuck in an AI world of recycled content
People blindly following OpenAI products have a very shortsighted vision. What they did is neither innovative, nor extraordinary, they got the data, convinced some victims into a kickstart, made sure the hardware supports the bigger deep neural network that can do the job. Check out the OpenAI alternative solutions, it's not hard.
I came to this realisation arguing with someone in a mutual discord server, about these very topics (the negative impacts of AI). They just couldn't see it, and refused to believe it. I was constantly met with things like "Sure, we'll have to adjust but it'll come" and "Things are no worse now than when the TV and when books were invented" (completely ignoring the many of billions companies are spending to make things more addictive ot our monkey minds, which don't change). Also lots of noble "everyone can use it and it'll benefit everyone"...when really, it only benefits those who can control it. No mention of biases in training data or anything else either. They were really completely blinded to the idea that it might not be good and we should serious admit there are huge issues looming.
I also found it telling that the multiple people like that also weren't fans of in-person interaction, outside their friend group. They saw Discord interactions as just as fine as going out and having serendipitous moments in person, with other real people, and just actually living. Something else I feel technology has stolen from us with everyone always glued to their screen. It's funny how I've become something of a Luddite, proudly, and think we need less internet and more real world, cause, well, life is real world, being human is through real world interactions. And not ones mediated by your phone.
I create a omniscient copyright detection bot and face it at everything you create 24 hours a day 7 days a week.
You go home and sing happy birthday to your kid. The bot gives you a non-monetary warning for using a copyrighted work without permission. No big deal, but it is on your permanent record.
It had been a stressful day so you take up your evening hobby of painting. You like nature scenes and trees and 30 minutes in you receive a violation, evidently Bob Ross has already done this and his surviving estate is now asking you to destroy the picture.
The next day you go to into your job at the corporate bureaucracy slinging lines of javascript. It's been a productive day so far and you have a few hundred new lines of code written and then the bots going off and HR and legal are ringing the phone within seconds. Turns out some comment you'd saw on Stack Overflow years ago was imprinted in your memory well enough you committed a copyright violation. Looks like you'll be losing your job.
Dead Comment
If you have a problem with ChatGPT's "scraped data", then you have more fundamental issues with how the internet is as it is today.
If the product is scraping the data and presenting it on their website like ChatGPT and Google, then that's effectively the same as taking away the ad revenue from those websites because they aren't getting the impressions.
Please, people, learn how to focus your thoughts. Go read up on copyright law in the United States. If you go into learning about copyright law trying to justify your own preconceived notions you will gain nothing.