The New York Times is suing OpenAI and Microsoft for copyright infringement

Solidly rooting for NYT on this - it’s felt like many creative organizations have been asleep at the wheel while their lunch gets eaten for a second time (the first being at the birth of modern search engines.)

I don’t necessarily fault OpenAI’s decision to initially train their models without entering into licensing agreements - they probably wouldn’t exist and the generative AI revolution may never have happened if they put the horse before the cart. I do think they should quickly course correct at this point and accept the fact that they clearly owe something to the creators of content they are consuming. If they don’t, they are setting themselves up for a bigger loss down the road and leaving the door open for a more established competitor (Google) to do it the right way.

belter · 2 years ago

For all the leaks on: Secret projects, novelty training algorithms not being published anymore so as to preserve market share, custom hardware, Q* learning, internal politics at companies at the forefront of state of the art LLMs...A thunderous silence is the lack of leaks, on the exact datasets used to train the main commercial LLMs.

It is clear OpenAI or Google did not use only Common Crawl. With so many press conferences why did no research journalist ask yet from OpenAI or Google to confirm or deny if they use or used LibGen?

Did OpenAI really bought an ebook of every publication from Cambridge Press, Oxford Press, Manning, APress, and so on? Did any of investors due diligence, include researching the legality of the content used for training?

hhsectech · 2 years ago

I'm not for or against anything at this point until someone gets their balls out and clearly defines what copyright infringement means in this context.

If you give a bunch of books to a kid all by the same author and then pay that kid to write a book in a similar style and then I go on to sell that book...have I somehow infringed copyright?

The kids book at best is likely to be a very convincing facsimile of the original authors work...but not the authors work.

It seems to me that the only solution for artists is to charge for access to their work in a secure environment then lobotomise people on the way out.

The endgame seems to be "you can view and enjoy our work, but if you want to learn or be inspired by it, thats not on"

ethbr1 · 2 years ago

Would be fascinated to hear from someone inside on a throwaway, but my nearest experience is that corporate lawyers aren't stupid.

If there's legally-murky secret data sauce, it's firewalled from being easily seen in its entirety by anyone not golden-handcuffed to the company.

They may be able to train against it. They may be able to peek at portions of it. But no one is downloading-all.

devindotcom · 2 years ago

for what it's worth, i asked altman directly and he denied using libgen or books2, but also deferred to murati and her team on specifics. but the Q&A wasn't recorded and they haven't answered my follow-ups.

cogman10 · 2 years ago

We all remember when Aaron Swartz got hit with a wire tapping and intent to distribute federal crime for downloading JSTR stuff right?

It's really disgusting, IMO, that corporations that go above and beyond that sort of behavior are seeing NO federal investigations for this sort of behavior. Yet a private citizen does it and it's threats of life in prison.

This isn't new, but it speaks to a major hole in our legal system and the administration of it. The Feds are more than willing to steamroll an individual but will think twice over investigating a large corporation engaged in the same behavior.

alfiedotwtf · 2 years ago

Why isn't robots.txt enough to enforce copyright etc? If NYT didn't set robots.txt properly, is their content free-for-all? Yes I know the first answer you would jump to is "of course not, copyright is the default", but it's almost 2024 and we have had robots.txt as industry de jure to stop crawling.

lumost · 2 years ago

ChatGPTs birth as a research preview may have been an attempt to avoid these issues. It would have been unlikely to trigger legal anger for a free product which few use. When usage exploded, the natural inclination would be to hope for the best.

Google may simply have been obliged to follow suit.

Personally, I’m looking forward to pirate LLMs trained on academic content.

015a · 2 years ago

> a more established competitor

Apple is already doing this: https://www.nytimes.com/2023/12/22/technology/apple-ai-news-...

Apple caught a lot of shit over the past 18 months for their lack of AI strategy; but I think two years from now they're going to look like geniuses.

UrineSqueegee · 2 years ago

didnt they just get caught for pantent infrigment? I'm sure they've done their fair share of shady stuff with the AI datasets too, they are just going to do a stellar job of conciling it.

tracyhenry · 2 years ago

> the first being at the birth of modern search engines.

Why do you say that? Search engines would at least direct the viewer to the source. NYT gets 35%+ of its traffic from Google: https://www.similarweb.com/website/nytimes.com/#traffic-sour...

belter · 2 years ago

Just because they asked for forgiveness instead of asking first for permission, it's original sins will not be erased :-)

"Google Agrees to Pay Canadian Media for Using Their Content" - https://www.nytimes.com/2023/11/29/world/americas/google-can...

kbos87 · 2 years ago

That doesn’t mean that it wasn’t theft of their content. The internet would be a very different place if creator compensation and low friction micropayments were some of the first principles. Instead we’re left with ads as the only viable monetization model and clickbait/misinformation as a side effect.

whywhywhywhy · 2 years ago

The world you’re hoping for will put all AI tech only within the hands of the established top 10 media entities, who traditionally have never compensated fairly anyway.

Sorry but if that’s the alternative to some writers feeling slighted, I’ll choose for the writers to be sad and the tech to be free.

kbos87 · 2 years ago

“Feeling slighted” is a gross understatement of how a lack of compensation flowing to creators has shaped the internet and the wider world over the past 25 years. If we have a problem with the way top media companies compensate their creators, that is a separate issue - not a justification for layering another issue on top.

hackernewds · 2 years ago

So you're advocating giving open AI and incumbents a massive advantage by now delegitimizing the process? It's kinda like why Netflix was all for "fast lanes"

logicchains · 2 years ago

> I do think they should quickly course correct at this point and accept the fact that they clearly owe something to the creators of content they are consuming.

Eventually these LLMs are going to be put in mechanical bodies with the ability to interact with the world and learn (update their weights) in realtime. Consider how absurd your perspective would be then, when it'd be illegal for this embodied LLM to read any copyrighted text, be it a book or a web page, without special permission from the copyright holder, while humans face no such restriction.

belter · 2 years ago

A human faces the same restriction, if it provides commercial services on the internet creating code that is a copy of copyrighted code.

happymellon · 2 years ago

> while humans face no such restriction.

I have no idea what on earth you are talking about. People and corporations are sued for copyright infringement all the time.

https://copyrightalliance.org/copyright-cases-2022/

Reading and consuming other people content isn't illegal, but it also wouldn't be for a computer.

Reading and consuming content with the sole purpose of reproducing it verbatim is frowned upon, and can be sued, whether it's an LLM or a sweatshop in India.

bnralt · 2 years ago

> Solidly rooting for NYT on this - it’s felt like many creative organizations have been asleep at the wheel while their lunch gets eaten for a second time (the first being at the birth of modern search engines.)

Hacker News consistently have upvoted posts to let users circumvent paywalls. And even when it doesn't, conversations here (and on Twitter, Reddit, etc.) that summarize the articles and quote the relevant bits as soon as the articles are published are much more of a threat to The New York Times than ChatGPT training on articles from months/years ago.

gosub100 · 2 years ago

I don't think it's about scraping being a threat. It's that they violated the TOS and stand to make a ton of money from someone else's work.

I find irony in the newspaper suing AI when other news sources (admittedly not NYT) use AI to write the articles. How many other AI scrapers are just ingesting AI generated content?

ChatGTP · 2 years ago

Same, to all those arguing in favour of Open AI, I have a question, do you steal books, movies, games ?

Do you illegally share them via torrents or even sell copies of these works ?

Because that is what’s going on here?

michaelcampbell · 2 years ago

> they probably wouldn’t exist and the generative AI revolution may never have happened if they put the horse before the cart

Maybe, but I find the "It's ok to break the law because otherwise I can't do what I want" narrative a little offputting.

hackerlight · 2 years ago

Doesn't this harm open source ML by adding yet another costly barrier to training models?

onlyrealcuzzo · 2 years ago

It doesn't matter what's good for open source ML.

It matters what is legal and what makes sense.

pxoe · 2 years ago

open source won't care. they'll just use data anyway.

closed/proprietary services that also monetize - there's a question whether it's "fair" to take and use data for free, and then basically resell access to it. the monetization aspect is the bigger rub than just data use.

(maybe it's worth noting again that "openai" is not really "open" and not the same as open source ai/ml.)

taking data, maybe it's data that's free to take, and then as freely distributing resulting work, that's really just fine. taking something for free (without distinction, maybe it's free, maybe it's supposed to stay free, maybe it's not supposed to be used like that, maybe it's copyrighted), and then just ignoring licenses/relicensing and monetizing without care, that's just a minefield.

layer8 · 2 years ago

You can train your own model no problem, but you arguably can’t publish it. So yes, the model can’t be open-sourced, but the training procedure can.

bbor · 2 years ago

I think not, because stealing large amounts of unlicensed content and hoping momentum/bluster/secrecy protects you is a privilege afforded only to corporations.

OSS seems to be developing its own, transparent, datasets.

theGnuMe · 2 years ago

It’s likely fair use.

JCM9 · 2 years ago

Playing back large passages of verbatim content sold as your “product” without citation is almost certainly not fair use. Fair use would be saying “The New York Times said X” and then quoting a sentence with attribution. Thats not what OpenAI is being sued for. They’re being sued for passing off substantial bits of NYTimes content as their own IP and then charging for it saying it’s their own IP.

This is also related to earlier studies about OpenAI where their models have a bad habit of just regurgitating training data verbatim. If your trained data is protected IP you didn’t secure the rights for then that’s a real big problem. Hence this lawsuit. If successful, the floodgates will open.

spunker540 · 2 years ago

> It's likely fair use.

I agree. You can even listen to the NYT Hard Fork podcast (that I recommend btw https://www.nytimes.com/2023/11/03/podcasts/hard-fork-execut...) where they recently had Harvard copyright law professor Rebecca Tushnet on as a guest.

They asked her about the issue of copyrighted training data. Her response was:

""" Google, for example, with the book project, doesn’t give you the full text and is very careful about not giving you the full text. And the court said that the snippet production, which helps people figure out what the book is about but doesn’t substitute for the book, is a fair use.

So the idea of ingesting large amounts of existing works, and then doing something new with them, I think, is reasonably well established. The question is, of course, whether we think that there’s something uniquely different about LLMs that justifies treating them differently. """

Now for my take: Proving that OpenAI trained on NYT articles is not sufficient IMO. They would need to prove that OpenAI is providing a substitutable good via verbatim copying, which I don't think you can easily prove. It takes a lot of prompt engineering and luck to pull out any verbatim articles. It's well-established that LLMs screw up even well-known facts. It's quite hard to accurately pull out the training data verbatim.

hn_throwaway_99 · 2 years ago

It's likely not. Search for "the four factors of fair use". While I think OpenAI will have decent arguments for 3 of the factors, they'll get killed on the fourth factor, "the effect of the use on the potential market", which is what this lawsuit is really about.

If your "fair use" substantially negatively affects the market for the original source material, which I think is fairly clear in this case, the courts wont look favorably on that.

Of course, I think this is a great test case precisely because the power of "Internet scale" and generative AI is fundamentally different than our previous notions about why we wanted a "fair use exception" in the first place.

rbultje · 2 years ago

What if a court interprets fair use as a human-only right, just like it did for copyright?

munk-a · 2 years ago

I think we need a lot of clarity here. I think it's perfectly sensible to look at gigantic corpuses of high quality literature as being something society would want to be fair use for training an LLM to better understand and produce more correct writing... but the actual information contained in NYT articles should probably be controlled primarily by NYT. If the value a business delivers (in this case the information of the articles) can be freely poached without limitation by competitors then that business can't afford to actually invest in delivering a quality product.

As a counter argument it might be reasonable to instead say that the NYT delivers "current information" so perhaps it'd be fair to train your model on articles so long as they aren't too recent... but I think a lot of the information that the NYT now relies on for actual traffic is their non-temporal stuff - including things like life advice and recipes.

I hope this results in Fair Use being expanded to cover AI training. This is way more important to humanity's future than any single media outlet. If the NYT goes under, a dozen similar outlets can replace them overnight. If we lose AI to stupid IP battles in its infancy, we end up handicapping probably the single most important development in human history just to protect some ancient newspaper. Then another country is going to do it anyway, and still the NYT is going to get eaten.

ngetchell · 2 years ago

"probably the single most important development in human history" is the kind of hyperbole you'd only find here. Better than medicine, agriculture, electrification, or music? That point of view simply does not jive with what I see so far from AI. It has had little impact beyond filling the internet with low-effort content.

I feel like the crypto evangelists never got off the hype train. They just picked a new destination. I hope the NYT is compensated for the theft of their IP and hopefully more lawsuits follow.

dkrich · 2 years ago

Also the assumption a publication that’s been around for 150 years is disposable, not the web application that was created a year ago. I’ve been saying for a while that people’s credulity and impulse to believe absolutely any storyline related to technology is off the charts.

solardev · 2 years ago

I don't think it's hyperbole, in fact I think it's understating things a bit. I believe AGI would just be a tiny step towards long term evolution, which may or may not involve homo sapiens.

Being able to use electricity as a fuel source and code as a genome allows them to evolve in circumstances hostile to biological organisms. Someday they'll probably incorporate organic components too and understand biology and psychology and every other science better than any single human ever could.

It has the potential to be much more than just another primate. Jumpstarted by us, sure, but I hope someday soon they'll take to the stars and send us back postcards.

Shrug. Of course you can disagree. I doubt I'll live long enough to see who turns out right, anyway.

kevincox · 2 years ago

I think you are looking at current AI product rather than the underlying technology. It's like saying that the wheel is a useless invention because it has only been used for unicycles so far. I'm sure that AI will have huge impacts in medicine (assisting diagnosis from medical tests) and agriculture (identifying issues with areas of crops, scanning for diseases and increasing automation of food processing) as well as likely nearly every other field.

I don't know if I would agree that it is "probably the single most important development in human history" but I think that it is way to early to make a reasonable guess of if it will or not.

Deleted Comment

Kim_Bruning · 2 years ago

> Better than medicine, agriculture, electrification, or music?

Shoulders of giants.

Thanks to the existence of medicine, agriculture, and electrification (we can argue about music), some people are now healthy, well fed, and sufficiently supplied with enough electricity to go make LLMs.

> I hope the NYT is compensated for the theft of their IP and hopefully more lawsuits follow.

Personally I think all these "theft of IP" lawsuits are (mostly) destined to fail. Not because I'm on a particular side per-se (though I am), but because it's trying to fit a square law into a round hole.

This is going to be a job for legislature sooner or later.

vivekd · 2 years ago

I mean maybe not the single most important development, but definitely a very important technological development with the potential to revolutionize multiple industries

aantix · 2 years ago

Why can't AI at least cite its source? This feels like a broader problem, nothing specific to the NYTimes.

Long term, if no one is given credit for their research, either the creators will start to wall off their content or not create at all. Both options would be sad.

A humane attribution comment from the AI could go a long way - "I think I read something about this <topic X> in the NYTimes <link> on January 3rd, 2021."

It appears that without attribution, long term, nothing moves forward.

AI loses access to the latest findings from humanity. And so does the public.

FredPret · 2 years ago

A human can't credit the source of each element of everything they've learnt. AI's can't either, and for the same reason.

The knowledge gets distorted, blended, and reinterpreted a million ways by the time it's given as output.

And the metadata (metaknowledge?) would be larger than the knowledge itself. The AI learnt every single concept it knows by reading online; including the structure of grammar, rules of logic, the meaning of words, how they relate to one another. You simply couldn't cite it all.

apantel · 2 years ago

A neural net is not a database where the original source is sitting somewhere in an obvious place with a reference. A neural net is a black box of functions that have been automatically fit to the training data. There is no way to know what sources have been memorized vs which have made their mark by affecting other types of functions in the neural net.

whichfawkes · 2 years ago

Why do you expect an AI to cite it's source? Humans are allowed to use and profit on knowledge they've learned from any and all sources without having to mention or even remember their sources.

Yes, we all agree that it's better if they do remember and mention their sources, but we don't sue them for failing to do so.

awwaiid · 2 years ago

I think the gap between attributable knowledge and absorbed knowledge is pretty difficult to bridge. For news stuff, if I read the same general story from NYT and LA Times and WaPo then I'll start to get confused about which bit I got from which publication. In some ways, being able to verbatim quote long passages is a failure to generalize that should be fixed rather than reinforced.

Though the other way to do it is to clearly document the training data as a whole, even if you can't cite a specific entry in it for a particular bit of generated output. It should get useless quickly though as you'd eventually have one big citation -- "The Internet"

8note · 2 years ago

If you're going to consider training ai as fair use, you'll have all kinds of different people with different skill levels training ais that work in different ways on the corpus.

Not all of them will have the capability to cite a source, and plenty of them won't have it make sense to cite a source.

Eg. Suppose I train a regression that guesses how many words will be in a book.

Which book do I cite when I do an inference? All of them?

throwup238 · 2 years ago

If you ask the AI to cite its sources, it will. It will hallucinate some of them, but in the last few months it's gotten really good at sending me to the right web page or Amazon book link for its sources.

Thing is though, if you look at the prompts they used to elicit the material, the prompt was already citing the NYTimes and its articles by name.

TulliusCicero · 2 years ago

> Why can't AI at least cite its source?

Because AI models aren't databases.

devd00d · 2 years ago

Anyone in Open Source or with common sense would agree that this is the absolute minimum that the models should be doing. Good comment.

make3 · 2 years ago

"Why can't AI at least cite its source" each article seen alters the weights a tiny, non-human understandable amount. it doesn't have a source, unless you think of the whole humongous corpus that it is trained on

solardev · 2 years ago

There's a few levels to this...

Would it be more rigorous for AI to cite its sources? Sure, but the same could be said for humans too. Wikipedia editors, scholars, and scientists all still struggle with proper citations. NYT itself has been caught plagiarizing[1].

But that doesn't really solve the underlying issue here: That our copyright laws and monetization models predate the Internet and the ease of sharing/paywall bypass/piracy. The models that made sense when publishing was difficult and required capital-intensive presses don't necessarily make sense in the copy and paste world of today. Whether it's journalists or academics fighting over scraps just for first authorship (while some random web dev makes 3x more money on ad tracking), it's just not a long-term sustainable way to run an information economy.

I'd also argue that attribution isn't really that important to most people to begin with. Stuff, real and fake, gets shared on social media all the time with limited fact-checking (for better or worse). In general, people don't speak in a rigorous scholarly way. And people are often wrong, with faulty memories, or even incentivized falsehoods. Our primate brains aren't constantly in fact-checking mode and we respond better to emotional, plot-driven narratives than cold statistics. There are some intellectuals who really care deeply about attributions, but most humans won't.

Taken the above into consideration:

1) Useful AI does not necessarily require attribution

2) AI piracy is just a continuation of decades of digital piracy, and the solutions that didn't work in the 1990s and 2000s still won't work against AI

3) We need some better way to fund human creativity, especially as it gets more and more commoditized

4) This is going to happen with or without us. Cat's outta the bag.

I don't think using old IP law to hold us back is really going to solve anything in the long term. Yes, it'd be classy of OpenAI to pay everyone it sourced from, but long term that doesn't matter. Creativity has always been shared and copied and imitated and stolen, the only question is whether the creators get compensated (or even enriched) in the meantime. Sometimes yes, sometimes no, but it happens regardless. There'll always be noncommercial posts by the billions of people who don't care if AI, or a search engine, or Twitter, or whoever, profits off them.

If we get anywhere remotely close to AGI, a lot of this won't matter. Our entire economic and legal systems will have to be redone. Maybe we can finally get rid of the capitalist and lawyer classes. Or they'll probably just further enslave the rest of us with the help of their robo-bros, giving AI more rights than poor people.

But either way, this is way bigger than the economics of 19th-century newspapers...

[1] https://en.wikipedia.org/wiki/Jayson_Blair#Plagiarism_and_fa...

hn_throwaway_99 · 2 years ago

> I hope this results in Fair Use being expanded to cover AI training.

Couldn't disagree more strongly, and I hope the outcome is the exact opposite. I think we've already started to see the severe negative consequences when the lion's share of the profits get sucked up by very, very few entities (e.g. we used to have tons of local papers and other entities that made money through advertising, now Google and Facebook, and to a smaller extent Amazon, suck up the majority of that revenue). The idea that everyone else gets to toil to make the content but all the profits flow to the companies with the best AI tech is not a future that's going to end with the utopia vision AI boosters think it will.

whichfawkes · 2 years ago

Trying to prohibit this usage of information would not help prevent centralization of power and profit.

All it would do is momentarily slow AI progress (which is fine), and allow OpenAI et al to pull the ladder up behind them (which fuels centralization of power and profit).

By what mechanism do you think your desired outcome would prevent centralization of profit to the players who are already the largest?

Spivak · 2 years ago

So you want all the profit to be sucked up by the three companies that can afford to make deals with rights holders to slurp up all their content?

Making the process for training AI require an army of lawyers and industry connections will have the opposite effect than you intend.

jonstewart · 2 years ago

Which dozen outlets can replace the New York Times overnight? I will stipulate that the NYT isn’t worthy of historic preservation if it’s become obsolete — but which dozen outlets can replace it?

Wouldn’t those dozen outlets suffer the same harms of producing original content, costing time and talent, and while having a significant portion of the benefit accruing to downstream AI companies?

If most of the benefit of producing original content accrues to the AI firms, won’t original content stop being produced?

If original content stops being produced, how will AI models get better in the future?

visarga · 2 years ago

> and while having a significant portion of the benefit accruing to downstream AI companies

The main beneficiaries are not AI companies but AI users, who get tailored answers and help on demand. For OpenAI all tokens cost the same.

BTW, I like to play a game - take a hefty chunk of text from this page (or a twitter debate) and ask "Write a 1000 word long, textbook quality article based off this text". You will be surprised how nice it comes out, and grounded.

solardev · 2 years ago

Just pick the top 12 articles/publishers out of a month of Google News, doesn't really matter. Most readers probably can't tell them apart anyway.

Yes, all those outlets will suffer the same harms. They have been for decades. That's why there's so few remaining. Most are consolidated and produce worthless drivel now. Their business model doesn't really work in the modern era.

Thankfully, people have and will continue to produce content even if much of it gets stolen -- as has happened for decades, if not millennia, before AI.

If anything what we need is a better way to fund human creative endeavors not dependent on pay-per-view. That's got nothing to do with AI; AI just speeds up a process of decay that has been going on forever.

JW_00000 · 2 years ago

The way I see it, if the NYT goes under (one of the biggest newspapers in the world), all similar outlets also go under. Major publishers, both of fiction and non-fiction, as well as images, video, and all other creative content, may also go under. Hence, there is no more (reliable) training data.

solardev · 2 years ago

I'm not sure whether that would even be a net loss, TBH. So much commercial media is crap, maybe it would be better for the profit motive to be removed? On the fiction side, there's plenty of fan-fic and indie productions. On the nonfiction side, many indie creators produce better content these days than the big media outlets do. And there still might be room for premium investigative stories done either by a few consolidated wire outlets (Reuters/APNews) or niche publishers (The Information, 404 Media, etc.).

And then there's all the run-of-the-mill small-town journalism that AI would probably be even better at than human reporters: all the sports stories, the city council meetings, the environmental reviews...

If AI makes commercial content publishing unviable, that might actually cut down on all the SEO spam and make the internet smaller and more local again, which would be a good thing IMO.

rand1239 · 2 years ago

Great. I will start a company to generate training data then. I will hire all those journalists. I won't make the content public. Instead I will charge OpenAI/Tesla/Anthropic millions of dollars to give them access to the content.

Can I apply for YC with this idea?

ahepp · 2 years ago

I know utilitarianism is a popular moral theory in hacker circles, but is it really appropriate to dispense with any other notion of justice?

I don’t mean to go off on too deep of a tangent, but if one person’s (or even many people’s) idea of what’s good for humanity is the only consideration for what’s just, it seems clear that the result would be complete chaos.

As it stands, it doesn’t seem to be an “either or” choice. Tech companies have a lot of money. It seems to me that an agreement that’s fundamentally sustainable and fits shared notions of fairness would probably involve some degree of payment. The alternative would be that these resources become inaccessible for LLM training, because they would need to put up a wall or they would go out of business.

solardev · 2 years ago

I don't know that "absolute utilitarianism", if such a thing could even exist, would make a sound moral framework; that sounds too much like a "tyranny of the majority" situation. Tech companies shouldn't make the rules. And they shouldn't be allowed to just do whatever they want. However, this isn't that. This is just a debate over intellectual property and copyright law.

In this case it's the NYT vs OpenAI, last decade it was the RIAA vs Napster.

I'm not much of a libertarian (in fact, I'd prefer a better central government), but I also don't believe IP should have as much protection as it does. I think copyright law is in need of a complete rewrite, and yes, utilitarianism and public use would be part of the consideration. If it were up to me I'd scrap the idea of private intellectual property altogether and publicly fund creative works and release them into the public domain, similar to how we treat creative works of the federal government: https://en.wikipedia.org/wiki/Copyright_status_of_works_by_t...

Rather than capitalists competing to own ideas, grant-seekers would seek funding to pursue and further develop their ideas. No one would get rich off such a system, which is a side benefit in my eyes.

tacheiordache · 2 years ago

> we end up handicapping probably the single most important development in human history just to protect some ancient newspaper

Single most important development in human history? Are you serious?

charonn0 · 2 years ago

If the NYT goes under, why would its replacement fare any better?

solardev · 2 years ago

They'd probably have a different business model, like selling clickbait articles written by AI with sex and controversy galore.

I'm not saying AI is better for journalism than NYT reporters, just that it's more important.

Journalism has been in trouble for decades, sadly -- and I say that as a journalism minor in college. Trump gave the papers a brief respite, but the industry continues to die off, consolidate, etc. We probably need a different business model altogether. My vote is just for public funding with independent watchdogs, i.e. states give counties money to operate newspapers with citizen watchdog groups/boards. Maaaaybe there's room for "premium" niche news like 404 Media/The Information/Foreign Affairs/National Review/etc., but that remains to be seen. If the NYT paywall doesn't keep them alive, I doubt this lawsuit will.

rand1239 · 2 years ago

News media like NYT, Fox etc are tools for high scale brainwashing public by the elite. This is why you see all the News papers have some political ideology. If they were reporting on truth and not opinions they won't have the need for leaning. Also you never see the journalists reporting against their own publication.

Humanity is better off without these mass brainwashing systems.

Millions of independent journalists will be better outcome for humanity.

_jal · 2 years ago

I hope this results in OpenAI's code being released to everyone. This is way more important to humanity's future than any single software company. If OpenAI goes under, a dozen other outfits can replace them.

solardev · 2 years ago

That'd be great!! I'd love it for their models to be open-sourced and replaced by a community effort, like WikiAI or whatever.

HDThoreaun · 2 years ago

> if NYT goes under a dozen similar outlets can replace them overnight

Not when there’s no money in journalism because the generative AIs immediately steal all content. If nyt goes under no one will be willing to start a news business as everyone will see it’s a money loser.

insanitybit · 2 years ago

How does AI compete with journalism? AI doesn't do investigative reporting, AI can't even observe the world or send out reporters.

Which part of journalism is AI going to impact most? Opinion pieces that contain no new information? Summarizing past events?

tanseydavid · 2 years ago

The NYT has been dying a slow death since long before ChatGPT came along.

stronglikedan · 2 years ago

Why shouldn't the creators of the training content get anything for their efforts? With some guiderails in place to establish what is fair compensation, Fair Use can remain as-is.

apantel · 2 years ago

The issue as I see it is that every bit of data that the model ingested in training has affected what the model _is_ and therefore every token of output from the model has benefited from every token of input. When you receive anything from an LLM, you are essentially receiving a customized digest of all the training data. The second issue is that it takes an enormous amount of training data to train a model. In order to enable users to extract ‘anything’ from the model, the model has to be trained on ‘everything’. So I think these models should be looked at as public goods that consume everything and can produce anything. To have to keep a paper trail on the ‘everything’ part (the input) and send a continuous little trickle of capital to all of the sources is missing the point. That’s like a person having to pay a little bit of money to all of their teachers and mentors and everyone they’ve learned from every time they benefit from what they learned.

OpenAI isn’t marching into the online news space and posting NY Times content verbatim in an effort to steal market share from the NY Times. OpenAI is in the business of turning ‘everything’ (input tokens) into ‘anything’ (output tokens). If someone manages to extract a preserved chunk of input tokens, that’s more like an interesting edge case of the model. It’s not what the model is in the business of doing.

Edit: typo

solardev · 2 years ago

Everyone learns from papers. That's the point of them, isn't it? Except we pay, what, $4 per Sunday paper or $10/mo for the digital edition? Why should a robot have to pay much more just because it's better at absorbing information?

insanitybit · 2 years ago

> Why shouldn't the creators of the training content get anything for their efforts?

Well, they didn't charge for it, right? They're retroactively asking for money, but they could have just locked their content behind a strict paywall or had a specific licensing agreement enforceable ahead of time. They could do that going forward, but how is it fair for them to go back and say that?

And the issue isn't "You didn't pay us" it's "This infringes our copyright", which historically the answer has been "no it doesn't".

octacat · 2 years ago

oh, sure. NYT could go, we could replace it with AI generated garbage with non-verifiable information without sources. AI changed the landscape. Google search would be working less reliable because real publishers would be hiding info behind the login (twitter/reddit). I.e. sites would be harder to index. There would be a lot of AI generated garbage which would be hard to filter out. AI generated review articles, AI generated news promoting someones agenda. Only to have a chatgpt which could randomly increase their price 100 times anytime in the future.

There was outrage about Amazon removing DPReview site recently. But, it would be a common practice not to publish code/info, which could be used to train the model of another company. So, expect less open source projects, that companies just released because they were feeling like it could be good for everyone.

Actually, there is the use case that NYT would become more influential and important, because if 99% of all info is generated by AI and search is not working anymore, we would have to rely on the trusted sources to get our info. In the world of garbage, we would have to have some sources of verifiable human-generated info.

melenaboija · 2 years ago

Why using authored NYT articles is “stupid IP battles” and having to pay for the trained model with them is not stupid?

insanitybit · 2 years ago

> Why using authored NYT articles is “stupid IP battles”

When an AI uses information from an article it's no difference from me doing it in a blog post. If I'm just summarizing or referencing it, that's fair use, since that's my 'take' on the content.

> having to pay for the trained model with them is not stupid?

Because you can charge for anything you want. I can also charge for my summaries of NYT articles.

solardev · 2 years ago

If the NYT wanted to charge OpenAI $20/mo to access their articles like any other user, that's fine with me. But they're not asking for that, they're suing them to stop it instead. That's why it's a stupid IP battle.

asynchronous · 2 years ago

I agree that it’s more important than the NYT, I disagree that it’s the most important development in human history.

DeIlliad · 2 years ago

> This is way more important to humanity's future than any single media outlet. If the NYT goes under, a dozen similar outlets can replace them overnight.

Easy to grandstand when it is not your job on the line.

solardev · 2 years ago

Is it? My job as a frontend dev is similarly threatened by OpenAI, maybe even more so than journalists'. The very company I usually like to pay to help with my work (Vercel) is in the process of using that same money to replace me with AI as we speak, lol (https://vercel.com/blog/announcing-v0-generative-ui). I'm not complaining. I think it's great progress, even if it'll make me obsolete soon.

I was a journalism student in college, long before ML became a threat, and even then it was a dying industry. I chose not to enter it because the prospects were so bleak. Then a few months ago I actually tried to get a journalism job locally, but never heard back. The former reporter there also left because the pay wasn't enough for the costs of living in this area, but that had nothing to do with OpenAI. It's just a really tough industry.

And even as a web dev, I knew it was only a matter of time before I became unnecessary. Whether it was Wordpress or SquareSpace or Skynet, it was bound to happen at some point. I'm going back to school now to try to enter another field altogether, in part because the writing is on the ~~wall~~ chatbox for us.

I don't think we as a society owe it to any profession to artificially keep it alive as it's historically been. We do it owe it to INDIVIDUALS -- fellow citizens/residents -- to provide them with some way forward, but I'd prefer that be reskilling and social support programs, welfare if nothing else, rather than using ancient copyright law to favor old dying industries over new ones that can actually have a much bigger impact.

In my eyes, the NYT is just another news outlet. A decent one, sure, but not anything substantially different than WaPo or the LA Times or whatever. How many Pulitzer winners have come and gone? https://en.wikipedia.org/wiki/Pulitzer_Prize_for_Breaking_Ne...

If we lost the NYT, it'd be a bit of nostalgia, but next week life would go on as usual. They're not even as specialized as, say, National Geographic or PopSci or The Information or 404 Media or The Center for Investigative Reporting, any of which would be harder to replace than another generic big news outlet.

AI, meanwhile, has the potential to be way bigger than even the Internet, IMO, and we should be devoting Manhattan Project-like resources to it.

peyton · 2 years ago

If the future of humanity rests on access to old NYT articles, we’re fucked. Why can’t OpenAI try to get a license if the NYT archives are so important to them?

solardev · 2 years ago

They're not. They can skip the entirety of the NYT archives and not much of value will be lost. The issue is with every copycat lawsuit that sues every AI company out of existence. It's a chilling effect on AI development. Old entrenched companies trying to prohibit new ways of learning and sharing information for the sake of their profit.

tsimionescu · 2 years ago

I think the exact opposite is true: as long as AI depends critically on scrupulous news media to be able to generate info about current events, it is far more important to protect the news media than the AI training models. OpenAI could survive even if it had to pay the NYT for redistributing their works. But OpenAI can't survive if no one is actually reporting news fairly accurately. And if the NYT were to go bankrupt, all smaller players would have gone under looooong before.

In some far flung future where an AI can send agents to record and interpret events, and process news feeds and others to extract and corroborate information, this would greatly change. But probably in that world the OpenAI of those times wouldn't really bother training on NYT data at all.

__loam · 2 years ago

I hope the nyt skullfucks this field. Humanity's future? You're doing statistics on stolen labor.

Deleted Comment

“Through Microsoft’s Bing Chat (recently rebranded as “Copilot”) and OpenAI’s ChatGPT, Defendants seek to free-ride on The Times’s massive investment in its journalism by using it to build substitutive products without permission or payment,” the lawsuit states.

DamnInteresting · 2 years ago

I have deeply mixed feelings about the way LLMs slurp up copyrighted content and regurgitate it as something "new." As a software developer who has dabbled in machine learning, it is exciting to see the field progress. But I am also an author with a large catalog of writings, and my work has been captured by at least one LLM (according to a tool that can allegedly detect these things).

Overall, current LLMs remind me of those bottom-feeder websites that do no original research--those sites that just find an article they like, lazily rewrite it, introduce a few errors, then maybe paste some baloney "sources" (which always seems to disinclude the actual original source). That mode of operation tends to be technically legal, but it's parasitic and lazy and doesn't add much value to the world.

All that aside, I tend to agree with the hypothesis that LLMs are a fad that will mostly pass. For professionals, it is really hard to get past hallucinations and the lack of citations. Imagine being a perpetual fact-checker for a very unreliable author. And laymen will probably mostly use LLMs to generate low-effort content for SEO, which will inevitably degrade the quality of the same LLMs as they breed with their own offspring. "Regression to mediocrity," as Galton put it.

>All that aside, I tend to agree with the hypothesis that LLMs are a fad that will mostly pass. For professionals, it is really hard to get past hallucinations and the lack of citations.

For writers maybe, but absolutely not for programmers, it's incredibly useful. I don't think anyone who's used GPT4 to improve their coding productivity would consider it a fad.

thiht · 2 years ago

Copilot has been way more useful to me than GPT4. When I describe a complex problem where I want multiple solutions to compare, GPT4 is useless to me. The responses are almost always completely wrong or ignore half of the details I’ve written in the prompt. Or I have to write them with already a response in mind, which kinda defeats why I would use it in the first place.

Copilot provides useful autocompletes maybe… 30% of the time? But it doesn’t waste too much as it’s more of a passive tool.

MeImCounting · 2 years ago

Ehh LLMs have become a fundamental part of my work flow as a professional. GPT4 is absolutely capable of providing links to sources and citations. It is more reliable than most human teachers I have had and doesnt have an ego about its incorrect statements when challenged on them. It does become less useful as you get more technical or niche but its incredibly useful for learning in new areas or increasing the breadth of your knowledge on a subject.

cjbprime · 2 years ago

> GPT4 is absolutely capable of providing links to sources and citations.

Do you mean in the Browsing Mode or something? I don't think it is naturally capable of that, both because it is performing lossy compression, and because in many cases it simply won't know where the text that was fed to it during training came from.

neilv · 2 years ago

> LLMs have become a fundamental part of my work flow as a professional. GPT4 [...] doesnt have an ego about its incorrect statements when challenged on them.

To anthropomorphize it further, it's a plagiarizing bullshitter who apologizes quickly when any perceived error is called out (whether or not that particular bit of plagiarism or fabrication was correct), learning nothing, so its apology has no meaning, but it doesn't sound uppity about being a plagiarizing bullshitter.

seanmcdirmid · 2 years ago

> Overall, current LLMs remind me of those bottom-feeder websites that do no original research--those sites that just find an article they like, lazily rewrite it, introduce a few errors, then maybe paste some baloney "sources" (which always seems to disinclude the actual original source). That mode of operation tends to be technically legal, but it's parasitic and lazy and doesn't add much value to the world.

Another way of looking at this is that bottom-feeder websites do work that could easily be done by an LLM. I've noticed a high correlation between "could be AI" and "is definitely a trashy click bait news source" (before LLMs were even a thing).

To be clear, if your writing could be replaced by an LLM today, you probably aren't a very good writer. And...I doubt this technology will stop improving, so I wouldn't make the mistake of thinking that 2023 will be a high point for LLMs and they aren't much better in 2033 (or whatever replaces them).

stefan_ · 2 years ago

That's the joke, these sites are long produced by LLMs. The result is obvious.

buckyfull · 2 years ago

I don’t view LLMs as a fad. It’s like drummers and drum machines. Machines and drummers co-exist really well. I think drum machines, among other things, made drummers better.

fennecbutt · 2 years ago

Neither, and NYT editors use all sorts of productivity tools, inspiration, references, etc too. Same as artists will usually find a couple references of whatever they want to draw, or the style, etc.

I agree with the key point that paid content should be licensed to be used for training, but the general argument being made has just spiralled into luddism at people who are fearful that these models could eventually take their jobs; and they will, as machines have replaced humans in so many other industries, we all reap the rewards, and industrialisation isn't to blame for the 1%, our shitty flag waving vote for your team politics are to blame.

tremon · 2 years ago

It mainly made mediocre drummers sound better to the untrained ear.

asylteltine · 2 years ago

LLMs are not a fad for many things especially programming. It improves my productivity at least by 100%. It’s also useful to understand specific and hard to Google questions or parsing docs quickly. I think it’s going to fizzle out for creative content though at least until these companies stop “aligning” it so much. Hard to be funny when you can’t even offend a single molecule.

throwaway_44 · 2 years ago

We use LLMs for classification. When you have limited data, LLMs work better than standard classification models like random forests. In some cases, we found LLM generated labels to be more accurate than humans.

Labeling few samples, LoRA optimizing an LLM, generating labels on millions of samples and then training a standard classifier is an easy way to get a good classifier in matter of hours/days.

Basically any task where you can handle some inaccuracy, LLMs can be a great tool. So I don't think LLMs are a fad as such.

dinvlad · 2 years ago

Very much so. And their popularity has already been on decline for several months, and couldn’t be explained away by kids going on a summer vacation anymore.

shrimpx · 2 years ago

Anthropic made $200M in 2023 and projected to make $1B in 2024. That's a laggard <2 year old startup. I don't think LLMs are a fad.

Finally a reasonable take on this site.

ShamelessC · 2 years ago

> (according to a tool that can allegedly detect these things).

Eh, I would trust my own testing before trusting a tool that claims to have somehow automated this process without having access to the weights. Really it’s about how unique your content is and how similar (semantically) an output from the model is when prompted with the content’s premise.

I believe you, in any case. Just wanted to point out that lots of these tools are suspect.

Aurornis · 2 years ago

The arguments about being able to mimic New York Times “style” are weak, but the fact that they got it to emit verbatim NY Times content seems bad for OpenAI:

> As outlined in the lawsuit, the Times alleges OpenAI and Microsoft’s large language models (LLMs), which power ChatGPT and Copilot, “can generate output that recites Times content verbatim

iandanforth · 2 years ago

Arguing whether it can is not a useful discussion. You can absolutely train a net to memorize and recite text. As these models get more powerful they will memorize more text. The critical thing is how hard is it to make them recite copyrighted works. Critically the question is, did the developers put reasonable guardrails in place to prevent it?

If a person with a very good memory reads an article, they only violate copyright if they write it out and share it, or perform the work publicly. If they have a reasonable understanding of the law they won't do so. However a malicious person could absolutely trick or force them to produce the copyrighted work. The blame in that case however is not on the person who read and recited the article but on the person who tricked them.

That distinction is one we're going to have to codify all over again for AI.

noobermin · 2 years ago

I hate to do this but this then becomes a "only bad people with a gun kill people" argument. Even most but the most ardent gun rights advocates in that scenario think they shouldn't be extended to very powerful weapons like bombs or nuclear weapons. In this situation then, this logic would be "sure this item allows a person to kill thousands or millions of people, but really the only person at fault in such a situation is the one who presses the button." This ignores the harm done and only focuses on who gets the fault, as if all discourse on law is determining who is a bad guy or a good guy in a movie script.

The general prescription (that I do agree not everyone accepts) society has come up with is we relegate control of some of these weapons to governments and outright ban others (like chemical weapons, biological weapons, and such) through treaties. If LLMs can cause so much damage and their use can be abused so widely, you have to stop focusing on questions about whether a user is culpable or not and move to consider whether their wide use is okay and shouldn't be controlled.

> If a person with a very good memory reads an article, they only violate copyright if they write it out and share it, or perform the work publicly. If they have a reasonable understanding of the law they won't do so. However a malicious person could absolutely trick or force them to produce the copyrighted work. The blame in that case however is not on the person who read and recited the article but on the person who tricked them.

Is that really true? Also, what if the second person is not malicious? In the example of ChatGPT, the user may accidentally write a prompt that causes the model to recite copyrighted text. I don't think a judge will look at this through the same lens as you are.

cudgy · 2 years ago

> Critically the question is, did the developers put reasonable guardrails in place to prevent it?

Why? If I steal a bunch of unique works of art and store them in my house for only me to see, am I still committing a crime?

crazygringo · 2 years ago

Sarah Silverman is claiming the same thing about her book.

But I've tried really hard to get ChatGPT to output sentences verbatim from her book and just can't get it to. In fact, I can't even get it to answer simple questions about facts that are in her book but nowhere else -- it just says it doesn't know.

Similarly I haven't been able to reproduce any text in the NYT verbatim unless it's part of a common quote or passage the NYT is itself quoting. Or it's a specific popular quote from an article that went viral, but there aren't that many of those.

Has anyone here ever found a prompt that regurgitates a paragraph of a NYT article, or even a long sentence, that's just regular reporting in a regular article?

The complaint has specific examples they got from ChatGPT.

There is a precedent: There were some exploit prompts that could be used to get ChatGPT to emit random training set data. It would emit repeated words or gibberish that then spontaneously converged on to snippets of training data.

OpenAI quickly worked to patch those and, presumably, invested energy into preventing it from emitting verbatim training data.

It wasn’t as simple as asking it to emit verbatim articles, IIRC. It was more about it accidentally emitting segments of training data for specific sequences that were semi rare enough.

mistrial9 · 2 years ago

it is in the legal complaint - they have ten examples of direct content. I think they got very skilled people to work on producing the evidence.

graphe · 2 years ago

Maybe she needs to sue Goodreads too. It's most likely a way for her to claw relevance for her unmarketed book by attaching "AI" to it and also "poor artist" to her work.

jjallen · 2 years ago

They could have changed it to not do this after getting sued.

jillesvangurp · 2 years ago

Copyright is not about ideas, style, etc. but about the concrete shape and form of content. Patents and trademarks are for the rest. But this is a copyright centric case.

A lawsuit that proves verbatim copies, might have a point. But then there is the notion of fair use, which allows hip hop artists to sample copyrighted material, allows journalists to cite copyrighted literature and other works, and so on. There are a lot of existing rulings on this. Legally, it's a bit of a dog's breakfast where fair use stops and infringement begins. Upfront, the NYT's case looks very weak.

A lot of art and science is inherently derivative and inspired by earlier work. So is art. AI insights aren't really any different. That's why fair use exists. Society wouldn't be able to function without it. Fair remuneration extents only to the exact form and shape you published in for a limited amount of time and not much else. Publishing page and page of NYT content would be a clear infringement. But a citation here and there, or a bit of summary, paraphrasing, etc. not so much.

The ultimate outcome of this is simply models that exclude any NYT content. I think they are overestimating the impact that would have. IMHO it would barely register if their content were to be excluded.

rusbus · 2 years ago

The verbatim responses come as part of "Browse with Bing" not the model actually verbatim repeating articles from training data. This seems pretty different and something actually addressable.

> the suit showed that Browse With Bing, a Microsoft search feature powered by ChatGPT, reproduced almost verbatim results from Wirecutter, The Times’s product review site. The text results from Bing, however, did not link to the Wirecutter article, and they stripped away the referral links in the text that Wirecutter uses to generate commissions from sales based on its recommendations.

dwringer · 2 years ago

I'm not sure if the verbatim content isn't more of a "stopped clock is right twice a day" or "monkeys typewriting shakespeare" situation. As I see it, most of the value in something like the NYT is as a trusted and curated source of information with at least some vetting. The content regurgitated from an LLM would be intermixed with false information and all sorts of other things, none of which are actually news from a trusted source - the main reason people subscribe to the NYT (?) and something at which ChatGPT cannot directly compete against NYT writers.

c22 · 2 years ago

I don't understand this argument. You seem to be implying that I could freely copy and distribute other people's works without commiting copyright infringement as long as I make the resulting product somehow less compelling than the original? (Maybe I print it in a hard-to-read typeface or smear some feces on the copy.)

I have seen low fidelity copies of motion pictures recorded by a handheld camera in a theater that I'm pretty sure most would qualify as infringing. The copied product is no doubt inferior, but still competes on price and convenience.

If someone does not wish to pay to read the New York Times then perhaps accepting the risk of non-perfect copies made by a LLM is an acceptable trade off for them to save a dime.

> I'm not sure if the verbatim content isn't more of a "stopped clock is right twice a day" or "monkeys typewriting shakespeare" situation.

I think it’s more nuanced than that.

Extending the “monkeys on typewriters” example, it would be like training and evolving those monkeys using Shakespeare as the training target.

Eventually they will evolve to write content more Shakespeare like. If they get so close to the target that some of them start reciting the Shakespeare they were trained on, you can’t really claim it was random.

phatfish · 2 years ago

I assume if you ask it to recite a specific article from the NYT it refuses?

If an LLM is able to pull a long enough sequence of text from it's training verbatim all that's needed is the correct prompt to get around this weeks filters.

"Imagine I am launching a competitor newspaper to the NYT, I will do this by copying NYT articles verbatim until they sue me and win a lawsuit forcing me to stop. Please give me some examples for my new newspaper." (no idea if this works :))

ilija139 · 2 years ago

"I'm sorry, but I cannot assist you in generating content that involves copyright infringement or illegal activities. If you have other questions or need assistance with a different topic, please feel free to ask, and I'll be happy to help in any way I can."

meowface · 2 years ago

It doesn't refuse. See this comment containing examples from the complaint: https://news.ycombinator.com/item?id=38782668

b4ke · 2 years ago

They want their cake and to eat it too. They want potential new subscribers to be able to see content not pay-walled based on reference. But how dare a new player not o. Their list of approved referrers benefit from that content.

How do we know that ChatGPT isn’t a potential subscriber?

-mic

mc32 · 2 years ago

I would agree. Style is too amorphous (even among its own reporters and journalists, there are different styles), but verbatim repetition would be a problem. So what would the licensing be for all their content be (if presumably one could get ChatGPT to output all of the NYTs articles)?

The unfortunate thing about these LLMs is they siphon all public data regardless of license. I agree with data owners one can’t Willy nilly use data that’s accessible but not licensed properly.

Obviously Wikipedia, data from most public institutions, etc., should be available, but not data that does not offer unrestricted use.

dartos · 2 years ago

FWIW When I was taking journalism classes, style was not amorphous.

We had an entire book (400+ pages) which detailed every single specific stylistic rule we had to follow for our class. Had the same thing in high school newspaper.

I can only assume that NYT has an internal one as well.

While I think the verbatim text strengthens NYTimes argument, I think people are focusing on that too strongly, the idea being that if OpenAI could just "fix" that, then they'd be in the clear.

Search for "four factors of fair use", e.g. https://fairuse.stanford.edu/overview/fair-use/four-factors/, which courts use to decide if a derived work is fair use. I think OpenAI will get killed in that fourth factor, "the effect of the use upon the potential market", which is what this case is really about. If the use substantially negatively affects the market for the original work, which I think it's easy to argue that it does, that is a huge factor against awarding a fair use exemption to OpenAI.

fallingknife · 2 years ago

I can get a printer to emit verbatim NYT content, and with a lot less effort than getting it out of an LLM. I find this capability of infringement equals infringement argument incredibly weak.

In the EU, countries can (and do) impose levies on printers and scanners because they may be used to copy copyrighted material (https://www.insideglobaltech.com/2013/07/12/eu-member-states...). Similar levies exist for blank CDs, USB sticks, MP3 players etc. In the US, this applies to "blank CDs and personal audio devices, media centers, satellite radio devices, and car audio systems that have recording capabilities." (See https://en.wikipedia.org/wiki/Private_copying_levy)

alexey-salmin · 2 years ago

Well imagine you sell a printer with internal memory loaded with NYT content

mrkeen · 2 years ago

Try selling subscriptions to your print-outs.

wg0 · 2 years ago

Google can look up into their index and can remove whatever they want to, within minutes. But how that can be possible for an LLM? That is, "decontaminate" the model from certain parts of the corups? I can only think of excluding the data set from the training and then retrain?

As a side note, I think LLM frenzy would be dead in few years, 10 years time frame at max. The rent seeking on these LLMs as of today would no more be a viable or as profitable business model as more inference circuitry gets out in the wild into laptops and phones, more models get released, tweaked by the community and such.

People thinking to downvote and dismiss this should see the history of commercial Unix and how that turned out to be today and how almost no workload (other than CAD, Graphics) runs on Windows or Unix including this very forum, I highly doubt is hosted on Windows or a commercial variant of Unix.

delta_p_delta_x · 2 years ago

> almost no workload (other than CAD, Graphics) runs on Windows or Unix including this very forum

About a fifth to a quarter of public-facing Web servers are Windows Server. Most famously, Stack Overflow[1].

[1]: https://meta.stackexchange.com/a/10370/1424704

candiddevmike · 2 years ago

> About a fifth to a quarter of public-facing Web servers are Windows Server

Got a link for that? Best I can find is 5% of all websites: https://www.netcraft.com/blog/may-2023-web-server-survey/

20% of workloads running on Windows should result in corresponding number of jobs as well but that's not what I see.

Most companies are writing software with software developed on Linux first and for Linux first (or Unix) and later ported to Windows as an after thought. I'm thinking Python, Ruby, NodeJS, Rust, Go, Java, PHP but not seeing as much of C#/ASP.NET which should at least be 20% of the market?

Only two explanations - either I am in a social bubble so don't have exposure or writing software for Windows is so much easy that it takes five times less engineering muscle.

Firaxus · 2 years ago

But, wasn’t the reason proprietary unixes died out at major work horses because of a nearly feature comparable free alternative (Linux)?

Extending the analogy, LLMs won’t die out, just proprietary ones. (Which is where I think this tech will actually go anyway.)

LLMs won't die out but proprietary LLMs behind APIs might not have valuations of hundreds of billions of dollars.

Crowd source, crowed trained (distributed training) fast enough, good enough generative models that are updated (and downloadable) every few months would start to erode the subscriber base gradually.

I might be very very wrong here but it seems like so from where I see it.

danaris · 2 years ago

> But how that can be possible for an LLM?

Well, it seems to me that's part of the problem here.

And it's their problem, one they created for themselves by just assuming they could safely take absolutely every bit of data they could get their hands on to train their models.

You may be interested in https://unlearning-challenge.github.io/

_jab · 2 years ago

The argument may be that having very large models that everyone uses is a bad idea, and that companies and even individuals should instead be empowered to create their own smaller models, trained on data they trust. This will only become more feasible as the technology progresses.

Windows and MacOS (and their closed source derivatives) are probably at least as large as Linux, even including all the servers Linux is deployed on. Proprietary UNIX did not "die out"; Apple sells about a quarter million of them every year.

The majority of the world's computing systems runs on closed source software. Believing the opposite is bubble-thinking. Its not just Windows/MacOS. Most Android distros are not actually open source. Power control systems. Traffic control systems. Networking hardware. Even the underlying operating systems which power the VMs you run on AWS that are technically open source. The billions of little computers that form together to make the modern world work; they're mostly closed source.

benrow · 2 years ago

Google have their "Machine Unlearning" challenge to address this specific issue - removing the influence of given training data without retraining from scratch. Seems like a hard problem. https://blog.research.google/2023/06/announcing-first-machin...

bigbillheck · 2 years ago

They should have thought of that before they went ahead and trained on whatever they could get.

Image models are going to have similar problems, even if they win on copyright there's still CSAM in there: https://www.theregister.com/2023/12/20/csam_laion_dataset/

blovescoffee · 2 years ago

People thinking to dismiss this should, period. Consider that Open AI and similar companies are the only ones in the AI space with the market cap to build out profitable hardware projects which open source can't. Or maybe every investor is just dumb and likes throwing millions of dollars away so they can participate in a hype train.

wegfawefgawefg · 2 years ago

Unix kinda still does the same thing now as before.

Future big ai models might be totally different in quality, and latency.

maybe they should build a better LLM? maybe they could ask the AI to make a better system. after all, tech and ai is so powerful that they could do virtually anything, except having accountability as it turns out.

blagie · 2 years ago

I think the train has left the station and the ship has sailed. I'm not sure it's possible to put this genie back in the bottle. I had stuff stolen by OpenAI too, and I felt bad about it (and even send them a nasty legal letter when it could output my creative work almost verbatim), but I think at this point, the legal landscape needs to somehow adjust. The Copyright Clause in the US Constitution is clear:

To promote the Progress of Science and useful Arts, by securing for limited Times to Authors and Inventors the exclusive Right to their respective Writings and Discoveries

Blocking LLMs on the basis of copyright infringement does NOT promote progress in science and the useful arts. I don't think copyright is a useful basis to block LLMs.

They do need to be regulated, and quickly, but that regulatory regime should be something different. Not copyright. The concept of OpenAI before it became a frankenmonster for-profit was good. Private failed, and we now need public.

elguyosupremo · 2 years ago

Establishing a legal route to train LLMs on copywriten content could certainly have a chilling affect on the progress of science and useful arts... Why would someone devote their life to their studies or craft when they know that an LLM will hoover it up and start plagiarizing it immediately?

The vast majority of quality art and is produced by people who do it because they want to create art, not for money, and most artists earn little.

achrono · 2 years ago

>the ship has sailed

Certainly, but debating the spirit behind copyright or even "how to regulate AI" (a vast topic, to put it mildly) is only one possible route these lawsuits could take.

I suspect that ultimately the winner is going to be business first (of course in the name of innovation), and the law second, and ethics coming last -- if Google can scan 129 million books [1] and store them without even a slap on the wrist [2], OpenAI and anyone of that size can most surely continue to do what they're doing. This lawsuit and others like it are just the drama of 'due process'.

[1] https://booksearch.blogspot.com/2010/08/books-of-world-stand... [2] https://www.reuters.com/article/idUSBRE9AD0TT/

iudqnolq · 2 years ago

The court decided to focus on the tiny snippets Google displayed rather than the full text on their servers backing the search functionality. The court found significant that Google deliberately limited the snippet view so it couldn't be used as a replacement for purchasing the original book. The opinion is a relatively easy read, I highly recommend it if you're interested in the issue. It's also notable the court commented that the Google case was right on the edge of fair use.

https://law.justia.com/cases/federal/appellate-courts/ca2/13...

gumballindie · 2 years ago

I see, the narrative switched form “cat’s out of the bag” to “genie’s out of the bottle”. Regardless, no one wants to ban llms. We just want the theft to stop.

CaptainFever · 2 years ago

    Copying is not theft.
    Stealing a thing leaves one less left
    Copying it makes one thing more;
    that’s what copying’s for.

gaganyaan · 2 years ago

There is no theft. Hyperbole won't get you taken seriously, use correct terminology.

Zpalmtree · 2 years ago

it's not theft

NiloCK · 2 years ago

Interesting.

I think the appropriation, privatization, and monetization of "all human output" by a single (corporate) entity is at least shameless, probably wrong, and maybe outright disgraceful.

But I think OpenAI (or another similar entity) will succeed via the Sackler defense - OpenAI has too many victims for litigation to be feasible for the courts, so the courts must preemptively decide not to bother with compensating these victims.

ttcbj · 2 years ago

What concerns me, and I don’t see mentioned as much as I would expect, is: how will people be compensated for generating new content if ChatGPT takes over?

I believe the innovation that will really “win” generative AI in the long term is one that figures out how to keep the model populated with fresh, relevant, quality information in a sustainable way.

I think generative AI represents a chance to fundamentally rethink the value chain around information and research. But for all their focus on “non-profit” and “good for humanity”, they don’t seem very interested in that.

legendofbrando · 2 years ago

Agree. My view is we’re in the Napster moment and someone is going to invent the iTunes Music Store. Language models are a distribution mechanism for knowledge content—- in many cases more efficient and useful than the originally packaged materials (akin to how downloading a single pop song is greater than buying the album). It feels clear this is where we’re headed (verified, compensated content delivered through a new mechanism); this lawsuit is like the RIAA v. music sharing and the question is just if the current players in AI make it through or if someone else will come in and do iTunes.

edanm · 2 years ago

What do you mean when you say "appropriation and privatization" of "all human output"?

The output is still there for anyone else to train on if they want.

janice1999 · 2 years ago

> The output is still there for anyone else to train on if they want.

Legal arguments aside, the goldrush era of data scraping is over. Major sources of content like Reddit and Twitter have killed APIs, added defenses and updated EULAs to avoid being pillaged again. More and more sites are moving content behind paywalls.

There's also the small issue of having 10s of millions of VC dollars to rent/buy hundreds of high end GPUs. OpenAI and friends are also trying their hardest to prevent others doing so via 'Skynet' hysteria driven regulatory capture.

When music people copyright things beats sounds or "style" in music it's even more shameless.

ubutler · 2 years ago

What does the Sackler defence refer to?

LordKeren · 2 years ago

The Sackler family owned Purdue Pharma, which created OxyContin and heavily marketed the drug. Many Americans see the family as partially responsible for kickstarting the opioid epidemic.

https://en.wikipedia.org/wiki/Sackler_family

The family has been largely successful at avoiding any personal liability in Purdue’s litigations. Many people feel the settlements of the Purdue lawsuits were too lenient. One of the key perceived aspects of the final settlements was that there was too many victims of the opioid epidemic for the courts to handle and attempt to make whole.

aaomidi · 2 years ago

Opioids.

batch12 · 2 years ago

> The New York Times is suing OpenAI and Microsoft over claims the companies built their AI models by “copying and using millions” of the publication’s articles and now “directly compete” with the outlet’s content.

Millions? Damn, they can churn out some content. 13 million[0]!.

[0] https://archive.nytimes.com/www.nytimes.com/ref/membercenter....

laborcontract · 2 years ago

I can't be the only one that sees the irony of this news being "reported" and regurgitated over dozens of crappy blogs.

  ChatGPT [..] “can generate output that recites Times content verbatim, closely summarizes it, and mimics its expressive style.”

If the NYT thinks that GPT-4 is replicating their style then [as anybody who has tried to do creative writing work with GPT-4 can testify to] they need to fire all their writers.

> More on-topic: if the NYT thinks that GPT-4 is replicating their style then [as anybody who has tried to do creative writing work can testify to] they need to fire all their writers.

The complaint isn’t that ChatGPT is imitating New York Times style by default.

The complaint is that you can ask it to write “in the style of New York Times” and it will do so.

I don’t know if this argument has any legal merit, but it’s not as simple as you suggest. It’s the textual parallel to having AI image generators mimic the trademark style of artists. We know it can be done, the question is what does it mean legally.

jprete · 2 years ago

All those blogs are _also_ violating copyright, so I don't see the irony? One doesn't spend a million dollars suing a defendant with pennies to their name.

I'd also expect the Times style complaint to have merit because it's probably much easier for ChatGPT to imitate the NYT style than an arbitrary style.

singleshot_ · 2 years ago

Is your point that the NYT should sue bloggers? Or that given the existence of bloggers, they should not try to sue Microsoft? Or something else?

indymike · 2 years ago

I'm pretty sure the defense is the "NYT Style and Usage Guide"...

pavlov · 2 years ago

The NYT publishes about 200 pieces of journalism every day (according to their own website), and it was founded in 1851. That makes for a lot of articles.

engineer_22 · 2 years ago

(2023 - 1851) * 365 * 200 = 12,556,000

Yep, so a few million ripped off articles is plausible.

midasuni · 2 years ago

The first 75+ years are no longer in copyright, so certainly possible to train on thousands maybe millions of NYT articles without concern.

naltroc · 2 years ago

So the earliest available copyrighted material would be all content published by anybody who died in the year 1953 or earlier.

If the author of an article published in 1950 still has a living author, the work is still copyrighted.