Based on the encyclopedic knowledge LLMs have of written works I assume all parties did the same. But I think there is a broader point to make here. Youtube was initially a ghost town (it started as a dating site) and it only got traction once people started uploading copyrighted TV shows to it. Google itself got big by indexing other people's data without compensation. Spotify's music library was also pirated in the early days. The contracts with the music labels came later. GPL violations by commercial products fits the theme also.
Companies aggressively protect their own intellectual property but have no qualms about violating the IP rights of others. Companies. Individuals have no such privilege. If you plug a laptop into a closet at MIT to download some scientific papers you forfeit your life.
Everyone on here is smart enough. Just do not participate and save your money. Do not pay for digital goods. If Netflix raises their prices, it doesn't matter because there is a torrent of all of their shows. If Spotify raises their prices, it doesn't matter because your favorite artist has their entire library in a torrent. If some game company ask you to pay real life prices for a digital costume, find the crack online and play on a private server. If YouTube wants to interrupt your video with an ad in the middle of the sentence, download one of the many options that blocks all ads. Billion dollar companies have shown they do not care about you. The people who complain about losing their salary, should just get replies thanking them for paying.
All the sad poor people who might be hurt were already paid. The caterer on your favorite show is not getting residuals. NBC also isn't going to stop making TV shows because that is all they can do. Content creators also existed on the internet long before that was a job. They just did it because they cared about it not for ad money. If you really want to support the artist directly go to a concert or just mail them a check. If you can't actually identify a person who might be hurt, then do not care.
if you want to support an artist go to the show and BUY MERCH at the table! almost all of their income comes from that. the importance of buying a T-shirt at the show cannot be overstated and sometimes you get to say hi to your idol, too
lol I absolutely do not want non digital goods nor pirating. Ever. It's 2025. I don't have a cdplayer, a tape player, a blue ray player, I don't even know what the most modern "blue ray" disc would be. I have $2k worth of vinyls that are just unique copies I display as art I'll never put in my record player, that's also never been used. I don't want to constantly worry about 60gb of mp3 files.
Oh no, that TV show I'll forget about in a year cost me $15/mo instead of $60 of blurays.
I jump in my cars and hit a button and music plays. Almost any music I want. That's amazing.
I'm also not pirating games. I'm not 12 without a job. I have a job. I pay developers for their work. I want more games, like Kingdom Come 3, to come out.
Weird ass comment. You seriously think we're going to put our lives on hold to.. what, fight "digital media"? You think I care about netflix? Or societies use of it? I haven't used netflix in years. I don't know anybody under 40 with a netflix account. Everyone on your end of the pirate spectrum uses debrid nowadays, anyway.
Next you're going to tell people to install the "Black XP Windows" edition to not support Microsoft and they all get malware and their credit cards stolen because they installed some pirated and modified cracked windows. Genius.
MSNBC just cancelled Andrea Mitchells TV show, today, because she brought in no younger audiences. So yes, shows do get cancelled by not being watched.
This comment was upvoted? Hn needs a break. This is some I'm 14 and edgy bullshit that sounds like it belongs on an eastern european piracy forum.
Yes. And the problem here isn't that companies get away with doing things like this, the problem is that individuals don't. Attempting to lock information behind a nightmarish legal system is the problem.
I'm pretty much at the point now where I don't buy the "copyright incentivizes creation" argument any more. Copyright, like advertising, incentivizes creation by enormous corporations, but also like advertising it incentivizes creations that overwhelmingly have little value.
Creative individuals don't need copyright to be incentivized to create—they need a safety net that gives them the freedom to spend time on the creativity that naturally wants to bubble out. If the goal is to encourage creativity, copyright is a lousy and enormously expensive substitute for Universal Basic Income.
Also, in Canada, it's basically impossible to protect your IP as an individual due to the astronomical cost and lack of options to recover that cost. So copyright will never incentivize my creations, or those of any small creator.
Sure creative people will always create but the scope of that creativity will be limited if we do away with intellectual property. Steve Spielberg would probably always have created movies, but he wouldn't have been able to make Jurassic Park, Saving Private Ryan,or Indiana Jones without capital from the studio system, and the studio system wouldn't have provided him with that capital of they couldn't extract economic rents from the copyright for those films.
Nothing stops you from downloading Ann’s archive and training a model on it, right? The likelihood that you, as an individual, get sued over is is virtually zero.
This is what Meta tried to do, quietly download and use the data, to do research and advance their LLMs, without trying to establish any legal precedents or pick up fights.
> If you plug a laptop into a closet at MIT to download some scientific papers you forfeit your life.
In case anybody here doesn't know, that's a reference to Aaron Swartz, an activist (and Reddit co-founder) that was risking 35 years in prison and a $1 million fine just for downloading a lot of academic papers from JSTOR. He eventually took his life because of the pressure. May his soul rest in peace.
Except he was offered 6 months in a plea bargain, which he declined because he wanted a trial. Whether 6 months was reasonable punishment for "plug a laptop into a closet at MIT to download some scientific papers" is another matter, but "you forfeit your life" or "35 years in prison and a $1 million fine " is massively misleading.
> once people started uploading copyrighted TV shows to it
End users, not YouTube employees, right? And they would take things down following DMCA requests and what not, right? So, pretty much following the law?
> Google itself got big by indexing other people's data without compensation
Scraping public websites to build a search index isn't the same as making LLMs that can recreate the source verbatim devoid of even attribution. I do agree there's an argument to be had about the LLM's transformative nature in the end though.
> Spotify's music library was also pirated in the early days
Not any version generally available to the public, and with the copyright holder's permission to do so.
the english empire once tried to mantain a monopoly over steam loom machines
the americans cheated their way to competition,
heck, even before that, the english empire got jumpstarted by stealing gold from the spanish (who were themselves exploiting it away from aztec and other mexican natives)
I'm saying it's business as usual, but also, culture doesn't work like tangible physical widgets so we must stop letting a few steal this boon of digital copying by means of silly ideas like DRM, copyright, patents. all means to cause scarcity
The textile industry in Brno here in Czech Republic (sometimes called "Moravian Manchester") was hugely helped by a local noble posing as a worker in England & the smuggling detailed self-drawn plans of industrial machinery back:
"Brno’s fortunes were changed forever when a young freemason called Franz Hugo Salma set out for England in 1801. He intended to steal the plans for the most modern textile machinery in the world. His crime, the first recorded act of industrial espionage, boosted the competitiveness of Moravian textiles. Soon after smuggling the plans out disguised as a worker, and handing them over to Brno’s fledgling textile industry, Brno became the most important textile centre in the Habsburg empire."
You can even go see some of the original plans in a museum:
"Eleven designs are still preserved in the library of the Rájec chateau. They form a unique set of documents demonstrating both the level of wool processing technology at the turn of the late 18th and early 19th centuries, as well as the aims and means of the relatively rare business of industrial espionage at that time."
Interesting, if we're to trust what NotOpenAI and Facebook say about their IP, the US should pay the UK reparations for IP theft based on textile industry profits starting in the 1850s until today?
Why do I get sued when I share some BitTorrents but $bigcorp can just do it with 1000 scale without problems?
The issue here is not copyright/patents/etc - the issue is that the law is applied selectively — the issue is that Aaron Schwartz is dead for sharing knowledge with the public and Zuccborg is a billionaire building his torment nexus
In Spotify’s defense, they used the pirated data only to show a proof of concept to the copyright holders, and that use was sanctioned by the local rights holders organization STIM.
The copyright holders then approved their concept, and subsequently Spotify got the rights to offer their service to customers. Everybody won.
That’s not entirely true, in Spotify’s early days you could upload files to the service and listen to songs uploaded by other people. I think the majority of any song I wanted to listen to before they went Europe-only for a time was “pirated”.
> Spotify's music library was also pirated in the early days.
I want to know more, please enlighten me (anyone who knows). I read the book "The Spotify Play" and it made it seem like the pirated music was an internal-only thing and not something available to customers. Is that true?
Before the launch, Spotify had a deal with the music rights holders association in Sweden (STIM) that they could use a merged collection of friends and families music libraries. All this was removed before Spotify went out of beta.
So while it was using pirated media, it was sanctioned by the rights holders for the experiment of building Spotify.
Users would upload their copies of the music and spotify would replay them. This was obvious to early users, even if they were only consumers, because of the pirate-shout-out-overlays that were in a lot of the poorer quality releases.
Another interesting note, in the early days of spotify, the app would saturate your upload bandwidth while using it. Given their close ties to utorrent, I always assumed that's how they were affording the bandwidth as well.
Pretty brilliant way to bootstrap I guess; they didn't have to pay for bandwidth or content until they already had contracts in place
"Mood Machine: The Rise of Spotify and the Costs of the Perfect Playlist" by Liz Pelly goes into more detail about their origins and the culture around piracy in Sweden at the time.
Crunchyroll started off as a straight up piracy site, it now has millions of paying subscribers and was sold to Sony for over a billion a few years ago.
I think if Google attempted to download the entirety of JSTOR with the express intent of making the full dataset freely available, then Google would also face legal consequences.
It's true, and relevant, that Google would feel those consequences much less sharply than Swartz did.
Don't buy into the rhetoric and call it "consequences". It's always a choice to sue, a choice to prosecute, and this would be true even if these choices were made consistently and impartially (which they certainly aren't).
Google Scholar explicitly made direct deals with publishers to scrape their content, with the constraint that while they can use the content to serve search results in Scholar, but cannot show the content of the papers on the site- just titles and short fragments that match. the deals were tenuous and I had to step carefully around my plan to use that database to implement large-scale scientific search over the literature (this was a long time before anybody was seriously considering using LLMs on research data).
I've spoken to several very wealthy/powerful people and tried to get them to negotiate a large-scale content license with the various publishers that would allow researchers and individuals to access more research in lower-friction ways. None of them (NIH, Schmidt, etc) were really interested.
Something to understand about capitalist competition (also in politics) is that it's a war. Not one with guns and bombs, but more like a cold war, with espionage and hacking and just generally doing anything you can to gain an advantage without bringing negative consequences on yourself.
The limit is what you can actually get away with, not what the rules say you can get away with, and the system aggressively selects players who recognize this. It's amoral - there is no "ought", only "is". An actor gets punished or not, with absolutely no regard to whether it "should" get punished. One thing is consistent: following the rules as written means you lose.
You can see it in Y Combinator (and other) startups. The biggest ex-startups are things like AirBNB (hotels but we don't follow the rules but we don't get punished for not following them) and Uber (taxis but we don't follow the rules but we don't get punished for not following them).
One way to not get punished for not following the rules is to invent a variation of the game where the rules haven't been written yet. I again refer you to AirBNB and Uber; Omegle also comes to mind, although they didn't monetize.
Viewed in this light, Aaron Swartz's mistake was not the part where he downloaded journal articles, but the part where he got caught downloading journal articles. Shadow library sites are doing the same thing, minus the getting caught. So are Meta and Google and OpenAI. sci-hub is only involved in a lawsuit because it got caught and is now in the stage where it finds out whether it gets punished or not.
Aaron committed suicide and FBI going after him was meant more as a lesson to the other kids at MIT than anything.
MegaUpload did the same, kim dotcom got raided in his sleep by FBI in New Zealand! So no I don't buy your reductionist argument, there are forces at play that allow companies with founders with the likes of Google to get away with it but not others.
> Youtube was initially a ghost town (it started as a dating site) and it only got traction once people started uploading copyrighted TV shows to it
To this day, there are a huge number of videos that show copyrighted content on YouTube; they are usually crappy clips, reversed and with different music playing in the background to avoid automated detection.
Pretty sure that even if you gave a purchasing team enough money for retail price and a list of all books ever published, they wouldn't be able to buy even a quarter of them.
thanks to the byzantine copyright system, you can't easily do it. Plus, just speculating, but maybe by paying, it establishes "consideration" for some implied contract? "You implicitly entered a contract with us by purchasing the book, then violated the contract by 'distributing' the material for commercial use" ?
I briefly worked for Crunchyroll, which began life as an anime pirating service with subtitles. The contracts with the Japanese anime publishers came later. Now they vigorously protect their content from "pirates".
"Spotify's music library was also pirated in the early days."
"Ek, who had been the CEO of the piracy platform uTorrent, founded Spotify with his friend, another entrepreneur named Martin Lorentzon. Both-Ek at 23 and Lorentzon 37-were already millionaires from the sales of previous businesses. The name Spotify had no particular meaning, and was not associated with music. According to Spotify Teardown, the company developed a software for improved peer-to-peer network sharing, and the founders spoke of it as a general "media distribution platform." The initial choice to focus on music, the founders said at the time, was because audio files are smaller than video files, not because of a dream of saving music.
In 2007, when Spotify first publicly tested its software, it allowed users to stream songs downloaded from The Pirate Bay, a service for unlicensed downloads. By late 2008, Spotify would convince music labels in Sweden to license music to the site, and unlicensed music was removed. From there, Spotify would take off across Europe and then the world."
And Hollywood was created on the west coast because for intellectual property it was still the far west and it allowed them to ignore patents on movie technologies.
It's roughly the Spotify story too. They had an extremely impressive catalog very early, way before they were bought by the entertainment cartel. The founders had background in torrenting and the initial product was quite similar to The Pirate Bay but with clearly capitalist ambitions and branding, in contrast to the anarchist leanings of the Pirate Bureau and rather anarchic attitude of The Pirate Bay.
The thing is Google, meta and YouTube weren't giant entities when they did this stuff. I think it's good no one cracked down on them for copyright stuff. Now they're developing an LLM that will generate potentially trillions in value to humanity and looks like they're not exactly playing by the rules. But I prefer looser intellectual property rights anyway so Im ok with it
>But I prefer looser intellectual property rights anyway so Im ok with it
I think more people, potentially anyways, would feel similar to to this if it applied even somewhat equally.
Instead, companies can seemingly do whatever they please whereas lawyers will send letters to your home for downloading a single episode of game of thrones.
Exactly. We need leaders with the political will to apply a "financial death penalty" to companies that engage in this kind of brazen behavior. That means all assets seized, the company dissolved, personal assets of executives seized, executives jailed. People running companies should live in mortal fear of ever doing the things that they routinely do today.
Do people even take civics classes anymore? That isn't how any of this works. Political will doesn't allow arbitrary punishments. You would need legislation at very least and that could face issues with the Eighth Amendment. (Which could not be post-facto of course.)
At least you're not calling for jailing all the shareholders....
VC and startups are fundamentally about disruption. You can't make an omelette without breaking a few eggs (laws). The incumbent players are not going to sit still and let things be "disrupted". A common response is to make sure the public knows about the broken eggs. I would say youtube, Google, Spotify, Uber, doordash, etc. all have made my life much better.
You don't know a world without them so you actually have no idea if they have made your life compared to that world much better or much worse. How your life was at the time is irrelevant.
Reminds me of recent discussions about similar topic, what may clearly look like a crime can be treated differently depending on if you do it as an individual or as a company. Somewhere down the line its all about understanding the limits and boundaries of the system, its a skill in itself.
I think the difference may be LLMs may not be laundered clean of copyright data anytime soon. Even if chatgpt got big and profitable, it's not so clear that it won't contain copyrighted data as that may simply be necessary to train the best models.
The modern solution has been to grow so fast that by the time anyone can go after you legally you've already amassed so much money/power that you can have the laws rewritten (or at least enforced) around your existence.
IMO part of the reason the SV tech bros are embracing right wing grift culture so publicly now is that this method, which had been serving them well for decades, doesn't really work without the infinite free money lending spigot being wide open.
You must be new to billionaire business practices: break the rules first, ask for forgiveness later.
By the time the cheque comes, your illicit venture either went bust or you built a bilion dollar empire capable of buying the best lawyers and lobbying to walk away clean.
1.8 million people are in United States jails today. It isn't a death sentence, and it is a foreseeable consequence of some ethically-appropriate actions.
Supporting folks spending time in jail is a valuable role in any social movement.
i know of a company that poisoned an entire town! thats terrorism if done by an individual. the company still exists, just paid a settlement and carried on...
I agree with your point, but will split hairs on using the word "terrorism". I think that should be reserved for people that commit atrocities for some political aim. I'm fairly sure the company in question (I assume Union Carbide) did not poison the town to advance a political agenda.
https://www.youtube.com/watch?v=cy3piCUPIkc - VICE documentary and visit video. I think it contains an interview with an American woman who suffered from WR Grace and Company's asbestos mining and manufacturing in the USA, she says "they knew, they knew". WRG faced 129,000 personal injury claims and set asude $3 Bn for settling asbestos related lawsuits.
Are you talking about Bhopal in 1984? If so it would be an understatement to refer to half a million people as a “town”, and an overstatement to imply it was terrorism. Willful negligence, yes, but terrorism, no.
Google used to send customers to your site. Now they try to show you the information on their site so that the customer doesn't need to go to your site.
Even before the LLM-craze Google was showing their Answers box or whatever it was called at the top of the results that told you the answer (sometimes) so that you didn’t have to visit any website.
This frames Google's indexing of the web in a totally, abjectly wrong fashion. It wasn't "other people's data", it was data people published to the public internet, implicitly and explicitly granting permission to download through the act of serving that data without restriction to whoever navigated to a particular URL.
That's how the internet works. If you want private content, you need to put up a gate mechanism of some sort with authentication or other methods of restricting access. Without that, you are literally having your server "serve" the content to whoever asks for it, without restriction or exception, without ToS or meaningful contract or agreements.
You can't have it both ways. "But they didn't know" or other post-hoc claims of innocent people publishing content to the web being misled or confused or abused is infantilizing nonsense.
The web wouldn't have been as amazing and revolutionary and liberating if the fundamental public and open nature of its systems was private and walled off by default.
Your take on YouTube going viral initially over copyrighted content isn't correct, either - it was ease of use and access. It was fairly popular by the time Google bought it, and once it was reachable and advertised by google itself, it exploded, because by that time, everyone had defaulted to using google for search.
Other people corrected your Spotify take.
The reason they pirated is because it is functionally impossible to gain access to the data in any other way. For consumers, there are lots of old shows, music, and other content that aren't accessible, so they turn to piracy. A vast majority of the time, if content is accessible, people will pay and do the technically legal and "right" thing.
Publishers exploit authors and content creators in the name of "platforming" and "marketing" , effectively doing as little as possible to take 90%+ of the value of a product and providing as little as possible to the producer of content or books or music. They get by on technicalities and have captured the legal arena entirely, with any attempt at reform or revolution meeting a messy death at the hands of lawyers and big money publishers.
Screw those people. They lie, cheat, and steal, and somehow have gotten away with fooling the world into thinking they're the good guys.
Copying bits and bytes is not stealing, and the ones trying to shill that narrative are trying to fool as many people as possible into giving them more money without any return of value in kind. I'd download the hell out of a car. Pirate everything.
The most outrageous thing about the whole story is that smart people (like here and not only) knew this all since day one. They been uncovering this the whole time.
And in their face, with all the fierce ignorance, broligarchs deny, evade and totally pretend this never happened. The most non open company of all even went to lengths to accuse others of stealing their IP - not theirs to begin with.
Just think of it - why did all major content platforms closed their APIs the day after GPT-2 got the word going…? Cause they knew all this very well - the content is precious and needed. They been doing it all along. Distilling the essence of world’s writing and digital imagery they had no right to.
We have a saying where I come from - no mercy for the chicken, no laws for the millions. I thought it was a local thing at first, it turned is how the world goes. Nothing new under the sun, indeed.
A bigger lesson might be "don't get caught until you're big enough to destroy the people suing you."
Napster got shut down for widespread enabling of copyright infringement. So did numerous other filesharing startups, including Travis Kalanick's first startup, Scour. Lots of small startups get put out of business all the time for being sued and not having the money to defend themselves.
Likewise, individuals like Donald Trump or Elon Musk get away with all sorts of illegal shit, because they are big enough to shut down the court systems prosecuting them.
Google's genius was in staying under the radar and aligning their incentives with everyone that might dislike them, until they were big enough that they could simply crush anyone that might dislike them.
" If you plug a laptop into a closet at MIT to download some scientific papers you forfeit your life."
This is exactly what I immediately thought while reading the article. It almost feels like the legal system only punishes general public, while most of these guys are above it.
Airbnb and Uber have showed us that laws matter only to the extent that the political will to enforce them exists. Throw enough lawyers and lobbying money at the problem and the laws can simply be re-written to be friendlier to your business model.
If you do something wrong then you, as a person, are held responsible and accountable.
If you do something wrong as "part of your job" then you're typically not held responsible and accountable but the company is (the exceptions being spectacular fraud: Enron, VW diesel).
It's not hard to see how this can go off the rails.
In more general terms, the legal system punishes what can be made a profit or an example when punishing.
Also, I don't think the legal system itself wants to get too much into "big institutions against the work of others", save for the fictional TV representations of smart lawyers and clever arguments, 99.9% of the legal system output is copy/paste.
>This is exactly what I immediately thought while reading the article. It almost feels like the legal system only punishes general public, while most of these guys are above it.
Welcome to the modern day aristocracy. Not only what you mentioned, this world is also divided into a group of insider who can get capital from 0 - 2%, while rest of us has a cost of 17%, 22% or 30%?
It doesn't "seem". The entire system in most countries works, by design, that way because the people in power trade in influence at a different plane.
That's why democracy often feels "failed" in that no change can be achieved because "it's just more of the same". Few Lobbyists representing the interests of a few people have more power than millions voting differently.
When individuals are assigned heroic status despite clear evidence of mental illness and crimes, such as “breaking and entering”, it prevents society from having rational discussions about both law enforcement and mental health support. This dynamic repeats across multiple high-profile cases.
People often elevate deeply flawed figures to heroic status when those figures seem to challenge authority or "the system." This happens especially with individuals who present themselves as outsiders fighting the establishment, have a compelling personal struggle narrative, or voice grievances that resonate with public frustrations
Trump fits this pattern - his supporters overlook concerning behaviors and statements because they see him as fighting a system they distrust. Like Manning and Swartz, his mental state and fitness are often ignored in favor of the "hero against the system" narrative.
This dynamic creates a feedback loop where legitimate criticism becomes harder to discuss rationally.
It is more a money thing. Meta can pay x billion like pocket change. Regular people are run through the ringer to teach the plebs to not get out of line.
It's not a feeling. It's exactly what happens. It's completely blatant.
For some reason, whenever you're a billionaire or company, things suddenly get so difficult that you can claim that it's impossible to be held accountable for anything. Murder, insider trading, laundering, treason, etc.
OpenAI complained about this, as did Google and everyone else. If your company can't exist without stealing data, then it's not a viable company. Companies don't have a constitutional right to exist.
> Google itself got big by indexing other people's data without compensation
Wrong.
a) Robots.txt which defines what content you wish to make available to third parties predates every search engine including Google. Web site owners chose to make it available to Google and search engines have respected their wishes despite it not being in their best interest.
b) The difference here is that OpenAI, Meta etc have not even tried to honour the wishes of copyright holders. They just considered everything as theirs.
c) Google grew big because it had no ads, fast interface and PageRank was significantly better. It wasn't because it had the most comprehensive index.
> Web site owners chose to make it available to Google.
Strong disagree. Since robots.txt is optional and the default is "crawl me as you please", website owners don't "choose to make it available", they just don't choose to make it non-available.
The more I learn about how AI companies trained their models, the more obvious it is that the rest of us are just suckers. We're out here assuming that laws matter, that we should never misrepresent or hide what we're doing for our work, that we should honor our own terms of use and the terms of use of other sites/products, that if we register for a website or piece of content we should always use our work email address so that the person or company on the other side of that exchange can make a reasonable decision about whether we can or should have access to it.
What we should have been doing all along is YOLO-ing everything. It's only illegal if you get caught. And if you get big enough before you get caught then the rules never have to apply to you anyway.
And if you were in any doubt before, this lesson is now exemplified by the holder of the highest office in the land and approved by popular vote. The rewards of acting ethically are, unfortunately, sometimes only personal. This must be a hard environment to raise children in, given the examples they see around them.
Parent here: it takes a lot of discussion, but it's a great time to talk about the reality of evil and villains. My kids are on the good side, or at least I like to think so ...
This argument may focus too much on the category of external rewards.
I might well be kidding myself or self-justifying, but I believe internal rewards are at least as important. Some materially successful people are deeply unhappy.
>What we should have been doing all along is YOLO-ing everything
No it isn't. The actual sucker attitude is copying what they do. You should act morally and with integrity out of respect for yourself. I never had any illusions that large tech companies act with respect towards the law, but it also has nothing to do with me.
If you have a spare few hours, the Acquired podcast episode on Meta is enlightening. They just stumbled through growth hack experiment after experiment without seemingly any risk assessment or ethics.
This sort of mindset is devoid of morals and honor. Don’t fall into the this mindset trap.
Like when Trump said he is “smart” for evading taxes during the presidential debates (IIRC the first ones, not recent ones).
It’s absolutely despicable. Have a moral compass. Treat people fairly. Be nice. Let’s be better than toddlers who haven’t learned yet that hitting is bad, and you shouldn’t do it even if mommy and daddy aren’t in the room.
I agree with you and will die to defend that position, but what existential reasons do we have to behave well? In reality, it seems like humans are just a bunch of animals, and the only important thing is survival.
My wife, just today, told me that she was very upset that I refused to interview and take jobs for things like building weapons, the panopticon, or advertising (two of those are the same thing), which I refuse to do because of my personal morals and ethics. How do I explain to her that I just can't do that, and give her a good reason why we should lose our home and live in one room with her mother because of my brain refusing to work in such industries? I really want to know, so I can explain to her and my son why such things matter, because for some reason they are concrete and foundational in my brain, there is no changing that.
I don't understand why it's even a question that Meta trained their LLM on copyrighted material. They say so in their paper! Quoting from their LLaMMa paper [Touvron et al., 2023]:
> We include two book corpora in our training dataset: the Gutenberg Project, [...], and the Books3 section of ThePile (Gao et al., 2020), a publicly available dataset for training large language models.
Following that reference:
> Books3 is a dataset of books derived from a copy of the contents of the Bibliotik private tracker made available by Shawn Presser (Presser, 2020).
Furthermore, they state they trained on GitHub, web pages, and ArXiv, which are all contain copyrighted content.
Surely the question is: is it legal to train and/or use and/or distribute an AI model (or its weights, or its outputs) that is trained using copyrighted material. That it was trained on copyrighted material is certain.
Critically, by torrenting they also directly distributed the copywritten material itself. That is a standalone infringement separate from any argument about trained LLMs.
They could have only leached and refrained from sharing any part of copyrighted data. If i were to commit something as risky as this, that is what i would do.
And punishing them in the normal manner will be an incredibly small slap on the wrist, and do absolutely nothing to help us find out what will play out in court regarding a fair-use defense on training AI with copyrighted material.
There are two different things when it comes to discussing training LLM's on "copyright" protected data, and I almost never see people differentiate.
1.) Training on copyright that is publicly available. You write a poem and publish it online for the world to read. That is your IP, no one else can take it an sell it, but they are free to read and be inspired by it. The legalitly of training on this is in the courts, but so far seems to be going in favor of LLMs.
2.) Training on copyright that is not publicly available. These are pretty much pirated works or works obtained by backdoor to avoid paying for them. Your poem is behind a paywall and you never got paid, yet the poem is known by the LLM. This is just straight illegal, as you legally must pay to view the work. However there might be conditions here too like paying for access to an archive and then training on everything in it.
I never gave my poem to Facebook. My site is for humans. And there was absolutely no problem with that website being public, until Facebook et al wanted to move the goalpost.. again.
Remember when companies started to claim that their abuse is on you, because you failed to publish the correct headers/robots.txt and their bot needs to be told the rules in specific language? And now we get the same attempt at making such distinction again, just this time its our fault for .. having a public website in the first place (should have operated a paywall, duh!)
3.) The company making an unauthorized copy of your work and storing it permanently in a giant corporate library of their own making which they refer to over and over.
This is distinct from (1) where the content is streamed or only ephemeral/incidental copies are made.
The very idea that LLMs are "inspired" by copyright material is so far beyond absurd I just don't know what reality you people live in. They are ingesting copyright material in order to re-use it. Yeah they remix it to add their own (incredibly annoying) tone but that's what they're doing.
authors can claim that they allow for public use unless it's used for training LLMs. And all of training work would fall under 2 because they would be used against the copyright.
I'm not sure there's any legal distinction though.
Is a book publicly available? No, you have to purchase it. But once you do, you're legally allowed to let your friends and family and so forth read it too. As long as you don't sell copies of it (the "copy" part of "copyright"), or meaningfully take away the ability for the publisher to make money from sales (so you can't post it for the whole world to see on the internet).
And sure, there are lots of ToS for digital works, but are they actually enforceable? ToS can say you're not allowed to let anyone else read the book you purchased. But no court is going to say you can't lend your Kindle to your friend for them to read it too. Many ToS clauses are flat-out illegal.
Meta will argue that training on books is no different from reading all the books at a friend's house. That as long as Meta isn't reselling or making publicly available the original text, they're in the clear.
Trained on doesn't mean significant inclusion in the final state.
Is it truly a violation of copyright when a user hacks out bits and pieces of easily restyled raw data points from a model to look samey? what about if it takes two models? Might be time to accept humans are just cooked in their ability to discern attempts at direct plagiarism - just as it is hard to discern Sky voice from Her voice.
I strongly urge people to read Thomas Babington Macaulay's speeches on copyright, its aims, terms, and hazards. Very well reasoned and explained.
In particular, people often cited the case of authors who had died leaving a family in destitution, and claimed that copyright extension would be a fair way of preventing this, but in most cases the remaining family had never held the copyright; the author had initally sold the reproduction rights to a publisher who had then sat on the work without publishing it. The author, driven into penury, was then induced to sell the copyright to the publisher outright for a pittance. So in such cases a copyright extension only benefited the publisher, and indeed increased their incentive to extort the copyright.
The one who got Hindu Sanskrit books translated in a horrible manner and then claimed: "I have no knowledge of either Sanskrit or Arabic. But I have done what I could to form a correct estimate of their value. I have read translations of the most celebrated Arabic and Sanskrit works. I have conversed both here and at home with men distinguished by their proficiency in the Eastern tongues. I am quite ready to take the Oriental learning at the valuation of the Orientalists themselves. I have never found one among them who could deny that a single shelf of a good European library was worth the whole native literature of India and Arabia."
This is the corollary of the fallacy of appeal to authority: the rejection of an argument on the grounds that the speaker was horribly wrong on an unrelated or very loosely related topic.
If you reject Macaulay on copyright because he was an imperialist, you can use the exact same logic to reject the arguments of essentially every person who ever lived. Very few humans who ever wrote anything important will perfectly align with your morality, and most will be horribly misaligned in at least one way.
Very nice of you to omit the following sentences of that excerpt, where it proceeds to develop its point on the argument for institution of an English-language based education system on British India. He praised how superior in quantity and quality were the Sanskrit or Arabic corpora, compared to European works, in the lyric/poetry. But that no technical or didactical literature amounted to even the most mundane of the European manuals like those used by then in England humble schools (and it seems completely plausible).
He was a fierce abolitionist. So much for accomplishing the mission of allegedly, judging by comments in this thread, 'deranged imperialist destruction and chaos imposition over the lesser ones'.
I'm not much versed into his speeches/stance on copyright, but I can vouch for the fact that the most honest and well-intended moves (not by him, by other figures) in defence of everyone's intellectual property were done in the same century. From the Twentieth onwards, it has been only twisted for the interest of a select few, and needless to ask where we are today in terms of caring about intellectual property of anybody.
[1] Just saw your other comment where you go on with his nauseating words. One just cannot comprehend that framing the past on the actual status quo is as futile as to not being even wrong, I guess?
I’m a huge IP hater and am sure that happens, but to be fair, letting copyright extend past death also increases the amount the author can sell it for in the first place.
The current workaround is to attribute footnotes to your beneficiaries, or quote them in the dedication. Those become derivative works subject to the lifetime of your beneficiary.
> in most cases the remaining family had never held the copyright; the author had initally sold the reproduction rights to a publisher
He was able to sell it because it is something valuable, exactly because of the copyright protections. Regardless of whether author sells the rights or not, he and his family would equally be better off with copyright.
Copyright infringement isn't stealing. I will die on this hill!
also, I don't think that implication is required, but lets pretend the implication is the only reasonable conclusion one could draw. Maybe it does make it acceptable?
If the vast majority of copyright enforcement isn't to protect creators of valuable work, but only serves to enrich those who take advantage of those creators. Then isn't it not just reasonable or acceptable, but ethically required for someone to do everything they can to dismantle the systems they're abusing against the interests of those who are actually improving the world with their creations?
Libgen is a civilizational project that should be endorsed, not prosecuted. I hope one day people will look at it and think how stupid we were today to shun the largest collection of literary works in human history.
Anna's Archive encourages (and monetizes!!) the use of their shadow library for LLM training. They have a page dedicated to it on their site. You pay them, and they give you high download speeds to entire datasets.
Libgen turns into a problem when you have a company developing generative AI with it, either giving money to GPU manufacturers or themselves with paid services (see OpenAI)
…why? Will people buy less books because we have intuitive algorithms trained on old books?
Personally, I strongly believe that the aesthetic skills of humanity are one of our most advanced faculties — we are nowhere close to replacing them with fully-automated output, AGI or no.
I think you’re overstating its importance. The internet already makes it possible to order almost any book in existence and have it arrive at your doorstep within a week or so, or often on your ebook reader instantly. And your local library probably participates in an interlibrary loan system that lets you request any book held by any library in the country for free.
LibGen gives you access to a much smaller body of works than either of those. It’s a little more convenient. But the big difference is that it doesn’t compensate the author at all.
And what about the other billions of people on the planet that don't even have a library, let alone a doorstep to receive a first world delivery service.
2. DRM is built in to most purchased ebooks, which means you can’t consume the book on any device. “Illegal” tools exist to circumvent this.
3. Large ebook stores - like other digital stores - essentially lend you a copy of the book. So when they are forced to pull a book, they’ll pull your access too.
Of course, now that the big players have consumed/archived the entire book dump, they can go ahead and kill it to prevent others from doing the same thing.
It is *much* more convenient. When a research path takes me to an article or book - I could buy or order or go to a physical library, that would take hours or days. I could also open it as a PDF in seconds. If you need to read a chapter from a book, or an article, or skim such checking to see if it's worthwhile, 20-30 times to figure something out, then libgen is the difference between finishing in a day or a month.
There are a whole lot of books that are out of print, and if a book went out of print before ebooks were a thing, it probably doesn't have a legal digital edition either.
Libraries can burn down (see Library of Alexandria), civilizations end (see various). LibGen makes it possible for an individual to backup a snapshot of cumulative human knowledge, and I think that's commendable.
> LibGen gives you access to a much smaller body of works than either of those.
> Just go to a real library.
The thrill of waiting a week for a book to arrive or navigating the labyrinthine interlibrary loan system is truly a privilege that many can afford. And who needs instant access to knowledge when you can have the pleasure of paying for shipping or commuting to a physical library?
It's also fascinating that you mention compensating authors, as if the current publishing model is a paragon of fairness and equity. I'm sure the authors are just thrilled to receive their meager royalties while the rest of the industry reaps the benefits.
LibGen, on the other hand, is a quaint little website that only offers access to a vast, sprawling library of texts, completely free of charge and accessible to anyone with an internet connection. I'm sure it's totally insignificant compared to the robust and equitable systems you mentioned.
Your suggestion to "just go to a real library" is also a brilliant solution, assuming that everyone has the luxury of living near a well-stocked library, having the time and resources to visit it, and not having any other obligations or responsibilities. I'm sure it's not at all a tone-deaf, out-of-touch recommendation.
We all like hating big corporations, especially Meta, and people seem to use this as an opportunity to advocate for punishing them. I think it's wiser to advocate for changing our IP laws.
While Aaron Swartz was bullied to suicide, these corporations will walk free and make billions. I say give every tech CEO the Swartz treatment, then change the law.
Swartz committed suicide because he was mentally ill. He also attempted suicide multiple times in his life while not being "bullied".
If he was acting rationally and came to the conclusion that dying was better than spending X years in jail, he would have committed suicide after sentencing, not before any trial had even happened.
Double standards is how the law is practiced since time immemorial. Copyright is Disney-Sony law made up few decades ago for no reason other than money. Pick your battles.
Two wrongs don't make a right. If a law is unjust, then what good is there in continuing to punish people who have broken it, just because other people have been punished in the past?
Either you think the law is just or unjust. If you think it's unjust, I don't possibly see how you think people should be punished for it. Meta wasn't responsible for what happened to Aaron Swartz.
Big corporations are too big, they should just not exist. When you have corporations more powerful than the government of the biggest states, it's a bug, not a feature.
The IP laws may need rethinking. Saying that they should disappear because big corporations are above the law doesn't help, though. First kill the big corporations, then think about fair laws. Changing the law now would not change anything since those corporations are already above the law.
Perhaps they just did, or we are doing it - basically this should lead to abolition of copyright to any published article there is. Not sure how’d it impact open source, we’ll either have all of it open, or none at all.
Big corporations don't have morale or ethics. They'll break any laws as long as it's profitable. There's no point complaining about Meta or Zuck. Meta does what it's designed to do. If people aren't happy, they should vote for more regulations.
We may in retrospect find that the moment may have passed where "big corporations" have become more powerful and impactful on our lives than the IP laws on the books. After all, we can already plainly see they only come into effect when useful by the powerful
Something tells me stronger IP laws will be drafted by holders of that IP, with little if any regard to the potential for job losses for regular people from AI.
Most of the public has jobs based upon IP? While it is probably a bigger share than farming, I doubt that. The actual drivers appear to be a mixture of hysteria, and reflexive anti-corporate sentiment as we see even self-proclaimed leftists going "WTF, I love copyright now!".
It really makes you think about those crazy internet folks from back in the day who thought copyright law was too strict and that restricting humanity to knowledge in such a way was holding us all back for the benefit of a tiny few.
I'm all for chopping up copyright law. But until we do so, companies like Meta need to be treated just like everyone else.
That means lawsuits, prison sentences, and millions in fines. And that's just the piracy part, there's also the lying/fraud part.
Interestingly, a Dutch LLM project was sent a cease and desist after the local copyright lobby caught wind of it being trained on a bunch of pirated eBooks. The case unfortunately wasn't fought out in court, because I would be very interested to see if this could make that copyright lobby take down ChatGPT and the other AI companies for doing the same.
The more concerning thing is that the best thing these overpaid people could come up with was.. download the torrent, like everyone else. Here you are, billions of resources, and no one is willing to spend a part of it to at least digitize some new data? Like even Google did?
I think they are morally required to improve the current state.
- Seed the torrent and publicly promote piracy pushing lawmakers.
- Contribute with digitisation and open access like Google did in the past.
- Make the part of their dataset that was pirated publicly accessible.
- Fight stupid copyright laws. I can't believe that copyright lasts more than 20 years. No field moves that slowly, and there should be tighter limits on faster moving fields.
This hasn't been my observation. Instead, I see a society where people regularly help and serve one other, frequently for free. Consider parents, social workers, most academics, food banks, charity in general, most workers in most businesses, et cetera. I wonder: who do you know and work with? A minority of people profit wonderfully off this. Incidentally, they seem to also preach principals that can only lead to the end of their gravy train.
You can counter by insisting that these "altruistic" behaviors are simply less directly but still in the altruist's interest. I would entirely agree.
Yep, you even see it on HN with artists and devs complaining about AI, especially when things like ChatGPT and Stable Diffusion were first announced. People who were pretty lax about copyright when it didn't affect them personally suddenly became copyright maximalists, talking about "stealing, theft, etc" Since then, people have calmed down and realized that AI is simply a tool like any other.
Beyond illegal downloading and distribution of copyrighted content, the article also describes how Meta staff seemingly lied about it in depositions (including, potentially, Mark Zuckerberg himself).
Companies aggressively protect their own intellectual property but have no qualms about violating the IP rights of others. Companies. Individuals have no such privilege. If you plug a laptop into a closet at MIT to download some scientific papers you forfeit your life.
All the sad poor people who might be hurt were already paid. The caterer on your favorite show is not getting residuals. NBC also isn't going to stop making TV shows because that is all they can do. Content creators also existed on the internet long before that was a job. They just did it because they cared about it not for ad money. If you really want to support the artist directly go to a concert or just mail them a check. If you can't actually identify a person who might be hurt, then do not care.
Oh no, that TV show I'll forget about in a year cost me $15/mo instead of $60 of blurays.
I jump in my cars and hit a button and music plays. Almost any music I want. That's amazing.
I'm also not pirating games. I'm not 12 without a job. I have a job. I pay developers for their work. I want more games, like Kingdom Come 3, to come out.
Weird ass comment. You seriously think we're going to put our lives on hold to.. what, fight "digital media"? You think I care about netflix? Or societies use of it? I haven't used netflix in years. I don't know anybody under 40 with a netflix account. Everyone on your end of the pirate spectrum uses debrid nowadays, anyway.
Next you're going to tell people to install the "Black XP Windows" edition to not support Microsoft and they all get malware and their credit cards stolen because they installed some pirated and modified cracked windows. Genius.
MSNBC just cancelled Andrea Mitchells TV show, today, because she brought in no younger audiences. So yes, shows do get cancelled by not being watched.
This comment was upvoted? Hn needs a break. This is some I'm 14 and edgy bullshit that sounds like it belongs on an eastern european piracy forum.
I'm pretty much at the point now where I don't buy the "copyright incentivizes creation" argument any more. Copyright, like advertising, incentivizes creation by enormous corporations, but also like advertising it incentivizes creations that overwhelmingly have little value.
Creative individuals don't need copyright to be incentivized to create—they need a safety net that gives them the freedom to spend time on the creativity that naturally wants to bubble out. If the goal is to encourage creativity, copyright is a lousy and enormously expensive substitute for Universal Basic Income.
This is what Meta tried to do, quietly download and use the data, to do research and advance their LLMs, without trying to establish any legal precedents or pick up fights.
In case anybody here doesn't know, that's a reference to Aaron Swartz, an activist (and Reddit co-founder) that was risking 35 years in prison and a $1 million fine just for downloading a lot of academic papers from JSTOR. He eventually took his life because of the pressure. May his soul rest in peace.
End users, not YouTube employees, right? And they would take things down following DMCA requests and what not, right? So, pretty much following the law?
> Google itself got big by indexing other people's data without compensation
Scraping public websites to build a search index isn't the same as making LLMs that can recreate the source verbatim devoid of even attribution. I do agree there's an argument to be had about the LLM's transformative nature in the end though.
> Spotify's music library was also pirated in the early days
Not any version generally available to the public, and with the copyright holder's permission to do so.
the americans cheated their way to competition,
heck, even before that, the english empire got jumpstarted by stealing gold from the spanish (who were themselves exploiting it away from aztec and other mexican natives)
I'm saying it's business as usual, but also, culture doesn't work like tangible physical widgets so we must stop letting a few steal this boon of digital copying by means of silly ideas like DRM, copyright, patents. all means to cause scarcity
"Brno’s fortunes were changed forever when a young freemason called Franz Hugo Salma set out for England in 1801. He intended to steal the plans for the most modern textile machinery in the world. His crime, the first recorded act of industrial espionage, boosted the competitiveness of Moravian textiles. Soon after smuggling the plans out disguised as a worker, and handing them over to Brno’s fledgling textile industry, Brno became the most important textile centre in the Habsburg empire."
You can even go see some of the original plans in a museum:
"Eleven designs are still preserved in the library of the Rájec chateau. They form a unique set of documents demonstrating both the level of wool processing technology at the turn of the late 18th and early 19th centuries, as well as the aims and means of the relatively rare business of industrial espionage at that time."
https://www.gotobrno.cz/en/brno-phenomenon/this-is-brno-kate...https://www.gotobrno.cz/en/place/salm-reifferscheidt-palace/
Cain killed Abel and got away with it!! I can kill someone today too!!!
The issue here is not copyright/patents/etc - the issue is that the law is applied selectively — the issue is that Aaron Schwartz is dead for sharing knowledge with the public and Zuccborg is a billionaire building his torment nexus
The copyright holders then approved their concept, and subsequently Spotify got the rights to offer their service to customers. Everybody won.
I want to know more, please enlighten me (anyone who knows). I read the book "The Spotify Play" and it made it seem like the pirated music was an internal-only thing and not something available to customers. Is that true?
So while it was using pirated media, it was sanctioned by the rights holders for the experiment of building Spotify.
Another interesting note, in the early days of spotify, the app would saturate your upload bandwidth while using it. Given their close ties to utorrent, I always assumed that's how they were affording the bandwidth as well.
Pretty brilliant way to bootstrap I guess; they didn't have to pay for bandwidth or content until they already had contracts in place
https://lizpelly.info/book
Just to point, but the material in question was public domain, so nobody had even a copyrights claim over it.
It's true, and relevant, that Google would feel those consequences much less sharply than Swartz did.
I've spoken to several very wealthy/powerful people and tried to get them to negotiate a large-scale content license with the various publishers that would allow researchers and individuals to access more research in lower-friction ways. None of them (NIH, Schmidt, etc) were really interested.
Apparently he would have gotten away with downloading the JSTOR database if he made it clear that he intended to only publish half of each paper.
The limit is what you can actually get away with, not what the rules say you can get away with, and the system aggressively selects players who recognize this. It's amoral - there is no "ought", only "is". An actor gets punished or not, with absolutely no regard to whether it "should" get punished. One thing is consistent: following the rules as written means you lose.
You can see it in Y Combinator (and other) startups. The biggest ex-startups are things like AirBNB (hotels but we don't follow the rules but we don't get punished for not following them) and Uber (taxis but we don't follow the rules but we don't get punished for not following them).
One way to not get punished for not following the rules is to invent a variation of the game where the rules haven't been written yet. I again refer you to AirBNB and Uber; Omegle also comes to mind, although they didn't monetize.
Viewed in this light, Aaron Swartz's mistake was not the part where he downloaded journal articles, but the part where he got caught downloading journal articles. Shadow library sites are doing the same thing, minus the getting caught. So are Meta and Google and OpenAI. sci-hub is only involved in a lawsuit because it got caught and is now in the stage where it finds out whether it gets punished or not.
Turns out there are 2 simultaneous wars there. One where companies and individuals compete ruthlessly.
And another one where if non profit associations of individuals form, guns come out.
MegaUpload did the same, kim dotcom got raided in his sleep by FBI in New Zealand! So no I don't buy your reductionist argument, there are forces at play that allow companies with founders with the likes of Google to get away with it but not others.
To this day, there are a huge number of videos that show copyrighted content on YouTube; they are usually crappy clips, reversed and with different music playing in the background to avoid automated detection.
I don't understand why you wouldn't just buy copies of the books. Seems like such a relatively inexpensive way to strengthen your legal case.
Or so they think, I think.
Some can steal from stores and see no repercussions.
Some can steal from others and see no repercussions.
Some can violently harm others and see no repercussions.
Some can damage property and see no repercussions.
Some can’t. This world is not right.
"Ek, who had been the CEO of the piracy platform uTorrent, founded Spotify with his friend, another entrepreneur named Martin Lorentzon. Both-Ek at 23 and Lorentzon 37-were already millionaires from the sales of previous businesses. The name Spotify had no particular meaning, and was not associated with music. According to Spotify Teardown, the company developed a software for improved peer-to-peer network sharing, and the founders spoke of it as a general "media distribution platform." The initial choice to focus on music, the founders said at the time, was because audio files are smaller than video files, not because of a dream of saving music.
In 2007, when Spotify first publicly tested its software, it allowed users to stream songs downloaded from The Pirate Bay, a service for unlicensed downloads. By late 2008, Spotify would convince music labels in Sweden to license music to the site, and unlicensed music was removed. From there, Spotify would take off across Europe and then the world."
https://qz.com/1683609/how-the-music-industry-shifted-from-n...
This is the inevitable.
I think more people, potentially anyways, would feel similar to to this if it applied even somewhat equally.
Instead, companies can seemingly do whatever they please whereas lawyers will send letters to your home for downloading a single episode of game of thrones.
At least you're not calling for jailing all the shareholders....
Deleted Comment
So in other words, it got big by providing free user traffic to people's websites without asking for compensation?
You generally don't charge the phone book money to include you in it. It's actually the other way around.
IMO part of the reason the SV tech bros are embracing right wing grift culture so publicly now is that this method, which had been serving them well for decades, doesn't really work without the infinite free money lending spigot being wide open.
By the time the cheque comes, your illicit venture either went bust or you built a bilion dollar empire capable of buying the best lawyers and lobbying to walk away clean.
I’m opposed to copyright and pro-aaronsw, but the state did not kill him.
1.8 million people are in United States jails today. It isn't a death sentence, and it is a foreseeable consequence of some ethically-appropriate actions.
Supporting folks spending time in jail is a valuable role in any social movement.
https://en.wikipedia.org/wiki/Asbest
https://www.youtube.com/watch?v=cy3piCUPIkc - VICE documentary and visit video. I think it contains an interview with an American woman who suffered from WR Grace and Company's asbestos mining and manufacturing in the USA, she says "they knew, they knew". WRG faced 129,000 personal injury claims and set asude $3 Bn for settling asbestos related lawsuits.
Weird framing given how much value was and is still placed on Google driving traffic to you
Google used to send customers to your site. Now they try to show you the information on their site so that the customer doesn't need to go to your site.
Basically the entire legal system needs to be retooled and rethought for computers.
And the legal system is for humans not computers.
That's how the internet works. If you want private content, you need to put up a gate mechanism of some sort with authentication or other methods of restricting access. Without that, you are literally having your server "serve" the content to whoever asks for it, without restriction or exception, without ToS or meaningful contract or agreements.
You can't have it both ways. "But they didn't know" or other post-hoc claims of innocent people publishing content to the web being misled or confused or abused is infantilizing nonsense.
The web wouldn't have been as amazing and revolutionary and liberating if the fundamental public and open nature of its systems was private and walled off by default.
Your take on YouTube going viral initially over copyrighted content isn't correct, either - it was ease of use and access. It was fairly popular by the time Google bought it, and once it was reachable and advertised by google itself, it exploded, because by that time, everyone had defaulted to using google for search.
Other people corrected your Spotify take.
The reason they pirated is because it is functionally impossible to gain access to the data in any other way. For consumers, there are lots of old shows, music, and other content that aren't accessible, so they turn to piracy. A vast majority of the time, if content is accessible, people will pay and do the technically legal and "right" thing.
Publishers exploit authors and content creators in the name of "platforming" and "marketing" , effectively doing as little as possible to take 90%+ of the value of a product and providing as little as possible to the producer of content or books or music. They get by on technicalities and have captured the legal arena entirely, with any attempt at reform or revolution meeting a messy death at the hands of lawyers and big money publishers.
Screw those people. They lie, cheat, and steal, and somehow have gotten away with fooling the world into thinking they're the good guys.
Copying bits and bytes is not stealing, and the ones trying to shill that narrative are trying to fool as many people as possible into giving them more money without any return of value in kind. I'd download the hell out of a car. Pirate everything.
And in their face, with all the fierce ignorance, broligarchs deny, evade and totally pretend this never happened. The most non open company of all even went to lengths to accuse others of stealing their IP - not theirs to begin with.
Just think of it - why did all major content platforms closed their APIs the day after GPT-2 got the word going…? Cause they knew all this very well - the content is precious and needed. They been doing it all along. Distilling the essence of world’s writing and digital imagery they had no right to.
We have a saying where I come from - no mercy for the chicken, no laws for the millions. I thought it was a local thing at first, it turned is how the world goes. Nothing new under the sun, indeed.
Napster got shut down for widespread enabling of copyright infringement. So did numerous other filesharing startups, including Travis Kalanick's first startup, Scour. Lots of small startups get put out of business all the time for being sued and not having the money to defend themselves.
Likewise, individuals like Donald Trump or Elon Musk get away with all sorts of illegal shit, because they are big enough to shut down the court systems prosecuting them.
Google's genius was in staying under the radar and aligning their incentives with everyone that might dislike them, until they were big enough that they could simply crush anyone that might dislike them.
This is exactly what I immediately thought while reading the article. It almost feels like the legal system only punishes general public, while most of these guys are above it.
> There must be in-groups whom the law protects but does not bind, alongside out-groups whom the law binds but does not protect.
If you do something wrong as "part of your job" then you're typically not held responsible and accountable but the company is (the exceptions being spectacular fraud: Enron, VW diesel).
It's not hard to see how this can go off the rails.
It’s because the legal system is not about justice, it’s about money
Most people can’t afford lawyers or expensive legal battles
On the other hand, individuals and organizations with a lot of money get to weaponize and exploit the legal system to their advantage
“To my friends, anything; to my enemies, the law”
In more general terms, the legal system punishes what can be made a profit or an example when punishing.
Also, I don't think the legal system itself wants to get too much into "big institutions against the work of others", save for the fictional TV representations of smart lawyers and clever arguments, 99.9% of the legal system output is copy/paste.
I think Aaron Swartz went to Harvard, not MIT
https://en.wikipedia.org/wiki/United_States_v._Swartz
Welcome to the modern day aristocracy. Not only what you mentioned, this world is also divided into a group of insider who can get capital from 0 - 2%, while rest of us has a cost of 17%, 22% or 30%?
That's why democracy often feels "failed" in that no change can be achieved because "it's just more of the same". Few Lobbyists representing the interests of a few people have more power than millions voting differently.
Deleted Comment
"This problem will be solved in the favor of the (party) which has the most money to throw into the problem" (paraphrase mine).
So, yeah.
[0]: https://www.youtube.com/watch?v=LrkAORPiaEA
People often elevate deeply flawed figures to heroic status when those figures seem to challenge authority or "the system." This happens especially with individuals who present themselves as outsiders fighting the establishment, have a compelling personal struggle narrative, or voice grievances that resonate with public frustrations
Trump fits this pattern - his supporters overlook concerning behaviors and statements because they see him as fighting a system they distrust. Like Manning and Swartz, his mental state and fitness are often ignored in favor of the "hero against the system" narrative.
This dynamic creates a feedback loop where legitimate criticism becomes harder to discuss rationally.
For some reason, whenever you're a billionaire or company, things suddenly get so difficult that you can claim that it's impossible to be held accountable for anything. Murder, insider trading, laundering, treason, etc.
OpenAI complained about this, as did Google and everyone else. If your company can't exist without stealing data, then it's not a viable company. Companies don't have a constitutional right to exist.
Deleted Comment
Dead Comment
Dead Comment
Dead Comment
Wrong.
a) Robots.txt which defines what content you wish to make available to third parties predates every search engine including Google. Web site owners chose to make it available to Google and search engines have respected their wishes despite it not being in their best interest.
b) The difference here is that OpenAI, Meta etc have not even tried to honour the wishes of copyright holders. They just considered everything as theirs.
c) Google grew big because it had no ads, fast interface and PageRank was significantly better. It wasn't because it had the most comprehensive index.
Strong disagree. Since robots.txt is optional and the default is "crawl me as you please", website owners don't "choose to make it available", they just don't choose to make it non-available.
What we should have been doing all along is YOLO-ing everything. It's only illegal if you get caught. And if you get big enough before you get caught then the rules never have to apply to you anyway.
Suckers. All of us.
I might well be kidding myself or self-justifying, but I believe internal rewards are at least as important. Some materially successful people are deeply unhappy.
Quibble: The majority of people voted against Trump, or at least not in his favor. He only got a plurality, not a majority.
No it isn't. The actual sucker attitude is copying what they do. You should act morally and with integrity out of respect for yourself. I never had any illusions that large tech companies act with respect towards the law, but it also has nothing to do with me.
Not quite. It's only illegal if you get caught and you are the wrong kind of person.
For the right kind of person not even a pat on the wrist.
Like when Trump said he is “smart” for evading taxes during the presidential debates (IIRC the first ones, not recent ones).
It’s absolutely despicable. Have a moral compass. Treat people fairly. Be nice. Let’s be better than toddlers who haven’t learned yet that hitting is bad, and you shouldn’t do it even if mommy and daddy aren’t in the room.
My wife, just today, told me that she was very upset that I refused to interview and take jobs for things like building weapons, the panopticon, or advertising (two of those are the same thing), which I refuse to do because of my personal morals and ethics. How do I explain to her that I just can't do that, and give her a good reason why we should lose our home and live in one room with her mother because of my brain refusing to work in such industries? I really want to know, so I can explain to her and my son why such things matter, because for some reason they are concrete and foundational in my brain, there is no changing that.
> We include two book corpora in our training dataset: the Gutenberg Project, [...], and the Books3 section of ThePile (Gao et al., 2020), a publicly available dataset for training large language models.
Following that reference:
> Books3 is a dataset of books derived from a copy of the contents of the Bibliotik private tracker made available by Shawn Presser (Presser, 2020).
(Presser, 2020) refers to https://twitter.com/theshawwn/status/1320282149329784833. (Which funnily refers to this DMCA policy: https://the-eye.eu/dmca.mp4)
Furthermore, they state they trained on GitHub, web pages, and ArXiv, which are all contain copyrighted content.
Surely the question is: is it legal to train and/or use and/or distribute an AI model (or its weights, or its outputs) that is trained using copyrighted material. That it was trained on copyrighted material is certain.
[Touvron et al., 2023] https://arxiv.org/pdf/2302.13971
[Gao et al., 2020] https://arxiv.org/pdf/2101.00027
1.) Training on copyright that is publicly available. You write a poem and publish it online for the world to read. That is your IP, no one else can take it an sell it, but they are free to read and be inspired by it. The legalitly of training on this is in the courts, but so far seems to be going in favor of LLMs.
2.) Training on copyright that is not publicly available. These are pretty much pirated works or works obtained by backdoor to avoid paying for them. Your poem is behind a paywall and you never got paid, yet the poem is known by the LLM. This is just straight illegal, as you legally must pay to view the work. However there might be conditions here too like paying for access to an archive and then training on everything in it.
This is distinct from (1) where the content is streamed or only ephemeral/incidental copies are made.
IMO there's a hack about this,
authors can claim that they allow for public use unless it's used for training LLMs. And all of training work would fall under 2 because they would be used against the copyright.
Is a book publicly available? No, you have to purchase it. But once you do, you're legally allowed to let your friends and family and so forth read it too. As long as you don't sell copies of it (the "copy" part of "copyright"), or meaningfully take away the ability for the publisher to make money from sales (so you can't post it for the whole world to see on the internet).
And sure, there are lots of ToS for digital works, but are they actually enforceable? ToS can say you're not allowed to let anyone else read the book you purchased. But no court is going to say you can't lend your Kindle to your friend for them to read it too. Many ToS clauses are flat-out illegal.
Meta will argue that training on books is no different from reading all the books at a friend's house. That as long as Meta isn't reselling or making publicly available the original text, they're in the clear.
Is it truly a violation of copyright when a user hacks out bits and pieces of easily restyled raw data points from a model to look samey? what about if it takes two models? Might be time to accept humans are just cooked in their ability to discern attempts at direct plagiarism - just as it is hard to discern Sky voice from Her voice.
In particular, people often cited the case of authors who had died leaving a family in destitution, and claimed that copyright extension would be a fair way of preventing this, but in most cases the remaining family had never held the copyright; the author had initally sold the reproduction rights to a publisher who had then sat on the work without publishing it. The author, driven into penury, was then induced to sell the copyright to the publisher outright for a pittance. So in such cases a copyright extension only benefited the publisher, and indeed increased their incentive to extort the copyright.
The one who got Hindu Sanskrit books translated in a horrible manner and then claimed: "I have no knowledge of either Sanskrit or Arabic. But I have done what I could to form a correct estimate of their value. I have read translations of the most celebrated Arabic and Sanskrit works. I have conversed both here and at home with men distinguished by their proficiency in the Eastern tongues. I am quite ready to take the Oriental learning at the valuation of the Orientalists themselves. I have never found one among them who could deny that a single shelf of a good European library was worth the whole native literature of India and Arabia."
This chap will educate us on copyright?
No thanks!
If you reject Macaulay on copyright because he was an imperialist, you can use the exact same logic to reject the arguments of essentially every person who ever lived. Very few humans who ever wrote anything important will perfectly align with your morality, and most will be horribly misaligned in at least one way.
Very nice of you to omit the following sentences of that excerpt, where it proceeds to develop its point on the argument for institution of an English-language based education system on British India. He praised how superior in quantity and quality were the Sanskrit or Arabic corpora, compared to European works, in the lyric/poetry. But that no technical or didactical literature amounted to even the most mundane of the European manuals like those used by then in England humble schools (and it seems completely plausible).
He was a fierce abolitionist. So much for accomplishing the mission of allegedly, judging by comments in this thread, 'deranged imperialist destruction and chaos imposition over the lesser ones'.
I'm not much versed into his speeches/stance on copyright, but I can vouch for the fact that the most honest and well-intended moves (not by him, by other figures) in defence of everyone's intellectual property were done in the same century. From the Twentieth onwards, it has been only twisted for the interest of a select few, and needless to ask where we are today in terms of caring about intellectual property of anybody.
[1] Just saw your other comment where you go on with his nauseating words. One just cannot comprehend that framing the past on the actual status quo is as futile as to not being even wrong, I guess?
> The one who got Hindu Sanskrit books translated in a horrible manner and then claimed: "I have no knowledge of either Sanskrit or Arabic. But
... Here's what they mean, from ChatGPT."
He was able to sell it because it is something valuable, exactly because of the copyright protections. Regardless of whether author sells the rights or not, he and his family would equally be better off with copyright.
copyright as written serves the interests of publishers who don't create valuable works more than the creators of the work...
also, I don't think that implication is required, but lets pretend the implication is the only reasonable conclusion one could draw. Maybe it does make it acceptable?
If the vast majority of copyright enforcement isn't to protect creators of valuable work, but only serves to enrich those who take advantage of those creators. Then isn't it not just reasonable or acceptable, but ethically required for someone to do everything they can to dismantle the systems they're abusing against the interests of those who are actually improving the world with their creations?
Deleted Comment
When Metallica sued Napster, for many people the reaction was, "wait I can download music for free?"
Are AI-written books getting published?
If they start out-competing humans, is that bad? According to most naysayers, they can't do anything original.
Are people asking the AI for books? And then hoping it will spit it out a human-written book word for word?
Personally, I strongly believe that the aesthetic skills of humanity are one of our most advanced faculties — we are nowhere close to replacing them with fully-automated output, AGI or no.
LibGen gives you access to a much smaller body of works than either of those. It’s a little more convenient. But the big difference is that it doesn’t compensate the author at all.
Just go to a real library.
2. DRM is built in to most purchased ebooks, which means you can’t consume the book on any device. “Illegal” tools exist to circumvent this.
3. Large ebook stores - like other digital stores - essentially lend you a copy of the book. So when they are forced to pull a book, they’ll pull your access too.
Of course, now that the big players have consumed/archived the entire book dump, they can go ahead and kill it to prevent others from doing the same thing.
> Just go to a real library.
The thrill of waiting a week for a book to arrive or navigating the labyrinthine interlibrary loan system is truly a privilege that many can afford. And who needs instant access to knowledge when you can have the pleasure of paying for shipping or commuting to a physical library?
It's also fascinating that you mention compensating authors, as if the current publishing model is a paragon of fairness and equity. I'm sure the authors are just thrilled to receive their meager royalties while the rest of the industry reaps the benefits.
LibGen, on the other hand, is a quaint little website that only offers access to a vast, sprawling library of texts, completely free of charge and accessible to anyone with an internet connection. I'm sure it's totally insignificant compared to the robust and equitable systems you mentioned.
Your suggestion to "just go to a real library" is also a brilliant solution, assuming that everyone has the luxury of living near a well-stocked library, having the time and resources to visit it, and not having any other obligations or responsibilities. I'm sure it's not at all a tone-deaf, out-of-touch recommendation.
Dead Comment
https://en.wikipedia.org/wiki/Aaron_Swartz#United_States_v._...
https://en.wikipedia.org/wiki/Aaron_Swartz#Death
While Aaron Swartz was bullied to suicide, these corporations will walk free and make billions. I say give every tech CEO the Swartz treatment, then change the law.
MIT students will get away with breaking bigger rules than community college students will.
If he was acting rationally and came to the conclusion that dying was better than spending X years in jail, he would have committed suicide after sentencing, not before any trial had even happened.
Two wrongs don't make a right. If a law is unjust, then what good is there in continuing to punish people who have broken it, just because other people have been punished in the past?
Either you think the law is just or unjust. If you think it's unjust, I don't possibly see how you think people should be punished for it. Meta wasn't responsible for what happened to Aaron Swartz.
Big corporations are too big, they should just not exist. When you have corporations more powerful than the government of the biggest states, it's a bug, not a feature.
The IP laws may need rethinking. Saying that they should disappear because big corporations are above the law doesn't help, though. First kill the big corporations, then think about fair laws. Changing the law now would not change anything since those corporations are already above the law.
It's not possible to kill big corporations before fair laws, because as you said yourself "corporations are already above the law"
Unfair laws don't apply to big corporations, they only apply to the people opposed to big corporations
It's akin to hamstringing a horse and saying you'll fix it when they win
The only distinction between corporations and governments is one of them are morally bankrupt arbiters of force.
For instance, what if google was still just serving search results w/ ads, and they never expanded that. How would you make them smaller?
I don’t know how you define powerful, but I highly doubt it is at that point.
Nor should big governments.
Nor should big countries, for that matter.
That said, I want them to burn for the right reasons.
Downloading data that should be available to the public is not one of them.
Also, change the law so this is legal for poor meta? smh..
That means lawsuits, prison sentences, and millions in fines. And that's just the piracy part, there's also the lying/fraud part.
Interestingly, a Dutch LLM project was sent a cease and desist after the local copyright lobby caught wind of it being trained on a bunch of pirated eBooks. The case unfortunately wasn't fought out in court, because I would be very interested to see if this could make that copyright lobby take down ChatGPT and the other AI companies for doing the same.
So a copyright warning letter in the mail from their ISP? Maybe someone should tell them about VPNs...
- Seed the torrent and publicly promote piracy pushing lawmakers.
- Contribute with digitisation and open access like Google did in the past.
- Make the part of their dataset that was pirated publicly accessible.
- Fight stupid copyright laws. I can't believe that copyright lasts more than 20 years. No field moves that slowly, and there should be tighter limits on faster moving fields.
You mean Electronic Frontier Foundation? https://www.eff.org/issues/innovation
It's incredibly rare to find people who hold ideals that are detrimental to their own life.
You can counter by insisting that these "altruistic" behaviors are simply less directly but still in the altruist's interest. I would entirely agree.
Dead Comment
Flippant response I know, but too many people worship at the alter of the job creater and believe these folks are moral upstanding citizens