Zuckerberg appeared to know Llama trained on Libgen

Of course he does. Heck most of us in early stages of LLM did the same thing. The data simply did not exists outside Google which is why it’s crazy that Google completely dropped the ball on AI this decade. They had such a huge lead in terms data access.

tdb7893 · a year ago

They dropped the ball on cloud and need to catch up and now it's AI. It's kinda interesting how being ahead with data center infrastructure and also AI research didn't lead to them being ahead on those products

sitkack · a year ago

Google is a playground funded by Ads and Ads make so much damn money that nothing can compete, even internally. If I were an activist investor, I'd make ads its own company. I was the FTC, I'd make ads its own company.

lvl155 · a year ago

To be fair, they did have the lead as late as 2018. It’s just they treated it like it was their PhD thesis. Didn’t protect their IP at all and let all their talent leave.

xbmcuser · a year ago

In my opinion the Ai and absorbing all knowledge part of Google was Larry Page after his health scare his focus and priorities changed about actually living his life not Google. I think he had also realized what was happening with Google and so wanted Alphabet as an umbrella organisation but in the end he gave it up and let be run as a normal company.

llm_trw · a year ago

And the only reason they had the data is because they scanned every book ever for Google books.

pk-protect-ai · a year ago

and every e-mail, and every document in google docs, and every video on youtube ...

yencabulator · a year ago

How was the data Google already had access to any less protected by copyright?

The data Google had was book scans, search engine indexing of arbitrary 3rd party content, and private email and documents they hosted.

whiplash451 · a year ago

Google dropping the ball on AI… given their achievements on Waymo, Gemini and Gemma (just to name a few)… does not sound like a fair statement

bn-l · a year ago

Those models are absolutely garbage. Terrible code understanding. Ridiculous hallucinations.

logifail · a year ago

Perhaps the more interesting question would be exactly how did they obtain their copy/copies of Libgen?

janice1999 · a year ago

It's hinted at in the article. If they torrented one large dataset, it's likely they did the same for Libgen.

> "I think torrenting from a corporate laptop doesn’t feel right,” wrote one engineer in April 2023, adding a smiley face emoji. (A later email acknowledged that the “SciMag” data had indeed been torrented.)

perfmode · a year ago

Are you asking for a way to obtain a copy?

Deleted Comment

Sometimes I just feel like these people overestimate how much they are actually owed from these training runs.

It’s trained on 15T tokens. So how many did you provide that were genuinely novel? And how much money do you want? Like $5 from OpenAI? And $0 from meta since it’s open source?

I personally hope we can all get on the same team with AI and treat its advancement as scientific research for the betterment of humanity.

logifail · a year ago

> It’s trained on 15T tokens. So how many did you provide that were genuinely novel?

Are we suggesting that we should ditch creators' rights and instead value intellectual property along the lines of "I should be able to copy all your stuff as long as long as I copy lots of other stuff too, and give it all away for free or almost free...?"

marssaxman · a year ago

That sounds like a pretty good deal to me - but I've always believed that the entire concept of "intellectual property" does overall more harm than good.

llm_trw · a year ago

The question if training an llm is fair use is one that will have to be answered by the courts.

ripped_britches · a year ago

No not at all, just that their damages are going to be measurably fairly low

scotty79 · a year ago

Of course. The alternative is that creators dictate the price for any of the infinite number of zero cost copies which is and always has been ridiculous.

jncfhnb · a year ago

Yes. As long as you don’t reproduce other people’s stuff specifically.

tomrod · a year ago

I think the statistical arguments cover this.

skulk · a year ago

> I personally hope we can all get on the same team with AI and treat its advancement as scientific research for the betterment of humanity.

s/AI/capital/.

It's painfully obvious that this is going to make material conditions worse for most people who use their minds to work instead of their hands. to these people, the "betterment of humanity" is a cruel joke.

idunnoman1222 · a year ago

Yeah, just like Google and stack overflow

wat10000 · a year ago

The normal way to figure this out is to negotiate. We’ll either come to a mutually agreeable amount, or they’ll decide it’s not worth the cost to use my stuff. If I think I deserve $5 from OpenAI, then I’d suggest that, and they’d accept or come back with a counteroffer or tell me I’m nuts and move on. Probably that last one.

But for some reason, these companies think they don’t need to bother, and can just use everyone’s stuff.

Wait, I phrased that wrong. For a very good reason based on long precedent, these companies know that IP law is a tool to be used by big companies against individuals and sometimes other big companies, but never by individuals against company, so they know they don’t have to bother.

latexr · a year ago

> Sometimes I just feel like these people overestimate how much they are actually owed from these training runs.

It’s not about being paid for including their work, it’s about being compensated for having done so without permission. For crying out loud, they went out of their way to remove copyright notices from the pirated work.

> It’s trained on 15T tokens. So how many did you provide that were genuinely novel?

Then they can just take it out. And go ahead and take out every thing you didn’t have permission to include. What’s that? The model is now significantly worse? Yeah, these things compound.

> And how much money do you want? Like $5 from OpenAI? And $0 from meta since it’s open source?

No, they would’ve wanted for the work to not have been included without permission in the first place. Do you understand the world you’re advocating for? You’re arguing it’s OK for rich people to do whatever they want if they throw some scraps on the floor for you. Not everything is about money. Unfortunately there’s no other reasonable way (legal and non violent) to punish these infringers.

> I personally hope we can all get on the same team with AI and treat its advancement as scientific research for the betterment of humanity.

What you’re expressing is “I hope everyone will stop arguing and agree with me”. These moguls care about themselves, it is incredibly naive to believe they give a rat’s ass about “the betterment of humanity”.

forgetfulness · a year ago

LLM companies aren't being funded by the hundreds of billions because investors expect science to be advanced by text and image generators.

I find it very unlikely that the commodification of knowledge work will be for the betterment humanity, I don't know if people are expecting here that just because the value of more people's labor becomes zero that we will do, what, do away with money? No, it will just mean that fewer people will have the chance to earn the right to use space and resources in a meaningful way.

inetknght · a year ago

> I find it very unlikely that the commodification of knowledge work will be for the betterment humanity

There's no law to force it. So of course it won't be.

Even if there were a law to force it, how would you enforce it?

Hammershaft · a year ago

Hard to treat llms training aon your data at your expense as research for the betterment of humanity when it is specifically the private company who is imposing that cost on you that profits.

fourside · a year ago

Does this goes both ways? Can I infringe of Disney’s IP on the grounds that their stories are so derivative that they aren’t actually that new?

The betterment of humanity seems to involve some parties making a ton of money while the people who provided the data apparently just need to be grateful.

jncfhnb · a year ago

You can absolutely use the story frameworks that Disney has done.

You cannot make a story featuring Simba.

jsheard · a year ago

The fact they they were willing to risk significant legal exposure in order to use this dataset suggests it's worth considerably more than 5 dollars to them. Zuck isn't putting his ass on the line for a Big Mac.

Deleted Comment

sensanaty · a year ago

So if the works they're stealing aren't worth anything, why do they need it so badly?

papercrane · a year ago

Assuming the works have registered with the copyright office they're eligible for statutory damages.

The range for that is huge though, it can be in the hundreds of dollars per work, or if the infringement is shown to be wilful then a judge can award up to $150,000 per work.

ripped_britches · a year ago

Fair point, seems willful here

lm28469 · a year ago

There is absolutely no logical pathway between the current flavor of hardcore free for all individualistic capitalism and what you describe here

TruffleLabs · a year ago

Stealing is still illegal

blibble · a year ago

well US copyright law statutory damages are $30,000 per work infringed, and $150,000 if done deliberately