Readit News logoReadit News
foob · 3 years ago
From the recent story about the Sarah Silverman lawsuit:

The complaint lays out in steps why the plaintiffs believe the datasets have illicit origins — in a Meta paper detailing LLaMA, the company points to sources for its training datasets, one of which is called ThePile, which was assembled by a company called EleutherAI. ThePile, the complaint points out, was described in an EleutherAI paper as being put together from “a copy of the contents of the Bibliotik private tracker.” Bibliotik and the other “shadow libraries” listed, says the lawsuit, are “flagrantly illegal.”

IANAL, but this basically sounds like LLaMa was trained on illegally obtained books by Meta's own admission. It's an exciting development that Meta is releasing a commercial-use version of the model, but I wonder if this is going to cause issues down the road. It's not like Meta can remove these books from the training set without retraining from scratch (or at least the last checkpoint before they were used).

[1] https://news.ycombinator.com/item?id=36657540

ramshanker · 3 years ago
Sometimes, I wonder what if someone in XYZ country downloads whole of Z-Library/Libgen, all the books ever printed, and all the papers ever published, all the newspapers and so on. and releases the model open source. There are jurisdictions with Lax rules.

And they will have much better knowledge, answers, etc than the western, Lawyer approved models.

Sometimes knowledge needs to be set free I guess.

TX81Z · 3 years ago
The production of knowledge needs to be funded as it isn’t “free”. Copyright and licensing is one model that has worked for a long time. It has flaws, but it has produced good things.

At this point with the quality of current web content and the collapse of journalism as an industry I think we can say online ads have utterly failed as a replacement income stream.

Unless you want all LLM to say “I’m sorry the data I was trained on ends in 2023” you still need a content funding model. Maybe not copyright, but sure as hell not ads either.

dogma1138 · 3 years ago
Training and copyright is going to be interesting, people can be trained on “illegally obtained” books too yet you’ll probably going to be hard pressed to make an argument that any employee who downloaded a book or a paper from “libre library” could be used as fruit of the poisonous tree argument down the line.
l33t233372 · 3 years ago
If the company supplied the employee with the “illegally obtained” books, that could be reason to view the situation differently than an employee acting on their own.

Since the company is obtaining + providing these models with 100% of their input data, it could be argued they have some responsibility to verify the legality of their procurement of the data.

stainablesteel · 3 years ago
its not deemed illegal yet

its in a weird place imo, with japan ruling that anything goes for AI data, other countries are put under pressure to allow the same

ie,

you're allowed to scrape the web

you're allowed to take what you scrape and put it in a database

you're allowed to use your database to inform on decisions you might make, or content you might create

but once you put AI model in the mix, all of a sudden there's problems, despite the fact that making the model is 10000% harder than doing all of the points mentioned above, the problem of using someone else's work somehow becomes a problem when it never was before

and if truly free and open source LLMs come into the game, then might the corporate ones become crippled from copyright? that's bad for business

brucethemoose2 · 3 years ago
> It's not like Meta can remove these books from the training set without retraining from scratch (or at least the last checkpoint before they were used).

They probably can:

https://github.com/zjunlp/EasyEdit

> I wonder if this is going to cause issues down the road.

There are some popular Stable Diffusion models, being run in small businesses, that I am certain have CSAM in them because they have a particular 4chan model in their merging lineage.

... And yet, it hasn't blown up yet? I have no explanation, but running "illegal" weights seems more sustainable than I would expect.

Der_Einzige · 3 years ago
I’ve been wondering when the landmark moral panic would start against Civit.AI and the coomer crowd. People have no idea just how much porn is being produced by this stuff. One of the top textual inversions right now is a… age slider… (https://civitai.com/models/65214/age-slider) ewww. It’s also extremely well rated and reviewed on there. I’m terrified at the impending backlash because depending on what happens the party going on in AI could end
whimsicalism · 3 years ago
That is not at all the same thing as removing the books.
twayt · 3 years ago
> They probably can:

No, actually they probably can’t. There is no verifiable way to remove the data from the model apart from completely removing all instances of information from the training data. The project you linked only describes a selective finetuning approach.

potsandpans · 3 years ago
that is quite the spicy claim

Deleted Comment

wongarsu · 3 years ago
If we accept the argument that you can train a ML model on data scraped from the internet because the model is sufficiently transformative and thus isn't impacted by the copyright of that data, then how does that change simply because somebody else distributed the data illegally? Either the ML model breaks the copyright chain or it doesn't. Or is the argument that using data that was provided to you in violation of copyright is illegal in general?
_Algernon_ · 3 years ago
How is it different than training from random blogs, or stack overflow or in general "The Internet"?
schleck8 · 3 years ago
Really, really bad look for Eleuther if this is true. I did not expect them do something like this and not even see the issue with it.
Ancalagon · 3 years ago
Move fast and break the law.
Der_Einzige · 3 years ago
Most large datasets are full of copyrighted content. They aren’t unique.
cameldrv · 3 years ago
It seems difficult to argue that Meta can copy every ebook in existence to train a model, but then other people cannot copy the resulting model.
zargon · 3 years ago
It's not open source, it's freeware or something like that. Weights aren't the source code of LLMs, they're the binaries.
spmurrayzzz · 3 years ago
Maybe this is just semantics, but I don't know if the OSS-vs-freemium distinction matters all that much (I'd have to think about the potential downsides a bit more tbh).

Virtually every discussion in the LLM space right now is almost immediately bifurcated by the "can I use this commercially?" question which has a somewhat chilling effect on innovation. The best performing open source LLMs we have today are llama-based, particularly the WizardLM variants, so giving them more actual industry exposure will hopefully be a force multiplier.

zargon · 3 years ago
Llama isn't open source either. But if I understand your point correctly, you're saying that the commercial use axis is what is important to people, and it's orthogonal to freeware vs open source. In the present environment, I agree. But I don't think we should let companies get away with poisoning the term open source for things which are not. I also believe that actual open source models have the near-term opportunity to make an impact and shape the future landscape, with red pajamas and others in the works. The distinction could be important in the near term, at the rate this field is developing at.
slimebot80 · 3 years ago
Forgive my ignorance, but might it matter if a country was hoping to limit another countries advancement into weaponising AI?
whimsicalism · 3 years ago
Strong disagree - I think OSS is fine framing of this. Weights are a third category, you can 'fork' them in an a way that you can't with standard binaries.
l33t233372 · 3 years ago
You can add hooks to functions and “fork” binaries, which is a pretty similar effort to adding training data to given model weights.
williamstein · 3 years ago
Maybe there is no source code? I imagine an LLM is like output of the following process. There's a huge room full of programmers that can directly edit machine code. You give them a random binary, which they then hack on for a while and publish the result. You then inspect it and tell them it isn't quite optimal in some way and ask them for a new version. Iterate on this process a bazillion times. At the end you get a binary that you're reasonably happy with. Nobody ever has the source code.
powersnail · 3 years ago
Source code is the preferred form for development.

In your scenario, despite the unrealistic coding process, the machine code is the source code, because that's what everyone is working on.

In the development of LLM, the weights is in no way the preferred form of development. Programmers don't work on weights. They work on data, infrastructure, the model, the code for training, etc. The point of machine learning is not to work on weights.

Unless you anthropomorphize optimizers, in which case the weights are indeed the preferred form of editing, but I had never seen anyone---even the most forward AGI supportors---argue that optimziers are intelligent agents.

ssd532 · 3 years ago
I read it in all such discussions. What does it mean? I just have a very high level understanding of AI models. No idea how things work under the hood or what knobs can be tweaked.
doctoboggan · 3 years ago
The source code is all the supporting code needed to run inference on the weights. This is usually python and in the case of llama it's already open source. Usually the source code is referred to as the "model". You can kind of think of the weights as a settings file in a normal desktop application. The desktop app has its own source code and loads in the settings file at runtime. It can load different settings files for different behaviors.

Deleted Comment

hoofedear · 3 years ago
Thank you for succinctly explaining the difference, I learned something today
StackOverlord · 3 years ago
Compiling source code doesn't cost million of dollars though

Dead Comment

greatpostman · 3 years ago
Meta is going to ruin open ais moat on purpose. Great business strategy and good for everyone but metas competitors
jonnat · 3 years ago
Quite the opposite, this is great for Meta's competitors. Meta is not trying to get market share with this strategy, it's trying to commoditize their complements (https://www.joelonsoftware.com/2002/06/12/strategy-letter-v/)

Content is a complement to a social network: the cheaper it is to create content, the more content is available, the easier it is to optimize a feed, the larger the time people spend in the platform, the higher the revenue. GenAI is just a method to drive the cost of content creation to zero.

strikelaserclaw · 3 years ago
kind of a dystopian nightmare world in which large corporations utilize AI to create low cost, infinite content that humans engage with (mostly content catering to the human tendency for tribalism, prestige, sexual desires etc...), sounds like we are creating a world similar to the Matrix.
freedomben · 3 years ago
I wish I could upvote this a dozen times. This is a very insightful comment. Read the link above first if you aren't sure what "commoditize their complements" means.
herodoturtle · 3 years ago
Reminds me of Joel Spolsky’s essay on “Commoditize your complement”:

https://www.joelonsoftware.com/2002/06/12/strategy-letter-v/

Deleted Comment

ekojs · 3 years ago
Seems that the source is a FT article that was discussed yesterday: https://news.ycombinator.com/item?id=36712168

From the FT article: '“The goal is to diminish the current dominance of OpenAI,” said one person with knowledge of high-level strategy at Meta.'

forgingahead · 3 years ago
Zuck is a total killer. What better way to fight Google and Microsoft than to effectively spawn thousands of potential competitors to their AI businesses with this (and other) releases. There will be a mad scramble over the released weights to develop new tech, these startups will raise tons of money, and then fight the larger incumbents.

This is not charity, this is a shrewd business move.

amelius · 3 years ago
"Commoditize your complement"
whimsicalism · 3 years ago
If you read past the title, this article is not at all clear if they are referring to a commercial offering (ie. license our model for $$) or an open-source license with commercial usage (Apache, etc.)

My guess is still the latter because that's what I've heard the rumors about, but this article is pretty unclear on this fact.

brucethemoose2 · 3 years ago
Falcon 40B was released as a "free with royalties above a certain amount of profit" license, and got roasted for it. It was so bad that they changed the license to Apache.

I don't think any business would run such a "licensed" model over MPT 30B or Falcon 40B, unless its way better than LLaMA 65b.

whimsicalism · 3 years ago
I think it is supposed to be better than LLaMA 65B. Plenty of businesses are paying for OAI API access.
pmarreck · 3 years ago
I have a 128 core Threadripper, a 2080 Ti and a 3080 Ti.

How can I play with open source LLM's locally?

brucethemoose2 · 3 years ago
Kobold.cpp is your best bet.

You can leverage those big CPUs while still loading both GPUs with a 65B model.

... If you are feeling extra nice, you should set that up as an AI horde worker whenever you run koboldcpp to play with models. It will run API requests for others in the background whenever its not crunching your own requests, in return allowing you priority access to models other hosts are running: https://aihorde.net/

pmarreck · 3 years ago
oooh, this is a great idea
estreeper · 3 years ago
If you're just looking to play with something locally for the first time, this is the simplest project I've found and has a simple web UI: https://github.com/cocktailpeanut/dalai

It works for 7B/13B/30B/65B LLaMA and Alpaca (fine-tuned LLaMA which definitely works better). The smaller models at least should run on pretty much any computer.

brucethemoose2 · 3 years ago
That project seems unmaintained, which is a problem because llama.cpp is changing extremely rapidly.

Also, it has no "1 click" exe release like kobold.

freedomben · 3 years ago
May I ask why you have such an amazing machine, and two nice graphics cards? Feel free to tell me it's none of my business, it's just very interesting to me :-)
pmarreck · 3 years ago
Career dev who had the cash and wanted to experiment with anything that can be done concurrently, such as in my language of choice lately, which features high concurrency (https://elixir-lang.org/) or these LLM's, or anything else that can be done in massively parallel fashion (which is, perhaps surprisingly, only a minority of possible computer work, but it still means I can run many apps without much slowdown!)

I originally had 2 2080ti's to experiment also with virtio/proxmox (you need 1 for the host and 1 for any VM you run). I never got that running successfully at the time, but then Proton got really good (I mainly just wanted to run windows games fast in a VM, but that circumvented that). Later on I upgraded one of them to a 3080ti.

It's a System76 machine, they make good stuff

nickthegreek · 3 years ago
Check out r/LocalLlama for a bunch of resources.
loufe · 3 years ago
I'm surprised nobody here has brought up the sensorship in this model. Listening to Mark Zuckerberg on Lex Friedman's podcast talk about it, it sounds like the model will be significantly blunted vs its "research" version release.