Is IA something different from AI? Or maybe just the French version ("intelligence artificielle" I imagine...)
My feeling is that a lot of Meta's AI/ML work actually ties into the AR/VR long-term dream. How do you make the so-called metaverse alive? By having people design it themselves. They're not going to do that in Maya, that's for sure. But if they could create virtual spaces and virtual people with Holodeck-style natural language instructions...
IMO it highlights just what a commodity they weights are. If all one needs is the weights to reproduce the work, then where is the value? I mean there is very little moat here. Further, what does it say about consciousness and individuality if we all are simply the values of the weights in our wet neural networks? Or whatever the biological equivalent is?
There's nothing "simply" here. The weights in question are a particular configuration of several gigabytes worth of data. They're not random. Getting anything comparable by randomly generating a number this long is a "total atoms in the universe to the power of total atoms in the universe" kind of a deal.
In abstract terms, those weights are by far the most dense form of meaning we've ever dealt with.
That’s like saying that creating a song or a movie has no value because after they are created anyone can download a file with it.
However it does open up a point which is: should we allow people to make huge amounts of money by infinitely copying and distributing their own work? Should we be protecting this model?
Unless you believe that you think with your soul or something, what else could you be other than your quantum state, or some close-enough compressed equivalent?
They could do what Microsoft used to do in the 90s-00s, make pirated Windows/Office available (by turning blind eye on private users) so that they capture and keep the mindshare.
Facebook, which ran over the law in order to be successful now uses the law in the exact opposite sense. Ultimately, it's greed, "What is good for me to do to you is not good for you to do to me."
Someone should tell him (and all the other metaverse people) that VR is almost always dystopian. It's what everyone gets sucked into when civilization stagnates and there is no opportunity, no culture, and nowhere to go. It belongs in worlds of Malthusian collapse, after a nuclear war, or where inequality is so high the majority of people have reverted to high-tech medieval peasants.
VR has a legitimate niche in gaming but outside that it's just not appealing. It's dystopian and depressing. Nobody wants to spend time in a social network with a helmet on their head being served ads.
It seems like VR is less than half of the investment by RL. In Meta's 2022 annual report, they say "Many of our metaverse investments are directed toward long-term, cutting edge research and development for products that are not on the market today and may only be fully realized in the next decade. This includes exploring new technologies such as neural interfaces using electromyography, which lets people control their devices using neuromuscular signals, as well as innovations in artificial intelligence (AI) and hardware to help build next- generation interfaces. ... *in 2023, we expect to spend approximately 50% of our Reality Labs operating expenses on our augmented reality initiatives, approximately 40% on our virtual reality initiatives, and approximately 10% on social platforms and other initiatives.*"
I'm not sure if Horizon falls into "virtual reality" or "social platforms" but it seems to be the latter: "For example, we have launched Horizon Worlds, a social platform where people can interact with friends, ..."
This seems like a big misstep by Meta. I had assumed they were intentionally allowing that torrent to float around, and tacitly encouraging open source to build on their models. It seemed like a way to differentiate themselves from "Open"AI, and I was actually feeling some good will toward them for a second!
Isn't this just to protect them from liability? If someone claims that their LLM hurt them in some way, they can say that they had nothing to do with it and that they tried to prevent its spread.
Also, it could be a maneuver to prevent the genericization of the word LLaMa, which they may want to continue using.
Is there precedent on model weights being copyrightable in the first place? I suppose the recipients of the DMCA notices are unlikely to be willing to contest it in court, though.
It's an interesting legal question. US copyright is based around expressiveness and originality (which is why phone books and IBM's logo are not copyright protected.)
An argument might be made that the curation of data that goes into the training set qualifies, but it might depend on how much expressiveness and originality went into the curation.
For example, I could see a court ruling that the weights for a model trained on "all the good music from the 70s" is copyrightable, as someone had to express what they believed was "good" music, but a model trained on a large percentage of the internet without much curation would not.
Of course, nobody really knows until the courts weigh-in on it.
If model weights become non-copyrightable, it'll lead to an incredible shift in the industry.
When model weights leak, anyone can pick them up and run with them. It's not like code, where you have to set up an entire bespoke infrastructure, microservices, data dependencies, etc. Models are crystalized, perfectly distilled functionality with a single interface.
You'll start to see more leaks, companies building off the work of other companies, etc. Part of me thinks this would lead to faster, more distributed innovation.
"In regard to collections of facts, O'Connor wrote that copyright can apply only to the creative aspects of collection: the creative choice of what data to include or exclude, the order and style in which the information is presented, etc.—not to the information itself."
It's almost certainly true that the collection of "all the good music from the 70s" is copyrightable as a collection, but that doesn't make the weights the result of a creative process.
If anyone's willing to fund the legal battle, let me know. (You can DM me on Twitter: https://twitter.com/theshawwn)
I'd be willing to issue a DMCA counterclaim for llama-dl on the grounds that model weights are not copyrightable. If it's worth settling the question in court, then this seems like a good opportunity.
If Meta has registered the work with the copyright office, the statutory damages in this case, should llama-dl lose, might be quite large.
Check in with an attorney before launching a battle with an opponent who has unlimited resources. There are likely to be many similar test cases in the coming year, perhaps more-readily fought.
IANAL, so this is not legal advise. I consulted a legal expert a few years ago about the status of machine learning models and they said it is really unclear. Apparently, if works are transformed enough that the original is not recognizable anymore it may not violate copyright. It hinges a lot on whether the original work is reproduced, so if you could get an LLM to spit out copyrighted texts unmodified, then it would most likely be copyright violation. But I think that doesn't really happen much in practice.
On the other hand, Meta can have copyright over the model through 'copyright in compilation', which protects compiled works, regardless of the copyright of the underlying material.
So, I fear that it may be possible to have it both ways. But realistically, I think we'll only know for sure when this is fought out in court.
Disclaimer: again I am not a lawyer, so take this with a grain of salt.
It seems intuitively kind of bonkers that they can ignore copyright when generating weights using stuff on the internet, and then turn around and claim copyright on the resulting artifacts.
Tech companies want it both ways. 1. They own the rights to any user content they store and transmit. 2. They are shielded by Section 230, and immune to liability for its misuse.
IDK, its more like finding recipes to many great restaurant chains all mushed together by a 5th grader whose uncle stole it from them, on the sidewalk. looks like a grey area to me legally but IANAL.
A particularly thorny question if data in the training set is copyrighted. Which I presume much of it is, since I was playing with alpaca.cpp and it answered all kinds of pop culture questions I asked it (to varying degrees of accuracy).
They only own the arrangement of what is and isn't in the training set, insamuch as that training set represents human creativity. The process of training model weights is itself purely mechanical.
The closest that they could get would be trade secrecy violations, but that only punishes the original leaker and anyone working in concert with them. I'm not sure if anyone's successfully managed to get an entire BitTorrent swarm to be considered misappropriating trade secrets. Presumably at some point, when the trade secret has been violated, you can obtain it without misappropriating - otherwise, how does that not just become Copyright 2.0?
Copyright-wise, why would the weights be any different from a video file? At the end of the day, LLaMa's weights and fast_and_furious_11.mp4 are both just strings of binary data that some company made to sell.
US copyright law cares about originality and expressiveness, not labour and cost. It can be very expensive to collect and print a list of every business and their phone number in a book, but the result is not copyrightable in the US.
Don't forget that transformers, the basic architecture of these large language models, were introduced by Google, and many other basic building blocks were introduced by universities and research centers all over the world. So no, the outcome is not very clear, at all.
I've got disappointing news for you. The copyright protecting your open source software never applied to its function, only its creative expression, and only insofar as the expression was separable from its function.
no, machines are not humans. And nobody stole your software, but they may have been in breach of the license. But because there's no financial damage, you probably don't have the financial standing to sue over it.
How is there no financial damage? If models start replacing existing business cases but are only able to do so because they were trained on copyrighted data relating to those business cases, then the financial damage should be very obvious.
Correct me if I'm wrong, but I don't think you require financial damages to sue for a violation of the license. Standing is conferred from having your (artificial legal) monopoly on the work as the author violated, not necessarily financially.
This is the whole principle that allows the GPL to work.
That aside, we should really stop misusing "steal". Not only is it legally inaccurate (Dowling versus United States), it's semantically inaccurate as well. The conflation of that with mere copyright infringement is a campaign driven by bad faith actors.
Funny how these bigcorps want to protect the copyright of their copyright infringement machines. Can't wait for someone to train a new model off of theirs and challenge them in court over it. Everything you can slurp off of the internet is fair game, right?
Of course, it’s no different than you browsing and reading the sum total of the Internet yourself which is a natural thing we are all capable of and then copying your brain and giving it to anyone who would like to query your headspace. Little known fact, that’s why the brain has a USB port. /s
> If they think they can catch OpenAI, and they want to charge for AI services, then what they're doing makes sense.
Then they wouldn't release anything to the researchers. Reason LLaMA weights are spreading in the first place is because they let a big group of people get access, someone is bound to upload it as a torrent.
Seems they gave people with .edu emails access relatively quickly too, researcher or not.
Fragmentation at first and you hinder the first mover in this case OpenAI in sizing the market completely and later on you can acquire the necessary pieces to get back into the market or buy time to close the gap.
Current Microsoft strategy seems so push into this direction. Open Source certain technologies and acquire important pieces e.g. Github / Stake in OpenAI etc. to build a bigger picture that they can monetize later.
one great move that FB could have done would be to ride the wave of positive PR + get all investors hyped:
---> "Meta is a credible alternative to OpenAI, the company is switching from "Meta"-bullshit to an "IA"-first company",
and get the investors to pump the Meta stock,
and then dilute some of the shares to raise some cash (or issue new shares to newly specialized IA hires).
But no, FB is still going after the VR gimmicks and NFTs.
4 billion USD per quarter wasted on Oculus (!), while they could use this money to fund and support a whole ecosystem around LLaMA.
My feeling is that a lot of Meta's AI/ML work actually ties into the AR/VR long-term dream. How do you make the so-called metaverse alive? By having people design it themselves. They're not going to do that in Maya, that's for sure. But if they could create virtual spaces and virtual people with Holodeck-style natural language instructions...
Yes, it's totally AI, and IA is the French version :)
Programmers in particular tend to overuse anglicisms, and often I end up mentally doing spanglish in my head,
"Necesito este value" "Este command deberia hacer esto" "Este iterator va a hacer $something a este object"
In abstract terms, those weights are by far the most dense form of meaning we've ever dealt with.
However it does open up a point which is: should we allow people to make huge amounts of money by infinitely copying and distributing their own work? Should we be protecting this model?
Not only do they have no street smarts whatsoever, but their book smarts start to disappear when you deviate from the training data.
VR has a legitimate niche in gaming but outside that it's just not appealing. It's dystopian and depressing. Nobody wants to spend time in a social network with a helmet on their head being served ads.
That’s one big reason the stock is up: $4B in the metaverse is nonsense. $4B in AI, though? Transformational
I'm not sure if Horizon falls into "virtual reality" or "social platforms" but it seems to be the latter: "For example, we have launched Horizon Worlds, a social platform where people can interact with friends, ..."
Let 'em burn it all.
This could have been their “contribution to the society” - yielding much better PR than they could have ever hoped for.
Also, it could be a maneuver to prevent the genericization of the word LLaMa, which they may want to continue using.
(Also: great username!)
EDIT: DMCA here [0]. It does sound like they're asserting copyright on the "content," i.e. presumably the weights themselves.
[0] https://github.com/github/dmca/blob/master/2023/03/2023-03-2...
An argument might be made that the curation of data that goes into the training set qualifies, but it might depend on how much expressiveness and originality went into the curation.
For example, I could see a court ruling that the weights for a model trained on "all the good music from the 70s" is copyrightable, as someone had to express what they believed was "good" music, but a model trained on a large percentage of the internet without much curation would not.
Of course, nobody really knows until the courts weigh-in on it.
When model weights leak, anyone can pick them up and run with them. It's not like code, where you have to set up an entire bespoke infrastructure, microservices, data dependencies, etc. Models are crystalized, perfectly distilled functionality with a single interface.
You'll start to see more leaks, companies building off the work of other companies, etc. Part of me thinks this would lead to faster, more distributed innovation.
"In regard to collections of facts, O'Connor wrote that copyright can apply only to the creative aspects of collection: the creative choice of what data to include or exclude, the order and style in which the information is presented, etc.—not to the information itself."
Here, the weights are also not even facts.
Am I missing a joke...?
https://www.ibm.com/legal/copytrade
I see what you did there ;)
I'd be willing to issue a DMCA counterclaim for llama-dl on the grounds that model weights are not copyrightable. If it's worth settling the question in court, then this seems like a good opportunity.
I wrote more about this further downthread: https://news.ycombinator.com/item?id=35288415
Check in with an attorney before launching a battle with an opponent who has unlimited resources. There are likely to be many similar test cases in the coming year, perhaps more-readily fought.
On the other hand, Meta can have copyright over the model through 'copyright in compilation', which protects compiled works, regardless of the copyright of the underlying material.
So, I fear that it may be possible to have it both ways. But realistically, I think we'll only know for sure when this is fought out in court.
Disclaimer: again I am not a lawyer, so take this with a grain of salt.
They only own the arrangement of what is and isn't in the training set, insamuch as that training set represents human creativity. The process of training model weights is itself purely mechanical.
The closest that they could get would be trade secrecy violations, but that only punishes the original leaker and anyone working in concert with them. I'm not sure if anyone's successfully managed to get an entire BitTorrent swarm to be considered misappropriating trade secrets. Presumably at some point, when the trade secret has been violated, you can obtain it without misappropriating - otherwise, how does that not just become Copyright 2.0?
In the same way that a list of ingredients can/can’t be copyrighted.
In the same way that a list of ingredients can/can’t be copyrighted.
It is however my understanding that downloading them can be considered a misappropriation of a trade secret.
Folks that would contest the bogus DMCA takedown requests would be liable to a trade secret suit.
That is what we have been told when they stole our open source software.
That said, the inclusion of GPL-licensed code in training sets may yet force the release of those models under the GPL.
I'd really like to see this tested at a court.
This is the whole principle that allows the GPL to work.
That aside, we should really stop misusing "steal". Not only is it legally inaccurate (Dowling versus United States), it's semantically inaccurate as well. The conflation of that with mere copyright infringement is a campaign driven by bad faith actors.
Deleted Comment
Depends on what they're afraid of.
If they want to commoditize the space, then I think you're right.
If they think they can catch OpenAI, and they want to charge for AI services, then what they're doing makes sense.
Then they wouldn't release anything to the researchers. Reason LLaMA weights are spreading in the first place is because they let a big group of people get access, someone is bound to upload it as a torrent.
Seems they gave people with .edu emails access relatively quickly too, researcher or not.
Current Microsoft strategy seems so push into this direction. Open Source certain technologies and acquire important pieces e.g. Github / Stake in OpenAI etc. to build a bigger picture that they can monetize later.
OpenAI is a real threat to Google & Facebook