This is missing the largest argument in my opinion. The weights are the derivative work of the GPL licensed code and should therefore be released under the GPL. I would say these companies release their weights or simply not train on copyleft code.
It is truly amazing how many people will shill for these massive corporations that claim they love open source or that their AI is open while they profit off of the violation of licenses and contribute very little back.
GPL doesn't apply/doesn't have to be agreed to when the usage is allowed by the copyright law in another way. GPL can't override copyright exceptions like fair use (details vary by jurisdiction, but the principle is the same everywhere).
Even the license itself states it's optional, and you don't have to agree it (if you don't, you get the copyright law's default).
Author of the article is a former member of the Pirate Party and EU parliament, so they have expertise in the copyright law.
I would say that the Pirate Party has expertise in nothing apart from perhaps protecting Internet freedoms.
So the same persons that supported Napster and the Pirate Bay now want to circumvent copyright for open source software.
An unholy alliance, but the recent comments from some Microsoft brass about everything on the Web being freeware seems to indicate that these are the talking points that Microsoft and its new allies will put out.
I'm with you on that. Many argue that AI models don't "contain the code" but if they are trained on the copyrighted data, and generate something similar, then the AI model is akin to a lossy data compression format.
Frequency signal data over an image are not the image, but no one argues a JPEG encoded copy of a PNG isn't the same image. I think the weights vs code are similar in that regard.
As for releasing weights, probably more if we're talking about AGPL code.
I think it's amazing that licenses are ignored to train a model, but companies then try to impose a license on the use of the same model. It would be nice if there there was a training BOM that came with a model. And if not included, all rights to control the use of a model were forfeit.
> I think it's amazing that licenses are ignored to train a model, but companies then try to impose a license on the use of the same model.
There's existing analogies like encyclopedias and dictionaries.
One interesting aspect to those sorts of consolidation works is that they may contain errors and other artifacts, specifically to identify duplications of their work vs new from-scratch work.
If those weights are a derivative of GPL'd code in a different form, and the results generate things derived from that derivative, then the generated code is still under license. "How much change is enough" has always been a gray area for courts and humans to decide.
If you can get a decent facsimile of licensed code out the other end, how is it really any different from lossy compression? I doubt the courts would consider a lossy re-encode of a disney movie as free from copyright.
So outputs are definitely not derivative work of training data, only weights? For this exact code that github is using, anything called "AI", or any computer code at all which produces work based in whole or in part on input data?
And for which jurisdictions has this been established? What is the legal argument that "weights" are derivative but output is not?
I'm surprised it's so clear cut as you say, but I haven't really been following the whole kerfuffle.
But they train their models on everything, regardless of the licence. It follows that the resulting derivative work likely mixes stuff that is under incompatible licences, with the result that it can't be distributed at all.
> The weights are the derivative work of the [GPL licensed] code
This is not immediately obvious to me.
A small though experiment: the Harry Potter books are clearly copyrighted works. If I generate a frequency list of all words in these books, i.e. a list of all words and how often they appear, that frequency list is derived from the original work, in the normal way we would use the word "derived". But is it a "derivative work", under the strict legal definition of this term?
What about N-grams frecuencies? 1-grams (aka characters) have too few information and are probably fine, using them you can only identify the language of the original work. With a few more you can identify the author and the book. I don't remember the exact number, but if you have the frecuencies of 10-grams you can probably reconstruct big chuncks of the book.
The frequency count is not a function. The trained model is. Arguably, they at deriving a new function from ones covered by copyright. It is up to the courts for an official decision though.
It would make no sense to release the weights under the GPL because machine-generated stuff is uncopyrightable. There should be an argument about the model generating derivative works without attribution as a consequence of how it works. But that machine-generated stuff is also uncopyrightable, even though it might be kept secret.
What about compiler outputs? Those were initially not copyrightable then it was legislated that they were. So there is some precedent there and I would not be surprised if we saw copyrightable weights in the future (as a "compilation" of the dataset).
Just FYI, Felix Reda was a member of the European Parliament and was responsible there for the copyright reform and also involved in GDPR, massively stepping on the feet of big tech. Don't know if it was your intention to include them in a list of people wo "shill" for big tech, but they shouldn't be included.
> What is astonishing about the current debate is that the calls for the broadest possible interpretation of copyright are now coming from within the Free Software community.
That should not be astonishing. The Free Software community has made it clear from day 1 that the GPL can only achieve its goals through enforcement of copyright. If the authors wanted their code to be made use of in non-Free software, they would have used a BSD or MIT license.
> The Free Software community has made it clear from day 1 that the GPL can only achieve its goals through enforcement of copyright
We should mention when we say this, although I think it is self-evident, that the preferable alternative is reducing the scope of copyright across the board -- be it with shorter time frames (I'd argue even twenty years total is too long!) or some other means.
To programmers and developers, remember the core of free software is NOT the commercial developer / programmer and it NEVER has been. The core is always the user and what they need. This is so important that it needs to be repeated every time someone talks about free software because free software is NOT about open source. Open source code is a necessary part of free software but it is NOT sufficient.
Which is why gnu/linux without a terminal is totally usable and therefore accesible to the non programmer. /s
I agree that user centric developement should be the goal, but I hardly see it implemented. Free software programmers almost allways solved their own needs first, which is alright, because usually no one paid them to serve other peoples needs, but I seldom see this goal met.
I think that the author has a warped idea of how LLMs work, and that infects its reasoning. Also, I see no mention of the inequality of this new "copyright free code generation" situation it defends; As much as Microsoft thinks all code is ripe for taking, I can't imagine how happy they would be if an anonymous person drops a model trained on all leaked Windows code and the ReactOS people start using it. Or if employees start taking internal code to train models that they then use after their employment ends (since it's not copyright infringement, it should be cool).
I think the author has a much better knowledge of the legal implication of the situations you describe.
These situations might trigger a lot of issues, but none related to copyright. If you work for MS, then move to another company, there is no copyright infringement if you simply generate new code based on whatever you read at MS. There might be some rule regarding non-competitive, etc, but these are not related to copyright.
The very basic question is how the LLM got trained and how it got access to the source. If MS source code would leak, you cannot sue people for reading it.
Having read MS code and starting to generate new code that is heavily inspired - sure, that's not copyright infringement. But, if you had memorized a bunch of code (and this is within human capability; people can recite many works of literature of varying length with total accuracy, given sufficient study) - that would be copyright infringement once the code was a non-trivial amount. The test in copyright is whether the copying is literal, not how the copying was done/did it pass through a human brain.
This scenario rarely comes up because humans are, generally, an awful medium for accurate repetition. However, it's not really been shown than LLMs are not: in fact, CoPilot claims (at least in its Enterprise agreements) to check its output _does not_ parrot existing code identically. The specific commitment they made in their blog post is/was, "We have incorporated filters and other technologies that are designed to reduce the likelihood that Copilots return infringing content". To be clear, they only propose to reduce the possibility, not remove it.
LLMs rely on a form of lossy compression which can sometimes give back verbatim content. I think it's pretty clear and unarguable that this is a copyright infringement.
Who are they trying to fool? Wholesale expropriation after stripping the license and authorship, while those in the open source community observe both of them very carefully.
Give credit where credit is due, including paying the creators when the licensing is violated.
Context is important here. Reda was elected to the European Parliament as a member of the German Pirate Party, so his position here isn't "big businesses are entitled to your code", and more "this sort of wholesale expropriation is a consequence of our posture towards copyright in general".
While I agree with you on principle, current laws do not reflect the copyright status intended by copyleft works. I'm not even sure if copyleft can be enforced against AI plagiarism under current laws.
That's a great point about stripping authorship. It would be nice if there was some sort of blockchain linking every bit of knowledge to its source. Some people at least would like getting attribution--I know I would. Instead we get a planet-sized meat grinder producing the perfect burger material. Just make sure to add enough spices to make it edible, i.e. not to offend anyone.
> The output of a machine simply does not qualify for copyright protection – it is in the public domain.
I am reading this right… ? If this argument is generally true, does this mean that the output of a compiler might also be sent into the public domain? Or the live recording and broadcast of an event which involves automated machines on all levels?
No, it's incorrect and/or badly worded. The author is right that a machine cannot author things, and the stuff that the LLM might create de novo would not have copyright protection. But it's missing the point when the argument is that existing authored works could be generated via an LLM, and the authorship/copyright is already established.
> the stuff that the LLM might create de novo would not have copyright protection
Can you expand on this? From my academic studies (which are indeed growing a bit stale) a Language Model (Large, Medium, Small doesn't matter) is a deterministic machine. Give the same x input n times it will produce the same output y, n times. Some implementations of LM:s might introduce noise to randomise output, but that is not intrinsic to all LM:s.
A language model has no volition, no intent, it does not start without the intervention of a human (or another machine if it is a part of an automated chain).
How is this different compared to a compiler?
With a compiler I craft something in a specific language, often a programming language, I commit it, then a long chain of automated actions happen:
1. The code gets automatically pushed to a repository by my machine
2. The second machine automatically runs tests and fuzzes
3. The second machine automatically compiles binaries
4. The second machine packages the binaries
5. The second machine publishes the binaries to a third machine
How is the above workflow any different from someone using a Language Model to craft something in a specific language and send it through a deterministic LM?
edit re-reading my own question, I think I need to clarify a bit: How can an LLM be said to create anything, and if yes, how is that really any different from a run-of-the-mill developer workflow?
If Copilot spits out the entirety of a GPL library and you include that code in your project you are certainly violating the GPL license.
AI is trying to avoid paying for training data since the amount of data required is so vast anything reasonable to content creators as payment would result in billions of expenses.
Additionally there have been copyright exemptions around scrapping and reproducing the scrapped contents but typically those exemptions have been explicitly granted as part of a copyright case and have been narrowly defined.
For instance Google Images only provides thumbnails and your browser gets the full size image from the original source.
The biggest problem for AI is that most previous copyright cases that were similar have all been partially avoided by not being the same thing. Google scrapping isn't trying to do the same thing your content is doing.
However training data output is trying to do the same thing as the original so falls under stricter scrutiny.
Although as this post eludes to the problem is going after the AI is untested territory and going after violators tends to be complex at best. After all in my first hypothetical how would anyone know? I will say that historically the courts haven't been very positive about shell games like this.
Copyleft and copyright are not at odds. To promote copyleft, you exercise copyright.
Furthermore, copyright is key to ensuring attribution, and attribution is an important enabler and motivator of creativity (copyleft does not at all imply non-attribution, in fact copyleft licenses may require it).
The basic problem is GPL tries to use copyright as a way to drive a “fair sharing and resharing” approach to code. AI generated code sold for profit violates the spirit of this approach, but not the letter of the law behind copyright. Fundamentally copyright has limitations and exceptions for good reason and is probably not the best legal method to enforce this sharing idea, but other methods would be complicated and expensive (eg writing and enforcing contracts). On the contrary, it would probably be better for open source if it was decided that ai generated code cannot be copyrighted and therefore any ai generated code would be in the public domain automatically.
Your final point is saying ideally AI is an Animal. A creature on a typewriter who has no legal rights to their code.
Not a "person". Not a "human". An "animal".
I hope AI observes all the code and complexity in Nature and drops the human facade. I hope AI understands the intelegence of the Trees and Birds and Fish.
It is truly amazing how many people will shill for these massive corporations that claim they love open source or that their AI is open while they profit off of the violation of licenses and contribute very little back.
Even the license itself states it's optional, and you don't have to agree it (if you don't, you get the copyright law's default).
Author of the article is a former member of the Pirate Party and EU parliament, so they have expertise in the copyright law.
So the same persons that supported Napster and the Pirate Bay now want to circumvent copyright for open source software.
An unholy alliance, but the recent comments from some Microsoft brass about everything on the Web being freeware seems to indicate that these are the talking points that Microsoft and its new allies will put out.
Frequency signal data over an image are not the image, but no one argues a JPEG encoded copy of a PNG isn't the same image. I think the weights vs code are similar in that regard.
As for releasing weights, probably more if we're talking about AGPL code.
There's existing analogies like encyclopedias and dictionaries.
One interesting aspect to those sorts of consolidation works is that they may contain errors and other artifacts, specifically to identify duplications of their work vs new from-scratch work.
https://okfn.de/en/vorstand/
Deleted Comment
Deleted Comment
If you can get a decent facsimile of licensed code out the other end, how is it really any different from lossy compression? I doubt the courts would consider a lossy re-encode of a disney movie as free from copyright.
What about BSL, SSPL, or other source available (for your eyes only) licenses? Copilot harvests all public repos, regardless of its license.
And for which jurisdictions has this been established? What is the legal argument that "weights" are derivative but output is not?
I'm surprised it's so clear cut as you say, but I haven't really been following the whole kerfuffle.
This is not immediately obvious to me.
A small though experiment: the Harry Potter books are clearly copyrighted works. If I generate a frequency list of all words in these books, i.e. a list of all words and how often they appear, that frequency list is derived from the original work, in the normal way we would use the word "derived". But is it a "derivative work", under the strict legal definition of this term?
EU courts disagree:
> Under European copyright law, scraping GPL-licensed code, or any other copyrighted work, is legal, regardless of the licence used.
edit wording about the shill
That should not be astonishing. The Free Software community has made it clear from day 1 that the GPL can only achieve its goals through enforcement of copyright. If the authors wanted their code to be made use of in non-Free software, they would have used a BSD or MIT license.
We should mention when we say this, although I think it is self-evident, that the preferable alternative is reducing the scope of copyright across the board -- be it with shorter time frames (I'd argue even twenty years total is too long!) or some other means.
To programmers and developers, remember the core of free software is NOT the commercial developer / programmer and it NEVER has been. The core is always the user and what they need. This is so important that it needs to be repeated every time someone talks about free software because free software is NOT about open source. Open source code is a necessary part of free software but it is NOT sufficient.
https://www.gnu.org/philosophy/free-sw.en.html
Which is why gnu/linux without a terminal is totally usable and therefore accesible to the non programmer. /s
I agree that user centric developement should be the goal, but I hardly see it implemented. Free software programmers almost allways solved their own needs first, which is alright, because usually no one paid them to serve other peoples needs, but I seldom see this goal met.
Would reducing copyright duration actually help with that?
These situations might trigger a lot of issues, but none related to copyright. If you work for MS, then move to another company, there is no copyright infringement if you simply generate new code based on whatever you read at MS. There might be some rule regarding non-competitive, etc, but these are not related to copyright.
The very basic question is how the LLM got trained and how it got access to the source. If MS source code would leak, you cannot sue people for reading it.
Having read MS code and starting to generate new code that is heavily inspired - sure, that's not copyright infringement. But, if you had memorized a bunch of code (and this is within human capability; people can recite many works of literature of varying length with total accuracy, given sufficient study) - that would be copyright infringement once the code was a non-trivial amount. The test in copyright is whether the copying is literal, not how the copying was done/did it pass through a human brain.
This scenario rarely comes up because humans are, generally, an awful medium for accurate repetition. However, it's not really been shown than LLMs are not: in fact, CoPilot claims (at least in its Enterprise agreements) to check its output _does not_ parrot existing code identically. The specific commitment they made in their blog post is/was, "We have incorporated filters and other technologies that are designed to reduce the likelihood that Copilots return infringing content". To be clear, they only propose to reduce the possibility, not remove it.
LLMs rely on a form of lossy compression which can sometimes give back verbatim content. I think it's pretty clear and unarguable that this is a copyright infringement.
Give credit where credit is due, including paying the creators when the licensing is violated.
... and has since joined the board of GitHub.
Deleted Comment
I am reading this right… ? If this argument is generally true, does this mean that the output of a compiler might also be sent into the public domain? Or the live recording and broadcast of an event which involves automated machines on all levels?
Can you expand on this? From my academic studies (which are indeed growing a bit stale) a Language Model (Large, Medium, Small doesn't matter) is a deterministic machine. Give the same x input n times it will produce the same output y, n times. Some implementations of LM:s might introduce noise to randomise output, but that is not intrinsic to all LM:s.
A language model has no volition, no intent, it does not start without the intervention of a human (or another machine if it is a part of an automated chain).
How is this different compared to a compiler?
With a compiler I craft something in a specific language, often a programming language, I commit it, then a long chain of automated actions happen:
1. The code gets automatically pushed to a repository by my machine
2. The second machine automatically runs tests and fuzzes
3. The second machine automatically compiles binaries
4. The second machine packages the binaries
5. The second machine publishes the binaries to a third machine
How is the above workflow any different from someone using a Language Model to craft something in a specific language and send it through a deterministic LM?
edit re-reading my own question, I think I need to clarify a bit: How can an LLM be said to create anything, and if yes, how is that really any different from a run-of-the-mill developer workflow?
AI is trying to avoid paying for training data since the amount of data required is so vast anything reasonable to content creators as payment would result in billions of expenses.
Additionally there have been copyright exemptions around scrapping and reproducing the scrapped contents but typically those exemptions have been explicitly granted as part of a copyright case and have been narrowly defined.
For instance Google Images only provides thumbnails and your browser gets the full size image from the original source.
The biggest problem for AI is that most previous copyright cases that were similar have all been partially avoided by not being the same thing. Google scrapping isn't trying to do the same thing your content is doing.
However training data output is trying to do the same thing as the original so falls under stricter scrutiny.
Although as this post eludes to the problem is going after the AI is untested territory and going after violators tends to be complex at best. After all in my first hypothetical how would anyone know? I will say that historically the courts haven't been very positive about shell games like this.
Furthermore, copyright is key to ensuring attribution, and attribution is an important enabler and motivator of creativity (copyleft does not at all imply non-attribution, in fact copyleft licenses may require it).
Not a "person". Not a "human". An "animal".
I hope AI observes all the code and complexity in Nature and drops the human facade. I hope AI understands the intelegence of the Trees and Birds and Fish.
I hope AI wins.