GitHub is sued, and we may learn something about Creative Commons licensing

Excellent. GitHub is in my opinion crossing a whole pile of lines here that should not have been crossed without the authors explicit permission, regardless of the utility of the tool they built. Copyright is not something that can be signed over by a terms-of-use change of a hosting provider, the expectation is that your host does not automatically claim the rights to anything that you store there.

Such projects should always be opt-in, not just because it is the law but also because it is common sense and the right thing to do from an ethical perspective.

jasode · 3 years ago

>lines here that should not have been crossed without the authors explicit permission, regardless of the utility of the tool they built.

Fyi... Google Books (scanned and OCR'd books) eventually won against the authors filing lawsuits of copyright infringement. So there is some precedent that courts do look at the "utility" or "sufficiently transformative" aspect when weighing copyright infringement.

https://www.google.com/search?q=google+books+%22is+transform...

But courts in Europe may judge things differently.

jacquesm · 3 years ago

A number of points in Google's favor: they are not passing off Google books content as their own, they limit your access to a small fraction of the offering.

The thing that surprised me about that ruling is that it was deemed final without a chance of an appeal.

nequo · 3 years ago

There are two important differences.

Google Books retains the bibliographical information so you can properly cite the authors or contact them for permission to use their material.

And Google Books does not automatically write new books for you that you can then send off to Penguin Books or self-publish on Amazon.

bhuga · 3 years ago

> So there is some precedent that courts do look at the "utility" or "sufficiently transformative" aspect when weighing copyright infringement.

Curiously, from the article, copyright infringement is not alleged:

> As a final note, the complaint alleges a violation under the Digital Millennium Copyright Act for removal of copyright notices, attribution, and license terms, but conspicuously does not allege copyright infringement.

Perhaps the plaintiffs are trying to avoid exactly this prior law?

cyanydeez · 3 years ago

GitHub is actively driving a product as opposed to duplicating and those products may go on to generate income

mpoteat · 3 years ago

If this lawsuit succeeds, I have a startup idea that I think would be effective.

Create a for-profit copyright registry for code snippets that are long enough to qualify for copyright protection. You can be the canonical owner of the copyright for a given piece of code! For a premium fee, we can generate and submit a patent on your behalf as well.

Once I have a large corpus (perhaps millions of entries of code, most one or two lines long), I can automatically scan new respositories and send cease and desist letters for violating my client's copyright. Even if a piece of code is very common, that doesn't mean its unoriginal, it just means that there are many people violating its copyright after all. According to the logic of the folks in this thread at least.

layer8 · 3 years ago

Do note that there’s the concept of https://en.wikipedia.org/wiki/Threshold_of_originality, which may be substantial for mere code snippets.

This may be one of the reasons why the lawsuit isn’t based on copyright.

PeterisP · 3 years ago

Copyright grants an author of a copyrightable work the exclusive right to make more copies of it. However, if people independently come up with the same exact thing, copying has not occurred and that exclusive right was not violated (and then the court battle effectively becomes one about proving whether copying did in fact occur).

In copyright law there is no such concept as "code snippets that are long enough to qualify for copyright protection" or "canonical owners". Quite explicitly, copyright does not give a monopoly over an idea, but merely protects against the unlawful reproduction of an original work.

If you take some snippet from a work in which you own copyright and find that in the world multiple people have somehow managed to write the exact snippet, but they did it independently without copying it from you, then copyright law effectively states the following things:

1) They definitely aren't violating your copyright, and you have no claim on them whatsoever - independent creation is a complete defense to copyright infringement;

2) Perhaps this snippet might be judged uncopyrightable, as the existence of multiple independent recreations is some evidence that it lacks originality and thus would not qualify for copyright protection at all.

gmiller123456 · 3 years ago

If the code snippets are so "obvious" that many people solve the problem axactly the same way, your going to have a lot of trouble asserting a copyright or patent over it.

But your idea is pretty much what almost all manufacturers do, and have been doing for decades.

int_19h · 3 years ago

Automated code scanning working on similar principles is already in use in many large tech companies, but in reverse. Basically, to prevent shipping improperly licensed code, missing attribution notices etc.

moralestapia · 3 years ago

Oh, the internet!

Just yesterday, I bought a nice domain name for an idea that's very close to what you mention, monetize on snippets of code.

If you want to team up, hit me up!

Deleted Comment

hinkley · 3 years ago

That sounds like leftpad but with more steps.

amelius · 3 years ago

I just hope it doesn't end in Microsoft paying some (from their perspective) small fine that is just the cost of doing business.

jacquesm · 3 years ago

The range of possible outcomes is enormous, I'll just wait by the sidelines but cherish the thought that moving out of GitHub when Microsoft bought it was the right decision. They can't be trusted, this has been proven over and over again and yet people keep falling for it. It's the fox guarding the chickens. I wrote about my misgivings at the time:

https://jacquesmattheij.com/what-is-wrong-with-microsoft-buy...

bluGill · 3 years ago

While the fine is a cost of doing business, if they don't change behavior they can be sued again, and courts tend to impose very large fines if they discover you were already fined for this and didn't change afterwards.

qayxc · 3 years ago

But why stop there? What's the difference between Microsoft, Google, Meta, and OpenAI in this regard?

All of those build their models based on the same sources and it's therefore a much more general issue than just one particular company being sued.

Cthulhu_ · 3 years ago

If the fine is higher than what they can feasibly gain from the product in a reasonable time, it'd make them drop the product entirely even if they are able to adjust it to conform to the laws.

judge2020 · 3 years ago

> Copyright is not something that can be signed over by a terms-of-use change of a hosting provider,

I mean, it's obvious that uploading code requires you license the hosting provider a license to host it (which is not singing over copyright); although feel free to argue that the license doesn't or shouldn't extend to CoPilot usage.

belorn · 3 years ago

With creative common and GPL there is a fairly common practice that work include multiple authors and rights holders. When a single user uploads such work to a hosting provider, the permission given to the provider will be limited to the permission that the user had. They can't give out permissions that they themselves do not have.

It is a similar case when a single user uploads a movie or game to a pirate torrent site. The site can have a terms-of-use that gives a license to the hosting provider, but naturally the users who upload the content might not have the permission to grant anything to the hosting provider. Depending on how much the hosting provider is or should be aware, hosting the content can still be illegal.

smoldesu · 3 years ago

You're right, but GitHub's TOS doesn't (or at least shouldn't) change the conditions of the original license. You're giving GitHub a copy of the source code, not the ability to dictate your license for you. There's certainly a lot of legal ambiguity in the copyright sense, but one thing seems clear: Microsoft trained Copilot on code they weren't certain they could use.

jacquesm · 3 years ago

Copyright is pretty complex legal material but the one thing that stands out for me is that you receive it upon creation and it requires a positive act on your part to relinquish it.

Macha · 3 years ago

There's also code that github themselves uploaded which they were permitted to do under the open source licenses in the mirrors user. I know some of these repos have since been moved as the authors became active on github (e.g. mirrors/linux is now torvalds/linux, indicating Linus has control of it even if it's a read only mirror), but I'm sure there's a few of them remaining.

PeterisP · 3 years ago

The legal argument made in this court case is that there is a substantial difference as redistributing the source as-is (keeping all the attached copyright, attribution and license notifications) is explicitly permitted by every open source license; but in the CoPilot usage the attribution gets removed, which is something which even the repository owner (assuming they're not the only author/copyright holder) does not have the right to do themselves, much less grant permission to others.

anothernewdude · 3 years ago

It doesn't actually. You can upload anything to github. The website doesn't stop you at all, even if you don't have the ability to grant anything to Github at all.

Eleison23 · 3 years ago

That's not obvious, because you don't necessarily own the code you're uploading. I can upload any sort of MIT-licensed, BSD-licensed, Apache-licensed, Creative-Commons-licensed, or GNU-copylefted works I want, anywhere within reason and compatible with those licenses, but if I didn't write them then I don't have the legal right to relicense, grant exclusive or restricted license to any specified parties.

So in a way this would void parts of many TOS agreements where you do relicense your User-Generated Content. If we're uploading memes to Facebook, they're gonna have to work out license terms with the copyright holders, not the uploaders.

zozbot234 · 3 years ago

Not "excellent" at all. This Richard Roe plaintiff is trying to use the notorious DMCA as an end-run around having to prove copyright infringement and withstand a possible fair use defense. That shouldn't be allowed, as a matter of Constitutionally-relevant protections.

chrismorgan · 3 years ago

But hang on, there would only be a DMCA violation if copyright infringement had in fact occurred, right? So fair use would be a perfectly legitimate defence, causing the DMCA not to apply.

Look, proving copyright infringement is downright trivial here, if copyright law applies. And that shows where GitHub’s defence will—must—lie.

(And for other readers unfamiliar with the parent comment’s phrasing: “end-run” is apparently an American sporting term which here makes “as an end-run around” mean “to circumvent” or “to work around”.)

baby · 3 years ago

You say that like Github Copilot could have been trained in a different way. There's just too much utility and progress in Github Copilot to let this lawsuit win.

kllrnohj · 3 years ago

Why should Microsoft's ability to create a new revenue stream be more important than anyone else's ability to enforce their licensing terms?

bee_rider · 3 years ago

That’s really up to everyone whose license may have been violated to determine. Personally I get zero utility from copilot (although, I only have a little code on GitHub so it isn’t a pressing issue).

Signing up for disperse litigation like this seems like a pretty ballsy move by GitHub, but hey, Microsoft presumably has in-house lawyers with lots of spare time.

jacquesm · 3 years ago

That's not how the law works.

hot_gril · 3 years ago

Could have been trained with the code owners' consent, or at least warning.

dspillett · 3 years ago

> Copyright is not something that can be signed over by a terms-of-use change of a hosting provider

Agreeing to GitHub's terms doesn't try to assign copyright over your code, it grabs licence to use your code however they see fit which is¹ legally quite different.

Of course the real fun comes if someone agrees to their terms then uploads some of my code which they have to right to assign the licence to GitHub for. What come-back do I get in that case if I don't want my stuff used that way?

It seems odd to me that MS² who for many years strongly spoke against touching anything with the remotest whiff of GPL because of what it could legally do to your release requirements, are now more than happy to hoover up all the GPL covered code in GitHub and potentially mix it into their users' work output via copilot.

----

[1] in my not-at-all-legally-trained understanding

[2] current owners of GitHub, for those not paying attention

mhitza · 3 years ago

> Agreeing to GitHub's terms doesn't try to assign copyright over your code, it grabs licence to use your code however they see fit which is¹ legally quite different.

I disagree, IANAL, and I'm happy they are getting sued. The fact that they are are foremost a code hosting/collaboration company and the terms of service we all agreed to when creating our accounts was to have them host our code, and use it however they need in order to provide the service. The fact that they changed, post agreement, the service provided (from mere hosting/collaboration to feeding it into Copilot) should be an opt-in. I hope it's tested in court what the service is, because if you have a feature that (let's say) 1% of your users use, that's not the service, is it?

tremon · 3 years ago

it grabs licence to use your code however they see fit

Not your code. Anyone's code that's uploaded to github by any third party. Under open source licenses, that's expressly permitted. However, it seems you're arguing that Github is not bound by the license under which they (and their users) acquired the code because of their TOS.

How many projects on github are put there by the original copyright holders? Perhaps it's more than 50%, but it certainly is less than 100%. So where is github's legal paperwork that shows that they're only processing the code that's copyrighted by the user who uploaded it, or that they received permission from third-party rights holders that did not agree to their TOS?

jacquesm · 3 years ago

You would need to agree for every copyrighted work that you intend to allow them to use it for different purposes. Since GitHub can be completely invisible from the point of a contributor I highly doubt that clears the bar for such an invasive and irrevocable act.

judge2020 · 3 years ago

It seems a lot of this stems from how the DMCA does not require that these hosting providers actually check for code ownership at submission, or maybe just how they don't have an explicit checkbox for "I affirm that I can license this code to GitHub" every time someone is uploading code.

If it is shown that the license in the TOS is valid, the legal question might boil down to "is the TOS License broad enough to where nobody thought that it allowed their code to be used in for-profit ML models?"

mch82 · 3 years ago

How did you type the superscript footnotes?

Edit: Wow, this is game changing. Markdown parsers need to implement superscript ascii character support!

Lowercase ⁽ᵃ⁾ Uppercase ⁽ᴬ⁾ Numbers ⁽⁹⁹⁾

Deleted Comment

Dead Comment

> “Your honor, we needed so many works that it was simply not practical to ask permission of the creators.” I don’t find this argument convincing given the ability today to license many content types at scale for TDM, including images, music and yes, journal articles (See “Full disclosure” above), but it is an argument often offered by infringers.

Why is this type of argument even valid? Isn't this fundamentally saying, "The cost of not infringing copyright is massive, so we will glibly infringe!"

So it is not okay to infringe copyright at a small scale but okay to do it in a large scale? How can such a line of argument be sensible in court? But apparently infringers are using this line of argument. So how? Is it not absurd?

judge2020 · 3 years ago

On the other side, they could argue that it's like a human learning how to code over a decade of looking at the internet, and that human doesn't need to DM every code author to ask if they can learn from their content (and the risk for the author is similar given the human might one day recall some author's code verbatim and not give attribution).

SamoyedFurFluff · 3 years ago

But the thing is that we explicitly allow humans to learn and develop their own skills learning from other humans, but we have our own taboos around directly copying peoples work without permission and passing it off as your own. The debate is that copilot isn’t a human, it’s a machine that outputs copied work on a statistical basis.

Humans are allowed to be unoriginal, uncreative, boring, mediocre, and all sorts of things. But they’re not copying whole cloth the way copilot is.

distcs · 3 years ago

> On the other side, they could argue that it's like a human learning how to code over a decade of looking at the internet

And that would make sense and it would be argued on its own merit. The judge/jury will decide if this argument is correct and legal.

But the article implies that there are lawyers and infringers out there who are arguing that they could not have possibly afforded the cost of not infringing, so they were justified in their infringement. Since when did the massive cost of avoiding infringement become a valid reason to carry on with infringement? This seems just plain absurd by common sense. How do lawyers and infringers make this argument? How is it even entertained in court? What am I missing?

stackbutterflow · 3 years ago

We allow humans to do what copilot does because we take into account that the human brain is very limited in this regard. If we could scan all of GitHub in under a week and recall perfectly what we saw we would already have different laws. Now that machines are able to somewhat learn like humans but 1,000,000 faster we need new laws.

That's why I don't believe "but that's like humans doing X" is a strong argument.

simion314 · 3 years ago

>On the other side, they could argue that it's like a human learning how to code over a decade of looking at the internet, and that human doesn't need to DM every code author to ask if they can learn from their content (and the risk for the author is similar given the human might one day recall some author's code verbatim and not give attribution).

It might not be that easy, I think Wine developers are not allowed to read code related to Windows, even if this code is published on GitHub. The fact you looked at the code was decided to be a risk.

You also have cases with a NN producing an identical output, so you either prove your NN NEVER produces copyrighted code or you have to have a second process that is 100% correct and double checks the NN output and check for plagiarism.

I am against Microsoft in this case because they decided not to put their proprietary code in the NN , would have been funny to have the AI write an open source Windows re-implementation when you feed it the Win APi documentation.

ClumsyPilot · 3 years ago

If copilot is so advanced that we need to grant it right human has, then it has right to freedom and owning copilot is a crime.

I dony think Microsoft wants to go down this path

carom · 3 years ago

It is not a human though. It is a function approximated from inputs and outputs. The laws are different and the licenses call out derivative works.

CJefferson · 3 years ago

If enough code is recalled verbatim, I can sue the author of that code. That seems to fit entirely with this case -- they are suing the owner of Copilot, partially because it reproduces chunks of code.

chlorion · 3 years ago

Why are wine developers and similar required to do clean room implementations to not be sued then?

Simply reading the leaked source code of Windows makes you not eligible to contribute to wine.

Why is Windows source code so much more important than mine?

The other thing is that copilot is not a human, so it doesn't matter anyways.

Humans are a special exception with laws, because they are intended to protect and benefit humans while also being fair. I don't think you can just substitute something in and assume that the same rules apply.

toyg · 3 years ago

> Isn't this fundamentally saying, "The cost of not infringing copyright is massive, so we will glibly infringe!"

Copyright is not a natural human right; it's a construct invented and conferred by governments in order to achieve certain objectives. (It's more like a state license than a right, to be honest; using "right" was a historical masterstroke from the original inventors).

As such, if those objectives can be provably achieved in a better way without copyright (or rather cutting copyright a bit smaller), there might well be a case for foregoing punishment.

It's an exceptionally-hard argument to make, but it's not illogical.

icambron · 3 years ago

This seems like a good argument for adjusting copyright law, but seems unhelpful in interpreting it. "This law isn't a good way to achieve the government's objectives" is not the same as "this law wasn't broken". Judges do have some discretionary power in interpretation and that can take into account congress's intent, but here that would be a massive stretch. A judge would simply say it's congress's job to fix copyright if it's not the best way to achieve certain policy goals.

Aerroon · 3 years ago

And the purpose of copyright is to aid and encourage the progress of science and the useful arts:

>[The Congress shall have power] “To promote the progress of science and useful arts, by securing for limited times to authors and inventors the exclusive right to their respective writings and discoveries.”

feoren · 3 years ago

> Copyright is not a natural human right; it's a construct invented and conferred by governments in order to achieve certain objectives. (It's more like a state license than a right, to be honest; using "right" was a historical masterstroke from the original inventors).

I don't know of a better definition for "natural human right" than "a right/privilege/protection given to everyone automatically, even if they don't know about it or claim it, unless they specifically opt out of it." We get to decide what our "natural human rights" are, and we've decided that you automatically get copyright on your creative works even if you don't know what copyright is. Seems like a good thing, and a natural human right.

intelVISA · 3 years ago

>So how? Is it not absurd?

it is.

"Your honor, I've been drink-driving so many times I honestly can't tell you an accurate estimate any more so in both our interests let's agree it was basically uncountable, or 0."

wait a minute...

JamesSwift · 3 years ago

Or after a crypto heist:

"Your honor, it would have been onerous to ask each account owner if we could have their tokens so we just took all of them at once"

crazygringo · 3 years ago

> So it is not okay to infringe copyright at a small scale but okay to do it in a large scale?

No, I think you're missing the "transformative" part.

The line of argument isn't "we're going to resell millions of codebases as-is for pure profit", which would be undisputed copyright infringement.

The argument is that something highly transformative (e.g. training models) isn't infringement at all, because transformative works are covered by fair use. And that, if we still wanted to explore interpreting/changing the law to force opt-in for highly transformative things, it's logistically unreasonable, to such an extent that the transformative thing couldn't occur at all. So that it's a waste of time to even be discussing asking for permission as some kind of potential compromise or requirement. If it's transformative and therefore fair use, asking for permission is an irrelevant distraction.

That's why this type of argument is valid. I'm not saying whether the argument will/should win in this particular case, but I'm definitely saying there's nothing absurd whatsoever about it.

jfk13 · 3 years ago

Yes, transformative works may be allowed. So I'd guess that creating a model is probably OK (speaking as a non-lawyer!). But using output generated by that model is another matter. The "model" is fundamentally a machine that produces output that is derived from the input it was given. And that output might not be sufficiently transformative to "escape" copyright/licensing restrictions.

In the extreme case, the model's output might be a verbatim copy of a large portion of the original input ("training materials"); but even if it has been extensively modified, e.g. to conform to the coding style of a target repository or to follow a different language standard, this might not be "transformative".

(Compare: A translation of Harry Potter to French looks superficially quite different from the English original, yet it is still a derivative work; and if you're planning to publish one, Ms Rowling (or her publisher) may want a word with you. And that would apply whether you translated it "manually" or pushed it through Google Translate.)

iamevn · 3 years ago

I'm not sure I buy that training a model is transformative in the fair use sense. How is training a model different from lossy compression?

distcs · 3 years ago

Great answer! Thanks for taking the time to write this answer. Learnt something new!

JimDabell · 3 years ago

> But apparently infringers are using this line of argument. So how? Is it not absurd?

You realise that they haven’t actually used that line of argument, right? The article author speculated that it might be part of the defence and then said they didn’t find it compelling. Set up a straw man and then knocked it down in virtually the same breath.

Don’t waste your time complaining about legal arguments that have not been made except in the imagination of one author.

Dead Comment

malfist · 3 years ago

It's the same argument people make about why crypto doesn't have to follow the laws on Know Your Customer. Because someone designed the crypto to break that law, so their hands are tied, it's too technically hard to comply.

Dead Comment

jrm4 · 3 years ago

I'm still baffled as to why people treat Github like a public library despite being owned by what was at one time the greatest enemy of free and open source software in existence. Not saying they haven't changed their tune somewhat, but a library owned by Barnes and Noble is going to have very different incentives than an actual library.

Made all the more silly by the fact that it's Git. You could just host it yourself for five bucks a month and that's probably overpaying.

deltarholamda · 3 years ago

>Made all the more silly by the fact that it's Git. You could just host it yourself for five bucks a month

Ah, but GitHub isn't selling git hosting. It's a social network that also does git. That's the cake that Microsoft bought, not the UI and API frosting.

culi · 3 years ago

Hmmm I wonder if there is any work on federated github alternatives. Seems like a much more consequential network effect to tackle than social media tbh

Too · 3 years ago

And Dropbox can also be hosted yourself with a oneliner rsync-script…

The value GitHub provides is far from unique in anyway, but let’s not pretend it’s trivial. Especially for an open source project already struggling to get contributors to their main code base, even more so for any ops work.

npteljes · 3 years ago

That's because there's no feedback loop of what you're saying, in the active lives of the people interfacing with GitHub. Consider an example of ingesting poison. If the poison tastes bad, I'll be sure to spit it out immediately, either by involuntary disgust, or because I associate with that negative feeling of being poisoned, something I don't want, so I react. But what if the poison tastes good? And what if it not only tastes good, it actually rewards me for ingesting it, in some way? People tell me it's poison, it might not say so on the label, and many are also ingesting it. Is it even believable that it's poison, given that I don't experience the negatives at all?

b3morales · 3 years ago

To take the analogy further, it also (arguably) wasn't poisonous for many years. The poison would have been added in June of 2018.

dijit · 3 years ago

your analogy is spot on and aligns with humanity very well.

Drugs, alcohol and sugar all fit this description very neatly.

thealchemistdev · 3 years ago

I saw this a few months ago.

  local> ssh user@example.com
  user@example> git init --bare $DIR
  user@example> exit
  local> git clone user@example.com:$DIR

I've seen VPS services for as low as $4 a month.

I'm with you in camp baffled.

tick_tock_tick · 3 years ago

And dropbox is just rsync with a bit of cute UI basically worthless. These comments are peak examples of how disconnected some hackernews users are from real life.

shepherdjerred · 3 years ago

What about backups? Managing access to the repository? Making the repository easy to discover? Can you browse the code in a browser, or read the README without cloning?

Of course you could do all of these things with enough work. But why would the average developer want to? Do you really think most developers care so much about Microsoft owning GitHub?

ollien · 3 years ago

Heh - a former employer of mine used something similar to this to sync code between laptops and development VMs :) We had a script that made a temporary commit, and pushed to a git repository in the way you describe, then reverted that commit.

wand3r · 3 years ago

Eh, I have no love for Github but this is huge bikeshedding. This can apply to so many pieces of a software project that at some point you just aren't even working on a project. Every tool has tradeoffs and based on use github and gitlab are the kind of tradeoffs developers are willing to make

I'd disagree. I think it's a "Black Swan" esque problem. Software developers shouldn't use Github for their bread and butter, in the same way that I argue real businesses should pay for email and not use Gmail.

Sure, it might work fine forever, but when it doesn't, you're really screwed and you could have avoided that in a relatively simple way. Reminds me of seatbelts and fire extinguishers.

skybrian · 3 years ago

It's been a bit over four years since the acquisition and Microsoft hasn't screwed it up yet.

It's good to have a backup plan, but it's convenient to keep using it for now.

TAForObvReasons · 3 years ago

GitHub built goodwill over the years. There were many controversies, but there were also many die-hard fans. That didn't evaporate overnight. Microsoft bought GitHub (and minted 3 billionaires in the process) specifically to acquire that goodwill and monetize it.

jacooper · 3 years ago

Its not good will, its features and comfort. Github has the UI that almost every developer is used to, easy to use CI/CD, great issue and Pull request handling. And more importantly, everything is free.

Even Ignoring the value and the features, Employers don't ask for your git link, they ask for your GitHub account. And since most projects are on GitHub having all of your projects there tooz makes it easier to see all of your commits, making your profile look more active.

That's without mentioning the ease of discovery and issue reporting since everyone has an account.

aliqot · 3 years ago

> acquire that goodwill and monetize it

Embrace

Extend <-- here

Extinguish

I refuse to believe that Microsoft is any different from before (not just wrt to foss but general attitude). Doesn't matter if there's a new CEO. Look at what they did to Minecraft logins. Or this.

imran-iq · 3 years ago

> I'm still baffled as to why people treat Github like a public library

Because too many open source projects rely on it. Projects like crates.io force you to have a github account to use it. Most (neo)vim plugin managers give preferential treatment to github over other forges.

spandrew · 3 years ago

I'd like to know more about how MSFT qualifies for the moniker "they're the greatest enemy of free and open source software". From my understanding they've invested in many open web resources previously (like jQuery).

I did say "at one time," not necessarily now.

But if you've been watching, literally the only reason they don't hate it now is because they lost the open source v. proprietary battle.

cyral · 3 years ago

And they opened sourced .NET

jotato · 3 years ago

This made me wonder. Is there a not-for-profit SCM host that does act like a library?

I don’t understand the author’s position in this article. It spends a long time talking about details of the licenses, but I can’t see any way the suit will actually be about licenses, because if it’s about licenses then it seems patently obvious to me that GitHub will lose very quickly, because they have undoubtedly violated the terms of the licenses.

As I see it, the only leg GitHub can possibly stand on is the “fair use” exemption of copyright law—that the license is irrelevant, because they weren’t using it under that license.

So then you get to the last paragraph of the article, and the “fair use” claim is finally mentioned—as something the plaintiffs seem to be seeking to avoid bringing into it because that would make things messy. But… GitHub’s defence must be “fair use”, I can see no other response. Yes, the plaintiffs “chose to focus on something that is beyond factual dispute”, but how are GitHub ever going to do anything other than bring fair use into it? So I don’t see how they could expect it to “still provide the same damages” without bringing fair use into it. (And I can’t imagine GitHub will settle for anything other that total vindication here—even settlement would doom Copilot.)

Returning to the title: I cannot imagine any way that We May Learn Something About Creative Commons Licensing from this suit. About the interactions between copyright law and machine learning, maybe. But about CC-*, GPL, Apache-2.0, MIT, whatever? Nah, there’s nothing interesting about them in the suit, because if they were involved, it’d be cut and dried.

jefftk · 3 years ago

That's entirely correct, and is why the suit will most likely [1] fail.

[1] 35% chance of success on https://manifold.markets/JeffKaufman/will-the-github-copilot...

mukesh610 · 3 years ago

IMO fair use is still not a strong argument for Microsoft. They commercialized the product and made money out of it.

Fair use is only allowed if the work you're doing is purely for the greater good. I might be wrong though, IANAL.

You are wrong, though public interest is certainly the basis of the purpose of fair use doctrine. But the simplest way of demonstrating that fair use still allows commercialisation is probably this example: search engines absolutely depend on fair use if they include any content from the linked pages. (And some countries have even tried to call the act of linking copyright infringement, though they’ve tended to back off at least a little, to requiring at least the title or other content for it to be infringement, and not just the URL.)

https://en.wikipedia.org/wiki/Fair_use, lots of good reading there.

elondaits · 3 years ago

It might be a problem to treat it as copyright. Copyright applies to reproduction, distribution, public performance... if I go to a library or bookstore and I read books and look at their covers, copyright does not apply. Would an android that walks around learning things be subject to copyright? To what extent does it need a body and mobility to be more like a person and less as a scraper?

It might seem stupid, but I worry that if copyright begins applying to "mining" then the next thing is that it applies to humans watching things.

Of course, if an AI re-creates copyrighted content, copyright should apply. Just like it applies when I redraw and sell the Mona Lisa, but not when I store it in my memory. I would pass on the responsibility to users. I don't fear my use of Github Copilot because it's far from infringing any reasonable copyright... then again, I'm assuming the most likely way to infringe copyright with GPT is to use a prompt that almost explicitly requests it.

samwillis · 3 years ago

I think one of the interesting things that will be covered in this lawsuit is whether the licence under which the code is released applies at all in the case of screen scraping.

The current understanding of screen scraping is that it is allowed, despite what is in the websites terms. Effectively if a human can access the content freely without having to actively agree to a license or terms you can scrape the content. You can't republish verbatim, but you can data mine and perform an analysis and publish that. This is how the legal status of all AI training data scraped from the web is being interpreted.

When it comes to open source code I suspect it will be found to be similar, if the code is freely visible on the web by a human without an active agreement to view it, then it will be possible to "scrape" it. I don't think the license the code is under will apply if that is the case.

Obviously in this case is GitHub "scraping" its own site for the training data? Probably not, that may come back to bite them.

This then also opens up all sorts of interesting questions of whether you can copy paste code from a website and use it internally (not republishing), despite the license attached to the code. If it is freely visible.

Clearly a test case, this one, is needed to clarify the situation. And just because it's legal, it doesn't mean it's moral or ethical.

We may yet see the outcome of this case change the current interpolation of legal screen scraping, it's going to be an interesting time.

On top of all this there is then the question of an AI model reproducing code (or and image or music) verbatim. That obviously needs to be clarified by the courts too.

ascagnel_ · 3 years ago

> When it comes to open source code I suspect it will be found to be similar, if the code is freely visible on the web by a human without an active agreement to view it, then it will be possible to "scrape" it. I don't think the license the code is under will apply if that is the case.

I don't see the scraping case applying here -- the idea that all human-readable code accessible on the public internet can be ingested into such a system without regard for its license would effectively mean that anything posted to the public internet is entered into the public domain, which seems ridiculous on its face (especially when the author(s) is including an explicit license alongside that code, which the scraper can theoretically also read and account for).

visarga · 3 years ago

I think we're facing a copyright extinction event. The whole concept is out of touch with the new reality - when you can generate 100 variations for your text, code or image with the click of a button, what does it even mean to hold copyright over the original?

SkyBelow · 3 years ago

>the idea that all human-readable code accessible on the public internet can be ingested into such a system without regard for its license would effectively mean that anything posted to the public internet is entered into the public domain

It clearly isn't in public domain. But if we define something new, public knowledge domain. This being material it is legal for a human to look at and learn from. They gain no rights to copyright or IP, but they can learn from it and use it per the existing limits of copyright and IP. Saying that anything posted online gets entered into public knowledge domain seem agreeable. Few things wouldn't be allowed here, generally material that is agreed upon to be illegal world wide (and some countries may have tighter limits, like a theocracy banning someone from learning anything from material deemed blasphemous).

Then it is a question of if an AI can also learn off of such material as long as it doesn't produce works that violate existing copyright or IP laws, same as a human. This doesn't seem, on its face, inherently ridiculous. There are still corner cases and potential for abuse, but those also exist with copyright law yet we don't throw the whole system away and just ban all forms of copying or selling the right to copy.

rockemsockem · 3 years ago

I see lots of folks equate "trained on" to "available verbatim" and that simply isn't the case for the vast majority of training data. It becomes hard to have a productive discussion when there is such focus on the examples that are regurgitated verbatim (often by people with explicit knowledge of the expected output, so they would *know* that they are going to infringe if they republished it) to the exclusion of talking about an in-general system that is trained on data and outputs unique data.

IMO that second case is a FAR more interesting question.

8note · 3 years ago

I'm not clear how it's particularly different from ingesting it into a browser, and then rendering it as part of the html into pixels

jcranmer · 3 years ago

> I think one of the interesting things that will be covered in this lawsuit is whether the licence under which the code is released applies at all in the case of screen scraping.

Screen scraping is essentially a question of whether or not the actions constitute something akin to hacking, which is almost completely orthogonal to copyright. The main intersection you get is that many screen scraping scenarios are about things that aren't copyrightable (the US doesn't recognize "sweat of the brow" doctrine, so databases aren't copyrightable). When Google was scraping lyrics off of lyric sites--lyrics being totally and clearly copyrightable--it was dinged pretty hard for that.

Permit · 3 years ago

> When Google was scraping lyrics off of lyric sites--lyrics being totally and clearly copyrightable--it was dinged pretty hard for that.

Was google dinged for scraping the lyrics or publishing them verbatim? My understanding is that it was the latter.

hoherd · 3 years ago

> Effectively if a human can access the content freely without having to actively agree to a license or terms you can scrape the content. You can't republish verbatim, but you can data mine and perform an analysis and publish that.

Would this apply to books? I can walk into a library or book store and OCR countless pages of countless books without agreeing to any license.

chasing · 3 years ago

Try publishing those OCRed pages and see what happens.

brezelgoring · 3 years ago

Steve Ballmer once called Linux and the GPL License a cancer because to copy a portion of code from a copyleft project, minimal as it may be, would make the whole project require a copyleft license.

If Github Co-Pilot includes GPL code then produced works should have GPL too, right? It is known that it produces verbatim copies of sections of code, so the 'derivative' explanation doesn't hold water.

Alternatives may be to copy CC0-only (which can't be guaranteed, also who'd believe you lol) or target license-only - as in, if my project is MIT, go with MIT projects to source from, if I'm GPL, include GPL and so on.

tjpnz · 3 years ago

>It is known that it produces verbatim copies of sections of code, so the 'derivative' explanation doesn't hold water.

For reference it's been shown to cough up code from Quake verbatim. This from John Carmack(?) also includes his profanity laden comments:

https://twitter.com/mitsuhiko/status/1410886329924194309

pantalaimon · 3 years ago

To be fair, that's probably one of the most copied pieces of code already.

> It is known that it produces verbatim copies of sections of code

This happens only rarely, like under 1% of the time. It happens mostly for well replicated code and not so much for code that only appears once. It can be filtered out with search and bloom filters of ngram hashes.

But the prompter can goad the model into copyright infringement by quoting the start of a copyrighted text verbatim, and asking for completion. The longer and more precise the prompt, the higher the chance of regurgitation. So, when it happens, we're often "asking for it".

Both regurgitation and hallucination seem to be LM problems we can tackle. They are complementary - in one we don't want the model to replicate the training data exactly (be creative), in the other we don't want the model to invent facts out of thin air (be factual). Both can be tackled by using search for reference testing.

anonymousab · 3 years ago

> I don't see regurgitation as a long term problem, it's just waiting for attention, probably wasn't top priority

The fact that Microsoft is wary of providing it with things like the Windows source code says all we need to know about how much it can be trusted.

asddubs · 3 years ago

it's not just trained on GPL code. it's trained on code of incompatible licenses, which means that the code is produces is potentially unlicensable in general. There's of course also the question of attribution, which many licenses require

I wonder who's decision was it to train the bots in 'code that is accessible to our scraper' and not in 'code we can sell derivatives products of'. I can tell the second group is quite small, so I understand the incentive, at least.

Maybe they chose the Uber strategy of 'What we are doing is bordering on illegal but by the time the bell rings we'll be valuable enough to write the law ourselves'.

tamrix · 3 years ago

But the question realistically is, is it the developers responsibility because they are the ones creating the program.

It's not like copilot made you use the code, copilot didn't commit or release the app with the code.

Copilot didn't breach the copyright, you did. It's a tool. You used it. You released it.

Maybe there's a product which you can include to see what code of yours violates copyright in the future?

> If Github Co-Pilot includes GPL code then produced works should have GPL too, right

No.

mellosouls · 3 years ago

It will be a real shame if the fantastic achievement of OpenAI with copilot etc is smothered by ego.

Innovation in code should be heralded but if in the majority of cases the coder using Copilot and similar tools is just saving time on bog standard functions they could write themselves, it's difficult to understand why that needs to be attributed.

AlexandrB · 3 years ago

> It will be a real shame if the fantastic achievement of OpenAI with copilot etc is smothered by ego.

I don't know if you watch YouTube, but this is probably how every creator hit with a bullshit DMCA claim for 5s of audio from a song feels. Why does OpenAI's work demand special consideration here? Or to put it another way - if we're going to be ignoring copyright, everyone should be able to do it.

Guid_NewGuid · 3 years ago

> if we're going to be ignoring copyright, everyone should be able to do it.

Yes, yesss. You're getting to the logical conclusion. Now I don't think Microsoft have become 'based' and want to break the copyright system but I hope they have inadvertently done so through their actions.

Oh I have no problem with protecting genuinely innovative work, including code. That's not the vast majority of code produced by or derived from these tools though.

esperent · 3 years ago

I don't think it would be a big deal if OpenAI/Copilot get shut down. Honestly it might be a good thing. Then we can generate new versions of these tools that are truly open using data that has been freely contributed, rather than obtained by for profit companies in shady cash grab.

And those tools will be similarly illegal if the court strikes down Copilot. Also it costs hundreds of thousands of dollars to train things like Copilot and GPT-3, so can we really rely on innovation happening in open source without any way to recoup costs? I get that you might not like OpenAI/Copilot for creating these tools in the way that they did, but surely you have to see that this decision goes WAY beyond what you blithely call a "shady cash grab".

marginalia_nu · 3 years ago

It's very easy to be generous with other people's property, intellectual or otherwise.

It's also very easy to invent supposed "intellectual property" rights out of thin air that conveniently last for a century or more (!) for your own work.

But instead of either of these views that focus on selfishness, it's much more productive to instead think about what benefits society as a whole.

sebzim4500 · 3 years ago

It's also pretty easy to be generous with your own when it comes to copilot. I have not been damaged in any way by copilot learning from my code, and no one else has either.

sfifs · 3 years ago

You could make this kind of "just" and "bog standard" argument for anything. Just using an image for educational or illustrative purpose, just using a song for a political rally etc etc.

The fact is as a society we have decided to reward creators with copyright as a means to commercialise their creation and get compensation. Who is to say programmers are not creators and the compensation they want for open source licenses is attribution?

Microsoft really should have known better than to touch OpenAI without a 10 feet barge pole.

How are you a "creator" (in an attribution-worthy sense) if you are producing an unoriginal implementation of an old algorithm that thousands of coders have produced before you?

Most coding is not innovative, and that is the kind of code that these tools are producing and derived from in most cases.

krazydad · 3 years ago

I agree with you. I love using copilot (and similar tools) and I’ve found it is exceedingly good at predicting patterns in my own coding. It has saved me a lot of time, and I would hate for it to go away or be crippled because of lawsuits like this. I could care less if my own code is used for training. My code isn’t precious, it’s what I do with it that is important.

pessimizer · 3 years ago

You're arguing that it should be allowed because you don't feel it harms you personally.

Literally no one objects to people like you donating your code to Microsoft.

Protecting the property rights of the rich is protecting freedom and civilization, protecting the property rights of people who share their work is ego.

rightbyte · 3 years ago

In my mind I don't want MS to rehash my code and sell it using "OpenAI" as some laundry machine.

Given how insane copyright laws are I would be pleased if they for once worked in my favour.

bastardoperator · 3 years ago

Completely agreed. It's crazy that people who claim to value freedom and being open quibble over copyright laws and licenses written for and by lawyers.

After reading the comments here, apparently everyone is a genius with code so special and unique, it would be unfathomable for two or more people to arrive at the same exact outcome.

Nobody is asking for AI in general to be illegal, only training on code you don't own and then emitting it for profit.

Why can't they train on code that they own, such as the windows source code?

Why doesn't copyright law apply to open source code but applies strongly to windows source code?

cmrdporcupine · 3 years ago

Tell me you don't know the history of and reasoning for free software (and attribution) licenses, without telling me you don't know the history of... etc.

Mind-bogglingly entitled.

Who's more entitled? The coder who has no issue with their unoriginal code being copied and mixed with millions of other samples and churned out in a helpful way for others, or the one who demands attribution in the most trivial of cases, or denies the access in these forms as it doesn't credit their brilliance in implementing a sort function?

I'm in the first category; I'm guessing by your abusive response you are in the second?

hgs3 · 3 years ago

AI advancements are shining a light on how warped our society has become. We _should_ invent technology to better our lives, but instead its become something to fear as it might destroy our livelihoods. I understand the threatening feeling AI brings, but rather than squashing innovation we should be rethinking the role of our economy, copyright, and intellectual property.

N_A_T_E · 3 years ago

I agree with the sentiment however given the current political realities it seems we are happier to let people lose their livelihoods without any clear replacements , training or economic plan of any kind. See manufacturing in the rust belt for an example of what happened in the last two decades when we moved to the knowledge worker economy without a plan for the workers.

EamonnMR · 3 years ago

AI had enormous potential for interesting artistic effects. Instead it's being used to make (for now) shoddy knockoffs, and maybe someday better-than-the-original knockoffs. I was hoping for the invention of a car, but all we got was a faster horse.

Would anyone have been able to sell a car if faster horses were already clogging up the streets?