Judge dismisses majority of GitHub Copilot copyright claims

I will throw in a random story here about chat gpt 4.0. I'm not commenting on this article directly, just a somewhat related anecdote. I was using chatgpt to help me write some android opengl rendering code. OpenGL can be very esoteric and I haven't touched it for at least 10 years.

Everything was going great and I had a working example, so I decided to look online for some example code to verify I was doing things correctly, and not making any glaring mistakes. It was then that I found an exact line by line copy of what chat gpt had given me. This was before it had the ability to google things, and the code predated openAI. It had even brought across spelling errors in the variables, the only thing it changed was it translated the comments from Spanish to English.

I had always been under the impression that chat gpt just learned from sources, and then gave you a new result based roughly on its sources. I think some of the confounding variables here were, 1. this was a very specific use case and not many examples existed, and 2. all opengl code looks similar, to a point.

The worst part was, there was no license provided for the code or the repo, so it was not legal for me to take the code wholesale like that. I am now much more cautious about asking chat gpt for code, I only have it give me direction now, and no longer use 'sample code' that it produces.

vundercind · a year ago

> I had always been under the impression that chat gpt just learned from sources, and then gave you a new result based roughly on its sources. I think some of the confounding variables here were, 1.

If I’ve understood the transformer paper correctly, these things probabilistically guess words based on what they’ve been trained on, with the probabilities weighted dynamically by the prompt, by what they’ve already generated, and by what they “think” they might generate for the next few tokens (they look ahead somewhat), with another set of probability-weight adjustments applied to all that by a statistical guess at which tokens or words are most-significant or important.

None of that would prevent them from spitting out exactly what they’ve seen in training data. Keeping them from doing that a lot requires introducing “noise” to all the statistics stuff above, and maybe a gate after generation that tries to check if what’s been generated is too similar to training data and forces another run (maybe with more noise) if it is, similar to how they prevent them from saying racist stuff or whatever.

devmor · a year ago

You have understood correctly. What LLMs are, at least in their current state, is not fundamentally different from a simple markov chain generator.

Technically speaking, it is of course, far more complex. There is some incredible vector math and token rerouting going on; But in terms of how you get output from input - it's still "how often have I seen x in relation to y" at the core level.

They do not learn, they do not think, they do not reason. They are probability engines. If anyone tells you their LLM is not, it has just been painted over in snake oil to appear otherwise.

Deleted Comment

codedokode · a year ago

I remember similar news about ML services that generate mnusic: they are able to reproduce melodies and lyric from copyrighted songs (if you find a way around filters on song or artist title) and even producer tags in Hip-Hop tracks.

All this latest ML growth is built on massive copyright violations.

tomp · a year ago

It’s not a copyright violation.

Maybe not myself, but many averagely-talented artists can draw Mickey Mouse.

They might even draw one for me if I ask! Or I can just find it on Google… (technically my computer is producing it on the screen…)

That in itself is not a copyright violation. But if I use it, in a commercial manner, then it becomes a copyright violation.

Producing copyrighted things isn’t illegal. It’s on the user to not use copyrighted things, in a way that’s illegal (not fair use or licenced).

nox101 · a year ago

What were you asking it to do? If it's less than 15 lines, tell me and I'll write the solution myself. You can check how close I get.

My point being that lots of OpenGL is practically boilerplate.

ativzzz · a year ago

> there was no license provided for the code or the repo

Interesting - I assume any code that's not licensed as "free to use for whatever purpose I want"

_rend · a year ago

Not at all: unless a license is provided, the code is fully protected under copyright and you have _no_ rights to copy it or use it in _any_ way you want (unless falling under "fair use" clauses for the jurisdiction you're in/the author is in).

paulddraper · a year ago

The opposite

johnnyanmac · a year ago

no license[0] is the default fallback when nothing is provided. Realistically, it's "use at your own risk", because someone who doesn't license may not even be aware of what others do with it (or you fallback to whatever rules of the platform you posted on).

https://choosealicense.com/no-permission/

skybrian · a year ago

Uh, no, this is not how copyright law works.

rlpb · a year ago

If there's only one way to do it, or a developer familiar with the field would independently come up with the same way of doing it, then the "copyrightability" of the result comes into question.

This doesn't stop you getting yourself a legal headache though, of course.

MLskynetio · a year ago

I do not understand your problem.

Do you wanna life in fear as a software developer because you did the same thing as others? Even if the problem has basically not an unlimited way of doing it?

Do you want a certain amout of code being (c) and patent?

I personally don't. I see an advantage of limited patents for complex magical algorithms were someone was really sitting there and solving a hard problem to reap the benefits for a short period of time, but otherwise no.

I do not want to check every code block for some (c) or patent.

kreyenborgi · a year ago

This has happened quite a few times with me as well, both with chatgpt and phind (phind in particular is often basically stackoverflow with a few variable names changed).

ars · a year ago

One way that can happen is if your prompt and context are so specific, that this copied code is the only thing that matches.

This would also imply that this specific question is a rare one, with few examples online for it to train on.

johnnyanmac · a year ago

>I think some of the confounding variables here were, 1. this was a very specific use case and not many examples existed, and 2. all opengl code looks similar, to a point.

Yeah, that's why I wouldn't trust AI right now with much except the most basic renderin boilerplate. I'd be so brazen to wager that 90% of the most vauable rendering recipes are prorietary code within some real time rendering studio. Of the remaining, half of that is in some text book and may or may not even be available to scrape online.

LLM's still needs a training set, and I'm not convinced the information even exists to be scraped on the public internet for this kind of stuff (if years of googling has taught me anything).

The original reporting has more details: https://www.developer-tech.com/news/judge-dismisses-majority...

In particular this:

An amended version of the complaint had taken issue with GitHub’s duplication detection filter, which allows users to “detect and suppress” Copilot suggestions matching public code on GitHub.

The developers argued that turning off this filter would “receive identical code” and cited a study showing how AI models can “memorise” and reproduce parts of their training data, potentially including copyrighted code.

However, Judge Tigar found these arguments unconvincing. He determined that the code allegedly copied by GitHub was not sufficiently similar to the developers’ original work. The judge also noted that the cited study itself mentions that GitHub Copilot “rarely emits memorised code in benign situations.”

I think this is the key point: reproduction is the issue, not training. And as noted in the study[1] reproduction doesn't usually happen unless you go to extra lengths to make it.

[1] Not sure but maybe https://dl.acm.org/doi/abs/10.1145/3597503.3639133? Can anyone find the filing?

omeid2 · a year ago

> reproduction doesn't usually happen unless you go to extra lengths to make it.

And who is to say that people who want to copy your code without adhering to your license terms or pay won't go to extra lengths? or am I missing something here?

nl · a year ago

> And who is to say that people who want to copy your code without adhering to your license terms or pay won't go to extra lengths? or am I missing something here?

It seems like they could just download your code from Github and violate your license like that so.. unclear why they'd bother doing it via copilot.

umeshunni · a year ago

At that point, they might as well copy your original code without going through an LLM to do it

ec109685 · a year ago

Wouldn’t the person themselves be in violation at that point and the owner of the code could go after them? (I know this wouldn’t be super practical but it seems to match what would have without an LLM in between).

williamcotton · a year ago

Well then you end up with a work that a judge will allow to proceed to trial.

Only expressive software is protected by copyright and sometimes that interpretation should be handled by a jury.

nox101 · a year ago

This is legal for a person to do, right? Why should it not be legal for an LLM to do?

AFAIK but IANAL, I can go look at a solution in a GPLed library and then code that solution in my proprietary code base. As in, "oh, I see, the used a hash map for that and a list of this, and locked at this point. I'll code up something similar". As along as you don't "copy the code" you're fine.

Am I wrong?

Is it just a matter of scale? (hey there LLM, re-write all of OpenOffice into ClosedOffice).

mistercow · a year ago

It’s trained on publicly available code, so what would be the point of that? If you’re looking to specifically infringe the copyright of code available on the open web, using an LLM code completion engine is just about the most roundabout and unreliable way to achieve that.

1vuio0pswjnm7 · a year ago

"Can anyone find the filing?"

https://arxiv.org/pdf/2202.07646

Personally, I would not rely on blogs like "developer-tech.com" for unbiased information about "AI" litigation.

I would read the order and draw own conclusions.^1 (Note the plaintiffs are attempting to file an interloctory appeal re: the dimissal of the DMCA claims.)

1 https://ia904708.us.archive.org/6/items/gov.uscourts.cand.40...

No doubt I am in the minority but I am more interested in the contract claims, which have survived dismissal, than in the DMCA ones. If plaintiffs can stop the "training" of "AI" through contract, then in theory DMCA violations can be avoided.^2

2 For example, license terms which specifically prohibit using the licensed source code to train language models.

Is there a "fair use" defense to contract liability.^3

3 https://cdn.ca9.uscourts.gov/datastore/opinions/2006/05/16/0...

User23 · a year ago

> Is there a "fair use" defense to contract liability.

NAL, but in common law jurisdictions and maybe others there can be various implicit terms to any contract, like fair dealing.

Also if you can literally claim fair use, unless you signed a contract waiving that right (if that's even possible), it doesn't matter.

Heck, most software licensing in the USA is purporting to grant you rights that you already have from the Copyright Act. That's right, in US law you own the authorized copy you receive. The license claiming that you don't is questionable at best. To be fair the courts somehow have managed to become divided on this, but the plain language of the law itself is crystal clear, explicitly granting the right to make additional copies as necessary for execution.

Also that hardly matters when the fake license can still be "enforced" via lawfare. Most everyone is going to choose to pay up rather than fight a protracted legal battle against Microsoft.

1vuio0pswjnm7 · a year ago

https://www.courtlistener.com/opinion/506070/acorn-structure...

https://www.courtlistener.com/opinion/604886/national-car-re...

jprete · a year ago

IANAL but I think for the specific dismissed claims in this specific case, reproduction is the issue, and it doesn't indicate anything about training.

I think it would be extremely hard to make claims against GitHub for training AI with code on GitHub, assuming GH has the typical "right to use data to improve service" clause that usually shows up on free-service EULAs.

aezart · a year ago

> I think this is the key point: reproduction is the issue, not training. And as noted in the study[1] reproduction doesn't usually happen unless you go to extra lengths to make it.

But Microsoft is selling a service capable of such reproduction. They're selling access to an archive containing copyrighted code.

To me it's the equivalent of selling someone a DVD set of pirated movies. The DVD set doesn't "reproduce" the copyrighted material unless you "prompt" it to (by looking through the set to find the movie, and then putting it in your DVD player), but it was already there to begin with.

nl · a year ago

Strongly disagree with your analogy here. Lots of services are capable of doing things that are against the law but in general it's the actual breaking of the law that is prosecuted.

The closest thing to what you are suggesting is the Napster ruling, where a critical part was that the whole service was only about copyright infringement. In the Github case most people are using it to write original code which is not a copyright violation so there is substantial non-infringing use.

But what I think doesn't matter. The judge disagreed with that interpretation too.

Dead Comment

Deleted Comment

KyleBerezin · a year ago

darby_nine · a year ago

Huh I guess you can just avoid legal liability by laundering through a chatbot

IncreasePosts · a year ago

Another way to look at it is anyone can look at source available code to learn how to program without breaking a license.

saurik · a year ago

And if you then write a program that is remarkably similar to the one you read, that's copyright infringement. As another reply noted--but without anywhere near enough verbosity--this is not without risk, and people who intend to work on similar systems often try to use a strategy where they burn one engineer by having them read the original code, have them document it carefully with a lawyer to remove all expressive aspects, and then have a separate engineer develop it from the clean documents.

anileated · a year ago

Anyone can, that’s orthogonal. This is about an automated tool that launders copyright at scale, generating revenue for its operator.

(And if you seriously say that this tool is learning how to program, ask yourself if that tool’s operator is effectively a slave owner.)

dgfitz · a year ago

> Another way to look at it is anyone can look at source available code to learn how to program without breaking a license.

Yes, and exactly ZERO amount of money have exchanged hands in this scenario. Learning is dope, the more the better.

The difference is, someone makes money off it, and not the persons(s) that wrote the code. This is not a valid argument

fsflover · a year ago

It's not so straightforward: https://en.wikipedia.org/wiki/Clean-room_design

croes · a year ago

Isn't fascinating that the same isn't true for books and music.

If it's too similar you get sued

kelnos · a year ago

And anyone can also look at that source available code, write their own version, distribute it, be sued for copyright infringement, and lose in court, because their version is too similar to the original.

ml-anon · a year ago

For the slow ones among us: Machine "learning" is not human learning. It is not similar to, analogous to, or in any way remotely comparable.

Spivak · a year ago

The significant step here is anything can do what you say. Because there's no human in the loop looking at source code and learning from it.

You have an autonomous system that's ingesting copyrighted material, doing math on it, storing it, and producing outputs on user requests. There's no learning or analogy to humans, the court is ruling that this particular math is enough to wash away bit color. The ruling was based on the outputs and the reasonable intent of the people who created it and what they are trying to accomplish, not how it works internally.

It's not the first, if you take copyrighted data and && 0x00 to all of it that certainly washes the bits too.

NoMoreNicksLeft · a year ago

The idea that the LLM violates copyright by reading/viewing a work is the same idea that you violate the copyright by reading or viewing the work. Perhaps you're creating an organically encoded copy of the work within your brain.

No copies are being made, and definitely no copies are being sold.

timhh · a year ago

No you can't, any more than you can encode a Disney film as a prime number or in the digits of pi and avoid copyright that way.

Read this classic essay: https://ansuz.sooke.bc.ca/entry/23

nimbius · a year ago

what the judge isnt arguing is the encoding...hes stating CoPilot:

“rarely emits memorised code in benign situations.”

So, you could in fact encode 5,000 copies of Mulan in different formats and, so long as 4,999 are not verbatim copies, youre good*

*you must affix the letters "AI" to the encoder

stale2002 · a year ago

No, not really. You mistake what the purpose of copyright is.

If I used a chatbot to sell the entire text of harry potter, all at once, that would still be illegal even though its through a chatbot.

Whats legal, of course, is creating transformative content, learning from other content, and mostly creating entirely new works even if you learned/trained from other content about how to do that. Or even if there are some similarities, or even if there were verbatim "copies" of full sentences like "he opened the door" that were "taken" from the original works!

Copyright law in the USA has never disallowed you entirely from ever using other people's works, in all circumstances. There are many exceptions.

> Copyright law in the USA has never disallowed you entirely from ever using other people's works, in all circumstances. There are many exceptions.

Sure, and the question is: "does using an AI chatbot like Copilot fall under one of those exceptions?" My position -- as well as the position of many here -- is that it shouldn't. You may disagree, and that's fine, but you're not fundamentally correct.

> If I used a chatbot to sell the entire text of harry potter, all at once, that would still be illegal even though its through a chatbot.

Right, which is why you sell access to the chatbot with a knowing wink.

> You mistake what the purpose of copyright is.

At one point it was to ensure individual creators could eke out a living when threatened by capital. I frankly have no clue what the current legal theory surrounding it is.

UncleMeat · a year ago

The law doesn't work this way. Deliberately circumventing copyright via something like Copilot will have different consequences, even if the eventual outcome is that Copilot is allowed to train on open source code that has restrictive licenses.

> The law doesn't work this way. Deliberately circumventing copyright via something like Copilot will have different consequences, even if the eventual outcome is that Copilot is allowed to train on open source code that has restrictive licenses.

Copilot is a deliberate circumvention of copyright. It might be legal but that doesn't change the clear intent here: charging people without having to do the work you're charging for.

austin-cheney · a year ago

The comments seem to misunderstand copyright. Copyright protects a literal work product from unauthorized duplication and nothing else. Even then there are numerous exceptions like fair use and personal backups.

Copyright does not restrict reading a book or watching a movie. Copyright also does not restrict access to a work. It only restricts duplication without express authorization. As for computer data the restricted duplication typically refers to dedicated storage, such as storage on disk as opposed to storage in CPU cache.

When Viacom sued YouTube for $1.6 billion they were trying to halt the public from accessing their content on YouTube. They only sued YouTube, not YouTube users, and only because YouTube stored Viacom IP without permission.

BeefWellington · a year ago

> When Viacom sued YouTube for $1.6 billion they were trying to halt the public from accessing their content on YouTube. They only sued YouTube, not YouTube users, and only because YouTube stored Viacom IP without permission.

Now do these steps for OpenAI instead of YouTube. Only OpenAI doesn't let users upload content, and instead scraped the content for themselves.

anticensor · a year ago

OpenAI actually lets users to upload content to the chat input.

advisedwang · a year ago

From the article it sounds like the plaintiffs were alleging that ChatGPT is effectively doing unauthorized duplication when it serves results that are extremely similar or identical to the plaintiff's code. They aren't just alleging that reading their code = infringement like you seem to imply.

I don't think the author was implying that.

But yes, that is the charge.

maronato · a year ago

The judge argues that copilot “rarely emits memorised code in benign situations”, but what happens when it does? It is bound to happen some day, and when it does would I be breaching copyright by publishing the code copilot wrote? Just a few weeks ago a very similar suit for stable diffusion had its motion to dismiss copyright infringement claims denied. https://arstechnica.com/tech-policy/2024/08/artists-claim-bi...

dragonwriter · a year ago

> The judge argues that copilot “rarely emits memorised code in benign situations”, but what happens when it does? It is bound to happen some day, and when it does would I be breaching copyright if i, unknowingly, published the code copilot wrote?

That's irrelevant to the case being made against GitHub, which is why it is addressed in the decision.

> Just a few weeks ago a very similar suit for stable diffusion had its motion to dismiss copyright infringement claims denied.

The case against Midjourney, SAI, and RunwayML is based on a very different legal theory -- it is a simple direct copyright violation case ("they copied our work onto their servers and used it to train models") whereas the Copilot case (the copyright part of it) is a DMCA case claiming that Copilot removes copyright information management information.

It's not really surprising that the Copilot case was easier to dispose of without trial; it was a big stretch that had the advantage for the plaintiffs that, were it allowed to go forward, it doesn't admit a fair use defense the way a traditional direct copyright violation case does.

They aren't really "similar" except that both are lawsuits against AI service/model providers that rest some subset of their claims on some part of Title 17 of the US Code.

slavik81 · a year ago

I am not a lawyer, but I explore these questions by imagining an existing situation with a human. If your friend gave you code to publish and it turned out he gave you someone else's code that he had memorized, would you be breaching copyright? The answer in that case is plainly yes, and I think it would be no different with an LLM.

Substituting a human for a computer changes some aspects of the situation (e.g., the LLM cannot hold copyright over the works it creates), but it's useful because it leaves the real human's actions unchanged. However, for more complex questions that interact with things like work-for-hire contract law, you may need to take a more sophisticated approach.

blackoil · a year ago

You'll get a second system, that searches your code against index of copyrighted code. If found say > 70% matching against some unique code, it will be flagged for rewrite. It can be automated in Copilot to simply regenerate with a different seed.

bschmidt1 · a year ago

In some languages there are few ways (or 1 way) to do things, so everyone writes the same looking for loops, etc. And sometimes in the same order, with the same filenames, etc. by convention - especially in the case of heavy framework usage where most people's code is mostly identical % wise. The flagging system would have to be able to identify framework usage separate from IP and so-on.

Beyond that, it seems like you'd need a highly expressive language for this to work well. You can effectively scan for plagiarism in English because it's so varied that it really is an outlier to see several lines of text that are identical to each other from different sources, but maybe it's not that strange to see entirely identical files, or at least very similar code, in totally distinct, say, React or Ruby-on-Rails projects.

I think of code methodologies as more like construction techniques. Maybe some pieces and parts are patentable, and some can even be productized as tools, but a lot of it is just convention and technique.

hadlock · a year ago

Looking forward to the "rewrite this over and over until it no longer triggers the copyright-warning alarm" button in my IDE

Analemma_ · a year ago

The same thing that happens if you write a song which happens to have the same pattern of four notes as another song: absolutely nothing, because that would be an insane standard to hold copyright to and would lead to nothing ever being produced without a tidal wave of infringement suits.

to11mtm · a year ago

Depends...

But frankly one could ask whether this would be closer to 'Sampling'...

> and when it does would I be breaching copyright by publishing the code copilot wrote?

Presumably OpenAI would be committing copyright infringement by even displaying that code to you, if it does not have a license to do so.

'Normally' IIRC you would still be a party to the lawsuit.

Might get out of it depending on specifics but you'd be a party.

OTOH, I -thought?- that part of at least enterprise Copilot was that they would defend in such cases.

ChrisArchitect · a year ago

Misleading OP,

Discussion from July:

Judge dismisses DMCA copyright claim in GitHub Copilot suit

https://news.ycombinator.com/item?id=40919253

AnimalMuppet · a year ago

Interesting. The parts that survived are the contract claims and the open-source license claims.

Contract is understandable - it supersedes almost everything else. If the law says I can do X but the contract says I can't, then I almost certainly can't.

It's nice to see open-source licenses being treated as having somewhat similar solidness as a contract.

tialaramex · a year ago

The FSF's argument for their copyleft was always based on exactly the same foundations as typical copyright licenses. If Alice can say that you must pay her $500 to do X with her copyrighted thing, then logically Bob can say that you must obey our rules to do X with his copyrighted thing.

This invites courts to pick, smash copyright (which would suit the FSF fine) or enforce their rules just the same (also fine). It makes it really difficult for a court, even one motivated to do so, to thread the needle and find a way to say Alice gets her way but Bob does not.

Structuring your arguments so that it's difficult for motivated courts to thread this needle is a good strategy when it's available. If you're lucky a judge will do it for you, as in Carlill v Carbolic Smoke Ball Co (the foundation of contract law) or indeed Bostock v. Clayton County - hey, says Gorsuch, the difference between this gay man and this straight woman isn't that they're attracted to men, that's the same - the actual difference is one of them is a man, but, that's sex discrimination, so this is a sex discrimination case!

panic · a year ago

If you have access to the Copilot weights, you should consider leaking them. We shared our code with you because we wanted it to be free, not sold back to us at $10/month.

ozr · a year ago

Fwiw, I've never paid for Copilot. I was automatically given free access for open source contributions. My largest public repo had maybe 100 stars. I've made minor commits to larger repos.

I don't know what the threshold is, but I'm fine with the trade-off I received.

Then you should be happy to know that there are multiple open source coding weights out there already! Some of which are as good as, or possibly better than co-pilot.

That should satisfy anyone who actually cares about this as opposed to only being interest in making a snappy gotcha one liners.

tbrownaw · a year ago

... because GPU-hours are worthless?

e-clinton · a year ago

So if a developer uses “free code” in their software, must they only do it for free?

amanaplanacanal · a year ago

That depends on the original license.