Tool to convert copyrighted music into fair use

I've seen a lot of people ragging on Copilot for "copy+pasting" code - does anyone have links to cases where it has done this without the user intentionally trying to generate a specific (extremely famous) code snippet?

I've seen tons of comments here and on Reddit that talk about multiple instances of entire functions being copied verbatim, but the only thing even remotely close to that I've seen is the fast inverse square root, so I must have missed a few tweets or something.

tyingq · 4 years ago

Is it clear how many people have access? If access is fairly limited, I'm not surprised at the low number of examples.

Here's an example from their own docs:

Compare:

https://docs.github.com/assets/images/help/copilot/example_r...

To: https://github.com/nilavghosh/OMSCS/blob/master/AI4Robotics-...

sellyme · 4 years ago

> Is it clear how many people have access? If access is fairly limited, I'm not surprised at the low number of examples.

I wouldn't be too surprised either, what surprised me was the large number of people who explicitly said words like "many examples" and then couldn't link more than one.

> Here's an example from their own docs:

Thank you! This is definitely a much stronger example than the Quake one. Looking at the git repository it seems like this is generic startup code for a university assignment, so it probably shows up a significant number of times in the training data.

While it definitely makes sense to interpret a piece of code showing up several hundred times in the exact same format as being okay to straight-up copy+paste (e.g., imports, some boilerplate in more verbose languages, keyboard input switch statements), this does seem to highlight that Copilot can't distinguish between code snippets that are always identical because that's just the correct way to do it, and code snippets that are always identical because only one person did that thing, and it just happens to be in hundreds of repositories for one reason or another.

geekraver · 4 years ago

I wonder how many of these verbatim examples are because the training data code itself occurs multiple times in GitHub because it in turn was copied from an upvoted answer in StackOverflow, LOL.

meibo · 4 years ago

It only seems to if you give it no or very little "source" input, like an empty file with a comment that says "// X algorithm".

There's been a lot of bikeshedding on this, but GitHub decidedly hasn't given enough information on how it works and what the training dataset is, and the fair use question definitely needs to be answered, maybe even in court - it's just a matter of time.

MadVikingGod · 4 years ago

What I think is interesting and not talked about in the copilot cases is there is actually two different copyright actions that come from using it.

The first is if GitHub had a license to distribute the code they have learned on. This is assuming that the code it produces is a derivative, which I don't think would be much of a stretch. For this I think most of the licenses that are used, GPL MIT etc, allow for someone to distribute in such a way.

The second is the user of copilot. They would be getting and distributing code where they wouldn't know the original license, but that wouldn't be a viable defense of infringement. To actually comply with most licenses they would have to follow the requirements.

In both cases I don't know if a fair use would really apply. Maybe GitHub could stretch the research aspect, but if you just use the code in your product there is no fair use.

Hamuko · 4 years ago

>what the training dataset is

All non-private repositories on GitHub.

uberswe · 4 years ago

I have access and hints for entire functions have only appeared from code already in the same file. It uses the file you are in for context.

It also says on their website that the AI may generate api keys that look real but it’s actually just a “fake” key that the AI generated as placeholder.

I have had access for one day, spent my entire Saturday playing with while working on an addon for a game. I find it useful and most of the hints come from other code I have in the same files which saves me time or let’s me know when I’m too repetitive :)

tyingq · 4 years ago

"It also says on their website that the AI may generate api keys that look real but it’s actually just a “fake” key that the AI generated as placeholder."

Maybe this is what you meant, but that's already been shown to be untrue. https://fossbytes.com/github-copilot-generating-functional-a...

dogecoinbase · 4 years ago

It's happily spitting out licenses and copyright notices with other people's names on them, it's pretty clearly half-baked.

sellyme · 4 years ago

While that's obviously a UX flaw that definitely shouldn't have made it to release, I find it hard to envision it ever actually being a problem. If someone's accepting auto-generated copyright notices and licenses that don't actually apply, they can't really point the finger at Github - it's called "Copilot", not "Pilot".

The problem with full snippets of arbitrary obscure repositories being copy+pasted is that there's no realistic way for a user to know if that's happening without putting in more effort than just writing the code themselves, somewhat defeating the point. That's not really case when the first line of the file contains "(C) Someone Else 2003".

dwild · 4 years ago

> I've seen a lot of people ragging on Copilot for "copy+pasting" code - does anyone have links to cases where it has done this without the user intentionally trying to generate a specific (extremely famous) code snippet?

It's not important that the examples are really specific, the issue is that it does happens. This tool has the potential to infringe copyright (sure it does seems legal at some place, doesn't make it more right though).

Look at how EA reversed engineered the Genesis in the past [1]. They had 2 teams, one that did the reversing, and another that did the implementation. That made it safe to say that no infringing code was going through. Plenty of emulators developers try to avoid source code leaks for similar reason. Co-Pilot can't do that.

The fact that the copyright status of the code is hard to determine, doesn't means it's not copyrighted code nonetheless.

Personally I got nothing against this kind of technology, but I do have a pretty big issue with how it learn. Once people published their code on Github, they didn't know it could have been used to do machine learning and it's bad that it is. If Github asked for the copyright over the code to do Co-Pilot, I wouldn't mind.

[1] https://www.youtube.com/watch?v=x0qe1FNqtCo&t=280s

IshKebab · 4 years ago

Github did an analysis and found that it does do it, though very rarely, and usually when it has little context (e.g. at the start of a file). They're working on detecting those cases though so it doesn't happen accidentally, so it is unlikely to be a realistic problem.

When did the HN crowd become so defensive of copyright? I understand the concerns on copilot but it’s kinda weirding me out.

aurelian15 · 4 years ago

As weird as it may seem, you should not forget that free software licenses are built upon the fabric of copyright. Without copyright, free software could not exist in its current form. For GPL-like "copyleft" licenses, there would be no way to enforce that binary distributions of derived works are accompanied by their source code. Similarly, in the context of permissive BSD/MIT-style licenses, there would be no way to enforce attribution.

So, given that FOSS---which a large portion of the HN crowd depends on---cannot work without copyright (at least not in its current form), the recent discussions may be less of a surprise.

cortesoft · 4 years ago

Maybe... although I personally think that the GPL and other 'copy left' licenses aren't the reason open source has prospered, nor do I think enforcing attribution really helps the FOSS world that much.

People write and share code because it is useful to do that, not because licenses require them to.

I think FOSS would do fine with no copyright, and in fact more software might end up open source if we had ZERO copyright... why not make your code open source and get back contributions when your code would end up being shared anyway?

throw0101a · 4 years ago

> When did the HN crowd become so defensive of copyright?

Copyright is good in limited quantities. The current multi-decade time horizon is probably what a lot of people are against, and not the concept in general.

And limited time period seems to be consistent through history. From the paper "Copyrights and Creativity: Evidence from Italian Opera in the Napoleonic Age":

> Comparing changes in the creation of new operas across Italian states with and without copyrights, we show that the adoption of basic copyrights encouraged the creation of new work. Moreover, we find that copyrights changed the quality of creative output by encouraging composers to produce more popular and durable works. These results generalize to a broader set of musical compositions and to librettos, as the literary component to the score of operas. Based on these findings, we conclude that the adoption of basic levels of copyright protection – not exceeding the lifetime of the composer – can help to raise both the quantity and the quality of new creative works.

> Importantly, we find that extensions in the length of copyright beyond the composer’s life did not encourage creativity. Performance data reveal that few operas were played after the first 20 years, which suggests that only the most durable creative goods stand to gain from copyright extensions. […]

* https://papers.ssrn.com/sol3/papers.cfm?abstract_id=2505776

ReactiveJelly · 4 years ago

Both the permissive and copyleft licenses are only enforcable through copyright law.

I don't mind that copyright exists, I just wish it was better.

Also there's a power difference between individuals violating the rights of a big company, and a big company violating the rights of many individuals.

If Copilot isn't reined in, it feels like yet another case of "The laws only apply to poor people".

lupire · 4 years ago

What does it mean to "enforce" a "permissive" license?

PaulKeeble · 4 years ago

Because its my (and many of ours) code they have "learnt" from, stripped the license and are intending to sell on. When we listed code under MIT or GPL we meant those licenses, they weren't random and Microsoft just seems to be completely ignoring the reality of reproducing those works which are covered by those licenses, they are making code private and paid for that is open source. Not OK.

breck · 4 years ago

"The heathen are sunk down in the pit that they made: in the net which they hid is their own foot taken"

Copyright is a horrible system. Microsoft has been one of the biggest proponents of that system. But now they've clearly violated it. They should either join in abolishing it, or face its consequences.

michaelmrose · 4 years ago

Consider people's reaction to people selling boot leg DVDs vs torrenting a movie. Although people may consider both morally incorrect the corrupting profit motive results in the former being seen far more negatively. In the current situation there is also the matter that the Microsoft is still perceived rightly I think very negatively and open source authors very positively. Also in a David v Golliath situation nobody wants to be seen rooting for the giant.

Personally I would be concerned about insert corp here accidentally stealing code from an open source project then years later going after the open source project for copyright infringement regarding the code they in fact stole from the open source project.

carom · 4 years ago

I guarantee this is not Microsoft's announcement that they are forfeiting their copyrights. This is just them abusing the spirit of ours.

NiceWayToDoIT · 4 years ago

Probably because when poor people give something for free to other people to lift them out of poverty it is called empowerment, but when billionaires take free work of poor people for their own personal selfish gain it is called - exploit.

hjek · 4 years ago

Say your AGPL code is Copiloted into someone's new program and they decide to release that under a non-free license; that's the issue. We're defensive of copyleft.

clusterfish · 4 years ago

This submission aside, it seems that most people are just concerned about getting sued on copyright grounds for using copilot.

zarzavat · 4 years ago

It's hypocrisy. People will defend entire books and research papers being shared on libgen/scihub, which is unarguably actual copyright infringement on a massive scale, but training an AI on open source code is somehow the worst thing ever even if there's no case law to say that this constitutes infringement at all.

abrokenpipe · 4 years ago

It's not really that hypocritical when you understand peoples perspective on it. In general people (here on HN/OS-community) care about sharing experiences and knowledge. Copyright on opensource content does not inhibit peoples ability to learn from it. Theres also the whole "big corp vs little guy" mentality at play here. If copilot was opensource then I don't think that anyone would have an issue with it, I actually think people would respond well to it if that were the case.

Dylan16807 · 4 years ago

It's not hypocrisy to say that different types of work should have different copyright lengths, and that for some types the answer is zero.

Especially if you remember the phrase "to promote the progress of science and useful arts".

asddubs · 4 years ago

I'm all for copilot if microsoft gets treated the same way libgen/scihub are for creating it. or if we abolish copyright, but the fact that they can just decide to do this and it's fine, but scihub gets DNS-blocked reveals the asymmetry at hand here.