Copilot regurgitating Quake code, including sweary comments

So this makes it official... this post[0] and the comments on the announcement[1] concerned about licensing issues were absolutely correct... and this product has the possibility of getting you sued if you use it.

Unfortunately for GitHub, there's no turning back the clocks. Even if they fix this, everyone that uses it has been put on notice that it copies code verbatim and enables copyright infringement.

Worse, there's no way to know if the segment it's writing for you is copyrighted... and no way for you to comply with license requirements.

Nice proof of concept... but who's going to touch this product now? It's a legal ticking time bomb.

0. https://news.ycombinator.com/item?id=27687450

1. https://news.ycombinator.com/item?id=27676266

eganist · 4 years ago

Adding to this:

I run product security for a large enterprise, and I've already gotten the ball rolling on prohibiting copilot for all the reasons above.

It's too big a risk. I'd be shocked if GitHub could remedy the negative impressions minted in the last day or so. Even with other compensating controls around open source management, this flies right under the radar with a c130's worth of adverse consequences.

fragmede · 4 years ago

Do you also block stack overflow and give guidance to never copy code from that website or elsewhere on the Internet? I'm legitimately curious - my org internally officially denounces the copying of stack overflow snippets. Thankfully for my role it's moot as I mostly work with an internal non-public language, for better or worse, and I have no idea how well that's followed elsewhere in the wider company.

alblue · 4 years ago

Same here. I’ve directed our teams and infra managers that we must be able to block the use of copilot for our firm’s code.

Id be very surprised if the other large enterprises that I have worked at downs doing exactly the same thing. Too much legal risk, for practically no benefit.

Kiro · 4 years ago

No-one cares about this. People have no clue about licenses and just copy-paste whatever. If someone gets access to their code and see all the violations they're screwed anyway.

jerf · 4 years ago

Ask your legal department about that. Sure, engineers don't care about licensing at all, but we are not the only players here.

edanm · 4 years ago

This is absolutely not true. While some individuals might not care and might not always conform to their companies' policies, most companies have policies, and most employees are aware of and mindful of these policies.

It's absolutely the case that before using certain libraries, most engineers in large corporations will make sure they are allowed to use that library. And if they don't, they are doing their job very badly IMO.

noobermin · 4 years ago

This kind of sucks honestly, copy and pasting without understanding has lead to all sorts of issues in IT. Not to mention legal issues as mentioned by another reply.

Deleted Comment

jpswade · 4 years ago

Not only this but a huge amount of publicly available code is truly terrible and should never really be used other than a point of reference, guidance.

thesz · 4 years ago

I think that proper coding assistant should help with not writing code (and I stress that it is "not writing code") - how to rearrange your code base for new requirements, for example.

Code not written does not have defects, does not need support and, as you point it out, is not a liability.

eximius · 4 years ago

Seems like the liability should also be on Copilot itself, as a derivative work.

root_axis · 4 years ago

The practical utility will outweigh the legal concerns. Engineers using this are going to be more productive and this is a competitive advantage that companies won't eschew.

voakbasda · 4 years ago

If the legal concerns are well-known, then what you are describing might be viewed as criminal negligence (at worst) and or insufficient duty of care (at best). Such engineers should be held fully responsible and accountable for their actions.

skybrian · 4 years ago

It seems like the risk is somewhat exaggerated because even when people get bad autocomplete results, they mostly won’t use them.

Cyberdog · 4 years ago

That's optimistic. The people who would rely heavily on this sort of thing are going to be the worst at detecting what a "bad autocomplete result" would look like. But even if you are capable of judging that you've got a good one, it still doesn't inform you of the obvious potential licensing issues with any bit of code.

Surely somebody working on this project foresaw this problem…

sktrdie · 4 years ago

If they get rid of licensed stuff it should be ok no? I really want to use this and seems inevitable that we'll need it just as google translate needs all of the books + sites + comments it can get a hold of.

mmastrac · 4 years ago

Well... the whole training set is licensed, so you can't really get rid of it. I think that the technology they are using for this is just not ready.

ianhorn · 4 years ago

eCa · 4 years ago

Which licenses would it be ok that the training material is licensed under, though? If it produces verbatim enough copies of eg. MIT licensed material, then attribution is required. Similar with many other open source-friendly licenses.

On the other hand, if only permissive licenses that also don't require attribution is used, well, then for a start, the available corpus is much smaller.

nextaccountic · 4 years ago

The overwhelming majority of code on Github, even code under permissive licenses, require attribution of the original authors.

Deleted Comment

runeb · 4 years ago

How would they do that?

Deleted Comment

__MatrixMan__ · 4 years ago

Is it still a legal concern if I'm just coding because I want to solve a problem and I'm not trying to use it to do business?

saurik · 4 years ago

Yes: not all code on GitHub is licensed in a way that lets you use it at all. People focus on GPL as if that were the tough case; but, in addition to code (like mine) under AGPL (which you need to not use in a product that exposes similar functionality to end users) there is code that is merely published under "shared source" licenses (so you can look, but not touch) and even literally code that is stolen and leaked from the internals of companies--including Microsoft!... this code often gets taken down later, but it isn't always noticed and either way: it is now part of Copilot :/--that, if you use this mechanism, could end up in your codebase.

maclockard · 4 years ago

If you publish the code anywhere, potentially. You could be (unknowingly) violating the original license if the code was copied verbatim from another source.

How much of a concern this is depends heavily on what the original source was.

Dead Comment

I may be over-reading, but I think this kind of example not only demonstrates the pragmatic legal issues, but also the fundamental weaknesses of a solely text-oriented approach to suggesting code. It doesn't really seem to have a representation of the problem being solved, or the relationship between things it generates and such a goal. This is not surprising in a tool which claims to work at least a little for almost all languages (i.e. which isn't built around any firm concept of the language's semantics).

I'd be much more excited by (and less unnerved by) a tool which brought program synthesis into our IDEs, with at least a partial description of intended behavior, especially if searching within larger program spaces could be improved with ML. E.g. here's an academic tool from last year which I would love to see productionized. https://www.youtube.com/watch?v=QF9KtSwtiQQ

computerex · 4 years ago

I think it’s pretty clear that program synthesis good enough to replace programmers requires AGI.

This solely text based approach is simply “easy” to do, and that’s why we see it. I think it’s cool and results are intriguing but the approach is fundamentally weak and IMO breakthroughs are needed to truly solve the problem of program synthesis.

YeGoblynQueenne · 4 years ago

There's a few decades worth of work on program synthesis and it works very well. You don't need AGI.

You need either a) a complete specification of the target program in a formal language (other than the target language) or b) an incomplete specification in the form of positive and negative examples of the inputs and outputs of the target program, and maybe some form of extra inductive bias to direct the search for a correct program [edit: the latter setting is more often known as program induction].

In the last few years the biggest, splashiest result in program synthesis was the work behind FlashFill, from Gulwani et al: one-shot program learning, and that's one shot, from a single example, not with a model pretrained on millions of examples. It works with lots of hand-crafted DSLs that try to capture the most common use-cases, a kind of programming common sense that, e.g. tells the synthesiser that if the input is "Mr. John Smith" and the output is "Mr" then if the input is "Ms Jane Brown" the output should be "Ms". It works really, really well but you didn't hear about it because it's not deep learning and so it's not as overhyped.

Copilot tries to circumvent the need for "programming common sense" by combining the spectacular ability of neural nets to interpolate between their training data with billions of examples of code snippets, in order to overcome their also spectacular inability to extrapolate. Can language models learned with neural nets replace the work of hand-crafting DSLs with the work of collecting and labelling petabytes of data? We'll have to wait and see. There are also many approaches that don't rely on hand-crafted DSLs, and also work really, really well (true one-shot learning of recursive programs without an example of the base case and the synthesis terminates) but those generally only work for uncommon programming languages like Prolog or Haskell, so they're not finding their way to your IDE, or your spreadsheet app, any time soon.

But, no, AGI is not needed for program synthesis. What's really needed I think is more visibility of program synthesis research so programmers like yourself don't think it's such an insurmountable problem that it can only be solved by magickal AGI.

manquer · 4 years ago

Maybe slightly less than AGI.

Parsing intent in a programing context is easier than others. Also most the code is written to be parsed for a machine anyway. So with ASTs and all other static and maybe even some dynmaic checks it should be possible.

We already some of it with type detection , intellisense etc

It is hard set of problem with no magic solutions like this with years of development time needed. That approach will not happen commercially, only incrementally in the community.

abeppu · 4 years ago

Also the goal doesn't need to be "to replace programmers". As with copilot, the point of a program synthesis tool can be to assist the programmer. The point of the system in the video linked above is partly that interactively using such a system can aid development. My main point is this can be a lot better in combination with approaches from outside the ML community, which may involve much tighter integration to specific languages, as well as some awareness of a goal for the synthesized portion.

To "replace programmers", an organization would need to have a way of specifying to the system a high level program behavior, and to confirm that an output from the system satisfies that high level behavior. I think for specifications of any complexity, producing and checking them would look like programming just of a different sort.

whimsicalism · 4 years ago

> fundamental weaknesses of a solely text-oriented approach to suggesting code.

I don't think it is clear that such "fundamental weaknesses" exist. A text-based approach can get you incredibly far.

abeppu · 4 years ago

I mean, the cases where it tries to assign copyright to another person in a different year highlights that context other than the other text in the file is semantically extremely important, and not considered by this approach. Merely generating text which looks appropriate to the model given surrounding text is ... misguided?

If you think about it, program synthesis is one of the few problems in which the system can have a perfectly faithful model dynamics of the problem domain. It can run any candidate it generates. It can examine the program graph. It can look at what parts of the environment were changed. To leave all that on the table in favor of blurting out text that seems to go with other text is like the toddler who knows that "five" comes after "four", but who cannot yet point to the pile of four candies. You gotta know the referents, not just the symbols. No one wants a half-broken Chinese Room.

YeGoblynQueenne · 4 years ago

>> I don't think it is clear that such "fundamental weaknesses" exist. A text-based approach can get you incredibly far.

Mnyeah, not really that "incredibly". Remember that neural network models are great at interpolation but crap at extrapolation. So as long as you accept that the code generated by Copilot stays inside a well-defined area of the program space, and that no true novelty can be generated that way, then yes, you can get a lot out of it and I think, once the wrinkles are ironed out, Copilot might be a useful, everyday tool in every IDE (or not; we'll have to wait and see).

But if you need to write a new program that doesn't look like anything anyone else has written before, then Copilot will be your passenger.

How often do you need to do that? I don't know. But there's still many open problems in programming that lots of people would really love to be able to solve. Many of them have to do with efficiency and you can't expect Copilot to know anything about efficiency.

For example, suppose we didn't know of a better sorting algorithm than bubblesort. Copilot would not generate mergesort. Even given examples of divide-and-conquer algorithms, it wouldn't be able to extrapolate to a divide-and-conquer algorithm that gives the same inputs for the same outputs as bubblesort. It wouldn't be able to, because it's trained to reproduce code from examples of code, not to generate programs from examples of their inputs and outputs. Copilot doesn't know anything about programs and their inputs and outputs. It is a language model, not a magickal pink fairy of wish fulfillment and so it doesn't know anything about things it wasn't trained on.

Again, how often do you need to write truly novel code? In the context of professional software development, I think not that often. So if it turns out to be a good boilerplate generator, Copilot can go a long way. As long as you don't ask it to generate something else than boilerplate.

There are approaches that work very well in the task of generating programs that they've never seen before from examples of their inputs and outputs, and that don't need to be trained on billions of examples. True one-shot learning of programs (without a model pre-trained on billions of examples) is possible. With current approaches. But those approaches only work for languages like Prolog and Haskell, so don't expect to see those approaches helping you write code in your IDE anytime soon.

rgbrenner · 4 years ago

celeritascelery · 4 years ago

From the Copilot FAQ:

> The technical preview includes filters to block offensive words

And somehow their filters missed f*k? That doesn’t give a lot of confidence in their ability filter more nuanced text. Or maybe it only filters truly terrible offensive words like “master”.

minimaxir · 4 years ago

In my testing of Copilot, the content filters only work on input, not output.

Attempting to generate text from code containing "genocide" just has Copilot refuse to run. But you can still coerce Copilot to return offensive output given certain innocuous prompts.

Closi · 4 years ago

Ahh, so it's the most pointless interpretation of the phrase "filters to block offensive words", where it is stopping the user from causing offense to the AI rather than the other way around.

Jackson__ · 4 years ago

Interesting how this continues to be an issue for GPT3 based projects.

A similar thing is happening in AI Dungeon, where certain words and phrases are banned to the point of suspending a users account if used a certain amount of times, yet they will happily output them when it is generated by GPT3 itself, and then punish the user if they fail to remove the offending pieces of text before continuing.

krick · 4 years ago

Lol, how does that make any sense? I mean, all these word blacklists are always pretty stupid, but at least you can usually see the motivation behind them. But in this case I'm not even sure what they tried to achieve, this is absolutely pointless.

aasasd · 4 years ago

Maybe Github just doesn't have many repos to control death factories and execution squads?

alisonkisk · 4 years ago

does it also censor "lesbian"?

spoonjim · 4 years ago

Blocks offensive words, but doesn't block carefully crafted malware.

thinkingemote · 4 years ago

From the GPLv2 licensed code:

https://github.com/id-Software/Quake-III-Arena/blob/master/c...

copilot repeats it word for word almost, including comments, and adds an MIT like license up the top

Thomashuet · 4 years ago

Actually the indentation of the first comment and the lack of preprocessor show it's not copied from this code directly but from Wikipedia (https://en.wikipedia.org/wiki/Fast_inverse_square_root#Overv...) So It could be that the Quake source code is not part of the training set but the Wikipedia version is.

SamBam · 4 years ago

While I strongly doubt they would use Wikipedia as a training set, has anyone done a search of GitHub code to see if other projects have copied-and-pasted that function from Wikipedia into their more-permissive codebases?

rmorey · 4 years ago

This exact code is all over github, >1k hits

https://github.com/search?q=%22i++%3D+%2A+%28+long+%2A+%29+%...

Then they all copied from the same source, and what's more, they are all derivative of a GPL work (and thus they should be GPLed themselves)

ajklsdhfniuwehf · 4 years ago

that will make a great defense at a copyright court.

"your honor, i would like to plead not guilty, on the basis that i just robbed that bank because i saw that everyone was robbing banks on the next city"

...on the other hand, that was the exact defense tried for the capitol rioters. So i don't know anything anymore.

arksingrad · 4 years ago

I guess this confirms John Carmack to be an AI

OskarS · 4 years ago

Apparently Carmack was not the original author, the origin I believe is SGI somewhere in the deep dark 90s.

Fordec · 4 years ago

I get why some people were saying it made them a better programmer. Of course it did, it's copy-pasting Carmack code.

nojito · 4 years ago

It's up to the end user to accept the suggestions.

croes · 4 years ago

Good luck checking every code line for license violations

freshhawk · 4 years ago

And it's up to the end user to evaluate the tool that makes the suggestions.

flatiron · 4 years ago

As someone who does code reviews the thought the developer didn’t code the code submitted to be merged never would cross my mind.

user-the-name · 4 years ago

And it is completely impossible for the user to do so.

So, the tool is worthless if you want to use it legally.

dgellow · 4 years ago

Another fascinating one, an "About me" page generated by copilot links to a real person's Github and twittter accounts!

https://twitter.com/kylpeacock/status/1410749018183933952

bencollier49 · 4 years ago

That's bonkers. And the beauty of it is that now someone could realistically do a GDPR Erasure request on the Neural Net. I do hope that they're able to reverse data out.

qayxc · 4 years ago

Since the information is encoded in model weights, I doubt that erasure is even possible. Only post-retrieval filtering would be an option.

It only goes to show that intransparent black-box models have no place in the industry. The networks leak information left and right, because it's way too easy to just crawl the web and throw terabytes of unfiltered data at the training process.

anyonecancode · 4 years ago

I think copilot is solving the wrong problem. A future of programming where we're higher up the abstraction tree is absolutely something I want to see. I am taking advantage of that right now -- I'm a decently good programmer, in the sense that I can write useful, robust, reliable software, but I'm pretty high up the stack, working in languages like Java or even higher up the stack that free me from worrying about the fine details of memory allocation or the particular architecture of the hardware my code is running on.

Copilot is NOT a shift up the abstraction tree. Over the last few years, though, I've realized the the concept of typing is. Typed programming is becoming more popular and prominent beyond just traditional "typed" languages -- see TypeScript in JS land, Sorbet in Ruby, type hinting in Python, etc. This is where I can see the future of programming being realized. An expressive type system lets you encode valid data and even valid logic so that the "building blocks" of your program are now bigger and more abstract and reliable. Declarative "parse don't validate"[1] is where we're eventually headed, IMO.

An AI that can help us to both _create_ new, useful types, and then help us _choose_ the best type, would be super helpful. I believe that's beyond the current abilities of AI, but can imagine that in the future. And that would be amazing, as it would then truly be moving us up the abstraction tree in the same way that, for instance, garbage collection has done.

[1] https://lexi-lambda.github.io/blog/2019/11/05/parse-don-t-va...

shadowgovt · 4 years ago

A taller abstraction tree makes tradeoffs of specialization: the deeper the abstractions, the more one has to understand when the abstractions break or when one chooses to use them in novel ways.

This is something I'm interested in regarding this approach... When it works as intended, it's basically shortening the loop in the dev's brain from idea to code-on-screen without adding an abstraction layer that someone has to understand in the future to interpret the code. The result is lower density, so it might take longer to read... Except what we know about linguistics suggests there's a balance between density and redundancy for interpreting information (i.e. the bottleneck may not be consuming characters, but fitting the consumed data into a usable mental model).

I think the jury's out on whether something like this or the approach of dozens of DSLs and problem-domain-shifting abstractions will ultimately result in either more robust or more quickly-written code.

But on the topic of types, I'm right there with you, and I think a copilot for a dense type forest (i.e. something that sees you writing a {name: string; address: string} struct and says "Do you want to use MailerInfo here?") would be pretty snazzy.

Yeah, but generating tons of stupid verbose code that nobody will be able to read and understand is more fun. Also, your superiors will be sure you are a valuable worker if you write more code.

This does make me wonder if this is susceptible to the same form of trolling as that MS AI got. Commit a load of grossly offensive material to multiple repos, and wait for Copilot to start parroting it. I think they're going to need some human moderation.

lawl · 4 years ago

Way better. It's susceptible to copyright trolling.

Put up repos with snippets for things people might commonly write. Preferably use javascript so you can easily "prove" it. Write a crawler that crawls and parses JS files to search for matching stuff in the AST. Now go full patent troll, eh, i mean copyright troll.

handrous · 4 years ago

1) Write a project heavily using Copilot (hell, automate it and write thousands of them, why not?)

2) AGPL all that code.

3) Search for large chunks of code very similar to yours, but written after yours, licensed more liberally than AGPL. Ideally in libraries used by major companies.

4) Point the offenders to your repos and offer a "convenient" paid dual-license to make the offenders' code legal for closed-source use, so they don't have to open source their entire product.

5) Profit?

gruez · 4 years ago

Offensive code is the least of my worries. What about vulnerable/exploitable code?

macNchz · 4 years ago

This was my first thought when reading about Copilot...it feels almost certain that someone will try poisoning the training data.

Hard to say how straightforward it'd be to get it to produce consistently vulnerable suggestions that make it into production code, but I imagine an attacker with some resources could fork a ton of popular projects and introduce subtle bugs. The sentiment analysis example on the Copilot landing page jumped out to me...it suggested a web API and wrote the code to send your text there. Step one towards exfiltrating secrets!

Never mind the potential for plain old spam: won't it be fun when growth hackers have figured out how to game the system and Copilot is constantly suggesting using their crappy, expensive APIs for simple things!? Given the state of Google results these days, this feels like an inevitability.

littlestymaar · 4 years ago

1- re-upload all the shell script you can find, after having inserted `rm -rf --no-preserve-root /` every other line

2- …

3- profit

Yep, trivial to implement as an attack.

tjpnz · 4 years ago

Given that code is easier to write than it is to read this one is troubling.

I certainly wouldn't want to be using this with languages like PHP (or even C for that matter) with all the decades of problematic code examples out there for the AI to learn from.

guhayun · 4 years ago

Just ask it to prioritize safety

NullPrefix · 4 years ago

Coding with Adolf?

heavyset_go · 4 years ago

Jojo Rabbit except Adolf is in the cloud and not in a kid's imagination.

raffraffraff · 4 years ago

Perhaps they think that any code that passed a review and got merged = human moderated

0-_-0 · 4 years ago

This is a very famous function [0] and likely appears multiple times in the training set (Google gives 40 hits for GitHub), which makes it more likely to be memorized by the network.

[0]: https://en.wikipedia.org/wiki/Fast_inverse_square_root#Overv...

It's worth keeping in mind that what a neural network like this (just like GPT3) is doing is generating the most probable continuation based on the training dataset. Not the best continuation (whatever that means), simply the most likely one. If the training dataset has mostly bad code, the most likely continuation is likely to be bad as well. I think this is still valuable, you just have to think before accepting a suggestion (just like you have to think before writing code from scratch or copying something from Stack Overflow).

dematz · 4 years ago

I have no idea how this or GPT3 works or how to evaluate them, but couldn't you argue that it's working as it should? You tell copilot to write a fast inverse square root, it gives you the super famous fast inverse square root. It'd be weird and bad if this didn't happen.

As far as licenses go, idk. Presumably it could delete associated comments and change variable names or otherwise obscure where it's taking code from. Maybe this part is shady.

bee_rider · 4 years ago

In particular, fast approximate inverse square root is an x86 instruction, and not a super new one. I'd be surprised if it wasn't in every major instruction set.

This is an interesting issue. I suspect training on datasets from places like Github would be likely to provide lots of "this is a neat idea I saw in a blog post about how they did things in the 90's" codes.

abecedarius · 4 years ago

> the most probable continuation based on the training dataset

This is not wrong, but it's easy to misread it as implying little more than a glorified Markov model. If it's like https://www.gwern.net/GPT-3 then it's already significantly cleverer, and so you should expect to sometimes get the kind of less-blatant derivation that companies aim to avoid using a cleanroom process or otherwise forbidding engineers from reading particular sources.

kortex · 4 years ago

Arguably the most famous block of code, of all time. Maybe fizzbuzz but there are so many flavors of it. And InvSqrt is way more memeable.

So I don't know if on this alone it proves Copilot regurgitates too much. I think other signs are more troubling, however, such as its tendency to continue from a prompt vs generate novelty.