So this makes it official... this post[0] and the comments on the announcement[1] concerned about licensing issues were absolutely correct... and this product has the possibility of getting you sued if you use it.
Unfortunately for GitHub, there's no turning back the clocks. Even if they fix this, everyone that uses it has been put on notice that it copies code verbatim and enables copyright infringement.
Worse, there's no way to know if the segment it's writing for you is copyrighted... and no way for you to comply with license requirements.
Nice proof of concept... but who's going to touch this product now? It's a legal ticking time bomb.
I run product security for a large enterprise, and I've already gotten the ball rolling on prohibiting copilot for all the reasons above.
It's too big a risk. I'd be shocked if GitHub could remedy the negative impressions minted in the last day or so. Even with other compensating controls around open source management, this flies right under the radar with a c130's worth of adverse consequences.
Do you also block stack overflow and give guidance to never copy code from that website or elsewhere on the Internet? I'm legitimately curious - my org internally officially denounces the copying of stack overflow snippets. Thankfully for my role it's moot as I mostly work with an internal non-public language, for better or worse, and I have no idea how well that's followed elsewhere in the wider company.
Same here. I’ve directed our teams and infra managers that we must be able to block the use of copilot for our firm’s code.
Id be very surprised if the other large enterprises that I have worked at downs doing exactly the same thing. Too much legal risk, for practically no benefit.
No-one cares about this. People have no clue about licenses and just copy-paste whatever. If someone gets access to their code and see all the violations they're screwed anyway.
This is absolutely not true. While some individuals might not care and might not always conform to their companies' policies, most companies have policies, and most employees are aware of and mindful of these policies.
It's absolutely the case that before using certain libraries, most engineers in large corporations will make sure they are allowed to use that library. And if they don't, they are doing their job very badly IMO.
This kind of sucks honestly, copy and pasting without understanding has lead to all sorts of issues in IT. Not to mention legal issues as mentioned by another reply.
Not only this but a huge amount of publicly available code is truly terrible and should never really be used other than a point of reference, guidance.
I think that proper coding assistant should help with not writing code (and I stress that it is "not writing code") - how to rearrange your code base for new requirements, for example.
Code not written does not have defects, does not need support and, as you point it out, is not a liability.
The practical utility will outweigh the legal concerns. Engineers using this are going to be more productive and this is a competitive advantage that companies won't eschew.
If the legal concerns are well-known, then what you are describing might be viewed as criminal negligence (at worst) and or insufficient duty of care (at best). Such engineers should be held fully responsible and accountable for their actions.
That's optimistic. The people who would rely heavily on this sort of thing are going to be the worst at detecting what a "bad autocomplete result" would look like. But even if you are capable of judging that you've got a good one, it still doesn't inform you of the obvious potential licensing issues with any bit of code.
Surely somebody working on this project foresaw this problem…
If they get rid of licensed stuff it should be ok no? I really want to use this and seems inevitable that we'll need it just as google translate needs all of the books + sites + comments it can get a hold of.
Unlicensed code just means “all rights reserved.” You’d need to limit it to permissively licensed code and make sure you comply with their requirements.
Which licenses would it be ok that the training material is licensed under, though? If it produces verbatim enough copies of eg. MIT licensed material, then attribution is required. Similar with many other open source-friendly licenses.
On the other hand, if only permissive licenses that also don't require attribution is used, well, then for a start, the available corpus is much smaller.
Yes: not all code on GitHub is licensed in a way that lets you use it at all. People focus on GPL as if that were the tough case; but, in addition to code (like mine) under AGPL (which you need to not use in a product that exposes similar functionality to end users) there is code that is merely published under "shared source" licenses (so you can look, but not touch) and even literally code that is stolen and leaked from the internals of companies--including Microsoft!... this code often gets taken down later, but it isn't always noticed and either way: it is now part of Copilot :/--that, if you use this mechanism, could end up in your codebase.
If you publish the code anywhere, potentially. You could be (unknowingly) violating the original license if the code was copied verbatim from another source.
How much of a concern this is depends heavily on what the original source was.
> The technical preview includes filters to block offensive words
And somehow their filters missed f*k? That doesn’t give a lot of confidence in their ability filter more nuanced text. Or maybe it only filters truly terrible offensive words like “master”.
In my testing of Copilot, the content filters only work on input, not output.
Attempting to generate text from code containing "genocide" just has Copilot refuse to run. But you can still coerce Copilot to return offensive output given certain innocuous prompts.
Ahh, so it's the most pointless interpretation of the phrase "filters to block offensive words", where it is stopping the user from causing offense to the AI rather than the other way around.
Interesting how this continues to be an issue for GPT3 based projects.
A similar thing is happening in AI Dungeon, where certain words and phrases are banned to the point of suspending a users account if used a certain amount of times, yet they will happily output them when it is generated by GPT3 itself, and then punish the user if they fail to remove the offending pieces of text before continuing.
Lol, how does that make any sense? I mean, all these word blacklists are always pretty stupid, but at least you can usually see the motivation behind them. But in this case I'm not even sure what they tried to achieve, this is absolutely pointless.
Actually the indentation of the first comment and the lack of preprocessor show it's not copied from this code directly but from Wikipedia (https://en.wikipedia.org/wiki/Fast_inverse_square_root#Overv...)
So It could be that the Quake source code is not part of the training set but the Wikipedia version is.
While I strongly doubt they would use Wikipedia as a training set, has anyone done a search of GitHub code to see if other projects have copied-and-pasted that function from Wikipedia into their more-permissive codebases?
that will make a great defense at a copyright court.
"your honor, i would like to plead not guilty, on the basis that i just robbed that bank because i saw that everyone was robbing banks on the next city"
...on the other hand, that was the exact defense tried for the capitol rioters. So i don't know anything anymore.
That's bonkers. And the beauty of it is that now someone could realistically do a GDPR Erasure request on the Neural Net. I do hope that they're able to reverse data out.
Since the information is encoded in model weights, I doubt that erasure is even possible. Only post-retrieval filtering would be an option.
It only goes to show that intransparent black-box models have no place in the industry. The networks leak information left and right, because it's way too easy to just crawl the web and throw terabytes of unfiltered data at the training process.
I think copilot is solving the wrong problem. A future of programming where we're higher up the abstraction tree is absolutely something I want to see. I am taking advantage of that right now -- I'm a decently good programmer, in the sense that I can write useful, robust, reliable software, but I'm pretty high up the stack, working in languages like Java or even higher up the stack that free me from worrying about the fine details of memory allocation or the particular architecture of the hardware my code is running on.
Copilot is NOT a shift up the abstraction tree. Over the last few years, though, I've realized the the concept of typing is. Typed programming is becoming more popular and prominent beyond just traditional "typed" languages -- see TypeScript in JS land, Sorbet in Ruby, type hinting in Python, etc. This is where I can see the future of programming being realized. An expressive type system lets you encode valid data and even valid logic so that the "building blocks" of your program are now bigger and more abstract and reliable. Declarative "parse don't validate"[1] is where we're eventually headed, IMO.
An AI that can help us to both _create_ new, useful types, and then help us _choose_ the best type, would be super helpful. I believe that's beyond the current abilities of AI, but can imagine that in the future. And that would be amazing, as it would then truly be moving us up the abstraction tree in the same way that, for instance, garbage collection has done.
A taller abstraction tree makes tradeoffs of specialization: the deeper the abstractions, the more one has to understand when the abstractions break or when one chooses to use them in novel ways.
This is something I'm interested in regarding this approach... When it works as intended, it's basically shortening the loop in the dev's brain from idea to code-on-screen without adding an abstraction layer that someone has to understand in the future to interpret the code. The result is lower density, so it might take longer to read... Except what we know about linguistics suggests there's a balance between density and redundancy for interpreting information (i.e. the bottleneck may not be consuming characters, but fitting the consumed data into a usable mental model).
I think the jury's out on whether something like this or the approach of dozens of DSLs and problem-domain-shifting abstractions will ultimately result in either more robust or more quickly-written code.
But on the topic of types, I'm right there with you, and I think a copilot for a dense type forest (i.e. something that sees you writing a {name: string; address: string} struct and says "Do you want to use MailerInfo here?") would be pretty snazzy.
Yeah, but generating tons of stupid verbose code that nobody will be able to read and understand is more fun. Also, your superiors will be sure you are a valuable worker if you write more code.
I may be over-reading, but I think this kind of example not only demonstrates the pragmatic legal issues, but also the fundamental weaknesses of a solely text-oriented approach to suggesting code. It doesn't really seem to have a representation of the problem being solved, or the relationship between things it generates and such a goal. This is not surprising in a tool which claims to work at least a little for almost all languages (i.e. which isn't built around any firm concept of the language's semantics).
I'd be much more excited by (and less unnerved by) a tool which brought program synthesis into our IDEs, with at least a partial description of intended behavior, especially if searching within larger program spaces could be improved with ML. E.g. here's an academic tool from last year which I would love to see productionized.
https://www.youtube.com/watch?v=QF9KtSwtiQQ
I think it’s pretty clear that program synthesis good enough to replace programmers requires AGI.
This solely text based approach is simply “easy” to do, and that’s why we see it. I think it’s cool and results are intriguing but the approach is fundamentally weak and IMO breakthroughs are needed to truly solve the problem of program synthesis.
There's a few decades worth of work on program synthesis and it works very well. You don't need AGI.
You need either a) a complete specification of the target program in a formal language (other than the target language) or b) an incomplete specification in the form of positive and negative examples of the inputs and outputs of the target program, and maybe some form of extra inductive bias to direct the search for a correct program [edit: the latter setting is more often known as program induction].
In the last few years the biggest, splashiest result in program synthesis was the work behind FlashFill, from Gulwani et al: one-shot program learning, and that's one shot, from a single example, not with a model pretrained on millions of examples. It works with lots of hand-crafted DSLs that try to capture the most common use-cases, a kind of programming common sense that, e.g. tells the synthesiser that if the input is "Mr. John Smith" and the output is "Mr" then if the input is "Ms Jane Brown" the output should be "Ms". It works really, really well but you didn't hear about it because it's not deep learning and so it's not as overhyped.
Copilot tries to circumvent the need for "programming common sense" by combining the spectacular ability of neural nets to interpolate between their training data with billions of examples of code snippets, in order to overcome their also spectacular inability to extrapolate. Can language models learned with neural nets replace the work of hand-crafting DSLs with the work of collecting and labelling petabytes of data? We'll have to wait and see. There are also many approaches that don't rely on hand-crafted DSLs, and also work really, really well (true one-shot learning of recursive programs without an example of the base case and the synthesis terminates) but those generally only work for uncommon programming languages like Prolog or Haskell, so they're not finding their way to your IDE, or your spreadsheet app, any time soon.
But, no, AGI is not needed for program synthesis. What's really needed I think is more visibility of program synthesis research so programmers like yourself don't think it's such an insurmountable problem that it can only be solved by magickal AGI.
Parsing intent in a programing context is easier than others. Also most the code is written to be parsed for a machine anyway. So with ASTs and all other static and maybe even some dynmaic checks it should be possible.
We already some of it with type detection , intellisense etc
It is hard set of problem with no magic solutions like this with years of development time needed. That approach will not happen commercially, only incrementally in the community.
Also the goal doesn't need to be "to replace programmers". As with copilot, the point of a program synthesis tool can be to assist the programmer. The point of the system in the video linked above is partly that interactively using such a system can aid development. My main point is this can be a lot better in combination with approaches from outside the ML community, which may involve much tighter integration to specific languages, as well as some awareness of a goal for the synthesized portion.
To "replace programmers", an organization would need to have a way of specifying to the system a high level program behavior, and to confirm that an output from the system satisfies that high level behavior. I think for specifications of any complexity, producing and checking them would look like programming just of a different sort.
I mean, the cases where it tries to assign copyright to another person in a different year highlights that context other than the other text in the file is semantically extremely important, and not considered by this approach. Merely generating text which looks appropriate to the model given surrounding text is ... misguided?
If you think about it, program synthesis is one of the few problems in which the system can have a perfectly faithful model dynamics of the problem domain. It can run any candidate it generates. It can examine the program graph. It can look at what parts of the environment were changed. To leave all that on the table in favor of blurting out text that seems to go with other text is like the toddler who knows that "five" comes after "four", but who cannot yet point to the pile of four candies. You gotta know the referents, not just the symbols. No one wants a half-broken Chinese Room.
>> I don't think it is clear that such "fundamental weaknesses" exist. A text-based
approach can get you incredibly far.
Mnyeah, not really that "incredibly". Remember that neural network models are
great at interpolation but crap at extrapolation. So as long as you accept that
the code generated by Copilot stays inside a well-defined area of the program
space, and that no true novelty can be generated that way, then yes, you can get
a lot out of it and I think, once the wrinkles are ironed out, Copilot might
be a useful, everyday tool in every IDE (or not; we'll have to wait and see).
But if you need to write a new program that doesn't look like anything anyone
else has written before, then Copilot will be your passenger.
How often do you need to do that? I don't know. But there's still many open
problems in programming that lots of people would really love to be able to
solve. Many of them have to do with efficiency and you can't expect Copilot to
know anything about efficiency.
For example, suppose we didn't know of a better sorting algorithm than
bubblesort. Copilot would not generate mergesort. Even given examples of
divide-and-conquer algorithms, it wouldn't be able to extrapolate to a
divide-and-conquer algorithm that gives the same inputs for the same outputs as
bubblesort. It wouldn't be able to, because it's trained to reproduce code from
examples of code, not to generate programs from examples of their inputs and
outputs. Copilot doesn't know anything about programs and their inputs and
outputs. It is a language model, not a magickal pink fairy of wish fulfillment
and so it doesn't know anything about things it wasn't trained on.
Again, how often do you need to write truly novel code? In the context of
professional software development, I think not that often. So if it turns out to
be a good boilerplate generator, Copilot can go a long way. As long as you don't
ask it to generate something else than boilerplate.
There are approaches that work very well in the task of generating programs
that they've never seen before from examples of their inputs and outputs, and
that don't need to be trained on billions of examples. True one-shot learning of
programs (without a model pre-trained on billions of examples) is possible.
With current approaches. But those approaches only work for languages like Prolog
and Haskell, so don't expect to see those approaches helping you write code in
your IDE anytime soon.
This does make me wonder if this is susceptible to the same form of trolling as that MS AI got. Commit a load of grossly offensive material to multiple repos, and wait for Copilot to start parroting it. I think they're going to need some human moderation.
Way better. It's susceptible to copyright trolling.
Put up repos with snippets for things people might commonly write. Preferably use javascript so you can easily "prove" it. Write a crawler that crawls and parses JS files to search for matching stuff in the AST. Now go full patent troll, eh, i mean copyright troll.
1) Write a project heavily using Copilot (hell, automate it and write thousands of them, why not?)
2) AGPL all that code.
3) Search for large chunks of code very similar to yours, but written after yours, licensed more liberally than AGPL. Ideally in libraries used by major companies.
4) Point the offenders to your repos and offer a "convenient" paid dual-license to make the offenders' code legal for closed-source use, so they don't have to open source their entire product.
This was my first thought when reading about Copilot...it feels almost certain that someone will try poisoning the training data.
Hard to say how straightforward it'd be to get it to produce consistently vulnerable suggestions that make it into production code, but I imagine an attacker with some resources could fork a ton of popular projects and introduce subtle bugs. The sentiment analysis example on the Copilot landing page jumped out to me...it suggested a web API and wrote the code to send your text there. Step one towards exfiltrating secrets!
Never mind the potential for plain old spam: won't it be fun when growth hackers have figured out how to game the system and Copilot is constantly suggesting using their crappy, expensive APIs for simple things!? Given the state of Google results these days, this feels like an inevitability.
Given that code is easier to write than it is to read this one is troubling.
I certainly wouldn't want to be using this with languages like PHP (or even C for that matter) with all the decades of problematic code examples out there for the AI to learn from.
This is a very famous function [0] and likely appears multiple times in the training set (Google gives 40 hits for GitHub), which makes it more likely to be memorized by the network.
It's worth keeping in mind that what a neural network like this (just like GPT3) is doing is generating the most probable continuation based on the training dataset. Not the best continuation (whatever that means), simply the most likely one. If the training dataset has mostly bad code, the most likely continuation is likely to be bad as well. I think this is still valuable, you just have to think before accepting a suggestion (just like you have to think before writing code from scratch or copying something from Stack Overflow).
I have no idea how this or GPT3 works or how to evaluate them, but couldn't you argue that it's working as it should? You tell copilot to write a fast inverse square root, it gives you the super famous fast inverse square root. It'd be weird and bad if this didn't happen.
As far as licenses go, idk. Presumably it could delete associated comments and change variable names or otherwise obscure where it's taking code from. Maybe this part is shady.
In particular, fast approximate inverse square root is an x86 instruction, and not a super new one. I'd be surprised if it wasn't in every major instruction set.
This is an interesting issue. I suspect training on datasets from places like Github would be likely to provide lots of "this is a neat idea I saw in a blog post about how they did things in the 90's" codes.
> the most probable continuation based on the training dataset
This is not wrong, but it's easy to misread it as implying little more than a glorified Markov model. If it's like https://www.gwern.net/GPT-3 then it's already significantly cleverer, and so you should expect to sometimes get the kind of less-blatant derivation that companies aim to avoid using a cleanroom process or otherwise forbidding engineers from reading particular sources.
Arguably the most famous block of code, of all time. Maybe fizzbuzz but there are so many flavors of it. And InvSqrt is way more memeable.
So I don't know if on this alone it proves Copilot regurgitates too much. I think other signs are more troubling, however, such as its tendency to continue from a prompt vs generate novelty.
Unfortunately for GitHub, there's no turning back the clocks. Even if they fix this, everyone that uses it has been put on notice that it copies code verbatim and enables copyright infringement.
Worse, there's no way to know if the segment it's writing for you is copyrighted... and no way for you to comply with license requirements.
Nice proof of concept... but who's going to touch this product now? It's a legal ticking time bomb.
0. https://news.ycombinator.com/item?id=27687450
1. https://news.ycombinator.com/item?id=27676266
I run product security for a large enterprise, and I've already gotten the ball rolling on prohibiting copilot for all the reasons above.
It's too big a risk. I'd be shocked if GitHub could remedy the negative impressions minted in the last day or so. Even with other compensating controls around open source management, this flies right under the radar with a c130's worth of adverse consequences.
Id be very surprised if the other large enterprises that I have worked at downs doing exactly the same thing. Too much legal risk, for practically no benefit.
It's absolutely the case that before using certain libraries, most engineers in large corporations will make sure they are allowed to use that library. And if they don't, they are doing their job very badly IMO.
Deleted Comment
Code not written does not have defects, does not need support and, as you point it out, is not a liability.
Surely somebody working on this project foresaw this problem…
On the other hand, if only permissive licenses that also don't require attribution is used, well, then for a start, the available corpus is much smaller.
Deleted Comment
Deleted Comment
How much of a concern this is depends heavily on what the original source was.
Dead Comment
> The technical preview includes filters to block offensive words
And somehow their filters missed f*k? That doesn’t give a lot of confidence in their ability filter more nuanced text. Or maybe it only filters truly terrible offensive words like “master”.
Attempting to generate text from code containing "genocide" just has Copilot refuse to run. But you can still coerce Copilot to return offensive output given certain innocuous prompts.
A similar thing is happening in AI Dungeon, where certain words and phrases are banned to the point of suspending a users account if used a certain amount of times, yet they will happily output them when it is generated by GPT3 itself, and then punish the user if they fail to remove the offending pieces of text before continuing.
Dead Comment
https://github.com/id-Software/Quake-III-Arena/blob/master/c...
copilot repeats it word for word almost, including comments, and adds an MIT like license up the top
https://github.com/search?q=%22i++%3D+%2A+%28+long+%2A+%29+%...
"your honor, i would like to plead not guilty, on the basis that i just robbed that bank because i saw that everyone was robbing banks on the next city"
...on the other hand, that was the exact defense tried for the capitol rioters. So i don't know anything anymore.
So, the tool is worthless if you want to use it legally.
https://twitter.com/kylpeacock/status/1410749018183933952
It only goes to show that intransparent black-box models have no place in the industry. The networks leak information left and right, because it's way too easy to just crawl the web and throw terabytes of unfiltered data at the training process.
Copilot is NOT a shift up the abstraction tree. Over the last few years, though, I've realized the the concept of typing is. Typed programming is becoming more popular and prominent beyond just traditional "typed" languages -- see TypeScript in JS land, Sorbet in Ruby, type hinting in Python, etc. This is where I can see the future of programming being realized. An expressive type system lets you encode valid data and even valid logic so that the "building blocks" of your program are now bigger and more abstract and reliable. Declarative "parse don't validate"[1] is where we're eventually headed, IMO.
An AI that can help us to both _create_ new, useful types, and then help us _choose_ the best type, would be super helpful. I believe that's beyond the current abilities of AI, but can imagine that in the future. And that would be amazing, as it would then truly be moving us up the abstraction tree in the same way that, for instance, garbage collection has done.
[1] https://lexi-lambda.github.io/blog/2019/11/05/parse-don-t-va...
This is something I'm interested in regarding this approach... When it works as intended, it's basically shortening the loop in the dev's brain from idea to code-on-screen without adding an abstraction layer that someone has to understand in the future to interpret the code. The result is lower density, so it might take longer to read... Except what we know about linguistics suggests there's a balance between density and redundancy for interpreting information (i.e. the bottleneck may not be consuming characters, but fitting the consumed data into a usable mental model).
I think the jury's out on whether something like this or the approach of dozens of DSLs and problem-domain-shifting abstractions will ultimately result in either more robust or more quickly-written code.
But on the topic of types, I'm right there with you, and I think a copilot for a dense type forest (i.e. something that sees you writing a {name: string; address: string} struct and says "Do you want to use MailerInfo here?") would be pretty snazzy.
I'd be much more excited by (and less unnerved by) a tool which brought program synthesis into our IDEs, with at least a partial description of intended behavior, especially if searching within larger program spaces could be improved with ML. E.g. here's an academic tool from last year which I would love to see productionized. https://www.youtube.com/watch?v=QF9KtSwtiQQ
This solely text based approach is simply “easy” to do, and that’s why we see it. I think it’s cool and results are intriguing but the approach is fundamentally weak and IMO breakthroughs are needed to truly solve the problem of program synthesis.
You need either a) a complete specification of the target program in a formal language (other than the target language) or b) an incomplete specification in the form of positive and negative examples of the inputs and outputs of the target program, and maybe some form of extra inductive bias to direct the search for a correct program [edit: the latter setting is more often known as program induction].
In the last few years the biggest, splashiest result in program synthesis was the work behind FlashFill, from Gulwani et al: one-shot program learning, and that's one shot, from a single example, not with a model pretrained on millions of examples. It works with lots of hand-crafted DSLs that try to capture the most common use-cases, a kind of programming common sense that, e.g. tells the synthesiser that if the input is "Mr. John Smith" and the output is "Mr" then if the input is "Ms Jane Brown" the output should be "Ms". It works really, really well but you didn't hear about it because it's not deep learning and so it's not as overhyped.
Copilot tries to circumvent the need for "programming common sense" by combining the spectacular ability of neural nets to interpolate between their training data with billions of examples of code snippets, in order to overcome their also spectacular inability to extrapolate. Can language models learned with neural nets replace the work of hand-crafting DSLs with the work of collecting and labelling petabytes of data? We'll have to wait and see. There are also many approaches that don't rely on hand-crafted DSLs, and also work really, really well (true one-shot learning of recursive programs without an example of the base case and the synthesis terminates) but those generally only work for uncommon programming languages like Prolog or Haskell, so they're not finding their way to your IDE, or your spreadsheet app, any time soon.
But, no, AGI is not needed for program synthesis. What's really needed I think is more visibility of program synthesis research so programmers like yourself don't think it's such an insurmountable problem that it can only be solved by magickal AGI.
Parsing intent in a programing context is easier than others. Also most the code is written to be parsed for a machine anyway. So with ASTs and all other static and maybe even some dynmaic checks it should be possible.
We already some of it with type detection , intellisense etc
It is hard set of problem with no magic solutions like this with years of development time needed. That approach will not happen commercially, only incrementally in the community.
To "replace programmers", an organization would need to have a way of specifying to the system a high level program behavior, and to confirm that an output from the system satisfies that high level behavior. I think for specifications of any complexity, producing and checking them would look like programming just of a different sort.
I don't think it is clear that such "fundamental weaknesses" exist. A text-based approach can get you incredibly far.
If you think about it, program synthesis is one of the few problems in which the system can have a perfectly faithful model dynamics of the problem domain. It can run any candidate it generates. It can examine the program graph. It can look at what parts of the environment were changed. To leave all that on the table in favor of blurting out text that seems to go with other text is like the toddler who knows that "five" comes after "four", but who cannot yet point to the pile of four candies. You gotta know the referents, not just the symbols. No one wants a half-broken Chinese Room.
Mnyeah, not really that "incredibly". Remember that neural network models are great at interpolation but crap at extrapolation. So as long as you accept that the code generated by Copilot stays inside a well-defined area of the program space, and that no true novelty can be generated that way, then yes, you can get a lot out of it and I think, once the wrinkles are ironed out, Copilot might be a useful, everyday tool in every IDE (or not; we'll have to wait and see).
But if you need to write a new program that doesn't look like anything anyone else has written before, then Copilot will be your passenger.
How often do you need to do that? I don't know. But there's still many open problems in programming that lots of people would really love to be able to solve. Many of them have to do with efficiency and you can't expect Copilot to know anything about efficiency.
For example, suppose we didn't know of a better sorting algorithm than bubblesort. Copilot would not generate mergesort. Even given examples of divide-and-conquer algorithms, it wouldn't be able to extrapolate to a divide-and-conquer algorithm that gives the same inputs for the same outputs as bubblesort. It wouldn't be able to, because it's trained to reproduce code from examples of code, not to generate programs from examples of their inputs and outputs. Copilot doesn't know anything about programs and their inputs and outputs. It is a language model, not a magickal pink fairy of wish fulfillment and so it doesn't know anything about things it wasn't trained on.
Again, how often do you need to write truly novel code? In the context of professional software development, I think not that often. So if it turns out to be a good boilerplate generator, Copilot can go a long way. As long as you don't ask it to generate something else than boilerplate.
There are approaches that work very well in the task of generating programs that they've never seen before from examples of their inputs and outputs, and that don't need to be trained on billions of examples. True one-shot learning of programs (without a model pre-trained on billions of examples) is possible. With current approaches. But those approaches only work for languages like Prolog and Haskell, so don't expect to see those approaches helping you write code in your IDE anytime soon.
Put up repos with snippets for things people might commonly write. Preferably use javascript so you can easily "prove" it. Write a crawler that crawls and parses JS files to search for matching stuff in the AST. Now go full patent troll, eh, i mean copyright troll.
2) AGPL all that code.
3) Search for large chunks of code very similar to yours, but written after yours, licensed more liberally than AGPL. Ideally in libraries used by major companies.
4) Point the offenders to your repos and offer a "convenient" paid dual-license to make the offenders' code legal for closed-source use, so they don't have to open source their entire product.
5) Profit?
Hard to say how straightforward it'd be to get it to produce consistently vulnerable suggestions that make it into production code, but I imagine an attacker with some resources could fork a ton of popular projects and introduce subtle bugs. The sentiment analysis example on the Copilot landing page jumped out to me...it suggested a web API and wrote the code to send your text there. Step one towards exfiltrating secrets!
Never mind the potential for plain old spam: won't it be fun when growth hackers have figured out how to game the system and Copilot is constantly suggesting using their crappy, expensive APIs for simple things!? Given the state of Google results these days, this feels like an inevitability.
2- …
3- profit
I certainly wouldn't want to be using this with languages like PHP (or even C for that matter) with all the decades of problematic code examples out there for the AI to learn from.
[0]: https://en.wikipedia.org/wiki/Fast_inverse_square_root#Overv...
As far as licenses go, idk. Presumably it could delete associated comments and change variable names or otherwise obscure where it's taking code from. Maybe this part is shady.
This is an interesting issue. I suspect training on datasets from places like Github would be likely to provide lots of "this is a neat idea I saw in a blog post about how they did things in the 90's" codes.
This is not wrong, but it's easy to misread it as implying little more than a glorified Markov model. If it's like https://www.gwern.net/GPT-3 then it's already significantly cleverer, and so you should expect to sometimes get the kind of less-blatant derivation that companies aim to avoid using a cleanroom process or otherwise forbidding engineers from reading particular sources.
So I don't know if on this alone it proves Copilot regurgitates too much. I think other signs are more troubling, however, such as its tendency to continue from a prompt vs generate novelty.