I think the biggest problem copilot will have in practice gaining traction is that verifying correctness isn’t any faster than writing the code yourself in many cases. The Easter(y) function is a classic example - it would be way faster to write that than to try and verify that there’s no subtle bugs.
Copilot is by design trying to give you something that _looks_ correct without caring whether it actually is - so it optimises for real-looking but subtly buggy code, which is the worst kind of broken code.
Years ago, I ran into a similar problem working on a program that was doing named entity recognition to assist humans with data entry. We found that, for our purposes, there seemed to be no (realistic) accuracy threshold beyond which the tool would save clients money, because double-checking the machine-generated output was inherently more work than doing it by hand.
So we pivoted the product to being something you would run on full auto, for situations where you didn't need a high level of quality. I'm not sure if that option is available to programmers, though.
Maybe Copilot could be turned into a context-aware search engine? That is, invoking it would return a list of examples that it thinks do the same thing as what you're trying to do, based on your work-in-progress code.
> I think the biggest problem copilot will have in practice gaining traction is that verifying correctness isn’t any faster than writing the code yourself in many cases.
Humorously, this is a similar problem to the one autonomous driving has. Being alert when something goes wrong randomly is more difficult than being alert all of the time.
However in the real world people don't always write bugless code and aren't always alert when driving. Therefore these AI assistants can still have a net positive result as long as they are better than the average performance of a human. Of course three quarters of us probably believe that "I'm not an average programmer so Copilot would only make me worse."
Personally I think the more interesting angle is the trolley problem this creates. People will die in self-driving car accidents and bugs will exist in AI generated code. Those people and bugs are different than the people who will die in human caused accidents and the bugs in human written code. If the number and severity of the results are lessened by the computer, are we willing to forgive the damage directly caused by the AI that falls short of perfection?
I've been using it and that's 100% correct. If it suggests more than a few lines, i might as well do it myself. However, it has been an awesome Intellisense tool for one-liners.. it can write out the rest of a comment or a simple map/filter method just fine. I don't think it will ever go further than that, but nor does it need to.
Be sure it doesn't generate any GPL licensed code down to the size of a single letter, to be safe. The reaction to GPL inspired snippet generation has been more fierce than I could have imagined, even though usable snippets are so short.
Yea, that's the major problem. I'd prefer just some sort of inline helper that could point directly to documentation, topics, stack overflow answers that might be helpful for whatever I'm developing. An enhanced "Intellisense" or something. That to me, is better, because ultimately it's up to the developer to place scrutiny on the solution. You basically can't blindly accept the implementation which makes this just a constant code review, of yourself? I dunno. This just seems half-baked
Agree. It’s way easier to write a function from scratch than to read/evaluate/fix whatever snippet Copilot throws at me. Replace Copilot with “junior engineer” or “senior engineer that knows more than me” and the result is the same (the junior engineer will probably introduce couple of subtle bugs that are hard to find; the senior engineer would write code in such a way that my mediocre brain won’t understand).
It looks to you like it should work, but it doesn't, and you can't figure out why.
That's not "mostly working," that's a frustrating waste of time. It's hard enough to notice when you accidentally swap `i` and `j` -- why would you want to make your life even more miserable by spending your time finding all of the instances where a pattern matching robot has done something similar in an unfamiliar block?
And if you do happen to get "mostly working" code, but only want it to stay together long enough for you to fundraise, you're basically stating that you plan on foisting this technical debt onto the poor sod you happen to hire.
Attitudes like yours are the reason this dogpile scares me.
I feel like a mix of hand written test cases and copilot generated code might go somewhere, but I think you've got the basic problem sorted out. I'd much rather type an algorithm in from scratch than wrap my head around whatever copilot spits out.
I had an idea long ago that you basically write unit tests (nowadays I would add property-based tests to the mix too) and a genetic algorithm (best I could come up with at the time, nowadays we obviously have much fancier techniques, as evidenced by Copilot) would come up with code to try and make the tests pass.
I could see Copilot used in such a way. I think the interaction would have to change though: force the user to give it the tests as input, not give it some basic instruction, have it generate code, and then I try to write tests after. The tests should be the spec that Copilot uses to generate its output.
Right now, I'm not excited about Copilot. Like you say, understanding what Copilot spits out is difficult and I suspect more error prone than just writing it yourself (since we often see what we want to see and can overlook even glaring mistakes). I'm also not excited about them ignoring the licenses of the code they trained on. But I can imagine a future iteration that generated code to pass some tests that I could get excited about.
I can foresee one niche where this doesn't matter: exploratory ad-hoc data science.
In this exploration stage, total correctness doesn't matter since you're just getting a feel for the data. Copilot might help a lot with the associated boilerplate.
I think Copilot-like tools could be excellent for the exploration phase. Marvin Minsky mused on this usage back in 1967:
> The programmer does not even have to be exact in his own ideas‑he may have a range of acceptable computer answers in mind and may be content if the computer's answers do not step out of this range. The programmer does not have to fixate the computer with particular processes. In a range of uncertainty he may ask the computer to generate new procedures, or he may recommend rules of selection and give the computer advice about which choices to make. Thus, computers do not have to be programmed with extremely clear and precise formulations of what is to be executed, or how to do it.
I can't wait for coders who used copilot for all coding projects they did. Copy and pasting snippets until it works. No proofs or real exams at bootcamps!
Having offline coding interviews to find Software Engineers will become even more important.
It also isn't giving you any information on the source(s) of the generated code. Which might help determine how much to trust it, whether it could have licensing issues, etc.
It's probably the only way it would work - to show top matching snippets from training data on request, with links to the source and ideally licensing information, if it can be gleaned automatically. This would also clearly show how much it is copying verbatim and what exactly is its contribution.
The funny part will be when all the human programmers who steal code will get doxed as a side effect. It shine a light on lots of skeletons in the closet.
I think my best guess is that this is actually meant to produce broken code, so that Microsoft can sell you additional services (cloud fuzzing?) to find and fix the bugs.
see when I saw the words 'risk assessment' i figured the presuppositional framework of the authors argument wasnt that copilot was legally sound. In other words, i didnt expect to jump straight to the technical validity of the product.
do not ignore the elephant in the room. copilot is stealing code from projects with open licenses.
I would expect an information security expert to comment on the risks they have a professional background in assessing. More broadly, your comment almost seems to suggest that one should preclude all avenues of criticism beyond whichever singular issue is most "obvious" / "problematic". That strikes me as less than optimal.
Producing code that kinda-mostly works, very quickly, is the behaviour the software industry optimises for. This tool will help do more of that, so it will be very widely adopted.
Developers who not use this (or similar tools) will not be hired, or only in particular niche domains where correctness matters.
I'm surprised that so much of the discussion around Copilot has centered around licensing rather than this.
You're basically asking a robot that stayed up all night reading a billion lines of questionable source code to go on a massive LSD trip and then use the resulting fever dream to fill in your for loops.
Coming from a hardware background where you often spend 2-8x of your time and money on verification vs. on the actual design, it seems obvious to me that Copilot as implemented today will either not provide any value (best case), will be a net negative (middling case), or will be a net negative, but you won't realize that you've surrounded yourself with a minefield for a few years (worst case).
Having an "autocomplete" that can suggest more lines of code isn't better, it's worse. You still have to read the result, figure out what it's doing, and figure out why it will or will not work. Figuring out that it won't work could be relatively straightforward, as it is today with normal "here's a list of methods" autocomplete. Or it could be spectacularly difficult, as it would be when Copilot decides to regurgitate "fast inverse square root" but with different constants. Do you really think you're going to be able to decipher and debug code like that repeatedly when you're tired? When it's a subtly broken block of code rather than a famous example?
That Easter example looks horrific, but I can absolutely see a tired developer saying "fuck it" and committing it at the end of the day, fully intending to check it later, and then either forgetting or hoping that it won't be a problem rather than ruining the next morning by attempting to look at it again.
I can't imagine ever using it, but I worry about new grads and junior developers thinking that they need to use crap like this because some thought leader praises it as the newest best practice. We already have too much modern development methodology bullshit that takes endless effort to stomp out, but this has the potential to be exceptionally disastrous.
I can't help but think that the product itself must be a PSYOP-like attempt to gaslight the entire industry. It seems so obvious to me that people are going to commit more broken code via Copilot than ever before.
IMHO they built the opposite of what's actually useful for real-world use. Copilot should have been trained to describe what a selected block of code does, not write a block of code from a description. It could be extremely useful when looking at new or under-documented codebases to have an AI that gives you a rough hint as to what some code might be doing. For example if you select some heinous spaghetti code function, press a button, and get a prompt back that says "This code looks like it's parsing HTML using regex (74.2% confidence)" it could be much easier for folks to be productive on big codebases.
I'm not sure I understand how you envision this working, given the underlying technology. You'd have to have a pretty large cache of such analyses to train on, right?
This is the thing that made no sense to me about it as a premise. Doing correct program synthesis is really hard even when you have really opinionated and well-defined models of the domain (e.g. the Termite project for generating Linux device drivers). The domain model for Copilot is somewhere between non-existent to so open-ended (i.e. all the diverse code on Github, et al.) as to be functionally non-existent.
A bare minimum baseline validation check for Copilot would be to see if it provides you code which won't compile in-context. If it will, then that means it's not even taking into account well-specified domain model of your chosen programming language's semantics. Which, upon satisfaction, is still miles away from taking into account the domain of your actual problem that you're using software to solve.
The only place where the approach taken, as-is, makes sense to me is for truly rote boilerplate code. However, that then begs the question... how is this machine learning approach more effective than a targeted heuristic approach already taken by existing IDE tooling, etc.?
FWIW, I don't think any of this is lost on GitHub. I think Copilot is more likely a tremendously marketable half-step and small piece of a larger longer-term strategy unfolding at Microsoft/GitHub to leverage an incredible asset they're holding, i.e... practically everybody's source code. The combination of detailed changelogs, CI results (e.g. GitHub actions), Copilot, and a couple other key pieces makes for a pretty incredible basis for reinforcement learning to multiple ends.
I would hire copilot to write tests for me, that’s about it. Writing tests can be a drag. It’s really a low-risk proposition to have generated code attempt it. If it’s a usable test, maybe it will catch a bug. If not, then kill it and let it generate a few more.
The expectation is entirely different than producing code. Code needs to be correct, secure, performant, and readable. Failure on any of those fronts can be expensive to disastrous. Nobody can reasonably expect a test suite to catch every bug, even if created by the smartest humans. If a copilot-created test does prevent a bug from shipping it provides immediate value. I could see it coming up with some whacky-but-useful test cases that a sane person might not consider. From a training perspective I would think that assertion descriptions contain more consistent lexical value than the average function signature.
It seems like the ambitious data scientists, product marketers, and managers fell in love with a revolutionary idea about AI writing code, and neglected to consult the engineers they are trying to ‘augment’.
Nope. That's like saying, "I might let a machine write the docs for me."
Good tests are documentation that a computer can verify. Because they explain the meaning of parts of the system, they contain information not available in the code. If you try using ML for test generation, you'll have the same problem you do with GPT-3 prose: it might look plausible at first glance, but lacks coherent meaning.
You'd also end up with one of the problems common in big test suites: poorly factored tests that end up being the sort of expressive duplication that is a giant drag on improving existing code. ML is nowhere near advanced enough to say, "Gosh, we're doing the same sort of test setup a bunch; let's extract that into a fixture, and then let's unify some fixtures into an ObjectMother.
For people looking to get the computer to do the work of catching more things with less burdensome test writing, I suggest taking a look at things like Hypothesis: https://hypothesis.readthedocs.io/en/latest/
> If you try using ML for test generation, you'll have the same problem you do with GPT-3 prose: it might look plausible at first glance, but lacks coherent meaning.
There is a company in this space of generating "plausible tests" for legacy code bases at very large enterprises (think Goldman Sachs, telcos etc) called Diffblue [0].
They raised funding back in 2017 [1] and it seems their biggest value-add is in creating unit tests for legacy Java code bases that often have little to no unit tests.
Essentially, these AI generated unit tests help a team "document" all known the behaviors of a legacy code base such that when a change is introduced that violates the behaviors covered by the generated unit tests, the tool can alert the team of the potential presence of a regression.
Anyway, they offer a fairly basic browser-based demo of their AI product called Diffblue Cover [2].
> one of the problems common in big test suites: poorly factored tests that end up being the sort of expressive duplication that is a giant drag on improving existing code.
I feel like you just described every developer/codebase where mock testing is stupidly enforced. Where every single unit test mocks every single indirect object. 98% of the testing code is just exhaustive setup and teardown of objects not being tested by each test, and then a bunch of conditional checks to ensure that every deeper/indirect method is being called exactly the right number of times with exactly the right arguments and returning exactly the right value. Almost all of the test code is just hacking mock objects. The actual purpose of each test is buried so deep that it's impossible to even understand the business logic being applied.
It would be a mistake to say that the output from GPT3 lacks coherent meaning. It's not that the output is gibberish, it's that it's too easy to mistake it for a human's work. This means that it's easy to mistake it for something that was created with understanding and intention, when in fact the author was nothing more than a random number generator. The same risk exists for copilot. [--GPT3]
They didn't say they were going to have Copilot write _all_ the tests. Writing tests for cases you can think of and trying Copilot for the extras doesn't seem like that bad of an idea.
It would be awful to write every test using Copilot, but there is potential there for a certain kind of test. If I'm writing an API, I want fresh eyes on it, not just tests written by the person who understands it most (me). For example, a fresh user might try to apply a common pattern that my API breaks. Copilot might be able to act like such a tester. By writing generic tests, it could mimic developers who haven't understood the API before they start using it (most of them).
My time is spend mostly not on writing code, but thinking what program has to do, testing (including writing unit tests) and understanding errors, I always tell people half joking programming is not about writing code, but an ability to debug it; understanding requirements, errors and bugs is hard, writing code and fixing bugs is relatively easy, in general.
Maybe Copilot 2 will do exactly this; it will generate tests based of half working code, run them and suggest improvements, that would increase productivity by like ~100%, but to me this sounds to good to be true.
> Writing tests can be a drag. It’s really a low-risk proposition to have generated code attempt it.
If Copilot can't write the correct code in the first place, you really shouldn't expect a proper test to be written by Copilot.
> Code needs to be correct, secure, performant, and readable.
Most tests should also have at least three of those attributes. Nobody actually wants their tests to be incorrect, slow, or impossible to understand or modify.
On the contrary, it might be interesting writing tests by hand and using the AI to produce code. If the tests are good enough for humans, they should be good enough for AI, given that the AI doesn't try to be actively malicious.
I think the use case for Copilot is a bit misunderstood. The way I see it you have two types of code:
1. Smart Code: Code that you honestly have to think about while you're writing. You write this code slowly and carefully, to make sure that it does what you need it to
2. Dumb Code: This is trivial code, like adding a button to a screen. This is code you really don't have to think about, because you already know exactly how to implement it. At this point your biggest obstacle is how fast can your fingers type on a keyboard.
For me Github Copilot is useless for "Smart Code" but a godsend when writing "Dumb Code". I want to focus more on writing and figuring out the "Smart Code", if I need to throw a form together in HTML or make a trivial helper function, I will gladly let AI take over and do that work for me.
> This is trivial code, like adding a button to a screen.
UX is probably the most important aspect of most software products. Every software product is either "smart code" or "smart ux". No one pays much for "dumb code with bad UX" except in dysfunctional markets.
Adding a button to a screen should be trivial, and if it's not you need better tools. (As in "a not-horribly-misdesigned language and framework", not as in "giant transformer".)
Deciding where to add the button, its shape, its size, what happens when it's clicked, the text on the button, ... is anything but trivial.
But everyone does pay for dumb code. Even in the best-written and most efficient codebases, there's still going to be some amount of tedious glue code and boilerplate that you have to write in order to create a functioning product. It definitely would be better to have better languages and frameworks instead of a giant transformer, but the better languages and frameworks don't exist yet while the giant transformer does.
This is basically the same argument that was made against required boiler plate in Java. “Your IDE can just generate that for you!” (And in sufficiently advanced cases, also keep ik it up to date.)
Imho, it is just an argument for making better languages and libraries. (These libraries will also make it easier to use with copilot.)
Exactly. The reason we aren't all sitting around hand-writing assembler is that programmers look at tedious processes and find higher-level abstractions that allow us to do more work in less time.
Once we spot a tedious common pattern, we should be finding ways to DRY it up. Configs, libraries, frameworks, DSL, tools, and languages are all great ways to do that. Copy-pasting and machine-generating code are short-term thinking in two ways: they focus on the initial creation of the code at the expense of maintenance, and they give up on increasing abstraction, lock the system into a productivity plateau.
You still have to describe to co-pilot what you want. So that doesn't make much sense. You should work on a higher level of abstraction then. If you aren't, why not spend a few minutes writing some functions instead of generating tons of unmaintainable boilerplate with co-pilot?
The code for a 'button type code' is trivial. Most what we used to call wizards handles that bit.
It is what the action of that button is where the real fun comes in.
I once had a project that was a yes/no dialog. Two buttons and some text. I had the dialog up and running in under an hour. The action that happened when you pressed yes took 3 months to finish.
I don't know, we somehow manage to replace point-and-click GUIs for placing buttons (Windows Forms etc) with Frontend developers writing elaborate code to achieve the same result in HTML/CSS. Productivity is far from the first priority for frontend development.
Or you're describing how you could put copilot in a box and make a really good low code gui programming solution where the complex stuff is good old complex code.
The problem is you still have to go back and read through the "dumb code" to make sure it was written correctly. At a certain point, is that actually faster than just writing it yourself? Maybe a little bit, for some people and for some usecases, but it becomes a much narrower value-proposition.
Personally, I'd rather use snippets or some form of "dumb" code generation over an AI to generate the "dumb code". Sure, I'll probably still have to do some typing using those methods, but it's still less than if I were doing it all by hand.
It's not clear to me how that's better than the traditional solution to generate "Dumb Code", copy-pasting something. And we all know the problems with copy-pasting as a lifestyle.
Yes! Why is everyone so negative about copilot? I think it's a great name for the product. It helps you write, it doesn't write for you. You're still in charge and it can't write the "smart code".
Generally, a copilot is someone you can trust. The whole point of having a copilot is to reduce my cognitive load. If I am a pilot and have my copilot fly the plane while I do something else, I may be in charge, but I trust him to fly safely and alert me if things go wrong. A copilot is also a licensed pilot, able to do almost everything the pilot does, he is just not in charge.
The article shows that I can't trust GitHub copilot. So I don't think it is a representative name. Here, it would be more like a servant.
≥ These three example pieces of flawed code did not require any cajoling; Copilot was happy to write them from straightforward requests for functional code. The inevitable conclusion is that Copilot can and will write security vulnerabilities on a regular basis, especially in memory-unsafe languages.
If people can copy paste the most insecure code from Stack Overflow or random tutorials, they will absolutely use Copilot to "write" code and it will be become the default, especially since it's so incredibly easy to use. Also, it's just the first generation tool if it's kind, imagine what similar products will accomplish in 20 years.
With the pace of technological innovation, I'm honestly not sure what a similar product will be able to accomplish in 20 years. It'll be crazy for sure. But I'm worried about today.
This is a product by a well-known company (GitHub) which is owned by an even more well-known company (Microsoft). GitHub is going to be trusted a lot more than a random poster on Stack Overflow or someone's blog online. And GitHub is explicitly telling new coders to use Copilot to learn a new language:
> Whether you’re working in a new language or framework, or just learning to code, GitHub Copilot can help you find your way. Tackle a bug, or learn how to use a new framework without spending most of your time spelunking through the docs or searching the web.
This is what differentiates Copilot from Stack Overflow or random tutorials. GitHub has a brand that's trusted more than random content creators on the internet. And it's telling new coders to use Copilot to learn things and not check elsewhere.
That's a problem. Doesn't matter what generation of the program it is. It creates unsafe code after using its brand reputation and recognition to convince new coders to not check elsewhere.
Consider Google Translate, right? Google is a well-known brand that is trusted (outside of a relatively small group of people that doesn't trust Google on principle). Yet every professional translator knows that the text produced by Google Translate is a result of machine translation, Google or no Google. They may marvel at the occasional accuracy, yet expect serious blunders in the text, and would therefore not just trust that translation before submitting it to their clients. They will check. Or at least they should.
Maybe you are right but where UML created busy work, Copilot will literally do your work for you. I can even imagine a future where management makes it policy to Copilot first to save time and money.
I think there is some difference. You don't come across some piece of code by chance, you were actively looking for it, probably there were multiple blogs, SO entries with needed information, one of those sources has to be chosen. You know that this is some random blog post or SO answer given by someone fresh.
Copilot is something different. Code is suggested automatically and, what's the most important, suggested by the authority - hey, this is GitHub, huge project, largest code repo on the planet, owned by Microsoft, one of the most successful company ever. Why should you not trust the code they are suggesting you?
And that's for starters before malicious parties start creating intentionally broken code only to hack system built with it. Greedy lawyers who will chase some innocent code snippet asking to pay for using it, etc.
I had not considered the proliferation of terrible open-source code on GitHub. I'd wager that the amount of code in public repositories from students learning to code may outweigh quality code in GitHub.
I wonder if there was any sort of filter for Copilot's input — only repositories with more than a certain number of stars/forks, only repositories committed to recently etc.
> Ultimately, a human being must take responsibility for every line of code that is committed. AI should not be used for "responsibility washing."
That's the whole point, and the rest is moot because of it. If I chose to let Copilot write code for me, I am responsible for it's output, full stop. This is the same as if I let a more junior engineer submit code to prod, but there aren't blog posts about not letting them work or trusting them with code.
Copilot doesn't seem any better then Tab Nine. Tab nine is GPT-2 based, works offline and can produce high quality boilerplate code based on previous lines. It can also generate whole methods which when work seems mind blowing but they are not always correct. Most suggestions are usually mind blowing anyway because previously we never had this kind of code completion.
It feels like it wrote the whole line which you were going to write exactly as it should have. But that's all it does. And it seems like Copilot is the same but on much larger scale and online.
Copilot is by design trying to give you something that _looks_ correct without caring whether it actually is - so it optimises for real-looking but subtly buggy code, which is the worst kind of broken code.
So we pivoted the product to being something you would run on full auto, for situations where you didn't need a high level of quality. I'm not sure if that option is available to programmers, though.
Have the machine notify you when it thinks you’ve made a mistake.
Humorously, this is a similar problem to the one autonomous driving has. Being alert when something goes wrong randomly is more difficult than being alert all of the time.
Personally I think the more interesting angle is the trolley problem this creates. People will die in self-driving car accidents and bugs will exist in AI generated code. Those people and bugs are different than the people who will die in human caused accidents and the bugs in human written code. If the number and severity of the results are lessened by the computer, are we willing to forgive the damage directly caused by the AI that falls short of perfection?
Oh this is going to make teaching intro to computer science sooooo 'interesting'.
It wouldn't be so bad if the students looked at the generated code and understood it, but experience tells me most of them will not.
Of course it will not write complete functions correctly.
It looks to you like it should work, but it doesn't, and you can't figure out why.
That's not "mostly working," that's a frustrating waste of time. It's hard enough to notice when you accidentally swap `i` and `j` -- why would you want to make your life even more miserable by spending your time finding all of the instances where a pattern matching robot has done something similar in an unfamiliar block?
And if you do happen to get "mostly working" code, but only want it to stay together long enough for you to fundraise, you're basically stating that you plan on foisting this technical debt onto the poor sod you happen to hire.
Attitudes like yours are the reason this dogpile scares me.
I could see Copilot used in such a way. I think the interaction would have to change though: force the user to give it the tests as input, not give it some basic instruction, have it generate code, and then I try to write tests after. The tests should be the spec that Copilot uses to generate its output.
Right now, I'm not excited about Copilot. Like you say, understanding what Copilot spits out is difficult and I suspect more error prone than just writing it yourself (since we often see what we want to see and can overlook even glaring mistakes). I'm also not excited about them ignoring the licenses of the code they trained on. But I can imagine a future iteration that generated code to pass some tests that I could get excited about.
In this exploration stage, total correctness doesn't matter since you're just getting a feel for the data. Copilot might help a lot with the associated boilerplate.
> The programmer does not even have to be exact in his own ideas‑he may have a range of acceptable computer answers in mind and may be content if the computer's answers do not step out of this range. The programmer does not have to fixate the computer with particular processes. In a range of uncertainty he may ask the computer to generate new procedures, or he may recommend rules of selection and give the computer advice about which choices to make. Thus, computers do not have to be programmed with extremely clear and precise formulations of what is to be executed, or how to do it.
From: https://web.media.mit.edu/~minsky/papers/Why%20programming%2...
Having offline coding interviews to find Software Engineers will become even more important.
The funny part will be when all the human programmers who steal code will get doxed as a side effect. It shine a light on lots of skeletons in the closet.
do not ignore the elephant in the room. copilot is stealing code from projects with open licenses.
Developers who not use this (or similar tools) will not be hired, or only in particular niche domains where correctness matters.
You're basically asking a robot that stayed up all night reading a billion lines of questionable source code to go on a massive LSD trip and then use the resulting fever dream to fill in your for loops.
Coming from a hardware background where you often spend 2-8x of your time and money on verification vs. on the actual design, it seems obvious to me that Copilot as implemented today will either not provide any value (best case), will be a net negative (middling case), or will be a net negative, but you won't realize that you've surrounded yourself with a minefield for a few years (worst case).
Having an "autocomplete" that can suggest more lines of code isn't better, it's worse. You still have to read the result, figure out what it's doing, and figure out why it will or will not work. Figuring out that it won't work could be relatively straightforward, as it is today with normal "here's a list of methods" autocomplete. Or it could be spectacularly difficult, as it would be when Copilot decides to regurgitate "fast inverse square root" but with different constants. Do you really think you're going to be able to decipher and debug code like that repeatedly when you're tired? When it's a subtly broken block of code rather than a famous example?
That Easter example looks horrific, but I can absolutely see a tired developer saying "fuck it" and committing it at the end of the day, fully intending to check it later, and then either forgetting or hoping that it won't be a problem rather than ruining the next morning by attempting to look at it again.
I can't imagine ever using it, but I worry about new grads and junior developers thinking that they need to use crap like this because some thought leader praises it as the newest best practice. We already have too much modern development methodology bullshit that takes endless effort to stomp out, but this has the potential to be exceptionally disastrous.
I can't help but think that the product itself must be a PSYOP-like attempt to gaslight the entire industry. It seems so obvious to me that people are going to commit more broken code via Copilot than ever before.
Why do that when you can just train a GPT-3 model on public repositories and call it a day?
A bare minimum baseline validation check for Copilot would be to see if it provides you code which won't compile in-context. If it will, then that means it's not even taking into account well-specified domain model of your chosen programming language's semantics. Which, upon satisfaction, is still miles away from taking into account the domain of your actual problem that you're using software to solve.
The only place where the approach taken, as-is, makes sense to me is for truly rote boilerplate code. However, that then begs the question... how is this machine learning approach more effective than a targeted heuristic approach already taken by existing IDE tooling, etc.?
FWIW, I don't think any of this is lost on GitHub. I think Copilot is more likely a tremendously marketable half-step and small piece of a larger longer-term strategy unfolding at Microsoft/GitHub to leverage an incredible asset they're holding, i.e... practically everybody's source code. The combination of detailed changelogs, CI results (e.g. GitHub actions), Copilot, and a couple other key pieces makes for a pretty incredible basis for reinforcement learning to multiple ends.
Maybe we should use Copilot to commit more open source code meaning that Copilot becomes more and more corrupted and unusable!
of course then we end up with a bunch of bad open source code that will turn people off of using open source.
Gee, I don't think Microsoft really thought this one through.
The expectation is entirely different than producing code. Code needs to be correct, secure, performant, and readable. Failure on any of those fronts can be expensive to disastrous. Nobody can reasonably expect a test suite to catch every bug, even if created by the smartest humans. If a copilot-created test does prevent a bug from shipping it provides immediate value. I could see it coming up with some whacky-but-useful test cases that a sane person might not consider. From a training perspective I would think that assertion descriptions contain more consistent lexical value than the average function signature.
It seems like the ambitious data scientists, product marketers, and managers fell in love with a revolutionary idea about AI writing code, and neglected to consult the engineers they are trying to ‘augment’.
Good tests are documentation that a computer can verify. Because they explain the meaning of parts of the system, they contain information not available in the code. If you try using ML for test generation, you'll have the same problem you do with GPT-3 prose: it might look plausible at first glance, but lacks coherent meaning.
You'd also end up with one of the problems common in big test suites: poorly factored tests that end up being the sort of expressive duplication that is a giant drag on improving existing code. ML is nowhere near advanced enough to say, "Gosh, we're doing the same sort of test setup a bunch; let's extract that into a fixture, and then let's unify some fixtures into an ObjectMother.
For people looking to get the computer to do the work of catching more things with less burdensome test writing, I suggest taking a look at things like Hypothesis: https://hypothesis.readthedocs.io/en/latest/
> If you try using ML for test generation, you'll have the same problem you do with GPT-3 prose: it might look plausible at first glance, but lacks coherent meaning.
There is a company in this space of generating "plausible tests" for legacy code bases at very large enterprises (think Goldman Sachs, telcos etc) called Diffblue [0].
They raised funding back in 2017 [1] and it seems their biggest value-add is in creating unit tests for legacy Java code bases that often have little to no unit tests.
Essentially, these AI generated unit tests help a team "document" all known the behaviors of a legacy code base such that when a change is introduced that violates the behaviors covered by the generated unit tests, the tool can alert the team of the potential presence of a regression.
Anyway, they offer a fairly basic browser-based demo of their AI product called Diffblue Cover [2].
Are you aware of them?
0: https://www.diffblue.com/
1: https://techcrunch.com/2017/06/27/diffblue/
2: https://www.diffblue.com/try-cover-browser/
I feel like you just described every developer/codebase where mock testing is stupidly enforced. Where every single unit test mocks every single indirect object. 98% of the testing code is just exhaustive setup and teardown of objects not being tested by each test, and then a bunch of conditional checks to ensure that every deeper/indirect method is being called exactly the right number of times with exactly the right arguments and returning exactly the right value. Almost all of the test code is just hacking mock objects. The actual purpose of each test is buried so deep that it's impossible to even understand the business logic being applied.
I hate evangelical "mock testers" with a passion.
Maybe Copilot 2 will do exactly this; it will generate tests based of half working code, run them and suggest improvements, that would increase productivity by like ~100%, but to me this sounds to good to be true.
If Copilot can't write the correct code in the first place, you really shouldn't expect a proper test to be written by Copilot.
> Code needs to be correct, secure, performant, and readable.
Most tests should also have at least three of those attributes. Nobody actually wants their tests to be incorrect, slow, or impossible to understand or modify.
"Oh, and they're both wrong."
"Both look plausibly correct at a glance"
You would end up with tests that look plausibly correct, but test the wrong results.
Dead Comment
1. Smart Code: Code that you honestly have to think about while you're writing. You write this code slowly and carefully, to make sure that it does what you need it to
2. Dumb Code: This is trivial code, like adding a button to a screen. This is code you really don't have to think about, because you already know exactly how to implement it. At this point your biggest obstacle is how fast can your fingers type on a keyboard.
For me Github Copilot is useless for "Smart Code" but a godsend when writing "Dumb Code". I want to focus more on writing and figuring out the "Smart Code", if I need to throw a form together in HTML or make a trivial helper function, I will gladly let AI take over and do that work for me.
UX is probably the most important aspect of most software products. Every software product is either "smart code" or "smart ux". No one pays much for "dumb code with bad UX" except in dysfunctional markets.
Adding a button to a screen should be trivial, and if it's not you need better tools. (As in "a not-horribly-misdesigned language and framework", not as in "giant transformer".)
Deciding where to add the button, its shape, its size, what happens when it's clicked, the text on the button, ... is anything but trivial.
Imho, it is just an argument for making better languages and libraries. (These libraries will also make it easier to use with copilot.)
Once we spot a tedious common pattern, we should be finding ways to DRY it up. Configs, libraries, frameworks, DSL, tools, and languages are all great ways to do that. Copy-pasting and machine-generating code are short-term thinking in two ways: they focus on the initial creation of the code at the expense of maintenance, and they give up on increasing abstraction, lock the system into a productivity plateau.
It is what the action of that button is where the real fun comes in.
I once had a project that was a yes/no dialog. Two buttons and some text. I had the dialog up and running in under an hour. The action that happened when you pressed yes took 3 months to finish.
If not, why should copilot?
The article shows that I can't trust GitHub copilot. So I don't think it is a representative name. Here, it would be more like a servant.
If people can copy paste the most insecure code from Stack Overflow or random tutorials, they will absolutely use Copilot to "write" code and it will be become the default, especially since it's so incredibly easy to use. Also, it's just the first generation tool if it's kind, imagine what similar products will accomplish in 20 years.
This is a product by a well-known company (GitHub) which is owned by an even more well-known company (Microsoft). GitHub is going to be trusted a lot more than a random poster on Stack Overflow or someone's blog online. And GitHub is explicitly telling new coders to use Copilot to learn a new language:
> Whether you’re working in a new language or framework, or just learning to code, GitHub Copilot can help you find your way. Tackle a bug, or learn how to use a new framework without spending most of your time spelunking through the docs or searching the web.
This is what differentiates Copilot from Stack Overflow or random tutorials. GitHub has a brand that's trusted more than random content creators on the internet. And it's telling new coders to use Copilot to learn things and not check elsewhere.
That's a problem. Doesn't matter what generation of the program it is. It creates unsafe code after using its brand reputation and recognition to convince new coders to not check elsewhere.
> GitHub has a brand that's trusted more
Consider Google Translate, right? Google is a well-known brand that is trusted (outside of a relatively small group of people that doesn't trust Google on principle). Yet every professional translator knows that the text produced by Google Translate is a result of machine translation, Google or no Google. They may marvel at the occasional accuracy, yet expect serious blunders in the text, and would therefore not just trust that translation before submitting it to their clients. They will check. Or at least they should.
Same with programmers.
People said the same thing about UML and similar tools so I'm not holding my breath.
Copilot is something different. Code is suggested automatically and, what's the most important, suggested by the authority - hey, this is GitHub, huge project, largest code repo on the planet, owned by Microsoft, one of the most successful company ever. Why should you not trust the code they are suggesting you?
And that's for starters before malicious parties start creating intentionally broken code only to hack system built with it. Greedy lawyers who will chase some innocent code snippet asking to pay for using it, etc.
Deleted Comment
I wonder if there was any sort of filter for Copilot's input — only repositories with more than a certain number of stars/forks, only repositories committed to recently etc.
That's the whole point, and the rest is moot because of it. If I chose to let Copilot write code for me, I am responsible for it's output, full stop. This is the same as if I let a more junior engineer submit code to prod, but there aren't blog posts about not letting them work or trusting them with code.
It feels like it wrote the whole line which you were going to write exactly as it should have. But that's all it does. And it seems like Copilot is the same but on much larger scale and online.