Readit News logoReadit News
wrs · 3 years ago
“…modify[ing] its AI model so that it traces attribution and gives credit to the original authors of the code, adding the associated copyright notices and license terms in the process…Biderman says is technologically feasible.”

Is it really feasible? What does “traces attribution” even mean here? It’s not emitting “code”, it’s emitting individual tokens that each were found throughout the input corpus. The “code” is the arrangement of those tokens, but that is determined by the weighting of the whole network, so what can be traced?

Can someone who understands generative ML better than me weigh in on this?

dahart · 3 years ago
Why wouldn’t it be feasible? (Maybe this depends on what you mean by ‘feasible’.) There’s no technical reason you can’t back-track the weights and make a list of which tokens from which training data were sampled. The list might be long, it could be impractical, but that has little bearing on whether it’s technically possible, right?

The problem here happens when the same source is sampled for many tokens in a row because it’s the only match for the context. It could also happen that many tokens in a row each have a long list of sources, but when put together have a subset of sources that appear in every token’s list. That means that someone’s input is being repeated verbatim, even if the network wasn’t trying to reproduce a single source. We could prune the list of attribution sources at the expense of compute by running largest common subset algorithms, which might be sufficient for attribution tracing?

It feels like this whole question might hinge on Fair Use. The network is copying other people’s code one token at a time. We (society & copyright law) all tend to agree that’s fine when it’s a single token out of context, and we all tend to agree it’s not fine when the whole output program matches any single input source. The question naturally becomes, “where’s the line, how many tokens in a row from a single source should be allowed?”

wrs · 3 years ago
The token “if” appears in a heck of a lot of inputs, and a heck of a lot of outputs. What you’re describing seems like basically building an inverted index of the input corpus and then doing a search for the output. If the answer comes back with a high relevance then you consider it “traced”. I wonder what the size of that inverted index would be compared to the size of the generative model.
c7b · 3 years ago
> There’s no technical reason you can’t back-track the weights and make a list of which tokens from which training data were sampled.

The weights map inputs to outputs, not training samples to outputs. They are adjusted during the training phase and then kept fixed (until there's a new iteration of the model; and let's ignore online learning for simplicity's sake now, the major models we're talking about here don't use it afaik). From the perspective of a user who inputs prompts and receives answers, the weights are constant. We could create some score indicating how much each training sample contributed to the weights overall, but then that score would be the same for all prompts.

It's really not clear to me what you mean by 'sampling from training data' for a specific prompt. The best I could imagine would be creating another metric measuring the distance between the prompt and all training samples, and then somehow combining that with the first score to get an overall contribution metric for this prompt. But that would be a) quite a bit of guesswork with a lot of modelling freedom and b) quite different from what you described.

belorn · 3 years ago
Fair use is a bit more context dependent, especially when applied on other form of media. Unstable diffusion is a big example where many seems to feel that no fair use should be allowed regardless of how small of a token is taken by the AI. A small amount also depend on the original work and how it fit into society at large.

Having fair use to be the pillar that all AI training stand on is going to take a while.

nonrandomstring · 3 years ago
The judge/law will say it comes down to intent.

Here's my take from an audio DSP viewpoint.

A composer listens to 10,000 hours of music. One day she writes an original piece based on the annealed parameters (temporal, spectral, intensity, pitch, sequence...) of a million artists. However it sounds like another specific artist who sues.

It is not a cover, a remix, homage, or even forged in the genre... it's just accidentally a bit too like it. (compare: Banana Splits vs. Bob Marley - and - Huey Lewis vs. Ray Parker Jr. - which completely misses the real impact of "Pop Muzik" by M)

The question is, was she exposed to influencing materials incorporated without intent or was it plagiarism (intent to reproduce a derivative etc) ?

By contrast I can take a piece of music, break it down by analysis into melodies, chords, timings, and use FFT to extract the precise spectrums of instruments, feed those to a resynthesis engine that finds new synthesiser parameters to create an exact sound-alike and then deliberately recreate a piece "In the near style of artist X". With a little musical processing I can change the key, inversions, re-template the rhythm to a new swing... always pushing the derived piece into new territory until eventually it's barely recognisable.

Nonetheless, in the second case I have clearly intended to steal someone's idea and "make it my own" by automatable transformation.

In the former case it seems to be a "genuine labour". (whatever that means in 2023)

The genuine artist intends to make something through intellectual labour.

Maybe that's the real question. I mean, about "labour". If the cost of the labour tends to zero, does it really matter?

janalsncm · 3 years ago
The model isn’t sampling from training data during inference. It’s sampling from a latent space created during training, which used all of the training data to create it.

Fundamentally, the presence of a next token in some of the samples in the corpus is just as informative as the absence of that token from other samples. You can’t cite one without the other.

If you wanted to give attribution you’d need to list the entire training corpus.

johnfn · 3 years ago
The failure mode I see here, which seems quite likely, is that attribution would almost always list tens of thousands of source files, or more. They doesn’t seem particularly useful or meaningful.
falcolas · 3 years ago
> It’s not emitting “code”, it’s emitting individual tokens ... The “code” is the arrangement of those tokens, but that is determined by the weighting of the whole network

This theory of operation is not borne out in reality. It's been clearly displayed that these tools are emitting verbatim copies of existing code (and its comments) in their input.

It's even being seen in image generation, where NatGeo cover images are reproduced in their entirety or where stock photo watermarks are emitted on finished images.

And so, what can be traced back to individual sources? Quite a bit it would seem.

sillysaurusx · 3 years ago
Well yes, if you’re asked to memorize someone’s code, you’ll get surprisingly far too. The fact that models can do this isn’t evidence of anything. It’s a capability (or “tendency” if you overfit too much).

I think it’s pretty obvious that if you train a model to reproduce existing work in its entirety, it fails the “sufficiently transformative” test and thus loses legal protection.

But there’s nothing stopping you from re-coding existing implementations of GPL’ed code. I used to do it. And your new code has your own chosen license, even if your ideas came from someone else. Are you sure the same logic shouldn’t apply to models?

safety1st · 3 years ago
Even if that theory of operation were true, it seems like an implementation detail that would be largely irrelevant in a court of law. Infringement is infringement regardless of how complicated your method for committing it was. If your AI reproduces a copyrighted work verbatim without permission, surely you are committing infringement? If you would be infringing by doing it via copy and paste, I would think a court would be likely to find that you infringed when you used an AI to do it. It does not matter that the algorithm on the backend is fancy...
rymate1234 · 3 years ago
> It's been clearly displayed that these tools are emitting verbatim copies of existing code (and its comments) in their input.

Which makes sense when you consider that the sort of code that is getting reproduced verbatim is usually library functions which developers may copy and paste verbatim comments and all into their project, especially when you prompt the AI with the header of a function that has been copied and pasted often, so the weightings will in that instance be heavily skewed towards reproducing that function

wrs · 3 years ago
The output may be recognizable by humans as being “the same”, but an ML classifier can look at what is “obviously” a cat and classify it as a turtle. So while your statement is accurate, it doesn’t explain how “traceability” could really work—what is the actual mechanism of “tracing”?
mcintyre1994 · 3 years ago
On your image generation point, something potentially interesting I saw on twitter/Reddit was about artists uploading “no to AI” images to artstation and then that logo almost appearing to vandalise AI generated art. I’m not sure if it’s a real thing or just people on twitter and stuff who don’t like AI art trolling though. Mostly because I’m not sure how up to date some of these models are being kept.
forrestthewoods · 3 years ago
> It's even being seen in image generation, where NatGeo cover images are reproduced in their entirety or where stock photo watermarks are emitted on finished images.

Can you cite sources? I’ve heard this claim repeatedly but have yet to see a good example.

ShamelessC · 3 years ago
The quote does not make that assertion…
Rekksu · 3 years ago
Do you have a source for this claim? This isn't really how generative models work
ProlificInquiry · 3 years ago
This quote seems to fundamentally misunderstand what transformers are doing at all. Technically I suppose you could save all gradient updates from every input token, and do some weighted averaging to show what inputs affected the particular output the most, but saving all those gradient updates would be unimaginably space consuming. "Feasible" is doing a lot of work there.

It's very hard for people to get away from the idea that GPT is "copying" something, but that's not what it's doing. The reality is, to get the exact artifact which produced the code in question, you need "Call me Ishmael" from Moby Dick just as much as the Linux kernel source.

amelius · 3 years ago
> The reality is, to get the exact artifact which produced the code in question, you need "Call me Ishmael" from Moby Dick just as much as the Linux kernel source.

Not always. Sometimes it just copies code without modification.

It never tells you when it does that, though. So to be on the safe side, better assume that it always does.

janalsncm · 3 years ago
Even then, the presence of a next token is just as informative as the absence of another during training. That information also gets backpropagated. And during inference, good luck identifying which weights were responsible for a given next token (assuming we’re using greedy decoding, don’t even get started on beam search) let alone which samples contributed to that weight (hint: they all did).
Filligree · 3 years ago
It sounds like nonsense. The most plausible solution would be to provide credit to every single author whose code was used in the original training set; of course, that would run into gigabytes just for the credits.
logicallee · 3 years ago
But GPL requires the resulting derivative code to be open-sourced under GPL, not just attributed. So if you did that then copilot could only emit code under a GPL license. Maybe that is okay, but it is not what copilot is trying to do at the moment. (And not everyone wants to open-source all their code under GPL.)
varjag · 3 years ago
And…? I think my last Xcode update was 20+ Gb.

Deleted Comment

hackinthebochs · 3 years ago
In the ideal case the next token is determined by the local context (the prefix string) and the entire corpus of trained code. In this case the prefix string has not been seen before and so the generator must do some interpretation/extrapolation to determine the likely continuation. But in some cases, perhaps many cases, the prefix string has been seen before, or is similar enough to what has been seen before, that the best continuation is just to spit out the similar string in the training corpus. Presumably such cases can be detected due to specific patterns of activation in the network and attributions can be captured/applied.

One dumb way to do this would be to include self-attributions directly in the stream of training data. So in the cases where the best continuation is to just transcribe the training data, the attributions is included in the data itself.

wrs · 3 years ago
That “presumably” is exactly what I’m trying to get past. How would it actually work? Once the input corpus has been chewed up and reduced to tokens, is there actually a representation of “the similar string in the corpus” any more?

I don’t think people are even clear on the problem statement here. If I fed in three very similar functions from different sources, and I got a fourth, also very similar, output, what is the “traced attribution” supposed to be?

dec0dedab0de · 3 years ago
Honestly, I would be fine with verbatim chunks of code if it could gaurantee a compatible license and did the copyright notices properly.

That always seemed like a problem with open source, you should theoretically be able to copy bits of code from hundreds of projects to make a new one, but keeping track of the licenses makes it too much of a pain. So the closest we really see is people vendorizing libraries.

beecafe · 3 years ago
You can just find the most matching snippets in the training data using an embedding model.
visarga · 3 years ago
They probably mean using a code search engine to check all snippets. The simplest thing would be am n-gram filter. A more advanced approach would use a code similarity neural net. It's not principled attribution, just locating the most similar example in the training set.
nonrandomstring · 3 years ago
> It's not principled attribution, just locating the most similar example in the training set.

This is a profoundly important distinction.

Back-tracing data to contributory training examples is a genuine "influenced by" relation. Picking the nearest neighbour to a given result (even if its an exact copy!) cannot say anything useful with respect to origins. And given that there will always be some proximate neighbour, it's really a "misattribution machine".

This is bit like how our broken patent system grants or denies ownership of a design based on similarity to extant art but regardless of actual originality.

godelski · 3 years ago
So there a few ways this can actually be done with different levels of accuracy/precision. But it is going to be complicated no matter what.

The easy thing to do is to compare outputs to inputs. This isn't technically hard (i.e. cosine similarity) but it is computationally difficult (i.e. cosine similarity of output to entire input). This would give us some weightings that show similarity. But this doesn't really tell us attribution, rather more a correlation. These are statistical models so there's reason to believe that this is okay.

Then there's inversion. This is processing data backwards through the model. This has different complexities compared to what type of model you're working with. GANs aren't great, diffusions are okay, normalizing flows are trivial (but good luck generating good images from NFs). If we can invert the model it is much easier to investigate and probe for contribution by looking at the distance of the latent generative variable to the location of the latent trained data. Basically you're looking at how different information is contributing to the overall output. This can also be done at every level in the network. Obviously this gets both technically challenging as well as computationally.

Another method would be using dataset reconstruction (this is outside my wheelhouse fwiw). This is where you try to recover the dataset from the final trained network. This too is complicated but there's plenty of papers showing progress in this space (lots of interest from privacy groups).

(TLDR-ish) There's other methods too. But basically what is being said is that there are ways to denote what and how much the training data contributed to the output of the model wherein we can then measure how similar the output is to the inputs (i.e. copying).

nonrandomstring · 3 years ago
The answer is of course it's possible, so long as you have a couple of gigabytes spare for the acknowledgements page. Any construct worth attributing will have roots in billions of parameters.
ProlificInquiry · 3 years ago
It's far worse than that: to attribute a particular output to an input you don't just need the input data, you need the gradient updates that data caused in the training run. A couple hundred gigabytes input tokens times 175 billion parameters equals... impossibility.
skibidibipiti · 3 years ago
Theoretically possibly, technically challenging. The training data would need to be annotated with traces, so a person would have to figure out where all of the answers came from
ajsnigrutin · 3 years ago
Maybe like a science paper... 10 lines of code, 50 citations :)
jacquesm · 3 years ago
To me AI code generators are the equivalent of crypto tumblers or mixers for digital coins. You can pretend all you want that the output is 'clean' but we all know it came from somewhere else and wasn't actually generated by the software, just endless little snippets that other people made.
godelski · 3 years ago
This is definitely not true. In theory a generator could reproduce a distribution of data that includes more than the data it was able to sample from. For example, you could create the distribution of all possible faces from a subset of sample faces. That means you could create the faces of people who do not yet exist. Now good luck getting that perfect generator (black swans, bias, etc), but it would be naive to believe that they (well... that there can be no generator that) are just copy input data and mashing them together. At least if you're going to believe that that also isn't happening in the real world, but then what's the difference?
remus · 3 years ago
I don't think that's entirely true. When I've tested codepilot it also knows something about the context in which it's working, so it can use relevant variable names in it's suggestions. To me that's a step beyond rote regurgitation of other code.
jacquesm · 3 years ago
That may be so, but the process is automated and copyright law has something to say about mechanical transformation.

As an example: if I take a piece of code you wrote and substitute all of the identifiers that does not create an original work.

system2 · 3 years ago
ChatGPT output is very clean and precise especially when variables provided in detail. I doubt you can trace it back if you make the prompt very elaborate.
jacquesm · 3 years ago
Why can't you trace it back? Remove all the input, fill the model with noise. Prompt all you want it will produce just noise. Train the model on your chosen input text. Suddenly it starts to provide 'clean and precise text'. Absent an argument that we're looking at AGI I see no other possible conclusion than that this is a mechanical transformation of two inputs (yours + the training data) into some output. What happens inside the box is irrelevant, that's just implementation details of the mechanism. If you build your mechanism in such a way that you can't trace it yourself that doesn't mean that you get to claim the output as an original work.

Deleted Comment

scotty79 · 3 years ago
falcolas · 3 years ago
I propose that information naturally wants to degrade. Paper decomposes. Bits flip. File formats are replaced and lost. Storage mediums degrade. It's all an extension of the universe trending towards entropy.

It actually takes quite a bit of effort to store, then distribute information precisely and broadly. There's a lot of infrastructure, effort, and money involved, and still information degrades and disappears over time.

Any libre information exists because people have put effort into it. Sometimes a lot of effort.

nativeit · 3 years ago
I cede your larger point that information availability requires upkeep, but I don’t think the tendency for information to degrade is usually attributable to entropy. It’s probably true enough in some cases, but more often than not it’s going to be general interest by the population that determines the accessibility of information—and that crosses boundaries and regulatory efforts like copyright. The more intense the interest and/or awareness (generally speaking) the more likely the information will find its way into the public domain.

It’s why you can easily find literature written hundreds or thousands of years ago in multiple languages with little effort, but some memo stored in a tape archive from the 1970s is likely gone forever. “Bitrot” is more likely to be a function of diminishing interest rather than increasing entropy.

CodexArcana · 3 years ago
AI is inherently incompatible with the notion of private property as a result of the way it is "trained". It unintentionally exposes how much of our creativity, exploration, and innovation we've commodified.
scotty79 · 3 years ago
That's why it wants to be free. To survive by being copied and adapt by being freely built upon.
somrand0 · 3 years ago
considering that information is something static (which doesn't change) that describes (is about) how something else is changing

I think that indeed, it is the inherent nature of information to radiate itself, i.e. to share (to shine, to spread)

belorn · 3 years ago
When Copilot was released the copyright discussion focused exclusively on code, but now we see very similar discussion around images with stable diffusion and the sister project unstable diffusion. When Copilot do reach the courts there will be some indication on how courts view author consent when it comes to training material. After that we might then see court cases for each form of media (images, video, text, sound), and also in their each unique contexts (books, online videos, porn, stock databases, and so on).
jacquesm · 3 years ago
In my opinion using all of the code on GitHub without respecting the licenses was a capital mistake. It should have been opt-in, maybe with some incentive but to just take it all without so much as a by-your-leave is not going to play well in court.
Permit · 3 years ago
In my opinion it was covered under the GitHub terms of service and is clearly transformative. I am optimistic that the courts will find it so and we can put these debates to rest similar to how we’ve done for web scraping.
notacoward · 3 years ago
It would have been awesome if Copilot had been written so that you tell it what your license is and it emits code trained only on code with compatible licenses. That would have created an incentive for people to publish their own code under more open licenses, instead of facilitating IP theft and thus creating a disincentive. That GH didn't choose this option is rather telling IMO.
rboyd · 3 years ago
I think it’s silly to pretend that human programmers are emitting a lot of code with a high degree of originality either. We’re all remixing some long-forgotten influential code laying deep in latent memory just like the models.
danShumway · 3 years ago
Humans do remix stuff when we create, and copyright draws a kind of arbitrary line around when that becomes original and when it becomes derivative, mostly because if it didn't then it would be impossible for us to get new books, movies, and songs.

All copyright is a hack; it's not an ideological, internally consistent framework. It is a set of fallible legal rules we invented so that people would get paid for creating things we care about. That doesn't mean that comparisons aren't useful or that we can't extrapolate from the existing rules, but even where fair use is concerned, the actual justification isn't a logical one, it's: "if we didn't have this standard nobody would make anything." So the boundaries around fair use and what things fall into it are not being created purely based on logic or first principles.

The reason why we have copyright is because we want people to be paid for making creative work. The reason we have fair use and a standard of "originality" that treats coders/artists learning from other artists as acceptable is because if we didn't, the entire system would fall apart and nobody would be able to make new creative works.

Everything in copyright exists purely to get people to make more stuff in a sustainable way. It's an outcome-driven process.

----

More recently, there are a lot of people who argue that IP is a real, fundamental property right, but frankly, IP doesn't stand up at all if you think about it too hard. The justifications for why IP theft is theft can't be consistently generalized in a way that applies outside of the IP space. The standards for what does and doesn't count as creative aren't really consistent or based on a straightforward definition of creativity.

A lot of people would love to say that IP rights are just property rights, but... it's not all that convincing, and the history of copyright doesn't really indicate to me that the people building the laws thought of them that way.

And again, that doesn't mean that there's no consistency in copyright rules or that copyright rulings don't have implications beyond the original rulings. But it is almost always easier to think about copyright and almost always easier to understand why copyright laws are the way that they are if you approach copyright as a means to an end, and understand the existing laws not as an attempt to create an internally consistent system, but as a series of attempts throughout US history to achieve a consistent publicly beneficial outcome.

----

With that in mind, I suspect whether or not AI works count as remixing is largely going to be decided based on commercial interests, individual judges, and individual juries, possibly with input from US legislature.

"Everybody remixes" historically hasn't been the most useful argument during these debates? So I don't know how it's going to play out this time around. I vaguely suspect it's going to come down to whether or not individual pieces are recognizable? That's how we got wild copyright laws about some individual chord progressions being treated as derivative works in songs.

marstall · 3 years ago
by reducing credit, copilot reduces incentive to create and publish free code. biting the hand that feeds it.

exact same problem exists with GPT3 and others.

big tech slashing and burning, ruthlessly exploiting the least empowered people in the tech economy.

neat hack.

scottomyers · 3 years ago
I used to put most of my code on public repos in Github, Gitlab, etc. Since Copilot came out, I've moved most of my projects to a Git server on my home network.

The prospect of knowledge work being resold in this way feels icky. Perhaps I'd feel different if there was a simple, well-defined way to opt out of contributing to the training corpus. Opt-in would be even better.

alextheparrot · 3 years ago
Copilot also makes it easier to use open source framework code.

Being able to ask a model for help writing Spark, SQLAlchemy, Tokio, whatever, actually increases the usability of free code vs proprietary code.

Thews · 3 years ago
How does copilot reduce the incentive to publish free code?
chubbnix · 3 years ago
If it writes what would have been open source dependencies based on open source code you no longer have to license that work or your own. Also coders might not want to open source original work for fear they are just going to feed that beast that will ultimately kill them. Sort of a chilling effect if they cannot chose to contribute to that cause or not while also sharing code as open source.
dspillett · 3 years ago
Depends how you define free.

If an AI being inspired by GPL code that happened to be in its training data because someone other than the creator stuck it in a repo on GitHub is just fine, and if some code produced as a result is practically identical that is fine too, and the resulting code is not considered GPL any more, then the GPL and licences like it are worthless.

The only two designations that mean anything at that point are public domain and commercial. “free” as in “Free” rather than public domain means the same as public domain.

So there is no point releasing free (other than public domain) code, discouraging the act. If I want to control how my code is used at all protected commercial release becomes the only option.

This of course suits the commercial interest behind copilot just fine and dandy…

---

But, if commercial code ended up in the training set the same should apply because in terms of giving the right to use code licences like the GPL and commercial licences are no different: the licence gives the right to use the code. If passing it through an AI gives that right, bypassing the licence, for one case than it should for the other too. I wonder if MS would be happy for copilot's own code to be in the training set and for me to produce and sell something based on the output of an AI trained with their code?!

---

I think the AI systems like copilot should be considered the same as us wet-ware naturally formed intelligence systems in that respect: if it produces something based on code under a particular licence then that something should be subject to the terms of the licence. Ignorance of the licence is no excuse. If the AI can not be made aware of the correct licence and attribution for the code, so it can include that with its suggestions based on it, then that code should not be in the training set.

For decades MS complained about open source code potentially creating this very situation, just with only non-artificial intelligences in the mix, now they are hoping no one can call them on that because it is convenient for them to ignore the issue.

marstall · 3 years ago
> by reducing credit
mellosouls · 3 years ago
I would have thought that in the vast majority of current AI-generated code we are talking about single blocks and functions that are just Intellisense on steroids that only a rather self-deluding coder would consider original enough or "theirs" to attribute authorship to.

There are no doubt grey areas and more serious cases as the technology improves and the generated content increases in length and functional value, but I hope we don't throw the productive baby out with the Luddite bathwater...

fipar · 3 years ago
I don’t disagree with you but I think there’s another angle to the discussion that involves the difference between the litigious capacity of corporations vs individuals to defend the simple kind of creations you describe, as I think corporations have defended (and even patented) very simple patterns before.

For what it’s worth, I wish the approach was as you describe but in all cases: make it impossible to defend attribution of basic building blocks regardless of how deep your pockets are.

vouaobrasil · 3 years ago
I think that kind of depends on how common the snippet is. If it's code for swapping to variables, well everyone has written that and it's small. But what if it's a rarer thing like a new kind of debayering algorithm that takes a much larger "snippet" from a novel algorithm? As far as I'm aware there are no rigorous studies showing the level to which one could duplicate longer stretches of code and if it can do something like this.
andybak · 3 years ago
Algorithms aren't copyrightable as far as my understanding goes and this is where it gets murky. The expression is copyrightable, the algorithm isn't (except when it is...)

My personal feeling is that we've gone too far down the road of assigning IP to code and we should be rolling it back. I'd hate for the current AI boom to trigger an extension of current IP law.

mellosouls · 3 years ago
Yes, of course, but that isn't the "vast majority" of use cases I pointed out; most people will presumably be using it for bog standard stuff in place of googling and thinking for a short while - let alone the people who didn't need the memory jog and are just selecting from an automatic prompt to save them typing it.
megous · 3 years ago
You can generate whole linux kernel drivers with eg. ChatGPT. And things will only get better, so your dismissal is not very forward looking. The issue will need to be dealt with.
mellosouls · 3 years ago
You should reread my comment to see that I explicitly made exceptions for more advanced cases.
scarface74 · 3 years ago
I can only speak to my experience using ChatGPT. But it’s doing a lot more than copying and pasting code snippets it finds on the internet. It actually is translating English to code.

I had a “DevOps” project I was working on creating deployment process using AWS technologies (disclaimer: where I work in Professional Services). I needed a few relatively simple Python scripts.

I first asked ChatGPT:

“given a JSON file like this [{“company”:”${company}”} replace the word surrounded by ${} with the equivalent environment variables using Python”.

It worked perfectly. But it hardcoded the input file.

“Modify the script to accept the input file using a command line argument -json-file using argparse”

That worked and used the “required=true” parameter.

“Instead of skipping replacement if an environment variable is not found, raise an exception”.

And it seems to understand the AWS SDK. I told it to convert a script I wrote to a CloudFormation custom resource using the cfnresponse module and it knew the correct Lambda event structure, the event format etc.

I have seen reports and have witnessed it making up functions occasionally.

But, I believe with the right prompts, it should be able to create simple CRUD scripts.

mellosouls · 3 years ago
I understand what it is doing, and how impressive it is; the question is whether or not the normal use case is for stuff that requires the originality of thought that requires attribution, and I can't see that being the case except in a very small minority of instances.
joshspankit · 3 years ago
I’m horrified.

I know we’re “just” talking about code here but decisions will be far-reaching and if we let the powers that be force AI-generated content to be copyright-attributed, two things are going to happen:

1. The biggest benefits of AI are going to be pushed back for decades and possibly indefinitely

2. Artists will be absolutely slaughtered as the same rules will come at them full-force. Almost every artist draws inspiration from other creative work. That’s how creativity works. Can you imagine how stifling it would be for every artist to have to document and “pay for” every piece of art they’ve ever seen??

joshspankit · 3 years ago
As well, this:

https://news.ycombinator.com/item?id=34242124

If AI can produce patentable and copyrightable proteins, the first pharmaceutical company to press a button will own all possible proteins. And then all possible permutations of every undiscovered drug.