Ownership of AI-Generated Code Hotly Disputed

“…modify[ing] its AI model so that it traces attribution and gives credit to the original authors of the code, adding the associated copyright notices and license terms in the process…Biderman says is technologically feasible.”

Is it really feasible? What does “traces attribution” even mean here? It’s not emitting “code”, it’s emitting individual tokens that each were found throughout the input corpus. The “code” is the arrangement of those tokens, but that is determined by the weighting of the whole network, so what can be traced?

Can someone who understands generative ML better than me weigh in on this?

dahart · 3 years ago

Why wouldn’t it be feasible? (Maybe this depends on what you mean by ‘feasible’.) There’s no technical reason you can’t back-track the weights and make a list of which tokens from which training data were sampled. The list might be long, it could be impractical, but that has little bearing on whether it’s technically possible, right?

The problem here happens when the same source is sampled for many tokens in a row because it’s the only match for the context. It could also happen that many tokens in a row each have a long list of sources, but when put together have a subset of sources that appear in every token’s list. That means that someone’s input is being repeated verbatim, even if the network wasn’t trying to reproduce a single source. We could prune the list of attribution sources at the expense of compute by running largest common subset algorithms, which might be sufficient for attribution tracing?

It feels like this whole question might hinge on Fair Use. The network is copying other people’s code one token at a time. We (society & copyright law) all tend to agree that’s fine when it’s a single token out of context, and we all tend to agree it’s not fine when the whole output program matches any single input source. The question naturally becomes, “where’s the line, how many tokens in a row from a single source should be allowed?”

wrs · 3 years ago

The token “if” appears in a heck of a lot of inputs, and a heck of a lot of outputs. What you’re describing seems like basically building an inverted index of the input corpus and then doing a search for the output. If the answer comes back with a high relevance then you consider it “traced”. I wonder what the size of that inverted index would be compared to the size of the generative model.

c7b · 3 years ago

> There’s no technical reason you can’t back-track the weights and make a list of which tokens from which training data were sampled.

The weights map inputs to outputs, not training samples to outputs. They are adjusted during the training phase and then kept fixed (until there's a new iteration of the model; and let's ignore online learning for simplicity's sake now, the major models we're talking about here don't use it afaik). From the perspective of a user who inputs prompts and receives answers, the weights are constant. We could create some score indicating how much each training sample contributed to the weights overall, but then that score would be the same for all prompts.

It's really not clear to me what you mean by 'sampling from training data' for a specific prompt. The best I could imagine would be creating another metric measuring the distance between the prompt and all training samples, and then somehow combining that with the first score to get an overall contribution metric for this prompt. But that would be a) quite a bit of guesswork with a lot of modelling freedom and b) quite different from what you described.

belorn · 3 years ago

Fair use is a bit more context dependent, especially when applied on other form of media. Unstable diffusion is a big example where many seems to feel that no fair use should be allowed regardless of how small of a token is taken by the AI. A small amount also depend on the original work and how it fit into society at large.

Having fair use to be the pillar that all AI training stand on is going to take a while.

nonrandomstring · 3 years ago

The judge/law will say it comes down to intent.

Here's my take from an audio DSP viewpoint.

A composer listens to 10,000 hours of music. One day she writes an original piece based on the annealed parameters (temporal, spectral, intensity, pitch, sequence...) of a million artists. However it sounds like another specific artist who sues.

It is not a cover, a remix, homage, or even forged in the genre... it's just accidentally a bit too like it. (compare: Banana Splits vs. Bob Marley - and - Huey Lewis vs. Ray Parker Jr. - which completely misses the real impact of "Pop Muzik" by M)

The question is, was she exposed to influencing materials incorporated without intent or was it plagiarism (intent to reproduce a derivative etc) ?

By contrast I can take a piece of music, break it down by analysis into melodies, chords, timings, and use FFT to extract the precise spectrums of instruments, feed those to a resynthesis engine that finds new synthesiser parameters to create an exact sound-alike and then deliberately recreate a piece "In the near style of artist X". With a little musical processing I can change the key, inversions, re-template the rhythm to a new swing... always pushing the derived piece into new territory until eventually it's barely recognisable.

Nonetheless, in the second case I have clearly intended to steal someone's idea and "make it my own" by automatable transformation.

In the former case it seems to be a "genuine labour". (whatever that means in 2023)

The genuine artist intends to make something through intellectual labour.

Maybe that's the real question. I mean, about "labour". If the cost of the labour tends to zero, does it really matter?

janalsncm · 3 years ago

The model isn’t sampling from training data during inference. It’s sampling from a latent space created during training, which used all of the training data to create it.

Fundamentally, the presence of a next token in some of the samples in the corpus is just as informative as the absence of that token from other samples. You can’t cite one without the other.

If you wanted to give attribution you’d need to list the entire training corpus.

johnfn · 3 years ago

The failure mode I see here, which seems quite likely, is that attribution would almost always list tens of thousands of source files, or more. They doesn’t seem particularly useful or meaningful.

falcolas · 3 years ago

> It’s not emitting “code”, it’s emitting individual tokens ... The “code” is the arrangement of those tokens, but that is determined by the weighting of the whole network

This theory of operation is not borne out in reality. It's been clearly displayed that these tools are emitting verbatim copies of existing code (and its comments) in their input.

It's even being seen in image generation, where NatGeo cover images are reproduced in their entirety or where stock photo watermarks are emitted on finished images.

And so, what can be traced back to individual sources? Quite a bit it would seem.

sillysaurusx · 3 years ago

Well yes, if you’re asked to memorize someone’s code, you’ll get surprisingly far too. The fact that models can do this isn’t evidence of anything. It’s a capability (or “tendency” if you overfit too much).

I think it’s pretty obvious that if you train a model to reproduce existing work in its entirety, it fails the “sufficiently transformative” test and thus loses legal protection.

But there’s nothing stopping you from re-coding existing implementations of GPL’ed code. I used to do it. And your new code has your own chosen license, even if your ideas came from someone else. Are you sure the same logic shouldn’t apply to models?

safety1st · 3 years ago

Even if that theory of operation were true, it seems like an implementation detail that would be largely irrelevant in a court of law. Infringement is infringement regardless of how complicated your method for committing it was. If your AI reproduces a copyrighted work verbatim without permission, surely you are committing infringement? If you would be infringing by doing it via copy and paste, I would think a court would be likely to find that you infringed when you used an AI to do it. It does not matter that the algorithm on the backend is fancy...

rymate1234 · 3 years ago

> It's been clearly displayed that these tools are emitting verbatim copies of existing code (and its comments) in their input.

Which makes sense when you consider that the sort of code that is getting reproduced verbatim is usually library functions which developers may copy and paste verbatim comments and all into their project, especially when you prompt the AI with the header of a function that has been copied and pasted often, so the weightings will in that instance be heavily skewed towards reproducing that function

wrs · 3 years ago

The output may be recognizable by humans as being “the same”, but an ML classifier can look at what is “obviously” a cat and classify it as a turtle. So while your statement is accurate, it doesn’t explain how “traceability” could really work—what is the actual mechanism of “tracing”?

mcintyre1994 · 3 years ago

On your image generation point, something potentially interesting I saw on twitter/Reddit was about artists uploading “no to AI” images to artstation and then that logo almost appearing to vandalise AI generated art. I’m not sure if it’s a real thing or just people on twitter and stuff who don’t like AI art trolling though. Mostly because I’m not sure how up to date some of these models are being kept.

forrestthewoods · 3 years ago

> It's even being seen in image generation, where NatGeo cover images are reproduced in their entirety or where stock photo watermarks are emitted on finished images.

Can you cite sources? I’ve heard this claim repeatedly but have yet to see a good example.

ShamelessC · 3 years ago

The quote does not make that assertion…

Rekksu · 3 years ago

Do you have a source for this claim? This isn't really how generative models work

ProlificInquiry · 3 years ago

This quote seems to fundamentally misunderstand what transformers are doing at all. Technically I suppose you could save all gradient updates from every input token, and do some weighted averaging to show what inputs affected the particular output the most, but saving all those gradient updates would be unimaginably space consuming. "Feasible" is doing a lot of work there.

It's very hard for people to get away from the idea that GPT is "copying" something, but that's not what it's doing. The reality is, to get the exact artifact which produced the code in question, you need "Call me Ishmael" from Moby Dick just as much as the Linux kernel source.

amelius · 3 years ago

> The reality is, to get the exact artifact which produced the code in question, you need "Call me Ishmael" from Moby Dick just as much as the Linux kernel source.

Not always. Sometimes it just copies code without modification.

It never tells you when it does that, though. So to be on the safe side, better assume that it always does.

janalsncm · 3 years ago

Even then, the presence of a next token is just as informative as the absence of another during training. That information also gets backpropagated. And during inference, good luck identifying which weights were responsible for a given next token (assuming we’re using greedy decoding, don’t even get started on beam search) let alone which samples contributed to that weight (hint: they all did).

Filligree · 3 years ago

It sounds like nonsense. The most plausible solution would be to provide credit to every single author whose code was used in the original training set; of course, that would run into gigabytes just for the credits.

logicallee · 3 years ago

But GPL requires the resulting derivative code to be open-sourced under GPL, not just attributed. So if you did that then copilot could only emit code under a GPL license. Maybe that is okay, but it is not what copilot is trying to do at the moment. (And not everyone wants to open-source all their code under GPL.)

varjag · 3 years ago

And…? I think my last Xcode update was 20+ Gb.

Deleted Comment

hackinthebochs · 3 years ago

In the ideal case the next token is determined by the local context (the prefix string) and the entire corpus of trained code. In this case the prefix string has not been seen before and so the generator must do some interpretation/extrapolation to determine the likely continuation. But in some cases, perhaps many cases, the prefix string has been seen before, or is similar enough to what has been seen before, that the best continuation is just to spit out the similar string in the training corpus. Presumably such cases can be detected due to specific patterns of activation in the network and attributions can be captured/applied.

One dumb way to do this would be to include self-attributions directly in the stream of training data. So in the cases where the best continuation is to just transcribe the training data, the attributions is included in the data itself.

wrs · 3 years ago

That “presumably” is exactly what I’m trying to get past. How would it actually work? Once the input corpus has been chewed up and reduced to tokens, is there actually a representation of “the similar string in the corpus” any more?

I don’t think people are even clear on the problem statement here. If I fed in three very similar functions from different sources, and I got a fourth, also very similar, output, what is the “traced attribution” supposed to be?

dec0dedab0de · 3 years ago

Honestly, I would be fine with verbatim chunks of code if it could gaurantee a compatible license and did the copyright notices properly.

That always seemed like a problem with open source, you should theoretically be able to copy bits of code from hundreds of projects to make a new one, but keeping track of the licenses makes it too much of a pain. So the closest we really see is people vendorizing libraries.

beecafe · 3 years ago

You can just find the most matching snippets in the training data using an embedding model.

visarga · 3 years ago

They probably mean using a code search engine to check all snippets. The simplest thing would be am n-gram filter. A more advanced approach would use a code similarity neural net. It's not principled attribution, just locating the most similar example in the training set.

nonrandomstring · 3 years ago

> It's not principled attribution, just locating the most similar example in the training set.

This is a profoundly important distinction.

Back-tracing data to contributory training examples is a genuine "influenced by" relation. Picking the nearest neighbour to a given result (even if its an exact copy!) cannot say anything useful with respect to origins. And given that there will always be some proximate neighbour, it's really a "misattribution machine".

This is bit like how our broken patent system grants or denies ownership of a design based on similarity to extant art but regardless of actual originality.

godelski · 3 years ago

So there a few ways this can actually be done with different levels of accuracy/precision. But it is going to be complicated no matter what.

The easy thing to do is to compare outputs to inputs. This isn't technically hard (i.e. cosine similarity) but it is computationally difficult (i.e. cosine similarity of output to entire input). This would give us some weightings that show similarity. But this doesn't really tell us attribution, rather more a correlation. These are statistical models so there's reason to believe that this is okay.

Then there's inversion. This is processing data backwards through the model. This has different complexities compared to what type of model you're working with. GANs aren't great, diffusions are okay, normalizing flows are trivial (but good luck generating good images from NFs). If we can invert the model it is much easier to investigate and probe for contribution by looking at the distance of the latent generative variable to the location of the latent trained data. Basically you're looking at how different information is contributing to the overall output. This can also be done at every level in the network. Obviously this gets both technically challenging as well as computationally.

Another method would be using dataset reconstruction (this is outside my wheelhouse fwiw). This is where you try to recover the dataset from the final trained network. This too is complicated but there's plenty of papers showing progress in this space (lots of interest from privacy groups).

(TLDR-ish) There's other methods too. But basically what is being said is that there are ways to denote what and how much the training data contributed to the output of the model wherein we can then measure how similar the output is to the inputs (i.e. copying).

nonrandomstring · 3 years ago

The answer is of course it's possible, so long as you have a couple of gigabytes spare for the acknowledgements page. Any construct worth attributing will have roots in billions of parameters.

ProlificInquiry · 3 years ago

It's far worse than that: to attribute a particular output to an input you don't just need the input data, you need the gradient updates that data caused in the training run. A couple hundred gigabytes input tokens times 175 billion parameters equals... impossibility.

skibidibipiti · 3 years ago

Theoretically possibly, technically challenging. The training data would need to be annotated with traces, so a person would have to figure out where all of the answers came from

ajsnigrutin · 3 years ago

Maybe like a science paper... 10 lines of code, 50 citations :)

I would have thought that in the vast majority of current AI-generated code we are talking about single blocks and functions that are just Intellisense on steroids that only a rather self-deluding coder would consider original enough or "theirs" to attribute authorship to.

There are no doubt grey areas and more serious cases as the technology improves and the generated content increases in length and functional value, but I hope we don't throw the productive baby out with the Luddite bathwater...

fipar · 3 years ago

I don’t disagree with you but I think there’s another angle to the discussion that involves the difference between the litigious capacity of corporations vs individuals to defend the simple kind of creations you describe, as I think corporations have defended (and even patented) very simple patterns before.

For what it’s worth, I wish the approach was as you describe but in all cases: make it impossible to defend attribution of basic building blocks regardless of how deep your pockets are.

vouaobrasil · 3 years ago

I think that kind of depends on how common the snippet is. If it's code for swapping to variables, well everyone has written that and it's small. But what if it's a rarer thing like a new kind of debayering algorithm that takes a much larger "snippet" from a novel algorithm? As far as I'm aware there are no rigorous studies showing the level to which one could duplicate longer stretches of code and if it can do something like this.

andybak · 3 years ago

Algorithms aren't copyrightable as far as my understanding goes and this is where it gets murky. The expression is copyrightable, the algorithm isn't (except when it is...)

My personal feeling is that we've gone too far down the road of assigning IP to code and we should be rolling it back. I'd hate for the current AI boom to trigger an extension of current IP law.

mellosouls · 3 years ago

Yes, of course, but that isn't the "vast majority" of use cases I pointed out; most people will presumably be using it for bog standard stuff in place of googling and thinking for a short while - let alone the people who didn't need the memory jog and are just selecting from an automatic prompt to save them typing it.

megous · 3 years ago

You can generate whole linux kernel drivers with eg. ChatGPT. And things will only get better, so your dismissal is not very forward looking. The issue will need to be dealt with.

mellosouls · 3 years ago

You should reread my comment to see that I explicitly made exceptions for more advanced cases.

scarface74 · 3 years ago

I can only speak to my experience using ChatGPT. But it’s doing a lot more than copying and pasting code snippets it finds on the internet. It actually is translating English to code.

I had a “DevOps” project I was working on creating deployment process using AWS technologies (disclaimer: where I work in Professional Services). I needed a few relatively simple Python scripts.

I first asked ChatGPT:

“given a JSON file like this [{“company”:”${company}”} replace the word surrounded by ${} with the equivalent environment variables using Python”.

It worked perfectly. But it hardcoded the input file.

“Modify the script to accept the input file using a command line argument -json-file using argparse”

That worked and used the “required=true” parameter.

“Instead of skipping replacement if an environment variable is not found, raise an exception”.

And it seems to understand the AWS SDK. I told it to convert a script I wrote to a CloudFormation custom resource using the cfnresponse module and it knew the correct Lambda event structure, the event format etc.

I have seen reports and have witnessed it making up functions occasionally.

But, I believe with the right prompts, it should be able to create simple CRUD scripts.

mellosouls · 3 years ago

I understand what it is doing, and how impressive it is; the question is whether or not the normal use case is for stuff that requires the originality of thought that requires attribution, and I can't see that being the case except in a very small minority of instances.