Relicensing with AI-Assisted Rewrite

I am pretty sure this article is predicated on a misunderstanding of what a "clean room" implementation means. It does not mean "as long as you never read the original code, whatever you write is yours". If you had a hermetically sealed code base that just happened to coincide line for line with the codebase for GCC, it would still be a copy. Traditionally, a human-driven clean room implementation would have a vanishingly small probability of matching the original codebase enough to be considered a copy. With LLMs, the probability is much higher (since in truth they are very much not a "clean room" at all).

The actual meaning of a "clean room implementation" is that it is derived from an API and not from an implementation (I am simplifying slightly). Whether the reimplementation is actually a "new implementation" is a subjective but empirical question that basically hinges on how similar the new codebase is to the old one. If it's too similar, it's a copy.

What the chardet maintainers have done here is legally very irresponsible. There is no easy way to guarantee that their code is actually MIT and not LGPL without auditing the entire codebase. Any downstream user of the library is at risk of the license switching from underneath them. Ideally, this would burn their reputation as responsible maintainers, and result in someone else taking over the project. In reality, probably it will remain MIT for a couple of years and then suddenly there will be a "supply chain issue" like there was for mimemagic a few years ago.

femto · 8 days ago

> If you had a hermetically sealed code base that just happened to coincide line for line with the codebase for GCC, it would still be a copy.

That's not what the law says [1]. If two people happen to independently create the same thing they each have their own copyright.

If it's highly improbable that two works are independent (eg. the gcc code base), the first author would probably go to court claiming copying, but their case would still fail if the second author could show that their work was independent, no matter how improbable.

[1] https://lawhandbook.sa.gov.au/ch11s13.php?lscsa_prod%5Bpage%...

jerf · 7 days ago

It is true that if two people happen to independently create the same thing, they each have their own copyright.

It is also true that in all the cases that I know about where that has occurred the courts have taken a very, very, very close look at the situation and taken extensive evidence to convince the court that there really wasn't any copying. It was anything but a "get out of jail free" card; it in fact was difficult and expensive, in proportion to the size of the works under question, to prove to the court's satisfaction that the two things really were independent. Moreover, in all the cases I know about, they weren't actually identical, just, really really close.

No rational court could possibly ever come to that conclusion if someone claimed a line-by-line copy of gcc was written by them, they must have independently come up with it. The probably of that is one out of ten to the "doesn't even remotely fit in this universe so forget about it". The bar to overcoming that is simply impossibly high, unlike two songs that happen to have similar harmonies and melodies, given the exponentially more constrained space of "simple song" as compared to a compiler suite.

danlitt · 7 days ago

Thank you for providing a reference! I certainly admit that "very similar photographs are not copies" as the reference states. And certainly physical copying qualifies as copying in the sense of copyright. However I still think copying can happen even if you never have access to a copy.

I suppose a different way of stating my position is that some activities that don't look like copying are in fact copying. For instance it would not be required to find a literal copy of the GCC codebase inside of the LLM somehow, in order for the produced work to be a copy. Likewise if I specify that "Harry Potter and the Philosopher's Stone is the text file with hash 165hdm655g7wps576n3mra3880v2yzc5hh5cif1x9mckm2xaf5g4" and then someone else uses a computer to brute force find a hash collision, I suspect this would still be considered a copy.

I think there is a substantial risk that the automatic translation done in this case is, at least in part, copying in the above sense.

brians · 8 days ago

I do not agree with your interpretation of copyright law. It does ban copies: there has to be information flow from the original to the copy for it to be a "copy." Spontaneous generation of the same content is often taken by the courts to be a sign that it's purely functional, derived from requirements by mathematical laws.

Patent law is different and doesn't rely on information flow in the same way.

kevin_thibedeau · 7 days ago

Derivative works can also run afoul of copyright. An LLM trained on a corpus of copyrighted code is creating derivative works no matter how obscure the process is.

danlitt · 7 days ago

I disagree that information flow is required. Do you have a reference for that? Certainly it is an important consideration. But consider all the real literary works contained in the infinite library of babel.[1] Are they original works just because no copy was used to produce them?

[1]: https://libraryofbabel.info/

BoredPositron · 8 days ago

Well discovery might be a fun exercise to see if the code is in the dataset of the llm.

petercooper · 8 days ago

The actual meaning of a "clean room implementation" is that it is derived from an API and not from an implementation

I know you were simplifying, and not to take away from your well-made broader point, but an API-derived implementation can still result in problems, as in Google vs Oracle [1]. The Supreme Court found in favor of Google (6-2) along "fair use" lines, but the case dodged setting any precedent on the nature of API copyrightability. I'm unaware if future cases have set any precedent yet, but it just came to mind.

[1]: https://en.wikipedia.org/wiki/Google_LLC_v._Oracle_America,_....

lokar · 7 days ago

Yeah, a cleanroom re-write, or even "just" a copy of the API spec is something to raise as a defense during a trial (along with all other evidence), it's not a categorical exemption from the law.

Also, I find it important that here the API is really minimal (compared to the Java std lib), the real value of the library is in the internal detection logic.

danlitt · 7 days ago

This is exactly what I had in mind when I said I was simplifying :) it is a valid point.

zabzonk · 8 days ago

> It does not mean "as long as you never read the original code, whatever you write is yours"

I think there is precedence that says exactly this - for example the BIOS rewrites for the IBM PC from people like Phoenix. And it would be trivial to instruct an LLM to prefer to use (say, in assembler) register C over register B wherever that was possible, resulting in different code.

danlitt · 7 days ago

As long as you never read the original code, it is very likely that whatever you write is yours. So I would not be surprised to read judges indicating in this direction. But I would be a little surprised to find out this was an actual part of the test, rather than an indication that the work was considered to have been copied. There are for instance lots of ways of reproducing copyrighted work without using a copy directly, but naive methods like generating random pieces of text are very time consuming, so there is not much precedence around them. LLMs are much more efficient at it!

bandrami · 8 days ago

Different but still derivative

wareya · 7 days ago

> If you had a hermetically sealed code base that just happened to coincide line for line with the codebase for GCC, it would still be a copy.

If you somehow actually randomly produce the same code without a reference, it's not a copy and doesn't violate copyright. You're going to get sued and lose, but platonically, you're in the clear. If it's merely somewhat similar, then you're probably in the clear in practice too: it gets very easy very fast to argue that the similarities are structural consequences of the uncopyrightable parts of the functionality.

> The actual meaning of a "clean room implementation" is that it is derived from an API and not from an implementation (I am simplifying slightly).

This is almost the opposite of correct. A clean room implementation's dirty phase produces a specification that is allowed to include uncopyrightable implementation details. It is NOT defined as producing an API, and if you produce an API spec that matches the original too closely, you might have just dirtied your process by including copyrightable parts of the shape of the API in the spec. Google vs Oracle made this more annoying than it used to be.

> Whether the reimplementation is actually a "new implementation" is a subjective but empirical question that basically hinges on how similar the new codebase is to the old one. If it's too similar, it's a copy.

If you follow CRRE, it's not a copy, full stop, even if it's somehow 1:1 identical. It's going to be JUDGED as a copy, because substantial similarity for nontrivial amounts of code means that you almost certainly stepped outside of the clean room process and it no longer functions as a defense, but if you did follow CRRE, then it's platonically not a copy.

> What the chardet maintainers have done here is legally very irresponsible.

I agree with this, but it's probably not as dramatic as you think it is. There was an issue with a free Japanese font/typeface a decade or two ago that was accused of mechanically (rather than manually) copying the outlines of a commercial Japanese font. Typeface outlines aren't copyrightable in the US or Japan, but they are in some parts of Europe, and the exact structure of a given font is copyrightable everywhere (e.g. the vector data or bitmap field for a digital typeface, as opposed to the idea of its shape). What was the outcome of this problem? Distros stopped shipping the font and replaced it with something vaguely compatible. Was the font actually infringing? Probably not, but better safe than sorry.

danlitt · 7 days ago

> If you somehow actually randomly produce the same code without a reference, it's not a copy and doesn't violate copyright.

I don't believe this, and I doubt that the sense of copying in copyright law is so literal. For instance, if I generated the exact text of a novel by looking for hash collisions, or by producing random strings of letters, or by hammering the middle button on my phone's autosuggestion keyboard, I would still have produced a copy and I would not be safe to distribute it. There need not have been any copy anywhere near me for this to happen. Whether it is likely or not depends on the technique used - naive techniques make this very unlikely, but techniques can improve.

It is also true that similarity does not imply copying - if you and I take an identical photograph of the same skyline, I have not copied you and you have not copied me, we have just fixed the same intangible scene into a medium. The true subjective test for copying is probably quite nuanced, I am not sure whether it is triggered in this case, but I don't think "clean room LLMs" are a panacea either.

> dirty phase produces a specification ... it is NOT defined as producing an API

This does not really sound like "the opposite of correct". APIs are usually not copyrightable, the truth is of course more complicated, if you are happy to replace "API" with "uncopyrightable specification" then we can probably agree and move on.

> it's probably not as dramatic as you think it is

In reality I am very cynical and think nothing will come of this, even if there are verbatim snippets in the produced code. People don't really care very much, and copyright cases that aren't predicated on millions of dollars do not survive the court system very long.

foooorsyth · 8 days ago

>The actual meaning of a "clean room implementation" is that it is derived from an API and not from an implementation

This is incorrect and thinking this can get you sued

https://en.wikipedia.org/wiki/Structure,_sequence_and_organi...

amiga386 · 7 days ago

Whether you get sued is more on the plaintiff than you.

Per your link, the Supreme Court's thinking on "structure, sequence and organization" (Oracle's argument why Google shouldn't even be allowed to faithfully produce a clean-room implementation of an an API) has changed since the 1980s out of concern that using it to judge copyright infringement risks handing copyright holders a copyright-length monopoly over how to do a thing:

> enthusiasm for protection of "structure, sequence and organization" peaked in the 1980s [..] This trend [away from "SS&O"] has been driven by fidelity to Section 102(b) and recognition of the danger of conferring a monopoly by copyright over what Congress expressly warned should be conferred only by patent

The Supreme Court specifically recognised Google's need to copy the structure, sequence and organization of Java APIs in order to produce a cleanroom Android runtime library that implemented Java APIs so that that existing Java software could work correctly with it.

Similarly, see Oracle v. Rimini Street (https://cdn.ca9.uscourts.gov/datastore/opinions/2024/12/16/2...) where Rimini Street has been producing updates that work with Oracle's products, and Oracle claimed this made them derivative works. The Court of Appeals decided that no, the fact A is written to interoperate with B does not necessarily make A a derivative work of B.

danlitt · 7 days ago

I did not expect people to take "API" so literally. This point is what I was referring to when I said "I am simplifying slightly". The point is that a clean room impl begins from a specification of what the software does, and that the new implementation is purported to be derived only from this. What I am trying to say is that "not looking at the implementation" is not exactly the point of the test - that is a rule of thumb, which works quite well for avoiding copyright infringement, but only when humans do it.

umvi · 7 days ago

You can be sued for any reason if a company feels threatened (see: Oracle v Google)

thousand_nights · 8 days ago

the whole concept of a "clean room" implementation sounds completely absurd.

a bunch of people get together, rewrite something while making a pinky promise not to look at the original source code

guaranteeing the premise is basically impossible, it sounds like some legal jester dance done to entertain the already absurd existing copyright laws

myrmidon · 7 days ago

> it sounds like some legal jester dance done to entertain [...] copyright laws

Clean room implementations are a jester dance around the judiciary. The whole point is to avoid legal ambiguity.

You are not required to do this by law, you are doing this voluntarily to make potential legal arguments easier.

The alternative is going over the whole codebase in question and arguing basically line by line whether things are derivative or not in front of a judge (which is a lot of work for everyone involved, subjective, and uncertain!).

bandrami · 7 days ago

In the archetypal example IBM (or whoever it was) had to make sure the two engineering teams were never in the cafeteria together at the same time

dudeinhawaii · 8 days ago

It usually refers to situations without access to the source code.

I've always taken "clean room" to be the kind of manufacturing clean room (sealed/etc). You're given a device and told "make our version". You're allowed to look, poke, etc but you don't get the detailed plans/schematics/etc.

In software, you get the app or API and you can choose how to re-implement.

In open source, yes, it seems like a silly thing and hard to prove.

Forgeties79 · 8 days ago

Halt and Catch Fire did a pretty funny rendition of this song and dance

jen20 · 7 days ago

> What the chardet maintainers have done here is legally very irresponsible.

Perhaps the maintainer wants to force the issue?

> Any downstream user of the library is at risk of the license switching from underneath them.

Checking the license of the transitive closure of your dependencies is table stakes for using them.

fc417fc802 · 7 days ago

The problem is that the transitive closure isn't clear here. One of the entries is being claimed to be one thing but might in fact turn out to be another.

danlitt · 7 days ago

> Perhaps the maintainer wants to force the issue?

I doubt it, and I don't see any evidence that's what they're doing. There are probably better ways, if that's what they want.

> Checking the license of the transitive closure of your dependencies is table stakes for using them.

Checking the license of the transitive closure of your dependencies is only feasible when the library authors behave responsibly.

Deleted Comment

pmarreck · 8 days ago

> With LLMs, the probability is much higher (since in truth they are very much not a "clean room" at all).

I beg to differ. Please examine any of my recent codebases on github (same username); I have cleanroom-reimplemented par2 (par2z), bzip2 (bzip2z), rar (rarz), 7zip (z7z), so maybe I am a good test case for this (I haven't announced this anywhere until now, right here, so here we go...)

https://github.com/pmarreck?tab=repositories&type=source

I was most particular about the 7zip reimplementation since it is the most likely to be contentious. Here is my repo with the full spec that was created by the "dirty team" and then worked off of by the LLM with zero access to the original source: https://github.com/pmarreck/7z-cleanroom-spec

Not only are they rewritten in a completely different language, but to my knowledge they are also completely different semantically except where they cannot be to comply with the specification. I invite you and anyone else to compare them to the original source and find overt similarities.

With all of these, I included two-way interoperation tests with the original tooling to ensure compatibility with the spec.

ostacke · 8 days ago

Bu that's not really what danlitt said, right? They did not claim that it's impossible for an LLM to generate something different, merely that it's not a clean room implementation since the LLM, one must assume, is trained on the code it's re-implementing.

airza · 8 days ago

By what means did you make sure your LLM was not trained with data from the original source code?

danlitt · 7 days ago

I only said the probability is higher, not that the probability is 1!

j45 · 7 days ago

This reminds me of a full rewrite.

When a developer reimplements a complete new version of code from scratch, with an understanding only, a new implementation generally should be an improvement on any source code not equal.

In today’s world, letting LLMs replicate anything will generate average code as “good” and generally create equivalent or more bloat anyways unless well managed.

StilesCrisis · 7 days ago

The world is chock-full of rewrites that came out disastrously worse than the thing they intended to replace. One of Spolsky's most-quoted articles of all time was about this.

https://www.joelonsoftware.com/2000/04/06/things-you-should-...

> They did it by making the single worst strategic mistake that any software company can make: They decided to rewrite the code from scratch.

Dead Comment

dathinab · 8 days ago

the author speaks about code which is syntactically completely different but semantically does the same

i.e. a re-implementation

which can either

- be still derived work, i.e. seen as you just obfuscating a copyright violation

- be a new work doing the same

nothing prevents an AI from producing a spec based on a API, API documentation and API usage/fuzzing and then resetting the AI and using that spec to produce a rewrite

I mean "doing the same" is NOT copyright protection, you need patent law for that. Except even with patent law you need to innovations/concepts not the exact implementation details. Which means that even if there are software patents (theoretically,1) most things done in software wouldn't be patentable (as they are just implementation details, not inventions)

(1): I say theoretically because there is a very long track record of a lot of patents being granted which really should never be granted. This combined with the high cost of invalidating patents has caused a ton of economical damages.

jacquesm · 8 days ago

No, that depends on whether or not the AI work product rests on key contributions to its training set without which it would not be able to the the work, see other comment. In that case it looks like 'a new work doing the same' but it still a derived work.

Ted Nelson was years ahead of the future where we really needed his Xanadu to keep track of fractional copyright. Likely if we had such a mechanism, and AI authors respected it then we would be able to say that your work is derived from 3000 other original works and that you added 6 lines of new code.

> The ownership void: If the code is truly a “new” work created by a machine, it might technically be in the public domain the moment it’s generated, rendering the MIT license moot.

How would that work? We still have no legal conclusion on whether AI model generated code, that is trained on all publicly available source (irrespective of type of license), is legal or not. IANAL but IMHO it is totally illegal as no permission was sought from authors of source code the models were trained on. So there is no way to just release the code created by a machine into public domain without knowing how the model was inspired to come up with the generated code in the first place. Pretty sure it would be considered in the scope of "reverse engineering" and that is not specific only to humans. You can extend it to machines as well.

EDIT: I would go so far as to say the most restrictive license that the model is trained on should be applied to all model generated code. And a licensing model with original authors (all Github users who contributed code in some form) should be setup to be reimbursed by AI companies. In other words, a % of profits must flow back to community as a whole every time code-related tokens are generated. Even if everyone receives pennies it doesn't matter. That is fair. Also should extend to artists whose art was used for training.

kouteiheika · 8 days ago

> I would go so far as to say the most restrictive license that the model is trained on should be applied to all model generated code.

There are research models out there which are trained on only permissively licensed data (i.e. no "All Rights Reserved" data), but they're, colloquially speaking, dumb as bricks when compared to state-of-art.

But I guess the funniest consequence of the "model outputs are a derivative work of their training data" would be that it'd essentially wipe out (or at very least force a revert to a pre-AI era commit) every open source project which may have included any AI-generated or AI-assisted code, which currently pretty much includes every major open source project out there. And it would also make it impossible to legally train any new models whose training data isn't strictly pre-AI, since otherwise you wouldn't know whether your training data is contaminated or not.

progval · 8 days ago

> There are research models out there which are trained on only permissively licensed data

Models whose authors tried to train only on permissively licensed data.

For example https://huggingface.co/bigcode/starcoder2-15b tried to be a permissively licensed dataset, but it filtered only on repository-level license, not file-level. So when searching for "under the terms of the GNU General Public License" on https://huggingface.co/spaces/bigcode/search-v2 back when it was working, you would find it was trained on many files with a GPL header.

kshri24 · 8 days ago

I agree with your assessment. Which is why I was proposing a middle-ground where an agreement is setup between the model training company and the collective of developers/artists et all and come up with a license agreement where they are rewarded for their original work for perpetuity. A tiny % of the profits can be shared, which would be a form of UBI. This is fair not only because companies are using AI generated output but developers themselves are also paying and using AI generated output that is trained on other developer's input. I would feel good (in my conscience) that I am not "stealing" someone else's effort and they are being paid for it.

foota · 8 days ago

I don't know how far it would get, but I imagine that a FAANG will be able to get the farthest here by virtue of having mountains of corporate data that they have complete ownership over.

pocksuppet · 8 days ago

dathinab · 8 days ago

> how does that work

AI can't claim ownership, humans can't either as they haven't produced it. If there is guaranteed no one which can claim ownership it often seen as being in the public domain.

In general it is irrelevant what the copyright of the AI training data is. At least in the US judges have been relevant clear about that. (Except if the AI reproduced input data close to verbatim. _But in general we aren't speaking about AI being trained on a code base but an AI using/rewriting it_.)

(1): Which isn't the same as no one seems to know who has ownership. It also might be owned by no-one in the sense that no one can grant you can copyright permission (so opposite of public domain), but also no-one can sue (so de-facto public domain).

jacquesm · 8 days ago

Humans can't claim ownership, but they are still liable for the product of their bot. That's why MS was so quick to indemnify their users, they know full well that it is going to be super hard to prove that there is a key link to some original work.

The main analogy is this one: you take a massive pile of copyrighted works, cut them up into small sections and toss the whole thing in a centrifuge, then, when prompted to produce a work you use a statistical method to pull pieces of those copyrighted works out of the centrifuge. Sometimes you may find that you are pulling pieces out of the laundromat in the order in which they went in, which after a certain number of tokens becomes a copyright violation.

This suggests there are some obvious ways in which AI companies can protect themselves from claims of infringement but as far as I'm aware not a single one has protections in place to ensure that they do not materially reproduce any fraction of the input texts other than that they recognize prompts asking it to do so.

So it won't produce the lyrics of 'Let it be'. But they'll be happy to write you mountains of prose that strongly resembles some of the inputs.

The fact that they are not doing that tells you all you really need to know: they know that everything that their bots spit out is technically derived from copyrighted works. They also have armies of lawyers and technical arguments to claim the opposite.

graemep · 8 days ago

> humans can't either as they haven't produced it. If there is guaranteed no one which can claim ownership it often seen as being in the public domain.

Says who?. The US ruling the article refers to does not cover this.

It is different in other countries. Even if US law says it is public domain (which is probably not the case) you had better not distribute it internationally. For example, UK law explicitly says a human is the author of machine generated content: https://news.ycombinator.com/item?id=47260110

m4rtink · 8 days ago

I would be totally fine with all code generated by LLMs being considered to be under GPL v3 unless the model authors can prove without any doubt it was not trained on any GPL v3 code - viral licensing to the max. ;-)

adrianN · 8 days ago

We‘ll have to wait until the technology progresses sufficiently that AI cuts into Disney’s profit.

shevy-java · 8 days ago

"We still have no legal conclusion on whether AI model generated code, that is trained on all publicly available source (irrespective of type of license), is legal or not."

I think it will depend on the way HOW the AI arrived to the new code.

If it was using the original source code then it probably is guilty-by-association. But in theory an AI model could also generate a rewrite if being fed intermediary data not based on that project.

dathinab · 8 days ago

> "We still have no legal conclusion on whether AI model generated code, that is trained on all publicly available source (irrespective of type of license), is legal or not."

it depends on the country you are in

but overall in the US judges have mostly consistently ruled it as legal

and this is extremely unlikely to change/be effectively interpreted different

but where things are more complex is:

- model containing training data (instead of generic abstractions based on it), determined by weather or not it can be convinced to produce close to verbatim output of the training data the discussion is about

- model producing close to verbatim training data

the later seems to be mostly? always? be seen as copyright violation, with the issue that the person who does the violation (i.e. uses the produced output) might not known

the former could mean that not just the output but the model itself can count as a form of database containing copyright violating content. In which case they model provider has to remove it, which is technically impossible(1)... The pain point with that approach is that it will likely kill public models, while privately kept models will for every case put in a filter and _claim_ to have removed it and likely will get away with it. So while IMHO it should be a violation conceptually, it probably is better if it isn't.

But also the case the original article refers to is more about models interacting/using with code base then them being trained on.

(1): For LLMs, it is very much removable for knowledge based used by LLMs.

amelius · 8 days ago

You should just look at it as a giant computation graph. If some of the inputs in this graph are tainted by copyright and an output depends on these inputs (changing them can change the output) then the output is tainted too.

d1sxeyes · 8 days ago

> We still have no legal conclusion on whether AI model generated code, that is trained on all publicly available source (irrespective of type of license), is legal or not.

That horse has bolted. No one knows where all the AI code any more, and it would no longer possible to be compliant with a ruling that no one can use AI generated code.

There may be some mental and legal gymnastics to make it possible, but it will be made legal because it’s too late to do anything else now.

Deleted Comment

conartist6 · 8 days ago

I hate that this may be true, but I also don't think the law will fix this for us.

I think this is down the community and the culture to draw our red lines on and enforce them. If we value open source, we will find a way to prevent its complete collapse through model-assisted copyright laundering. If not, OSS will be slowly enshittified as control of projects slowly flows to the most profit-motivated entities.

thedevilslawyer · 8 days ago

That's unpractical enough that you might as well wish for UBI and world peace rather than this.

kshri24 · 8 days ago

Why is it impractical? Github already has a sponsor system. Also this can be a form of UBI.

calny · 7 days ago

The maintainer's response: https://github.com/chardet/chardet/issues/327#issuecomment-4...

The second part here is problematic, but fascinating: "I then started in an empty repository with no access to the old source tree, and explicitly instructed Claude not to base anything on LGPL/GPL-licensed code." Problem - Claude almost certainly was trained on the LGPL/GPL original code. It knows that is how to solve the problem. It's dubious whether Claude can ignore whatever imprints that original code made on its weights. If it COULD do that, that would be a pretty cool innovation in explainable AI. But AFAIK LLMs can't even reliably trace what data influenced the output for a query, see https://iftenney.github.io/projects/tda/, or even fully unlearn a piece of training data.

Is anyone working on this? I'd be very interested to discuss.

Some background - I'm a developer & IP lawyer - my undergrad thesis was "Copyright in the Digital Age" and discussed copyleft & FOSS. Been litigating in federal court since 2010 and training AI models since 2019, and am working on an AI for litigation platform. These are evolving issues in US courts.

BTW if you're on enterprise or a paid API plan, Anthropic indemnifies you if its outputs violate copyright. But if you're on free/pro/max, the terms state that YOU agree to indemnify THEM for copyright violation claims.[0]

[0] https://www.anthropic.com/legal/consumer-terms - see para. 11 ("YOU AGREE TO INDEMNIFY AND HOLD HARMLESS THE ANTHROPIC PARTIES FROM AND AGAINST ANY AND ALL LIABILITIES, CLAIMS, DAMAGES, EXPENSES (INCLUDING REASONABLE ATTORNEYS’ FEES AND COSTS), AND OTHER LOSSES ARISING OUT OF … YOUR ACCESS TO, USE OF, OR ALLEGED USE OF THE SERVICES ….")

DrammBA · 7 days ago

Also the maintainer's ground-up rewrite argument is very flimsy when they used chardet's test-data and freely admit to:

> I've been the primary maintainer and contributor to this project for >12 years

> I have had extensive exposure to the original codebase: I've been maintaining it for over a decade. A traditional clean-room approach involves a strict separation between people with knowledge of the original and people writing the new implementation, and that separation did not exist here.

> I reviewed, tested, and iterated on every piece of the result using Claude.

> I was deeply involved in designing, reviewing, and iterating on every aspect of it.

Lerc · 7 days ago

There was a paper that proposed a content based hashing mask for traning

The idea is you have some window size, maybe 32 tokens. Hash it into a seed for a pseudo random number generator. Generate random numbers in the range 0..1 for each token in the window. Compare this number against a threshold. Don't count the loss for any tokens with a rng value higher than the threshold.

It learns well enough because you get the gist of reading the meaning of something when the occasional word is missing, especially if you are learning the same thing expressed many ways.

It can't learn verbatim however. Anything that it fills in will be semantically similar, but different enough to get cause any direct quoting onto another path after just a few words.

> you get the gist of reading the meaning of something when the occasional word is missing,

I think it's more subtle than that. IIUC the tokens were all present for the purpose of computing the output and the score is based on the output. It's only the weight update where some of the tokens get ignored. So the learning is lossy but the inference driving the learning is not.

Rather than a book that's missing words it's more like a person with a minor learning disability that prevents him from recalling anything perfectly.

However it occurs to me that data augmentation could easily break the scheme if care isn't taken.

Thanks! Appreciate the response and will look into this

oofbey · 7 days ago

The difference in indemnification based on which plan you’re on is super important. Thanks for pointing that out - never would have thought to look.

amelius · 7 days ago

Is this clause even legally valid?

How can the user know if the LLM produces anything that violates copyright?

(Of course they shouldn't have trained it on infringing content in the first place, and perhaps used a different model for enterprise, etc.)

layer8 · 7 days ago

> Is anyone working on this?

There was recently https://news.ycombinator.com/item?id=47131225.

Thanks! I missed that. The attribution by training data source category (arxiv vs wikipedia vs nemotron etc.) is an interesting approach.

popalchemist · 7 days ago

Copyright does not cover ideas. Only specific executions of ideas. So unless it's a line-by-line copy (unlikely) there is no recourse for someone to sue for a re-execution/reimplementation of an idea.

wlonkly · 7 days ago

Where do derivative works fit into your model of copyright?

aeon_ai · 7 days ago

You've likely paid attention to the litigation here. Regardless of what remains to be litigated, the training in and of itself has already been deemed fair use (and transformative) by Alsup.

Further, you know that ideas are not protected by copyright. The code comparison in this demonstrates a relatively strong case that the expression of the idea is significantly different from that of the original code.

If it were the case that the LLM ingested the code and regurgitated it (as would be the premise of highlighting the training data provenance), that similarity would be much higher. That is not the case.

You're right, I've followed the litigation closely. I've advocated for years that "training is fair use" and I'm generally an anti-IP hawk who DEFENDS copyright/trademark cases. Only recently have I started to concede the issue might have more nuance than "all training is fair use, hard stop." And I still think Judge Alsup got it right.

That said, even if model training is fair use, model output can still be infringing. There would be a strong case, for example, if the end user guides the LLM to create works in a way that copies another work or mimics an author or artist's style. This case clearly isn't that. On the similarity at issue here, I haven't personally compared. I hope you're right.

overfeed · 7 days ago

> The code comparison in this demonstrates a relatively strong case that the expression of the idea is significantly different from that of the original code.

Can I use one AI agent to write detailed tests based on disassembled Windows, and another to write code that passes those same function-level tests? If so, I'm about to relicense Windows 11 - eat my shorts, ReactOS!

danlitt · 8 days ago

nairboon · 8 days ago

That code is still LGPL, it doesn't matter what some release engineer writes in the release notes on Github. All original authors and copyright holders must have explicitly agreed to relicense under a different license, otherwise the code stays LGPL licensed.

Also the mentioned SCOTUS decision is concerned with authorship of generative AI products. That's very different of this case. Here we're talking about a tool that transformed source code and somehow magically got rid of copyright due to this transformation? Imagine the consequences to the US copyright industry if that were actually possible.

In the legal system there's no such thing as "code that is LGPL". It's not an xattr attached to the code.

There is an act of copying, and there is whether or not that copying was permitted under copyright law. If the author of the code said you can copy, then you can. If the original author didn't, but the author of a derivative work, who wasn't allowed to create a derivative work, told you you could copy it, then it's complicated.

And none of it's enforced except in lawsuits. If your work was copied without permission, you have to sue the person who did that, or else nothing happens to them.

pavlov · 8 days ago

If anything, the SCOTUS decision would seem to imply that generative AI transformations produce no additional creative contribution and therefore the original copyright holder has all rights to any derived AI works.

(IANAL)

that is a very good formulation of what I have been trying to say

but also probably not fully right

as far as I understand they avoid the decision of weather an AI can produce creative work by saying that the neither the AI nor it's owner/operator can claim ownership of copyright (which makes it de-facto public domain)

this wouldn't change anything wrt. derived work still having the original authors copyright

but it could change things wrt. parts in the derived work which by themself are not derived

That's a reasonable theory though it's stuck with the problem that any model will by its training be derivative of codebases that have incompatible licenses, and that in fact every single use of an LLM is therefore illegal (or at least tortious).

iff it went through the full clean room rewrite just using AI then no, it's de-facto public domain (but also it probably didn't do so)

iff it is a complete new implementation with completely different internal then it could also still be no LGPL even if produced by a person with in depth knowledge. Copyright only cares if you "copied" something not if you had "knowledge" or if it "behaves the same". So as long as it's distinct enough it can still be legally fine. The "full clean room" requirement is about "what is guaranteed to hold up in front of a court" not "what might pass as non-derivative but with legal risk".

pornel · 8 days ago

Generative AI changed the equation so much that our existing copyright laws are simply out of date.

Even copyright laws with provisions for machine learning were written when that meant tangential things like ranking algorithms or training of task-specific models that couldn't directly compete with all of their source material.

For code it also completely changes where the human-provided value is. Copyright protects specific expressions of an idea, but we can auto-generate the expressions now (and the LLM indirection messes up what "derived work" means). Protecting the ideas that guided the generation process is a much harder problem (we have patents for that and it's a mess).

It's also a strategic problem for GNU. GNU's goal isn't licensing per se, but giving users freedom to control their software. Licensing was just a clever tool that repurposed the copyright law to make the freedoms GNU wanted somewhat legally enforceable. When it's so easy to launder code's license now, it stops being an effective tool.

GNU's licensing strategy also depended on a scarcity of code (contribute to GCC, because writing a whole compiler from scratch is too hard). That hasn't worked well for a while due to permissive OSS already reducing scarcity, but gen AI is the final nail in the coffin.

It's not a problem. If you give a work to an AI and say "rewrite this", you created a derivative work. If you don't give a work to an AI and say "write a program that does (whatever the original code does)" then you didn't. During discovery the original author will get to see the rewriter's Claude logs and see which one it is. If the rewriter deleted their Claude logs during the lawsuit they go to jail. If the rewriter deleted their Claude logs before the lawsuit the court interprets which is more likely based on the evidence.

hennell · 7 days ago

But the AI has the work to derive from already. I just went to Gemini and said "make me a picture of a cartoon plumber for a game design". Based on your logic the image it made me of a tubby character with a red cap, blue dungarees, red top and a big bushy mustache is not a derivative work...

(interestingly asking it to make him some friends it gave me more 'original' ideas, but asking it to give him a brother and I can hear the big N's lawyers writing a letter already...)

buckle8017 · 7 days ago

Except Claude was for sure trained on the original work and when asked to produce a new product that does the same thing will just spit out a (near) copy

empath75 · 7 days ago

> Generative AI changed the equation so much that our existing copyright laws are simply out of date.

Ideas are not protected by copyright, expression of ideas is.

You can't legally copy a creative work, but you can describe the idea of the work to an AI and get a new expression of it in a fraction of the time it took for the original creator to express their idea.

The whole premise of copyright is that ideas aren't the hard part, the work of bringing that idea to fruition is, but that may no longer be true!

leecommamichael · 8 days ago

“Changing the equation” by boldly breaking the law.

Majromax · 7 days ago

> “Changing the equation” by boldly breaking the law.

Is it? I think the law is truly undeveloped when it comes to language models and their output.

As a purely human example, suppose I once long ago read through the source code of GCC. Does this mean that every compiler I write henceforth must be GPL-licensed, even if the code looks nothing like GCC code?

There's obviously some sliding scale. If I happen to commit lines that exactly replicate GCC then the presumption will be that I copied the work, even if the copying was unconscious. On the other hand, if I've learned from GCC and code with that knowledge, then there's no copyright-attaching copy going on.

We could analogize this to LLMs: instructions to copy a work would certainly be a copy, but an ostensibly independent replication would be a copy only if the work product had significant similarities to the original beyond the minimum necessary for function.

However, this is intuitively uncomfortable. Mechanical translation of a training corpus to model weights doesn't really feel like "learning," and an LLM can't even pinky-promise to not copy. It might still be the most reasonable legal outcome nonetheless.

mlinhares · 7 days ago

Its only breaking the law if you don't have enough money to pay the politicians.

ajross · 8 days ago

> GNU's goal isn't licensing per se, but giving users freedom to control their software.

I think that's maybe misunderstanding. GNU wants everyone to be able to use their computers for the purposes they want, and software is the focus because software was the bottleneck. A world where software is free to create by anyone is a GNU utopia, not a problem.

Obviously the bigger problem for GNU isn't software, which was pretty nicely commoditized already by the FOSS-ate-the-world era of two decades ago; it's restricted hardware, something that AI doesn't (yet?) speak to.

satvikpendem · 7 days ago

Honestly, good. Copyright and IP law in general have been so twisted by corporations that only they benefit now, see Mickey Mouse laws by Disney for example, or patenting obvious things like Nintendo or even just patent trolling in general.

hamdingers · 7 days ago

The biggest recording artist in the world right now had to re-record her early albums because she didn't own the copyright, imagine how many artists don't get that big and never have that opportunity.

That individual artists are still defending this system is baffling to me.

abrookewood · 8 days ago

This seems relevant: "No right to relicense this project (github.com/chardet)" https://news.ycombinator.com/item?id=47259177

That's another project though, right? In this case I think it is different because that project just seems stolen. The courts can probably verify this too.

I think the main question is when a rewrite is a clean rewrite, via AI. If it is a clean rewrite they can choose any licence.

littlestymaar · 8 days ago

No, TFA is about chardet too:

> chardet , a Python character encoding detector used by requests and many others, has sat in that tension for years: as a port of Mozilla’s C++ code it was bound to the LGPL, making it a gray area for corporate users and a headache for its most famous consumer.

samrus · 8 days ago

Im struggling to see where this conclusion came from. To me it sounds like the AI-written work can not be coppywritten, and so its kind of like a copy pasting the original code. Copy pasting the original code doesnt make it public domain. Ai gen code cant be copywritten, or entered into the public domain, or used for purposes outside of the original code's license. Whats the paradox here?

Sharlin · 8 days ago

The point is that even a work written by an AI trained exclusively on liberally licensed or public domain material cannot have copyright (isn’t a "work" in the legal sense) and thus nobody has standing to put it under a license or claim any rights to it.

If I train a limerick generator on the contents of Project Gutenberg, no matter how creative its outputs, they’re not copyrightable under this interpretation. And it’s by far the most reasonable interpretation of the law as both intended and written. Entities that are not legal persons cannot have copyright, but legal persons also cannot claim copyright of something made by a nonperson, unless they are the "creative force" behind the work.

NitpickLawyer · 8 days ago

> To me it sounds like the AI-written work can not be coppywritten

I think we didn't even began to consider all the implications of this, and while people ran with that one case where someone couldn't copyright a generated image, it's not that easy for code. I think there needs to be way more litigation before we can confidently say it's settled.

If "generated" code is not copyrightable, where do draw the line on what generated means? Do macros count? Does code that generates other code count? Protobuf?

If it's the tool that generates the code, again where do we draw the line? Is it just using 3rd party tools? Would training your own count? Would a "random" code gen and pick the winners (by whatever means) count? Bruteforce all the space (silly example but hey we're in silly space here) counts?

Is it just "AI" adjacent that isn't copyrightable? If so how do you define AI? Does autocomplete count? Intellisense? Smarter intellisense?

Are we gonna have to have a trial where there's at least one lawyer making silly comparisons between LLMs and power plugs? Or maybe counting abacuses (abaci?)... "But your honour, it's just random numbers / matrix multiplications...

lelanthran · 8 days ago

All of your questions have seemingly trivial answers. Maybe I am missing something, but...

> If "generated" code is not copyrightable, where do draw the line on what generated means? Do macros count?

Does the output of the macro depend on ingesting someone else's code?

> Does code that generates other code count?

Does the output of the code depend on ingesting someone else's code?

> Protobuf?

Does your protobuf implementation depend on ingesting someone else's code?

> If it's the tool that generates the code, again where do we draw the line?

Does the tool depend ingestion of of someone else's code?

> Is it just using 3rd party tools?

Does the 3rd party tool depend on ingestion of someone else's code?

> Would training your own count?

Does the training ingest someone else's code?

> Would a "random" code gen and pick the winners (by whatever means) count?

Does the random codegen depend on ingesting someone else's code?

> Bruteforce all the space (silly example but hey we're in silly space here) counts?

Does the bruteforce algo depend on ingesting someone else's code?

> Is it just "AI" adjacent that isn't copyrightable?

No, it's the "depends on ingesting someone else's code" that makes it not copyrightable.

> If so how do you define AI?

Doesn't matter whether it is AI or not, the question is are you ingesting someone else's code.

> Does autocomplete count?

Does the specific autocomplete in question depend on ingesting someone else's code?

> Intellisense?

Does the specific Intellisense in question depend on ingesting someone else's code?

> Smarter intellisense?

Does the specific Smarter Intellisense in question depend on ingesting someone else's code?

...

Look, I see where you're going with this - reductio ad absurdum and all - but it seems to me that you're trying to muddy the waters by claiming that either all code generation is allowed or no code generation is disallowed.

Let me clear the waters for all the readers - the complaint is not about code generation, it's about ingesting someone else's code, frequently for profit.

All these questions you are asking seem to me to be irrelevant and designed to shift the focus from the ingestion of other people's work to something that no one is arguing against.

laksjhdlka · 8 days ago

They say "if" it's a new work, then it might not be copyrightable, I guess. You suppose that it's still the original work, and hence it's still got that copyright.

I think they are rhetorically asking if your position is correct.

cxr · 8 days ago

FYI: the concept is "copyright" not "copywrite". It doesn't turn into "copywritten" as an adjective. The adjective is "copyrighted".

antonvs · 7 days ago

Freedom of expression implies equal writes for all!

__alexs · 8 days ago

AI written absolutely is copyrightable. There are just some unresolved tensions around where the lines are and how much and what kind of involvement humans need to have in the process.

"Accepting AI-rewriting as relicensing could spell the end of Copyleft"

True, but too weak. It ends copyright entirely. If I can do this to a code base, I can do it to a movie, to an album, to a novel, to anything.

As such, we can rest assured that for better or for worse this is going to be resolved in favor of this not being enough to strip the copyright off of something and the chardet/chardet project would be well advised not to stand in front of the copyright legal behemoth and defeat it in single combat.

Vegenoid · 6 days ago

No, because the function of code is distinct from the implementation of the code. With software, something that is functionally identical can be created with a different underlying implementation. This is not the case with media.