AI tooling must be disclosed for contributions

I’m loving today. HN’s front page is filled with some good sources today. No nonsense sensationalism or preaching AI doom, but more realistic experiences.

I’ve completely turned off AI assist on my personal computer and only use AI assist sparingly on my work computer. It is so bad at compound work. AI assist is great at atomic work. The rest should be handled by humans and use AI wisely. It all boils down back to human intelligence. AI is only as smart as the human handling it. That’s the bottom line.

tick_tock_tick · 3 days ago

> AI is only as smart as the human handling it.

I think I'm slowly coming around to this viewpoint too. I really just couldn't understand how so many people were having widely different experiences. AI isn't magic; how could I have expected all the people I've worked with who struggle to explain stuff to team members, who have near perfect context, to manage to get anything valuable across to an AI?

I was original pretty optimistic that AI would allow most engineers to operate at a higher level but it really seems like instead it's going to massively exacerbate the difference between an ok engineer and a great engineer. Not really sure how I feel about that yet but at-least I understand now why some people think the stuff is useless.

btown · 3 days ago

One of my mental models is that the notion of "effective engineer" used to mean "effective software developer" whether or not they were good at system design.

Now, an "effective engineer" can be a less battle-tested software developer, but they must be good at system design.

(And by system design, I don't just mean architecture diagrams: it's a personal culture of constantly questioning and innovating around "let's think critically to see what might go wrong when all these assumptions collide, and if one of them ends up being incorrect." Because AI will only suggest those things for cut-and-dry situations where a bug is apparent from a few files' context, and no ambitious idea is fully that cut-and-dry.)

The set of effective engineers is thus shifting - and it's not at all a valid assumption that every formerly good developer will see their productivity skyrocket.

btucker · 3 days ago

I've been starting to think of it like this:

Great Engineer + AI = Great Engineer++ (Where a great engineer isn't just someone who is a great coder, they also are a great communicator & collaborator, and love to learn)

Good Engineer + AI = Good Engineer

OK Engineer + AI = Mediocre Engineer

jgilias · 3 days ago

This fits my observations as well. With the exception that it’s sometimes the really sharp engineers who can do wonders themselves who aren’t really great at communication. AI really needs you to be verbose, and a lot of people just can’t.

felipeerias · 2 days ago

At the moment, AI tools are particularly useful for people who feel comfortable browsing through large amounts of text, intuitively nudging the machine this way and that until arriving at a valuable outcome.

However, that way of working can be exasperating for those who prefer a more deterministic approach, and who may feel frustrated by the sheer amount of slightly incorrect stuff being generated by the machine.

oblio · 3 days ago

As a summary, any scaling tech greatly exacerbates differences.

Nassim Taleb is the prophet of our times and he doesn't get enough credit.

jerf · 3 days ago

I've been struggling to apply AI on any large scale at work. I was beginning to wonder if it was me.

But then my wife sort of handed me a project that previously I would have just said no to, a particular Android app for the family. I have instances of all the various Android technologies under my belt, that is, I've used GUI toolkits, I've used general purpose programming languages, I've used databases, etc, but with the possible exception of SQLite (which even that is accessed through an ORM), I don't know any of the specific technologies involved with Android now. I have never used Kotlin; I've got enough experience that I can pretty much piece it together when I'm reading it but I can't write it. Never used the Android UI toolkit, services, permissions, media APIs, ORMs, build system, etc.

I know from many previous experiences that A: I could definitely learn how to do this but B: it would be a many-week project and in the end I wouldn't really be able to leverage any of the Android knowledge I would get for much else.

So I figured this was a good chance to take this stuff for a spin in a really hard way.

I'm about eight hours in and nearly done enough for the family; I need about another 2 hours to hit that mark, maybe 4 to really polish it. Probably another 8-12 hours and I'd have it brushed up to a rough commercial product level for a simple, single-purpose app. It's really impressive.

And I'm now convinced it's not just that I'm too old a fogey to pick it up, which is, you know, a bit of a relief.

It's just that it works really well in some domains, and not so much in others. My current work project is working through decades of organically-grown cruft owned by 5 different teams, most of which don't even have a person on them that understands the cruft in question, and trying to pull it all together into one system where it belongs. I've been able to use AI here and there for some stuff that is still pretty impressive, like translating some stuff into psuedocode for my reference, and AI-powered autocomplete is definitely impressive when it correctly guesses the next 10 lines I was going to type effectively letter-for-letter. But I haven't gotten that large-scale win where I just type a tiny prompt in and see the outsized results from it.

I think that's because I'm working in a domain where the code I'm writing is already roughly the size of the prompt I'd have to give, at least in terms of the "payload" of the work I'm trying to do, because of the level of detail and maturity of the code base. There's no single sentence I can type that an AI can essentially decompress into 250 lines of code, pulling in the correct 4 new libraries, and adding it all to the build system the way that Gemini in Android Studio could decompress "I would like to store user settings with a UI to set the user's name, and then display it on the home page".

I think I recommend this approach to anyone who wants to give this approach a fair shake - try it in a language and environment you know nothing about and so aren't tempted to keep taking the wheel. The AI is almost the only tool I have in that environment, certainly the only one for writing code, so I'm forced to really exercise the AI.

katbyte · 3 days ago

It’s like the difference between someone who can search the internet or codebase well bs someone who can’t

Using search engines is a skill

smartmic · 3 days ago

Today, I read the following in the concluding sentence of Frederik P. Brooks' essay “No Silver Bullets, Refired"[0]. I am quoting the entire chapter in full because it is so apt and ends with a truly positive message.

> Net on Bullets - Position Unchanged

> So we come back to fundamentals. Complexity is the business we are in, and complexity is what limits us. R. L. Glass, writing in 1988, accurately summarizes my 1995 views:

>> So what, in retrospect, have Parnas and Brooks said to us? That software development is a conceptually tough business. That magic solutions are not just around the corner. That it is time for the practitioner to examine evolutionary improvements rather than to wait—or hope—for revolutionary ones.

>> Some in the software field find this to be a discouraging picture. They are the ones who still thought breakthroughs were near at hand.

>> But some of us—those of us crusty enough to think that we are realists—see this as a breath of fresh air. At last, we can focus on something a little more viable than pie in the sky. Now, perhaps, we can get on with the incremental improvements to software productivity that are possible, rather than waiting for the breakthroughs that are not likely to ever come.[1]

[0]: Brooks, Frederick P.,Jr, The mythical man-month: essays on software engineering (1995), p. 226

[1]: Glass, R. L., "Glass"(column), System Development, (January 1988), pp. 4-5.

WhyNotHugo · 3 days ago

> AI is only as smart as the human handling it.

An interesting stance.

Plenty of posts in the style of "I wrote this cool library with AI in a day" were written by really smart devs who are known for shipping good quality library very quickly.

devmor · 3 days ago

I'm right there with you, and having a similar experience at my day job. We are doing a bit of a "hack week" right now where we allow everyone in the org to experiment in groups with AI tools, especially those that don't regularly use them as part of their work - and we've seen mostly great applications of analytical approaches, guardrails and grounded generation.

It might just be my point of view, but I feel like there's been a sudden paradigm shift back to solid ML from the deluge of chatbot hype nonsense.

rerdavies · 2 days ago

Sure. But a smart person using an AI is way smarter than a smart person not using an AI. Also keep in mind that the IQ of various AIs varies dramatically. The Google Search AI, for example, has an IQ in the 80s (and it shows); whereas capable paid AIs consistently score in the 120 IQ range. Not as smart as me, but close enough. And entirely capable of doing in seconds what would take me hours to accomplish, while applying every single one of my 120+ IQ points to the problem at hand. Im my opinion, really smart people delegate.

danenania · 3 days ago

The way I've been thinking about it is that the human makes the key decisions and then the AI connects the dots.

What's a key decision and what's a dot to connect varies by app and by domain, but the upside is that generally most code by volume is dot connecting (and in some cases it's like 80-90% of the code), so if you draw the lines correctly, huge productivity boosts can be found with little downside.

But if you draw the lines wrong, such that AI is making key decisions, you will have a bad time. In that case, you are usually better off deleting everything it produced and starting again rather than spending time to understand and fix its mistakes.

Things that are typically key decisions:

- database table layout and indexes

- core types

- important dependencies (don't let the AI choose dependencies unless it's low consequence)

- system design—caches, queues, etc.

- infrastructure design—VPC layout, networking permissions, secrets management

- what all the UI screens are and what they contain, user flows, etc.

- color scheme, typography, visual hierarchy

- what to test and not to test (AI will overdo it with unnecessary tests and test complexity if you let it)

- code organization: directory layout, component boundaries, when to DRY

Things that are typically dot connecting:

- database access methods for crud

- API handlers

- client-side code to make API requests

- helpers that restructure data, translate between types, etc.

- deploy scripts/CI and CD

- dev environment setup

- test harness

- test implementation (vs. deciding what to test)

- UI component implementation (once client-side types and data model are in place)

- styling code

- one-off scripts for data cleanup, analytics, etc.

That's not exhaustive on either side, but you get the idea.

AI can be helpful for making the key decisions too, in terms of research, ideation, exploring alternatives, poking holes, etc., but imo the human needs to make the final choices and write the code that corresponds to these decisions either manually or with very close supervision.

Deleted Comment

I’m not a big AI fan but I do see it as just another tool in your toolbox. I wouldn’t really care how someone got to the end result that is a PR.

But I also think that if a maintainer asks you to jump before submitting a PR, you politely ask, “how high?”

cvoss · 3 days ago

It does matter how and where a PR comes from, because reviewers are fallible and finite, so trust enters the equation inevitably. You must ask "Do I trust where this came from?" And to answer that, you need to know where it come from.

If trust didn't matter, there wouldn't have been a need for the Linux Kernel team to ban the University of Minnesota for attempting to intentionally smuggle bugs through the PR process as part of an unauthorized social experiment. As it stands, if you / your PRs can't be trusted, they should not even be admitted to the review process.

RossBencina · 2 days ago

> "Do I trust where this came from?"

In an open source project I think you have to start with a baseline assumption of "trust nobody." Exceptions possibly if you know the contributors personally, or have built up trust over years of collaboration.

I wouldn't reject or decline to review a PR just because I don't trust the contributor.

otterley · 2 days ago

If it comes with good documentation and appropriate tests, does that help?

Dead Comment

koolba · 3 days ago

> You must ask "Do I trust where this came from?" And to answer that, you need to know where it come from.

No you don’t. You can’t outsource trust determinations. Especially to the people you claim not to trust!

You make the judgement call by looking at the code and your known history of the contributor.

Nobody cares if contributors use an LLM or a magnetic needle to generate code. They care if bad code gets introduced or bad patches waste reviewers’ time.

dsjoerg · 3 days ago

You haven't addressed the primary stated rationale from the linked content: "I try to assist inexperienced contributors and coach them to the finish line, because getting a PR accepted is an achievement to be proud of. But if it's just an AI on the other side, I don't need to put in this effort, and it's rude to trick me into doing so."

nosignono · 3 days ago

> I wouldn’t really care how someone got to the end result that is a PR.

I can generate 1,000 PRs today against an open source project using AI. I think you do care, you are only thinking about the happy path where someone uses a little AI to draft a well constructed PR.

There's a lot ways AI can be used to quickly overwhelm a project maintainer.

Waterluvian · 3 days ago

In that case a more correct rule (and probably one that can be automatically enforced) for that issue is a max number of PRs or opened issues per account.

oceanplexian · 3 days ago

> I can generate 1,000 PRs today against an open source project using AI.

Then perhaps the way you contribute, review, and accept code is fundamentally wrong and needs to change with the times.

It may be that technologies like Github PRs and other VCS patterns are literally obsolete. We've done this before throughout many cycles of technology, and these are the questions we need to ask ourselves as engineers, not stick our heads in the sand and pretend it's 2019.

EarlKing · 3 days ago

It's not just about how you got there. At least in the United States according to the Copyright Office... materials produced by artificial intelligence are not eligible for copyright. So, yeah, some people want to know for licensing purposes. I don't think that's the case here, but it is yet another reason to require that kind of disclosure... since if you fail to mention that something was made by AI as part of a compound work you could end up losing copyright over the whole thing. For more details, see [2] (which is part of the larger report on Copyright and AI at [1]).

[1] https://www.copyright.gov/ai/

[2] https://www.copyright.gov/ai/Copyright-and-Artificial-Intell...

ants_everywhere · 2 days ago

> • The use of AI tools to assist rather than stand in for human creativity does not affect the availability of copyright protection for the output.

> • Copyright protects the original expression in a work created by a human author, even if the work also includes AI-generated material

> • Human authors are entitled to copyright in their works of authorship that are perceptible in AI-generated outputs, as well as the creative selection, coordination, or arrangement of material in the outputs, or creative modifications of the outputs.

smitop · 2 days ago

> if you fail to mention that something was made by AI as part of a compound work you could end up losing copyright over the whole thing

The source you linked says the opposite of that: "the inclusion of elements of AI-generated content in a larger human-authored work does not affect the copyrightability of the larger human-authored work as a whole"

wahnfrieden · 3 days ago

You should care. If someone submits a huge PR, you’re going to waste time asking questions and comprehending their intentions if the answer is that they don’t know either. If you know it’s generated and they haven’t reviewed it themselves, you can decide to shove it back into an LLM for next steps rather than expect the contributor to be able to do anything with your review feedback.

Unreviewed generated PRs can still be helpful starting points for further LLM work if they achieve desired results. But close reading with consideration of authorial intent, giving detailed comments, and asking questions from someone who didn't write or read the code is a waste of your time.

That's why we need to know if a contribution was generated or not.

KritVutGu · 3 days ago

You are absolutely right. AI is just a tool to DDoS maintainers.

Any contributor who was shown to post provably untested patches used to lose credibility. And now we're talking about accommodating people who don't even understand how the patch is supposed to work?

nullc · 3 days ago

> is that they don’t know either

It would be nice if they did, in fact, say they didn't know. But more often they just waste your time making their chatbot argue with you. And the chatbots are outrageous gaslighters.

All big OSS projects have had the occasional bullshitter/gaslighter show up. But LLMs have increased the incidence level of these sorts of contributors by many orders of magnitude-- I consider it an open question if open-public-contribution opensource is viable in the world post LLM.

armchairhacker · 3 days ago

Agreed. As someone who uses AI (completion and Claude Code), I'll disclose whenever asked. But I disagree that it's "common courtesy" when not explicitly asked; since many people (including myself) don't mind and probably assume some AI, and it adds distraction (another useless small indicator; vaguely like dependabot, in that it steals my attention but ultimately I don't care).

eschaton · 3 days ago

It’s not just common courtesy to disclose, it’s outright fraud not to disclose.

risyachka · 2 days ago

It should be. You didn’t write generated code, why should I spend my life reading it?

If you want me to put in the effort- you have to put it in first.

Especially considering in 99% of cases even the one who generated it didn’t fully read/understand it.

mtlmtlmtlmtl · 2 days ago

The reason it's common courtesy is out of respect for the reviewer/maintainer's time. You need to let em know to look for the kind of idiotic mistakes LLMs shit out on a routine basis. It's not a "distraction", it's extremely relevant information. On the maintainer's discretion, they may not want to waste their time reviewing it at all, and politely or impolitely ask the contributor to do it again, and use their own brain this time. It also informs them on how seriously to take this contributor in the future, if the work doesn't hold water, or indeed, even if it does, since the next time the contributor runs the LLM lottery the result may be utter bullshit.

Whether it's prose or code, when informed something is entirely or partially AI generated, it completely changes the way I read it. I have to question every part of it now, no matter how intuitive or "no one could get this wrong"ish it might seem. And when I do, I usually find a multitude of minor or major problems. Doesn't matter how "state of the art" the LLM that shat it out was. They're still there. The only thing that ever changed in my experience is that problems become trickier to spot. Because these things are bullshit generators. All they're getting better at is disguising the bullshit.

I'm sure I'll gets lots of responses trying to nitpick my comment apart. "You're holding it wrong", bla bla bla. I really don't care anymore. Don't waste your time. I won't engage with any of it.

I used to think it was undeserved that we programmers called ourselved "engineers" and "architects" even before LLMs. At this point, it's completely farcical.

"Gee, why would I volunteer that my work came from a bullshit generator? How is that relevant to anything?" What a world.

nullc · 2 days ago

FWIW, I can say from direct experience people that other people are watching and noting when people are submitting AI slop as their own work, and taking note to never hire these people. Beyond the general professional ethics, it makes you harder to distinguish from malicious parties and other incompetent people LARPing as having knowledge that they don't.

So fail to disclose at your own peril.

ants_everywhere · 2 days ago

If you don't disclose the use of

- books

- search engines

- stack overflow

- talking to a coworker

then it's not clear why you would have to disclose talking to an AI.

Generally speaking, when someone uses the word "slop" when talking about AI it's a signal to me that they've been sucked into a culture war and to discount what they say about AI.

It's of course the maintainer's right to take part in a culture war, but it's a useful way to filter out who's paying attention vs who's playing for a team. Like when you meet someone at a party and they bring up some politician you've barely heard of but who their team has vilified.

alfalfasprout · 3 days ago

The reality is as someone that helps maintain several OSS projects you vastly underestimate the problem that AI-assisted tooling has created.

On the one hand, it's lowered the barrier to entry for certain types of contributions. But on the other hand getting a vibe-coded 1k LOC diff from someone that has absolutely no idea how the project even works is a serious problem because the iteration cycle of getting feedback + correctly implementing it is far worse in this case.

Also, the types of errors introduced tend to be quite different between humans and AI tools.

It's a small ask but a useful one to disclose how AI was used.

raincole · 3 days ago

When one side has much more "scalability" than the other, then the other side has very strong motivation to match up.

- People use AI to write cover letters. If the companies don't filter out them automatically, they're screwed.

- Companies use AI to interview candidates. No one wants to spend their personal time talking to a robot. So the candidates start using AI to take interviews for them.

etc.

If you don't at least tell yourself that you don't allow AI PRs (even just as a white lie) you'll one day use AI to review PRs.

oceanplexian · 3 days ago

Both sides will use AI and it will ultimately increase economic productivity.

Imagine living before the invention of the printing press, and then lamenting that we should ban them because it makes it "too easy" to distribute information and will enable "low quality" publications to have more reach. Actually, this exact thing happened, but the end result was it massively disrupted the world and economy in extremely positive ways.

bandrami · 2 days ago

Whether the output of AI can be copyrighted remains a legal minefield, so if I were running a project where copyright-based protections are important (say, anything GPL) I would want to know if a PR contained them.

tgsovlerkhgsel · 3 days ago

If a maintainer asks me to jump through too many stupid hoops, I'll just not contribute to the software.

That said, requiring adequate disclosure of AI is just fair. It also suggests that the other side is willing to accept AI-supported contributions (without being willing to review endless AI slop that they could have generated themselves if they had the time to read it).

I would expect such a maintainer to respond fairly to "I first vibecoded it. I then made manual changes, vibecoded a test, cursorily reviewed the code, checked that the tests provide good coverage, ran both existing and new tests, and manually tested the code."

That fair response might be a thorough review, or a request that I do the thorough review before they put in the time, but I'd expect it to be more than a blatant "nope, AI touched this, go away".

bagels · 3 days ago

You have other choices, such as not contributing.

renrutal · 3 days ago

I won't put it as "just another tool". AI introduces a new kind of tool where the ownership of the resulting code is not straightforward.

If, in the dystopian future, a justice court you're subjected to decides that Claude was trained on Oracle's code, and all Claude users are possibly in breach of copyright, it's easier to nuke from orbit all disclosed AI contributions.

Razengan · 3 days ago

> if a maintainer asks you to jump before submitting a PR, you politely ask, “how high?”

or say "fork you."

Dead Comment

quotemstr · 3 days ago

As a project maintainer, you shouldn't make rules unenforceable rules that you and everyone else know people will flout. Doing so comes makes you seem impotent and diminishes the respect people have for rules in general.

You might argue that by making rules, even futile ones, you at least establish expectations and take a moral stance. Well, you can make a statement without dressing it up as a rule. But you don't get to be sanctimonious that way I guess.

voxl · 3 days ago

Except you can enforce this rule some of the time. People discover that AI was used or suspect it all the time, and people admit to it after some pressure all the time.

Not every time, but sometimes. The threat of being caught isn't meaningless. You can decide not to play in someone else's walled garden if you want but the least you can do is respect their rules, bare minimum of human decency.

natrius · 3 days ago

Unenforceable rules are bad, but if you tweak the rule to always require some sort of authorship statement (e.g. "I wrote this by hand" or "I wrote this with Claude"), then the honor system will mostly achieve the desired goal of calibrating code review effort.

KritVutGu · 3 days ago

> As a project maintainer, you shouldn't make rules unenforceable rules

Total bullshit. It's totally fine to declare intent.

You are already incapable of verifying / enforcing that a contributor is legally permitted to submit a piece of code as their own creation (Signed-off-by), and do so under the project's license. You won't embark on looking for prior art, for the "actual origin" of the code, whatever. You just make them promise, and then take their word for it.

sheepscreek · 3 days ago

We keep talking about “AI replacing coders,” but the real shift might be that coding itself stops looking like coding. If prompts become the de facto way to create applications/developing systems in the future, maybe programming languages will just be baggage we’ll need to unlearn.

Programming languages were a nice abstraction to accommodate our inability to comprehend complexity - current day LLMs do not have the same limitations as us.

The uncomfortable part will be what happens to PRs and other human-in-the-loop checks. It’s worthwhile to consider that not too far into the future, we might not be debugging code anymore - we’ll be debugging the AI itself. That’s a whole different problem space that will need an entirely new class of solutions and tools.

tsimionescu · 3 days ago

This fundamentally misunderstands why programming languages exist. They're not required because "we can't understand complexity". They were invented because we need a way to be very specific about what we want the machine to do. Whether it's the actual physical hardware we're talking to when writing assembly, or it's an abstract machine that will be translated to the hardware like in C or Java, the key point is that we want to be specific.

Natural language can be specific, but it requires far too many words. `map (+ 1) xs` is far shorter to write than "return a list of elements by applying a function that adds one to its argument to each element of xs and collecting the results in a separate list", or similar.

ryoshu · 3 days ago

All we need to do is prompt an LLM with such specificity that it does exactly what we want the machine to do.

jedbrown · 2 days ago

Provenance matters. An LLM cannot certify a Developer Certificate of Origin (https://en.wikipedia.org/wiki/Developer_Certificate_of_Origi...) and a developer of integrity cannot certify the DCO for code emitted by an LLM, certainly not an LLM trained on code of unknown provenance. It is well-known that LLMs sometimes produce verbatim or near-verbatim copies of their training data, most of which cannot be used without attribution (and may have more onerous license requirements). It is also well-known that they don't "understand" semantics: they never make changes for the right reason.

We don't yet know how courts will rule on cases like Does v Github (https://githubcopilotlitigation.com/case-updates.html). LLM-based systems are not even capable of practicing clean-room design (https://en.wikipedia.org/wiki/Clean_room_design). For a maintainer to accept code generated by an LLM is to put the entire community at risk, as well as to endorse a power structure that mocks consent.

raggi · 2 days ago

For a large LLM I think the science in the end will demonstrate that verbatim reproduction is not coming from verbatim recording, as the structure really isn’t setup that way in the models under question here.

This is similar to the ruling by Alsup in the Anthropic books case that the training is “exceedingly transformative”. I would expect a reinterpretation or disagreement on this front from another case to be both problematic and likely eventually overturned.

I don’t actually think provenance is a problem on the axis you suggest if Alsups ruling holds. That said I don’t think that’s the only copyright issue afoot - the copyright office writing on copyrightability of outputs from the machine essentially requires that the output fails the Feist tests for human copyrightability.

More interesting to me is how this might realign the notion of copyrightability of human works further as time goes on, moving from every trivial derivative bit of trash potentially being copyrightable to some stronger notion of, to follow the feist test, independence and creativity. Further it raises a fairly immediate question in an open source setting if many individual small patch contributions themselves actually even pass those tests - they may well not, although the general guidance is to set the bar low - but is a typo fix either? There is so far to go on this rabbit hole.

snickerbockers · 2 days ago

I'd be fine with that if that was the way copyright law had been applied to humans for the last 30+ years but it's not. Look into the OP's link on clean room reverse engineering, I come from an RE background and people are terrified of accidentally absorbing "tainted" information through extremely indirect means because it can potentially used against them in court.

I swear the ML community is able to rapidly change their mind as to whether "training" an AI is comparable to human cognition based on whichever one is beneficial to them at any given instant.

j4coh · 2 days ago

So if you can get an LLM to produce music lyrics, for example, or sections from a book, those would be considered novel works given the encoding as well?

strogonoff · 2 days ago

In the West you are free to make something that everyone thinks is a “derivative piece of trash” and still call it yours; and sometimes it will turn out to be a hit because, well, it turns out that in real life no one can reliably tell what is and what isn’t trash[0]—if it was possible, art as we know it would not exist. Sometimes what is trash to you is a cult experimental track to me, because people are different.

On that note, I am not sure why creators in so many industries are sitting around while they are being more or less ripped off by massive corporations, when music has got it right.

— Do you want to make a cover song? Go ahead. You can even copyright it! The original composer still gets paid.

— Do you want to make a transformative derivative work (change the composition, really alter the style, edit the lyrics)? Go ahead, just damn better make sure you license it first. …and you can copyright your derivative work, too. …and the original composer still gets credit in your copyright.

The current wave of LLM-induced AI hype really made the tech crowd bend itself in knots trying to paint this as an unsolvable problem that requires IP abuse, or not a problem because it’s all mostly “derivative bits of trash” (at least the bits they don’t like, anyway), argue in courts how it’s transformative, etc., while the most straightforward solution keeps staring them in the face. The only problem is that this solution does not scale, and if there’s anything the industry in which “Do Things That Don’t Scale” is the title of a hit essay hates then that would be doing things that don’t scale.

[0] It should be clarified that if art is considered (as I do) fundamentally a mechanism of self-expression then there is, of course, no trash and the whole point is moot.

camgunz · 2 days ago

> For a large LLM I think the science in the end will demonstrate that verbatim reproduction is not coming from verbatim recording

We don't need all this (seemingly pretty good) analysis. We already know what everyone thinks: no relevant AI company has had their codebase or other IP scraped by AI bots they don't control, and there's no way they'd allow that to happen, because they don't want an AI bot they don't control to reproduce their IP without constraint. But they'll turn right around and be like, "for the sake of the future, we have to ingest all data... except no one can ingest our data, of course". :rolleyes:

jojobas · 2 days ago

There are only so many ways to code quite a few things. My classmate and I once got in trouble in high school for having identical code for one of the tasks at a coding competition, down to variable names and indentation. There is no way he could or would steal my code, and I sure didn't steal his.

rovr138 · a day ago

This is how sqlite handles it,

> Contributed Code

> In order to keep SQLite completely free and unencumbered by copyright, the project does not accept patches. If you would like to suggest a change and you include a patch as a proof-of-concept, that would be great. However, please do not be offended if we rewrite your patch from scratch.

source, https://www.sqlite.org/copyright.html

Aeolun · 2 days ago

Or you know, they just feel like code should be free. Like beer should be free.

We didn't have this whole issue 20 years ago because nobody gave a shit. If your code was public, and on the internet, it was free for everyone to use by definition.

Borealid · 2 days ago

An LLM can be used for a clean room design so long as all (ALL) of its training data is in the clean room (and consequently does not contain the copyrighted work being reverse engineered).

An LLM trained on the Internet-at-large is also presumably suitable for a clean room design if it can be shown that its training completed prior to the existence of the work being duplicated, and thus could not have been contaminated.

This doesn't detract from the core of your point, that LLM output may be copyright-contaminated by LLM training data. Yes, but that doesn't necessarily mean that an LLM output cannot be a valid clean-room reverse engineer.

account42 · 2 days ago

> An LLM trained on the Internet-at-large is also presumably suitable for a clean room design if it can be shown that its training completed prior to the existence of the work being duplicated, and thus could not have been contaminated.

This is assuming that you are only concerned with a particular work when you need to be sure that you are not copying any work that might be copyrighted without making sure to have a valid license that you are abiding by.

neilv · 3 days ago

There is also IP taint when using "AI". We're just pretending that there's not.

If someone came to you and said "good news: I memorized the code of all the open source projects in this space, and can regurgitate it on command", you would be smart to ban them from working on code at your company.

But with "AI", we make up a bunch of rationalizations. ("I'm doing AI agentic generative AI workflow boilerplate 10x gettin it done AI did I say AI yet!")

And we pretend the person never said that they're just loosely laundering GPL and other code in a way that rightly would be existentially toxic to an IP-based company.

ineedasername · 3 days ago

Courts (at least in the US) have already ruled that use of ingested data for training is transformative. There’s lots of details to figure, but the genie is out of the bottle.

Sure it’s a big hill to climb in rethinking IP laws to align with a societal desire that generating IP continue to be a viable economic work product, but that is what’s necessary.

shkkmo · 2 days ago

> Courts (at least in the US) have already ruled that use of ingested data for training is transformative

Yes, the training of the model itself is (or should be) a transformative act so you can train a model on whatever you have legal access to view.

However, that doesn't mean that the output of the model is automatically not infringing. If the model is prompted to create a copy of some copyrighted work, that is (or should be) still a violation.

Just like memorizing a book isn't infringment but reproducing a book from memory is.

Some courts at some levels. It’s by no means settled law.

This is far from settled law. Let's not mischaracterize it.

Even so, an AI regurgitating proprietary code that's licensed in some other way is a very real risk.

eru · 2 days ago

> Sure it’s a big hill to climb in rethinking IP laws to align with a societal desire that generating IP continue to be a viable economic work product, but that is what’s necessary.

Well, AI can perhaps solve the problem it created here: generated IP with AI is much cheaper than with humans, so it will be viable even at lower payoffs.

Less cynical: you can use trade secrets to protect your IP. You can host your software and only let customers interact with it remotely, like what Google (mostly) does.

Of course, this is a very software-centric view. You can't 'protect' eg books or music in this way.

jhanschoo · 2 days ago

An AI model's output can be transformative, but you can be unlucky enough that the LLM memorized the data that it gave you.

bsder · 3 days ago

> Courts (at least in the US) have already ruled that use of ingested data for training is transformative.

If you have code that happens to be identical to some else's code or implements someone's proprietary algorithm, you're going to lose in court even if you claim an "AI" gave it to you.

AI is training on private Github repos and coughing them up. I've had it regurgitate a very well written piece of code to do a particular computational geometry algorithm. It presented perfect, idiomatic Python with perfect tests that caught all the degenerate cases. That was obviously proprietary code--no amount of searching came up with anything even remotely close (it's why I asked the AI, after all).

slg · 2 days ago

>societal desire that generating IP continue to be a viable economic work product

It is strange that you think the law is settled when I don't think even this "societal desire" is completely settled just yet.

Hamuko · 3 days ago

Training an AI model is not the same as using an AI model.

BobbyTables2 · 2 days ago

I’m curious … So “transformative” is not necessarily “derivative”?

Seems to me the training of AI is not radically different than compression algorithms building up a dictionary and compressing data.

Yet nobody calls JPEG compression “transformative”.

Could one do lossy compression over billions of copyrighted images to “train” a dictionary?

luma · 3 days ago

Also ban StackOverflow and nearly any text book in the field.

The reality is that programmers are going to see other programmers code.

JoshTriplett · 3 days ago

"see" and "copy" are two different things. It's fine to look at StackOverflow to understand the solution to a problem. It's not fine to copy and paste from StackOverflow and ignore its license or attribution.

Content on StackOverflow is under CC-by-sa, version depends on the date it was submitted: https://stackoverflow.com/help/licensing . (It's really unfortunate that they didn't pick license compatible with code; at one point they started to move to the MIT license for code, but then didn't follow through on it.)

Huge difference, and companies recognized the difference, right up until "AI" hype.

nitwit005 · 2 days ago

> The reality is that programmers are going to see other programmers code.

You're certainly correct. It's also true that companies are going to sue over it. There's no reason to make yourself an easy lawsuit target, if it's trivial to avoid it.

There's a reason why clean-room reverse engineering exists and it's to ensure that you can't commit copyright infringement by knowing about the code you are trying to reimplement.

timeon · 3 days ago

How is that same thing?

mhh__ · 2 days ago

That's not really how LLMs are used, unless we're planning on classing gitignores as IP

LLMs are interesting because they can combine things they learn from multiple projects into a new language that doesn't feature in any of them, and pick up details from your request.

Unless you're schizophrenic enough to insist that you never even see other code it's just not a realistic problem

Honestly I've had big arguments about this IP stuff before and unless you actually have a lawyer specifically go after something or very obviously violate the GPL it's just a tactic for people to slow people they don't like down. People find a way to invent HR departments fractally.

lgas · 2 days ago

> If someone came to you and said "good news: I memorized the code of all the open source projects in this space, and can regurgitate it on command", you would be smart to ban them from working on code at your company.

If you find a human that did that send them my way, I'll hire them.

> There is also IP taint when using "AI". We're just pretending that there's not.

I don't think anyone who's not monetarily incentivize to pretend there are IP/Copyright issues actually thinks there are. Luckily everyone is for the most part just ignoring them and the legal system is working well and not allowing them an inch to stop progress.

> I don't think anyone who's not monetarily incentivize to pretend there are IP/Copyright issues actually thinks there are.

Why do you think that about people who disagree with you? You're responding directly to someone who's said they think there's issues, and not pretending. Do you think they're lying? Did you not read what they said?

And AFAICT a lot of other people think similarly to me.

The perverse incentives to rationalize are on the side of the people looking to exploit the confusion, not the people who are saying "wait a minute, what you're actually doing is..."

So a gold rush person claiming opponents must be pretending because of incentives... seems like the category of "every accusation is a confession".

watwut · 2 days ago

There is same pretensiom with "hallucinations". If I did the same, it would bs called lying or bullshitting.

andruby · 3 days ago

> I try to assist inexperienced contributors and coach them to the finish line, because getting a PR accepted is an achievement to be proud of

I really appreciate this point from mitchellh. Giving thoughtful constructive feedback to help a junior developer improve is a gift. Yet it would be a waste of time if the PR submitter is just going to pass it to an AI without learning from it.

Junior developers are entering a workforce where they will never not be using AI

aleph_minus_one · 2 days ago

> Junior developers are entering a workforce where they will never not be using AI

This remark seems very US-centric to me. In my observation, many people are much more skeptical concerning whether AI is actually useful beyond some gimmicky applications.

tpoacher · 2 days ago

Yes, in the same way junior pilots are entering a workforce where they will never not be using an autopilot.

I don't think using AI at all is forbidden, he just doesn't want AI to do the whole PR?

hagbarth · 2 days ago

They will still need to learn to recognise if the output from AI is good or not.

makeitdouble · 2 days ago

The rules can be finely adjusted when it actually becomes problematic, they're not trying to pass a law through Congress.

thallavajhula · 3 days ago

king_geedorah · 3 days ago

Re: "What about my autocomplete?" which has shown up twice in this thread so far.

> As a small exception, trivial tab-completion doesn't need to be disclosed, so long as it is limited to single keywords or short phrases.

RTFA (RTFPR in this case)

Lerc · 3 days ago

I think this seems totally reasonable, the additional context provided is, I think, important to the requirement.

Some of the AI policy statements I have seen come across more as ideology statements. This is much better, saying the reasons for the requirement and offering a path forward. I'd like to see more of this and less "No droids allowed"

hodgehog11 · 3 days ago

How does this not lead to a situation where no honest person can use any AI in their submissions? Surely pull requests that acknowledge AI tooling will be given significantly less attention, on the grounds that no one wants to read work that they know is written by AI.

skogweb · 3 days ago

I don't think this is the case. Mitchell writes that he himself uses LLMs, so it's not black and white. A PR author who has a deep understanding of their changes and used an LLM for convenience will be able to convey this without losing credibility imo

Workaccount2 · 3 days ago

Make a knowledgeable reply and mention you used chat-gpt - comment immediately buried.

Make a knowledgeable reply and give no reference to the AI you used- comment is celebrated.

We are already barreling full speed down the "hide your AI use" path.

showcaseearth · 3 days ago

I doubt a PR is going to be buried if it's useful, well designed, good code, etc, just because of this disclosure. Articulate how you used AI and I think you've met the author's intent.

If the PR has issues and requires more than superficial re-work to be acceptable, the authors don't want to spend time debugging code spit out by an AI tool. They're more willing to spend a cycle or two if the benefit is you learning (either generally as a dev or becoming more familiar with the project). If you can make clear that you created or understand the code end to end, then they're more likely to be willing to take these extra steps.

Seems pretty straightforward to me and thoughtful by the maintainers here.

vultour · 2 days ago

The last three GitHub issues I ran across when looking something up had people literally copy pasting the entire ChatGPT response as their comment. It feels like I'm living in some crazy dystopia when several _different_ people post a 30+ line message that's 95% the same. I'm sorry but I refuse to interact with people who do this, if I wanted to talk to a computer I'd do it myself.

wmf · 2 days ago

HN works that way but Mitchell said he isn't opposed to AI. You have to know the vibe of your environment.

MerrimanInd · 3 days ago

It just might. But if people generate a bias against AI generated code because AI can generate massive amounts of vaguely correct looking yet ultimately bad code then that seems like an AI problem not a people problem. Get better, AI coding tools.

No one is saying to not use AI. The intent here is to be honest about AI usage in your PRs.

You ask this as if it’s a problem.

whimsicalism · 3 days ago

i'm happy to read work written by AI and it is often better than a non-assisted PR

andunie · 3 days ago

Isn't that a good thing?

jama211 · 3 days ago

What, building systems where we’re specifically incentivised not to disclose ai use?

It might encourage people to be dishonest, or to not contribute at all. Maybe that's fine for now, but what if the next generation come to rely on these tools?

Good point. That's the point exactly. Don't use AI for writing your patch. At all.

Why are you surprised? Do companies want to hire "honest" people whose CVs were written by some LLM?

Octoth0rpe · 3 days ago

> Do companies want to hire "honest" people whose CVs were written by some LLM?

Yes, some companies do want to hire such people, the justification given is something along the lines of "we need devs who are using the latest tools/up to date on the latest trends! They will help bring in those techniques and make all of our current devs more productive!". This isn't a bad set of motivations or assumptions IMO.

Setting aside what companies _want_, they almost certainly are already hiring devs with llm-edited CVs, whether they want it or not. Such CVs/resumes are more likely to make it through HR filters.

I don't know if future generations will agree with this sentiment, in which case we lock ourselves out of future talent (i.e. those that use AI to assist, not to completely generate). The same arguments were made about Photoshop once upon a time.

Unfortunately yes, they very much seem to. Since many are using LLMs to assess CVs, those which use LLMs to help write their CV have a measured advantage.