TomOwens (u/TomOwens)

TomOwens commented on MIT-Human License Proposal github.com/tautvilas/MIT-... · Posted by u/brisky

JoshTriplett · 2 months ago

> The AI model is an even more transformative use, since (from my understanding) you can't reverse engineer the training data out of a model.

You absolutely can; the model is quite capable of reproducing works it was trained on, if not perfectly then at least close enough to infringe copyright. The only thing stopping it from doing so is filters put in place by services to attempt to dodge the question.

> In fact, the GPL family licenses don't require the output of software under one of those licenses to be open, so I suspect that would also be true for content.

It does if the software copies portions of itself into the output, which seems close enough to what LLMs do. The neuron weights are essentially derived from all the training data.

> There are also questions about guardrails and how much is necessary to prevent exact reproduction of copyright protected material that, even if licensed for training, isn't licensed for redistribution.

That's not something you can handle via guardrails. If you read a piece of code, and then produce something substantially similar in expression (not just in algorithm and comparable functional details), you've still created a derivative work. There is no well-defined threshold for "how similar", the fundamental question is whether you derived from the other code or not.

The only way to not violate the license on the training data is to treat all output as potentially derived from all training data.

TomOwens · 2 months ago

> You absolutely can; the model is quite capable of reproducing works it was trained on, if not perfectly then at least close enough to infringe copyright. The only thing stopping it from doing so is filters put in place by services to attempt to dodge the question.

The model doesn't reproduce anything. It's a mathematical representation of the training data. Software that uses the model generates the output. The same model can be used across multiple software applications for different purposes. If I were to go to https://huggingface.co/deepseek-ai/DeepSeek-V3.2/tree/main (for example) and download those files, I wouldn't be able to reverse-engineer the training data without building more software.

Compare that to a search database, which needs the full text in an indexable format, directly associated with the document it came from. Although you can encrypt the database, at some point, it needs to have the text mapped to documents, which would make it much easier to reconstruct the complete original documents.

> That's not something you can handle via guardrails. If you read a piece of code, and then produce something substantially similar in expression (not just in algorithm and comparable functional details), you've still created a derivative work. There is no well-defined threshold for "how similar", the fundamental question is whether you derived from the other code or not.

The threshold of originality defines whether something can be protected by copyright. There are plenty of small snippets of code that can't be protected. But there are still questions about these small snippets that were consumed in the context of a larger, protected work, especially when there are only so many ways to express the same concept in a given language. It's definitely easier in written text than code to reason about.

TomOwens commented on MIT-Human License Proposal github.com/tautvilas/MIT-... · Posted by u/brisky

josephcsible · 2 months ago

One of these things is true:

1. Training AI on copyrighted works is fair use, so it's allowed no matter what the license says.

2. Training AI on copyrighted works is not fair use, so since pretty much every open source license requires attribution (even ones as lax as MIT do; it's only ones that are pretty much PD-equivalent like CC0, WTFPL, and Unlicense that don't) and AI doesn't give attribution, it's already disallowed by all of them.

So in either case, having a license mention AI explicitly wouldn't do any good, and would only make the license fail to comply with the OSD.

TomOwens · 2 months ago

Point 2 misses the distinction between AI models and their outputs.

Let's assume for a moment that training AI (or, in other words, creating an AI model) is not fair use. That means that all of the license restrictions must be adhered to.

For the MIT license, the requirement is to include the copyright notice and permission notice "in all copies or substantial portions of the Software". If we're going to argue that the model is a substantial portion of the software, then only the model would need to carry the notices. And we've already settled on accessing over a server doesn't trigger these clauses.

Something like the AGPL is more interesting. Again, if we accept that the model is a derivative work of the content it was trained on, then the AGPL's viral nature would require that the model be released under an appropriate license. However, it still says nothing about the output. In fact, the GPL family licenses don't require the output of software under one of those licenses to be open, so I suspect that would also be true for content.

So far, though, in the US, it seems courts are beginning to recognize AI model training as fair use. Honestly, I'm not surprised, given that it was seen as fair use to build a searchable database of copyright-protected text. The AI model is an even more transformative use, since (from my understanding) you can't reverse engineer the training data out of a model.

But there is still the ethical question of disclosing the training material. Plagiarism still exists, even for content in the public domain. So attributing the complete set of training material would probably fall into this form of ethical question, rather than the legal questions around intellectual property and licensing agreements. How you go about obtaining the training material is also a relevant discussion, since even fair use doesn't allow you to pirate material, and you must still legally obtain it - fair use only allows you to use it once you've obtained it.

There are still questions for output, but those are, in my opinion, less interesting. If you have a searchable copy of your training material, you can do a fuzzy search of that material to return potential cases where the model returned something close to the original content. GitHub already does something similar with GitHub Copilot and finding public code that matches AI responses, but there are still questions there, too. It's more around matches that may not be in the training data or how much duplicated code needs to be attributed. But once you find the original content, working with licensing becomes easier. There are also questions about guardrails and how much is necessary to prevent exact reproduction of copyright protected material that, even if licensed for training, isn't licensed for redistribution.

TomOwens commented on SoC-2 is table stakes now. Here's what matters for AI products superagent.sh/blog/soc-2-... · Posted by u/homanp

TomOwens · 3 months ago

The premise of this whole post is incorrect. If an organization is building an AI product or offering an AI service, then a SOC 2 report, or at least a SOC 2 Type 2 report, should answer these questions.

"What happens if someone tries to extract training data?" CC6.7 covers data loss and data transfer restrictions. I've typically included controls related to monitoring data transfer, including flagging and highlighting potential breaches. Documented procedures on what happens if data loss or unauthorized data transfer occurs. These can be reviewed, but may be hard for the auditor to test unless they were executed and there's evidence that they were executed as written.

"Can this agent be manipulated into accessing data it shouldn't? How do you test for adversarial attacks?" I'm struggling to understand the difference between these questions. It seems like part of the answer likely overlaps with controls to address CC6.7 and data loss or data transfer restrictions. CC8.1 discusses testing the product or service.

"How do you prevent prompt injection?" This may be a bit specific for a SOC 2 Type 2 report, since it really gets into requirements, architecture, and design decisions rather than controls over the requirements, architecture, and design. That is, you can essentially not require preventing prompt injection and follow all of your controls related to, for example, CC8.1. CC8.1 talks about managing, authorizing, executing, and documenting changes. You can do all of these things well without that requirement in place.

"What guardrails are in place, and have they been validated?" This is the entire SOC 2 Type 2 report. It lists all evaluated criteria, describes the organization's controls, and provides an audit of those controls. It's up to the organization being audited, however, to think about what controls are necessary for their context. The controls that should be in scope of the audit will differ for an AI product or service than for something else. The recipient of the SOC 2 report can review the controls and ask questions.

Part of the burden is on the organization getting the SOC 2 audit report to think about what controls they need. But there's also a burden on the organization reviewing the audit report not just to see that there are no exceptions, but to review the controls described to make sure the controls are in place for the given product or service. And this detailed information about the controls is what makes something like the SOC 2 audit report a whole lot more useful than something like an ISO 27001 certificate, which says that whatever policies and procedures are in place meet the requirements of the standard and doesn't offer details on how those requirements are met.

TomOwens commented on Ask HN: Security standards requiring fired employees leave immediately · Posted by u/bryanrasmussen

TomOwens · 2 years ago

I'm looking through what I have access to quickly.

I started with the CSA's Cloud Controls Matrix, just because they trace to a bunch of other standards. They have a control - IAM-07 - that is to "de-provision or respectively modify access of movers / leavers or system identify changes in a timely manner in order to effectively adopt and communicate identity and access management policies". This control points to other sources.

One standard that I have access to, CIS Critical Security Controls v8.1, calls for a process that disables or deletes accounts and revoking access "immediately upon termination, rights revocation, or role change of a user". I believe v8.1 is the latest version.

The Trust Services Criteria mapping is to CC5.3 and CC6.3. This defers to defined organizational policies and procedures and doesn't specify any timelines.

ISO/IEC 27001:2022 and ISO/IEC 27002:2022 mapping is A.5.15 and A.5.18. This is identified as a gap in earlier versions of these standards. I don't have ready access to these standards, so I can't tell you if they give any timelines.

The NIST 800-53 rev 5 mappings are to AC-2 (1, 2, 6, 8), AC-3 (8), AC-6 (7), AU-10 (4), AU-16 (1), and CM-7 (1). All of these appear to defer to organizationally-defined timing and frequency for review.

The NIST CSF v2.0 mapping is to GV.RR-04, GV.SC-10, PR.AA-01, and PR.AA-05. The most relevant ones are the PR.AA controls and they both defer to organizational policies for definitions.

As far as I can tell, most standards simply require that a company defines their policies and procedures and then certification or audit against that standard would only ensure that the documented policies are being followed. If you wanted to implement the "immediate", one way to do it would be to document that as your process (optionally by adopting the CIS Critical Security Controls, but you may or may not want to adopt the whole set) and then have it in a SOC 2 Type 2 audit where the auditor would sample people who have left the organization and when their access was revoked.

TomOwens commented on Please don’t upload my code on GitHub nogithub.codeberg.page/... · Posted by u/modinfo

eloisius · 3 years ago

Copilot was not only trained on permissively licensed code. It’s trained on all public repos, even if the code is copyrighted (which is the default absent a more permissive license)

TomOwens · 3 years ago

If the copyrighted code was uploaded to GitHub by the owner, there's no problem with this. When you upload code to GitHub, one of the rights that you grant to GitHub is the right to use your content for "improving the Service over time". See D.4. License Grant to Us in the GitHub Terms of Service. Once it is up there, you also grant other users certain rights, like viewing public repos and forking repos into their own copies. See D.5. License Grant to Other Users. Even with the most restrictive protections in place, using GitHub requires you to give up certain rights.

A question would be if creating and training Copilot is "improving the Service over time". I would suspect that it would be, though.

There are still some open questions around what happens when Copilot suggests code verbatim, but these are mostly for the users of Copilot. Although I would hope that GitHub is thinking about offering information to ensure that users understand the source of code they use, if it may be protected, and what licenses it may be offered under. There are still some interesting legal questions here, but I don't think that the training of Copilot is one of them.

A more interesting question would be what GitHub does if someone uploads someone else's copyright-protected code to GitHub and it is used for training Copilot before it is removed. If you don't own the copyright, you can't grant GitHub the rights needed to use that code for anything, including improving the service.

TomOwens commented on Massachusetts health notifications app installed without users’ knowledge play.google.com/store/app... · Posted by u/_v7gu

TomOwens · 5 years ago

Most of the comments on that app as well as here are probably wrong. I'd suspect that everyone who had the app "installed without their permission" opted into the Android COVID-19 Exposure Notification program. This was deployed by Google as part of an update to Google Play Services.

When you go to your phone's settings with this update, there's an option to enable COVID-19 Exposure Notifications. When you turn it on, it prompts you for your location and will download your region's app that uses your phone's new capabilities to connect to the appropriate health authorities.

Massachusetts just opted into this program in the last couple of weeks. I'm honestly not sure why they did it so late - this would have been helpful earlier. Apple iPhones also have this capability, including interoperability with Android phones, and iPhone users in Massachusetts are also able to turn on this setting.

Now, if someone can actually prove that they didn't opt into the COVID-19 Exposure Notifications, then I'd be concerned. But my guess is they opted in when it came out, but there was no app for their region, so nothing was downloaded and the feature did nothing. Then, Massachusetts rolled out the app now and lots of people who configured their phones earlier in the pandemic got a new app. They granted permission for it, perhaps months ago.

TomOwens commented on Software Engineering Body of Knowledge computer.org/education/bo... · Posted by u/p1necone

veltas · 5 years ago

Not attacking this association, but just wanted to say I really dislike the term "body of knowledge".

Areas of knowledge are certainly some kind of 'body', but the term always sounds to me like something you just keep adding more stuff onto the outside as you go, like a big snowball of ideas. And knowledge can become defunct, irrelevant, or disproven over time.

And there's something about the way it tries to sound almost authoritative, without claiming to be scientific.

TomOwens · 5 years ago

I think this is why the PDF is actually the "Guide to the Software Engineering Body of Knowledge". It's not a representation of the complete body of knowledge itself, but extracting key concepts and terms and provides pointers to things that are most relevant. If things are irrelevant or disproven over time, the guide to the body of knowledge would remove those terms, concepts, or references and point to something else.

TomOwens commented on Feasibility of stealthily introducing vulnerabilities in open source software [pdf] github.com/QiushiWu/qiush... · Posted by u/etxm

TomOwens · 5 years ago

I wonder if the people involved in approving and conducting this research are aware of the ACM's Code of Ethics. I can see pretty clear links to at least two or three of the code's ethical principles. This seems to be a pretty serious breakdown of the researchers understanding their ethical responsibilities, but also the review and approval of research projects.

TomOwens commented on Removed gem breaks Rails ActiveStorage github.com/rails/rails/is... · Posted by u/ldulcic

Denvercoder9 · 5 years ago

> The absolute worst thing, though, was that changing a license should not be a minor (or a major) version number increase.

The license didn't change. It was always already GPL, due to the usage of GPL-licensed code, regardless of what the metadata said. The change just made the metadata correctly reflect reality.

[EDIT: I should clarify that technically mimemagic wasn't already GPL, but the only legal way to use it was by satisfying your obligations under the GPL, making it effectively GPL. The author did relicense his own code to be GPL instead of MIT.]

To me it seems like making your downstreams aware of that ASAP is pretty important, since this has important legal implications for them as well. Yanking the old versions and releasing an update with an incompatible version number is a way to do that, albeit one that's quite disruptive.

TomOwens · 5 years ago

Yeah. That's a better way of putting it. The author didn't opt to change the license. He corrected a licensing error.

I do agree that making the downstream users aware is important, I just don't agree that immediately yanking is the right solution. Putting out a new version would have been nice. Adding a post-install message to the new versions would have been good to start to get the word out. Not sure how far to take it, but opening issues with dependencies (RubyGems provides this information) would have also been nice, giving the major dependencies a good notice before yanking.

TomOwens commented on Removed gem breaks Rails ActiveStorage github.com/rails/rails/is... · Posted by u/ldulcic

freedomben · 5 years ago

After the "left-pad" fiasco, and a similar event on the Ruby side, I started vendoring my dependencies as standard practice. I have not been sorry yet, in fact I feel vindicated in that approach.

TomOwens · 5 years ago

Vendoring is a good first step, too. As long as you have a local copy of all the dependencies, you're better off than needing to go pull them from the Internet every time you want them and risk having them gone. Potentially worse is having the same version but with modifications.