You cannot have our user's data

On the topic of licenses and LLM’s—of course, we have to applaud sourcehut at least trying to not allow all their code to be ingested by some mechanical license violation service. But, it seems like a hard game. Ultimately the job of their site is to serve code, so they can only be so restrictive.

I wonder if anyone has tried going in the opposite direction? Someone like adding to their license: “by training a machine learning algorithm trained on this source code or including data crawled from this site, you agree that your model is free to use by all, will be openly distributed, and any output generated by the model is licensed under open source terms.” (But, ya know, in bulletproof legalese). I guess most of these thieves won’t respect the bit about distributing. But at least if the model leaks or whatever, the open source community can feel free to use it without any moral conflict or legal stress.

jsheard · 5 months ago

If you squint at the GPL then you could argue that every LLM is already under it, because it's a viral license and there's almost certainly some GPL code in there somewhere. I'm sure the AI companies would beg to differ though, they want a one-way street where there's zero restrictions on IP going into models, but they can dictate whatever restrictions they like on the resulting model, derived models, and model output.

I hope one of the big proprietary models leaks one day so we get to see OpenAI or Google tie themselves in knots to argue that training on libgen is fine, but distilling a leaked copy of GPT or Gemini warrants death by firing squad.

bastardoperator · 5 months ago

I think the courts were pretty clear, prove damages. I'm not saying I agree in any capacity, but the AI companies went to court, and it appears they've already won.

bee_rider · 5 months ago

The hope is that we can out the onus on them to start suing people or whatever, at least. The US legal system is biased toward whoever has the biggest budget of course, but the defense still gets a little bit of advantage as well.

pabs3 · 5 months ago

There have been copyright office rulings saying that ML model output is not copyrightable, so that last part of the suggested license seems a bit strange, since the rulings could preclude it for code at some point.

Also, it remains to be seen whether copyright law will outright allow model training without a license, or if there will be case law to make it fair use in the USA, or if models will be considered derivative works that require a license to prepare, or what other outcome will happen.

ranger_danger · 5 months ago

> There have been copyright office rulings saying that ML model output is not copyrightable

Source? If you're referring to Thaler v. Perlmutter, I would argue that is an incorrect assessment.

> The court held that the Copyright Act requires all eligible works to be authored by a human being. Since Dr. Thaler listed the Creativity Machine, a non-human entity, as the sole author, the application was correctly denied. The court did not address the argument that the Constitution requires human authorship, nor did it consider Dr. Thaler's claim that he is the author by virtue of creating and using the Creativity Machine, as this argument was waived before the agency.

If someone had instead argued that they themselves were the author by means of instructing an LLM to generate the output based on a prompt they made up, plus the influence of prior art (the pretrained data), which you could also argue is not only what he probably did anyway (but argued it poorly), but in a way is also how humans make art themselves... that would be a much more interesting decision IMO.

bee_rider · 5 months ago

It seems like AI companies are making a major bet on the idea that the output of their models can be licensed and used in products.

candiddevmike · 5 months ago

How can AI code be added to any kind of open source license, or would it just be that code that isn't covered under the license (since it's effectively public domain?)?

mvdtnz · 5 months ago

What does copyright have to do with it? It's about distribution.

majorchord · 5 months ago

> Someone like adding to their license

I would assume that clause would be unenforceable. They may be able to try to sue for violating the terms of the license, but I'm fairly confident they're not going to get a judge to order them to give their model away for free even if they won. And they would likely still need to show damages in order to win a contract case.

bee_rider · 5 months ago

> I guess most of these thieves won’t respect the bit about distributing. But at least if the model leaks or whatever, the open source community can feel free to use it without any moral conflict or legal stress.

The hope is to flip the script. Sure, the company might not fulfill their full obligation under the terms and conditions they agreed to (in the same way that we all agree to these “continuing to use this site means you agree to our terms and conditions,” they are agreeing to the terms and conditions by continuing to scrape the site). But, at least if a model leaks or some pirates software that was generated by one of their LLMs, they can say, well it was open source.

Yeah, I don't like this.

We have an Apache licensed project. You absolutely can use it for anything you want, including AI analysis. I don't appreciate third parties deciding on their own what can be done with code that isn't theirs.

In fact I'd say AI is overall a benefit to our project, because we have a large, quite complex platform, and the fact that ChatGPT actually manages to sometimes correctly write scripts for it is quite wonderful. I think it helps new people get started.

In fact in light of the recent Github discussion I'd say I personally see this as a reason to avoid sourcehut. Sorry, but I want all the visibility I can get.

notrealyme123 · 5 months ago

I was surprised to not see the "/s" at the end.

Big-Tech deciding that all our work belongs to them: Good

Small Code hosting platform does not want to be farmed like a Field of Corn: Bad

dale_glass · 5 months ago

Why would you expect a /s?

I understand their standpoint: it's their infrastructure, and their bills.

However my concerns are with my project, not with their infrastructure bills. I seek to maximize the prominence, usability and general success of my project, and as such I want it to have a presence everywhere it can have one. I want ChatGPT/copilot/etc to know it exists and to write code for it, just in case that brings in more users.

Blocking abusive behavior? Sure. But I very specifically disagree with the blanket prohibition of "Anything used to feed a machine learning model". I do not see it being in my interest.

maleldil · 5 months ago

This would be fine in an ideal world. However, the one we live in has crawlers that don't care how many resources they use. They're fine with taking the server down or bankrupting the owner as long as they get the data they want.

dale_glass · 5 months ago

And I can understand the abuse argument, however they have a blanket exclusion for AI I do not agree with.

sksxihve · 5 months ago

The code might not be theirs but the service hosting the code is and nothing is stopping you from hosting your code elsewhere. For some people blocking LLMs might be a reason to use sourcehut over github.

mtlynch · 5 months ago

>We have an Apache licensed project. You absolutely can use it for anything you want, including AI analysis. I don't appreciate third parties deciding on their own what can be done with code that isn't theirs.

That's not what the Apache license says.

According to the Apache license, derivative work needs to provide attribution and a copyright notice, which no major LLMs do.

Deleted Comment

eesmith · 5 months ago

I pay for sourcehut hosting, and I have no problems at all with this decision.

Unlike you with your GitHub-based project, I avoid Microsoft like the plague. I do not want to be complicit in supporting their monopoly power, their human rights abuses, and their environmental destruction.

LtWorf · 5 months ago

You won't get visibility from AI.

I'm curious what your project is. Blockchain?

dale_glass · 5 months ago

VR platform. We're actually opposed to any blockchain tech as an organization. We're into it for the users.

But I see no reason to have any issues with LLMs. ChatGPT/copilot/etc helping new people getting started? That sounds absolutely great to me.

mrweasel · 5 months ago

> Sorry, but I want all the visibility I can get.

I can understand that, but the various AI companies pounding sourcehut into the ground also results in zero visibility.

mrstresser.com. 21600 IN NS sterling.ns.cloudflare.com. silentstress.cc. 21472 IN NS ernest.ns.cloudflare.com. maxstresser.com. 21600 IN NS edna.ns.cloudflare.com. darkvr.su. 21600 IN NS paige.ns.cloudflare.com. stresser.sh. 21600 IN NS luke.ns.cloudflare.com. stresserhub.org. 21600 IN NS fay.ns.cloudflare.com.

simonw · 5 months ago

Blocking aggressive crawlers - whether or not they have anything to do with AI - makes complete sense to me. There are growing numbers of badly implemented crawlers out there that can rack up thousands of dollars in bandwidth expenses for sites like SourceHut.

Framing that as "you cannot have our user's data" feels misleading to me, especially when they presumably still support anonymous "git clone" operations.

immibis · 5 months ago

I still maintain that since we already have this system (it's called "looking up your ISP and emailing them") where if you send spam emails, we contact your ISP, and you get kicked off the internet...

And the same system will also you get banned from your ISP if you port scan the Department of Defense...

why are we not doing the same thing against DoS attackers? Why are ISPs not hesitant to cut people off based on spam mail, but they won't do it based on DoS?

diggan · 5 months ago

> why are we not doing the same thing against DoS attackers?

The first D in DDoS stands for "distributed", meaning it comes from multiple different origins, usually hacked devices. If we start throwing off every compromised network, we'd only have a few (secure) networks left. Probably network equipment vendors would quickly have to redo their security so it actually protects people.

So yeah, good question.

thunderfork · 5 months ago

All you need to evade ISP complaints is (e.g.) a botnet of residential IPs making a few requests each, instead of one IP making a ton.

zzo38computer · 5 months ago

I agree; blocking aggressive crawlers that are badly behaved, etc, is what is sense. The files that are public are public and I should expect anyone who wants a copy can get them and do what they want with them.

RadiozRadioz · 5 months ago

From the Anubis docs

> Anubis uses a multi-threaded proof of work check to ensure that users browsers are up to date and support modern standards.

This is so not cool. Further gatekeeping websites from older browsers. That is absolutely not their call to make. My choice of browser version is entirely my decision. Web standards are already a change treadmill, this type of artificial "You must be at least Internet Explorer 11" or "this website works best in Chrome" nonsense makes it much worse.

My browser is supported by your website if it implements all the things your website needs. That is the correct test. Not: "Your User-Agent is looking at me funny!" or "The bot prevention system we chose has an arbitrary preference for particular browser versions".

Just run the thing single threaded if you have to.

Anon1096 · 5 months ago

>My browser is supported by your website if it implements all the things your website needs.

Well I guess your browser does not support everything needed?Being able to run a multi threaded proof of work is not the same as checking arbitrary user agents, any browser can implement it.

By default, Anubis will not block Lynx and some other browsers that do not implement JavaScripts, but will block scrapers that claim to be Mozilla-based browsers, and many of the badly behaving ones do claim to be Mozilla based browsers, so this helps. (I do not have a browser compatible with the Anubis software, and Anubis does not bother me.)

If necessary, it would also be possible to do what powxy does which is displaying an explanation of how the proof of work is working in case you want to implement your own.

Having alternative access of some files using other protocols, might also help, too.

So any bot author that reads your comment and switches their user agents to curl can now bypass anubis? That doesn't seem very well thought-out to me.

xena · 5 months ago

Patches are welcome!

matt3210 · 5 months ago

Anubis has had great results blocking LLM agents https://anubis.techaro.lol/

ac29 · 5 months ago

That's what sourcehut is using.

As an aside, I saw this for the first time on a kernel.org website a few days ago and actually thought it might have been hacked since I briefly saw something about kH/s (which screams cryptominer to me).

A screenshot for anyone who hasnt seen it: https://i.imgur.com/dHOmHtn.png

(this screen appears only very briefly, so while it is clear what it is from a static screenshot, its very hard to tell in real time)

runjake · 5 months ago

Yes, this is explained and linked in the first sentence of the linked article.

xvilka · 5 months ago

The solution is to make Git fully self-contained and encrypted, just like Fossil[1] - store issues and PRs inside the repository itself, truly distributed system.

[1] https://fossil-scm.org/

rvba-fr · 5 months ago

Looks like git diffs are the new gold for training LLMs : https://carper.ai/diff-models-a-new-way-to-edit-code/

sltr · 5 months ago

> a racketeer like CloudFlare

Could anyone teach me what makes this a fair characterization of Cloudflare?

Not sure exactly what it is referring to, but I could make a guess that it's because Cloudflare sells LLM inference as a service, but also a service that blocks LLMs. A bit like a Anti-DDOS company also selling DDOS services.

For example, https://developers.cloudflare.com/workers-ai/guides/demos-ar... has examples visit websites, then for the people on the other side (who want to protect themselves against those visits) there is https://developers.cloudflare.com/waf/detections/firewall-fo...

Just a guess though, I don't know for sure the authors intentions/meanings.

> A bit like a Anti-DDOS company also selling DDOS services.

That's not far off from what Cloudflare does either, the majority of DDoS-for-hire outfits are hidden behind and protected by Cloudflare.

Whenever this comes up I do a quick survey of the top DDG results for "stresser" to see if anything's changed, and sure enough:

rsync · 5 months ago

"Just a guess though, I don't know for sure the authors intentions/meanings."

I am reminded of this posting from years past:

https://news.ycombinator.com/item?id=38496499

"A lot of hacking groups, terror organizations and other malicious actors have been using cloud flare for a while without them doing shit about it. ... It's their business model. More DDoS means more cloudflare customers, yaaay."

I've not spent much time on this topic but I am very interested in the notion that a well-meaning third party established some kind of looking glass that surveyed cloudflare behavior and that third party was sued ?

I'd like to learn more about that situation ...

mariusor · 5 months ago

I remember that when the first influxes of LLM crawlers have hit Sourcehut, they had some talks with Cloudflare which ended when CF demanded an outrageous amount of money from a company the size of Sourcehut. If I find the source for this, I'll update.

[edit] Here's the source: https://sourcehut.org/blog/2024-01-19-outage-post-mortem/#:~...

Cloudflare has been accused of playing both sides--they host services for known/associated DDoS providers while conveniently offering services to protect DDoS.