I fear for the unauthenticated web

Rate limiting is the first step before cutting everything off behind forced logins.

> This practice started with larger websites, ones that already had protection from malicious usage like denial-of-service and abuse in the form of services like Cloudflare or Fastly

FYI Cloudflare has a very usable free tier that’s easy to set up. It’s not limited to large websites.

snerbles · 9 months ago

Cloudflare also locks out non-Chrome/Firefox browsers, stifling the development of alternatives.

[0] https://news.ycombinator.com/item?id=42953508

blibble · 9 months ago

I get the feeling that I'm going to read a blog post in a few years telling us that the CDN companies have been selling everything pulled through their cache to the AI companies since 2022

Aurornis · 9 months ago

CDNs are a cash cow. They’re not going to set their reputation on fire and violate all of their security guarantees for negligible amounts of money.

nottorp · 9 months ago

And even if they don't, is everything depending on Cloudflare to stay online a good thing?

zwnow · 9 months ago

Until they threaten you to pay a huge bill or they will shutdown your services. No thanks. Cloudflare has extremely questionable business practices.

sshine · 9 months ago

Cloudflare took down our website: https://news.ycombinator.com/item?id=40481808

A user running an online casino claimed that Cloudflare abruptly terminated their service after they refused to upgrade to a $10,000/month enterprise plan. The user alleged that Cloudflare failed to communicate the reasons clearly and deleted their account without warning.

Quote: "Cloudflare wanted them to use the BYOIP features of the enterprise plan, and did not want them on Cloudflare's IPs. The solution was to aggressively sell the Enterprise plan, and in a stunning failure of corporate communication, not tell the customer what the problem was at all."

——

Tell HN: Don't Use Cloudflare: https://news.ycombinator.com/item?id=31336515

Summary: A user shared their experience of being forced to upgrade to a $3,000/month plan after using 200-300TB of bandwidth on Cloudflare's business plan. They criticized Cloudflare's lack of transparency regarding bandwidth limits and aggressive sales tactics.

Quote: "A lot of this stuff wasn't communicated when we signed up for the business plan. There was no mention of limits, nor any contracts nor fineprint."

——

Tell HN: Impassable Cloudflare challenges are ruining my browsing experience: https://news.ycombinator.com/item?id=42577076

Summary: A user expressed frustration with Cloudflare's bot protection challenges, which made it difficult for them to unsubscribe from emails or access websites. They highlighted how these challenges disproportionately affect privacy-conscious users with non-standard browser configurations.

Quote: "The 'unsubscribe' button in Indeed's job notification emails leads me to an impassable Cloudflare challenge. That's a CAN-SPAM act violation."

dougb5 · 9 months ago

What exactly should be rate-limited, though? See the discussion here -- https://news.ycombinator.com/item?id=43422413 -- the traffic at issue in that case (and in one that I'm dealing with myself) is from a large number of IPs making no more than a single request each.

layer8 · 9 months ago

Centralizing large parts of the web behind Cloudfare is something to be feared as well.

harha_ · 9 months ago

Screw cloudflare, I rather host my own proxies.

I would think all you need to do is add a copyright statement of some kind.

Sad things are getting to this point. Maybe I should add this to my site :)

(c) Copyright (my email), if used for any form of LLM processing, you must contact me and pay 1000USD per word from my site for each use.

jcranmer · 9 months ago

The argument the AI companies are making is that training for LLMs is fair use which means a copyright statement means fuck all from their point of view. (Even if it does, assuming you're in the US, unless you register the copyright with the US copyright office, you can only sue for actual damages, which means the cost of filing a lawsuit against them--not even litigating, just the court fee for saying "I have a lawsuit"--would be more expensive than anything you could recover. Even if you did register and sued for statutory damages, the cost of litigation would probably exceed the recovery you could expect.)

Of course, the big AI companies are already trying to get the government to codify AI training as fair use and sidestep the litigation which doesn't seem to be going entirely their way on this matter (cf. https://arstechnica.com/google/2025/03/google-agrees-with-op...).

tsumnia · 9 months ago

In addition, we need to start paying attention to the growing legislation about AI and copyright law. There was an article on HN I think this week (or last) specifically where a judge ruled AI cannot own copyright on its generated materials.

IANAL, but I do wonder how this ruling will be used as a point of reference whenever we finally ask the question "Does material produced by GenAI violate copyright laws?" Specifically if it cannot claim ownership, a right that we've awarded to trees and monkeys, how does it operate within ownership laws?

And don't even get me ranting about HUMAN digital rights or Personified AIs.

tqwhite · 9 months ago

Fair use requires transformation. LLM is as transformative as it gets. If I'm on the jury, you're going to have to make new copyright law for me to convict.

I am personally happy to have everyone, people and LLM alike, learn from my wisdom.

Aurornis · 9 months ago

Copyright is for topics like redistribution of the source material. You can’t add arbitrary terms to a copyright claim that go beyond what copyright law supports.

I think you’re confusing copyright with a EULA. You would need users to agree to the EULA terms before viewing the material. You can’t hide contractual obligations in the footer of your website and call it copyright.

101008 · 9 months ago

What about if my index says "This are the EULA, by clicking "Next" or "Enter", you are accepting them", and a LLM scrapper "clicks" Next to fetch the rest of the content?

jefftk · 9 months ago

It's reasonably likely, but not yet settled, that LLM training falls under fair use and doesn't require a license. This is what the https://githubcopilotlitigation.com/ class action (from 2022) is about, and its still making its way through the court. This prediction market has it at 12% likely to succeed, suggesting that courts will not agree with you: https://manifold.markets/JeffKaufman/will-the-github-copilot...

jcranmer · 9 months ago

> It's reasonably likely, but not yet settled, that LLM training falls under fair use and doesn't require a license.

I would say it's not reasonably likely that LLM training is fair use. Because I've read the most recent SCOTUS decision on fair use (Warhol), and enough other decisions on fair use, to understand that the primary (and nearly only, in practice) factor is the effect on the market for the original. And AI companies seem to be going out of their way to emphasize that LLM training is only going to destroy the market for the originals, which weighs against fair use. Not to mention the existence of deals licensing content for LLM training which... basically concedes the point.

Of the various options, a ruling that LLM training is fair use I find the least likely. More likely is either that LLM training is not fair use, that LLM training is not infringing in the first place, or that the plaintiffs can't prove that the LLM infringed their work.

maeln · 9 months ago

> This prediction market has it at 12% likely to succeed

Randos on the internet with a betting addiction are distinctively different from a court of law. I wish people would stop talking about prediction market as if they mattered.

dingnuts · 9 months ago

this isn't about copyright but about computer access. the CFAA is extremely broad; if you ban LLM companies from access on grounds of purpose you have every legal right to do so

in theory that legislation has teeth, too. they are not allowed to access your system if you say they are not; authentication is irrelevant.

every GET request to a system that doesn't permit access for training data is a felony

waveringana · 9 months ago

why are we pretending that these gambling sites have any weight on anything

JohnFen · 9 months ago

Such a notice is legally meaningless, though. Doubly so if the courts rule that scraping for AI purposes counts as fair use.

kerkeslager · 9 months ago

This is pretty naive.

The only reason copyright is so strong in the US is that there are big players (Disney, Elsevier) who benefit from it. But gig tech is much bigger, and LLMs have created a situation where big tech has a vested interest in eroding copyright law. Both sides are gearing up for a war in the court systems, and it's definitely not a given who will win. But, if you try to enter the fray as an individual or small company, you definitely aren't going to win.

jasperr1 · 9 months ago

The reality is that a lot of these small websites have very permissive licenses. I really hope we don't get to the point where we must all make our licenses stricter.

krapp · 9 months ago

The reality is that none of these LLM scrapers give a damn about copyright, because the entire AI industry is built on flagrant copyright violation, and the premise that they can be stopped by a magic string is laughable.

You could sue, if you can afford it, meanwhile all of your data is already training their models.

jeffwask · 9 months ago

Sure, because Meta certainly followed copyright law to the letter when they torrented thousands of copyrighted books from hundreds of published and known authors to train Lama. Forgive me if I doubt a text disclaimer on the page will slow them down.

dspillett · 9 months ago

Unfortunately copyright is no limit to these companies.

Meta is stating in court that knowingly downloading pirated content is perfectly fine (ref https://news.ycombinator.com/item?id=43125840) so they for one would have absolutely no issue completely ignoring your copyright notice and stated licensing costs. Good luck affording a legal team to try force them to pay attention.

cxr · 9 months ago

Perversely, this submission is essentially blogspam. The article linked in the second paragraph, to which this "1 minute" read adds almost nothing of value, is the main story:

<https://thelibre.news/foss-infrastructure-is-under-attack-by...>

394 comments. 645 points. Submitted 3 hours ago: <https://news.ycombinator.com/item?id=43422413>

btown · 9 months ago

But also ironically, it's almost heartwarming these days to see blogspam that's not machine-generated! A real live human cared enough about an article to write a brief (perhaps only barely substantial, but at least handwritten) take on it!

It's reminiscent, perhaps, of the feel and motivation for Tumblr reblogs - and Tumblr continues to be vibrant by virtue of this culture: https://www.tumblr.com/engineering/189455858864/how-reblogs-... (2019)

Now, is driving attention and reputation to their site (in the broadest senses) part of a blogspammer/reblogger's motivation? Absolutely!

But should we be concerned about rewarding their act of curation, as long as there is at least some level of genuine curation intent? A world where that answer is categorically "no" would be antithetical, I think, to the concept of the participatory web.

dkkergoog · 9 months ago

"heartwarming ... To see blogspam" the internet was a mistake

Deleted Comment

MisterTea · 9 months ago

I dont feel this is blog spam it's more of a quick comment of the situation pointing to the actual article. I dont see anything wrong with writing a short post boosting or commenting on another article. There are no ads so I dont see this as blogspam which I associate with financial gain or clout.

tempfile · 9 months ago

It also linked to https://drewdevault.com/2025/03/17/2025-03-17-Stop-externali..., which is another worthwhile read.

Cheer2171 · 9 months ago

All the time I see links on HN front page to Twitter and Mastodon posts with just as little text to them. Why does it upset you when it is in the medium of blogs, but not micro blogs?

SethMLarson · 9 months ago

Hehe, just participating in POSSE :) Funnily enough the story you're linking to quotes me with pictures of a story I wrote (https://sethmlarson.dev/slop-security-reports) about LLM-generated reports to open source projects.

hugs · 9 months ago

I might be naive, but I think it's time we seriously start implementing "HTTP status code 402: Payment Required" across the board.

"L402" is an interesting proposal. Paying a fraction of a penny per request. https://github.com/l402-protocol/l402

cwmma · 9 months ago

this is basically what they are doing, but instead of charging actual money they are making visitors spin the CPU ideally in a proof of work problem, which has the same outcome from the crawlers perspective.

fewsats · 9 months ago

I've talked with tons of publishers and all say the same thing:

"Hey, we'd happily give these companies clean data if they just paid us instead of building these scrapers."

I think there is a psychological aspect that made micropayments never work for humans but machines may be better suited for it.

woah · 9 months ago

This has existed for decades. The proof of CPU work is called "frontend frameworks"

rambambram · 9 months ago

I stumbled upon this status code last year - had never heard of it before - and I bookmarked it and then forgot about it. Thanks for the reminder.

SoftTalker · 9 months ago

This is ultimately the answer. If something has value, users should pay for it. We haven't had a good way to do that on the web, so it has resulted in the complete shitshow that most websites are.

There's a real economic problem here: when someone scrapes your site, you're literally paying for them to use your stuff. That's messed up (and not sustainable)

It seems like a good fit for micropayments. They never took off with people but machines may be better suited for them.

L402 can help here.

https://l402.org

The other obvious solution is a "web of trust" where Cloudflare just tells you "this request goes in, this one goes out".

I think the paying approach is superior (after all you make money out of people using your service) but Cloudflare is a straight forward/simpler one.

Aren't you paying for me to use the site, too? Or Google? Isn't the point of paying for a web hosting service to distribute information?

Yes, but there is a "free lunch" problem. I can run a script that hits your page costing you X at a fraction of the cost for me (the user)

*Edit: typo

parliament32 · 9 months ago

Linked in the article that this article links to is a project I found interesting for combatting this problem, a (non-crypto) proof-of-work challenge for new visitors https://github.com/TecharoHQ/anubis

Looks like the GNOME Gitlab instance implements it: https://gitlab.gnome.org/GNOME

kh_hk · 9 months ago

For targeted scrapes, isn't proof of work trivial to bypass?

1. headless browser 2. get cookie 3. use cookie on subsequent plain requests

It doesn't sound like the scrapers are that smart yet, but when they get there, presumably you'd just lower the cookie lifetime until the requests are down to an acceptable level. It takes a split-second in my browser so it shouldn't interfere much for human visitors.

hubraumhugo · 9 months ago

We should try separating good bots from bad bots:

Good bots: search engine crawlers that help users find relevant information. These bots have been around since the early days of the internet and generally follow established best practices like robots.txt and rate limits. AI agents like OpenAI's Operator or Anthopic's Computer Use probably also fit into that bucket as they are offering useful automation without negative side effects.

Bad bots: bots that have a negative affect website owners by causing higher costs, spam, or downtime (automated account creation, ad fraud, or DDoS). AI crawlers fit into that bucket as they disregard robots.txt and spoof user agent. They are creating a lot of headaches for developers responsible for maintaining heavily crawled sites. AI companies don't seem to care about any crawling best practices that the industry has developed over the past two decades.

So the actual question is how good bots and humans can coexist on the web while we protect websites against abusive AI crawlers. It currently feels like an arms race without a winner.

jsheard · 9 months ago

Discriminating search engine bots is pretty straightforward, the big names provide bulletproof methods to validate whether a client claiming to be their bot is really their bot. It'll be an uphill battle for new search engines if everyone only trusts Googlebot and Bingbot though.

https://developers.google.com/search/docs/crawling-indexing/...

https://www.bing.com/webmasters/help/verifying-that-bingbot-...

kmeisthax · 9 months ago

> How long until scrapers start hammering Mastodon servers?

Mastodon has AUTHORIZED_FETCH and DISALLOW_UNAUTHENTICATED_API_ACCESS which would at least stop these very naive scrapers from getting any data. Smarter scrapers could actually pretend to speak enough ActivityPub to scrape servers, though.

jmclnx · 9 months ago