Perversely, this submission is essentially blogspam. The article linked in the second paragraph, to which this "1 minute" read adds almost nothing of value, is the main story:
But also ironically, it's almost heartwarming these days to see blogspam that's not machine-generated! A real live human cared enough about an article to write a brief (perhaps only barely substantial, but at least handwritten) take on it!
Now, is driving attention and reputation to their site (in the broadest senses) part of a blogspammer/reblogger's motivation? Absolutely!
But should we be concerned about rewarding their act of curation, as long as there is at least some level of genuine curation intent? A world where that answer is categorically "no" would be antithetical, I think, to the concept of the participatory web.
I dont feel this is blog spam it's more of a quick comment of the situation pointing to the actual article. I dont see anything wrong with writing a short post boosting or commenting on another article. There are no ads so I dont see this as blogspam which I associate with financial gain or clout.
All the time I see links on HN front page to Twitter and Mastodon posts with just as little text to them. Why does it upset you when it is in the medium of blogs, but not micro blogs?
Hehe, just participating in POSSE :) Funnily enough the story you're linking to quotes me with pictures of a story I wrote (https://sethmlarson.dev/slop-security-reports) about LLM-generated reports to open source projects.
this is basically what they are doing, but instead of charging actual money they are making visitors spin the CPU ideally in a proof of work problem, which has the same outcome from the crawlers perspective.
This is ultimately the answer. If something has value, users should pay for it. We haven't had a good way to do that on the web, so it has resulted in the complete shitshow that most websites are.
There's a real economic problem here: when someone scrapes your site, you're literally paying for them to use your stuff. That's messed up (and not sustainable)
It seems like a good fit for micropayments. They never took off with people but machines may be better suited for them.
Rate limiting is the first step before cutting everything off behind forced logins.
> This practice started with larger websites, ones that already had protection from malicious usage like denial-of-service and abuse in the form of services like Cloudflare or Fastly
FYI Cloudflare has a very usable free tier that’s easy to set up. It’s not limited to large websites.
I get the feeling that I'm going to read a blog post in a few years telling us that the CDN companies have been selling everything pulled through their cache to the AI companies since 2022
A user running an online casino claimed that Cloudflare abruptly terminated their service after they refused to upgrade to a $10,000/month enterprise plan. The user alleged that Cloudflare failed to communicate the reasons clearly and deleted their account without warning.
Quote: "Cloudflare wanted them to use the BYOIP features of the enterprise plan, and did not want them on Cloudflare's IPs. The solution was to aggressively sell the Enterprise plan, and in a stunning failure of corporate communication, not tell the customer what the problem was at all."
Summary: A user shared their experience of being forced to upgrade to a $3,000/month plan after using 200-300TB of bandwidth on Cloudflare's business plan. They criticized Cloudflare's lack of transparency regarding bandwidth limits and aggressive sales tactics.
Quote: "A lot of this stuff wasn't communicated when we signed up for the business plan. There was no mention of limits, nor any contracts nor fineprint."
Summary: A user expressed frustration with Cloudflare's bot protection challenges, which made it difficult for them to unsubscribe from emails or access websites. They highlighted how these challenges disproportionately affect privacy-conscious users with non-standard browser configurations.
Quote: "The 'unsubscribe' button in Indeed's job notification emails leads me to an impassable Cloudflare challenge. That's a CAN-SPAM act violation."
What exactly should be rate-limited, though? See the discussion here -- https://news.ycombinator.com/item?id=43422413 -- the traffic at issue in that case (and in one that I'm dealing with myself) is from a large number of IPs making no more than a single request each.
Linked in the article that this article links to is a project I found interesting for combatting this problem, a (non-crypto) proof-of-work challenge for new visitors https://github.com/TecharoHQ/anubis
It doesn't sound like the scrapers are that smart yet, but when they get there, presumably you'd just lower the cookie lifetime until the requests are down to an acceptable level. It takes a split-second in my browser so it shouldn't interfere much for human visitors.
Good bots: search engine crawlers that help users find relevant information. These bots have been around since the early days of the internet and generally follow established best practices like robots.txt and rate limits. AI agents like OpenAI's Operator or Anthopic's Computer Use probably also fit into that bucket as they are offering useful automation without negative side effects.
Bad bots: bots that have a negative affect website owners by causing higher costs, spam, or downtime (automated account creation, ad fraud, or DDoS). AI crawlers fit into that bucket as they disregard robots.txt and spoof user agent. They are creating a lot of headaches for developers responsible for maintaining heavily crawled sites. AI companies don't seem to care about any crawling best practices that the industry has developed over the past two decades.
So the actual question is how good bots and humans can coexist on the web while we protect websites against abusive AI crawlers. It currently feels like an arms race without a winner.
Discriminating search engine bots is pretty straightforward, the big names provide bulletproof methods to validate whether a client claiming to be their bot is really their bot. It'll be an uphill battle for new search engines if everyone only trusts Googlebot and Bingbot though.
> How long until scrapers start hammering Mastodon servers?
Mastodon has AUTHORIZED_FETCH and DISALLOW_UNAUTHENTICATED_API_ACCESS which would at least stop these very naive scrapers from getting any data. Smarter scrapers could actually pretend to speak enough ActivityPub to scrape servers, though.
The argument the AI companies are making is that training for LLMs is fair use which means a copyright statement means fuck all from their point of view. (Even if it does, assuming you're in the US, unless you register the copyright with the US copyright office, you can only sue for actual damages, which means the cost of filing a lawsuit against them--not even litigating, just the court fee for saying "I have a lawsuit"--would be more expensive than anything you could recover. Even if you did register and sued for statutory damages, the cost of litigation would probably exceed the recovery you could expect.)
Of course, the big AI companies are already trying to get the government to codify AI training as fair use and sidestep the litigation which doesn't seem to be going entirely their way on this matter (cf. https://arstechnica.com/google/2025/03/google-agrees-with-op...).
In addition, we need to start paying attention to the growing legislation about AI and copyright law. There was an article on HN I think this week (or last) specifically where a judge ruled AI cannot own copyright on its generated materials.
IANAL, but I do wonder how this ruling will be used as a point of reference whenever we finally ask the question "Does material produced by GenAI violate copyright laws?" Specifically if it cannot claim ownership, a right that we've awarded to trees and monkeys, how does it operate within ownership laws?
And don't even get me ranting about HUMAN digital rights or Personified AIs.
Fair use requires transformation. LLM is as transformative as it gets. If I'm on the jury, you're going to have to make new copyright law for me to convict.
I am personally happy to have everyone, people and LLM alike, learn from my wisdom.
Copyright is for topics like redistribution of the source material. You can’t add arbitrary terms to a copyright claim that go beyond what copyright law supports.
I think you’re confusing copyright with a EULA. You would need users to agree to the EULA terms before viewing the material. You can’t hide contractual obligations in the footer of your website and call it copyright.
What about if my index says "This are the EULA, by clicking "Next" or "Enter", you are accepting them", and a LLM scrapper "clicks" Next to fetch the rest of the content?
It's reasonably likely, but not yet settled, that LLM training falls under fair use and doesn't require a license. This is what the https://githubcopilotlitigation.com/ class action (from 2022) is about, and its still making its way through the court.
This prediction market has it at 12% likely to succeed, suggesting that courts will not agree with you: https://manifold.markets/JeffKaufman/will-the-github-copilot...
> It's reasonably likely, but not yet settled, that LLM training falls under fair use and doesn't require a license.
I would say it's not reasonably likely that LLM training is fair use. Because I've read the most recent SCOTUS decision on fair use (Warhol), and enough other decisions on fair use, to understand that the primary (and nearly only, in practice) factor is the effect on the market for the original. And AI companies seem to be going out of their way to emphasize that LLM training is only going to destroy the market for the originals, which weighs against fair use. Not to mention the existence of deals licensing content for LLM training which... basically concedes the point.
Of the various options, a ruling that LLM training is fair use I find the least likely. More likely is either that LLM training is not fair use, that LLM training is not infringing in the first place, or that the plaintiffs can't prove that the LLM infringed their work.
> This prediction market has it at 12% likely to succeed
Randos on the internet with a betting addiction are distinctively different from a court of law. I wish people would stop talking about prediction market as if they mattered.
this isn't about copyright but about computer access. the CFAA is extremely broad; if you ban LLM companies from access on grounds of purpose you have every legal right to do so
in theory that legislation has teeth, too. they are not allowed to access your system if you say they are not; authentication is irrelevant.
every GET request to a system that doesn't permit access for training data is a felony
The only reason copyright is so strong in the US is that there are big players (Disney, Elsevier) who benefit from it. But gig tech is much bigger, and LLMs have created a situation where big tech has a vested interest in eroding copyright law. Both sides are gearing up for a war in the court systems, and it's definitely not a given who will win. But, if you try to enter the fray as an individual or small company, you definitely aren't going to win.
The reality is that a lot of these small websites have very permissive licenses. I really hope we don't get to the point where we must all make our licenses stricter.
The reality is that none of these LLM scrapers give a damn about copyright, because the entire AI industry is built on flagrant copyright violation, and the premise that they can be stopped by a magic string is laughable.
You could sue, if you can afford it, meanwhile all of your data is already training their models.
Sure, because Meta certainly followed copyright law to the letter when they torrented thousands of copyrighted books from hundreds of published and known authors to train Lama. Forgive me if I doubt a text disclaimer on the page will slow them down.
Unfortunately copyright is no limit to these companies.
Meta is stating in court that knowingly downloading pirated content is perfectly fine (ref https://news.ycombinator.com/item?id=43125840) so they for one would have absolutely no issue completely ignoring your copyright notice and stated licensing costs. Good luck affording a legal team to try force them to pay attention.
Copyright is something for them to beat us with, not the other way around, apparently.
<https://thelibre.news/foss-infrastructure-is-under-attack-by...>
394 comments. 645 points. Submitted 3 hours ago: <https://news.ycombinator.com/item?id=43422413>
It's reminiscent, perhaps, of the feel and motivation for Tumblr reblogs - and Tumblr continues to be vibrant by virtue of this culture: https://www.tumblr.com/engineering/189455858864/how-reblogs-... (2019)
Now, is driving attention and reputation to their site (in the broadest senses) part of a blogspammer/reblogger's motivation? Absolutely!
But should we be concerned about rewarding their act of curation, as long as there is at least some level of genuine curation intent? A world where that answer is categorically "no" would be antithetical, I think, to the concept of the participatory web.
Deleted Comment
Deleted Comment
Deleted Comment
"L402" is an interesting proposal. Paying a fraction of a penny per request. https://github.com/l402-protocol/l402
"Hey, we'd happily give these companies clean data if they just paid us instead of building these scrapers."
I think there is a psychological aspect that made micropayments never work for humans but machines may be better suited for it.
It seems like a good fit for micropayments. They never took off with people but machines may be better suited for them.
L402 can help here.
https://l402.org
I think the paying approach is superior (after all you make money out of people using your service) but Cloudflare is a straight forward/simpler one.
*Edit: typo
> This practice started with larger websites, ones that already had protection from malicious usage like denial-of-service and abuse in the form of services like Cloudflare or Fastly
FYI Cloudflare has a very usable free tier that’s easy to set up. It’s not limited to large websites.
[0] https://news.ycombinator.com/item?id=42953508
A user running an online casino claimed that Cloudflare abruptly terminated their service after they refused to upgrade to a $10,000/month enterprise plan. The user alleged that Cloudflare failed to communicate the reasons clearly and deleted their account without warning.
Quote: "Cloudflare wanted them to use the BYOIP features of the enterprise plan, and did not want them on Cloudflare's IPs. The solution was to aggressively sell the Enterprise plan, and in a stunning failure of corporate communication, not tell the customer what the problem was at all."
——
Tell HN: Don't Use Cloudflare: https://news.ycombinator.com/item?id=31336515
Summary: A user shared their experience of being forced to upgrade to a $3,000/month plan after using 200-300TB of bandwidth on Cloudflare's business plan. They criticized Cloudflare's lack of transparency regarding bandwidth limits and aggressive sales tactics.
Quote: "A lot of this stuff wasn't communicated when we signed up for the business plan. There was no mention of limits, nor any contracts nor fineprint."
——
Tell HN: Impassable Cloudflare challenges are ruining my browsing experience: https://news.ycombinator.com/item?id=42577076
Summary: A user expressed frustration with Cloudflare's bot protection challenges, which made it difficult for them to unsubscribe from emails or access websites. They highlighted how these challenges disproportionately affect privacy-conscious users with non-standard browser configurations.
Quote: "The 'unsubscribe' button in Indeed's job notification emails leads me to an impassable Cloudflare challenge. That's a CAN-SPAM act violation."
Looks like the GNOME Gitlab instance implements it: https://gitlab.gnome.org/GNOME
1. headless browser 2. get cookie 3. use cookie on subsequent plain requests
Good bots: search engine crawlers that help users find relevant information. These bots have been around since the early days of the internet and generally follow established best practices like robots.txt and rate limits. AI agents like OpenAI's Operator or Anthopic's Computer Use probably also fit into that bucket as they are offering useful automation without negative side effects.
Bad bots: bots that have a negative affect website owners by causing higher costs, spam, or downtime (automated account creation, ad fraud, or DDoS). AI crawlers fit into that bucket as they disregard robots.txt and spoof user agent. They are creating a lot of headaches for developers responsible for maintaining heavily crawled sites. AI companies don't seem to care about any crawling best practices that the industry has developed over the past two decades.
So the actual question is how good bots and humans can coexist on the web while we protect websites against abusive AI crawlers. It currently feels like an arms race without a winner.
https://developers.google.com/search/docs/crawling-indexing/...
https://www.bing.com/webmasters/help/verifying-that-bingbot-...
Mastodon has AUTHORIZED_FETCH and DISALLOW_UNAUTHENTICATED_API_ACCESS which would at least stop these very naive scrapers from getting any data. Smarter scrapers could actually pretend to speak enough ActivityPub to scrape servers, though.
Sad things are getting to this point. Maybe I should add this to my site :)
(c) Copyright (my email), if used for any form of LLM processing, you must contact me and pay 1000USD per word from my site for each use.
Of course, the big AI companies are already trying to get the government to codify AI training as fair use and sidestep the litigation which doesn't seem to be going entirely their way on this matter (cf. https://arstechnica.com/google/2025/03/google-agrees-with-op...).
IANAL, but I do wonder how this ruling will be used as a point of reference whenever we finally ask the question "Does material produced by GenAI violate copyright laws?" Specifically if it cannot claim ownership, a right that we've awarded to trees and monkeys, how does it operate within ownership laws?
And don't even get me ranting about HUMAN digital rights or Personified AIs.
I am personally happy to have everyone, people and LLM alike, learn from my wisdom.
I think you’re confusing copyright with a EULA. You would need users to agree to the EULA terms before viewing the material. You can’t hide contractual obligations in the footer of your website and call it copyright.
I would say it's not reasonably likely that LLM training is fair use. Because I've read the most recent SCOTUS decision on fair use (Warhol), and enough other decisions on fair use, to understand that the primary (and nearly only, in practice) factor is the effect on the market for the original. And AI companies seem to be going out of their way to emphasize that LLM training is only going to destroy the market for the originals, which weighs against fair use. Not to mention the existence of deals licensing content for LLM training which... basically concedes the point.
Of the various options, a ruling that LLM training is fair use I find the least likely. More likely is either that LLM training is not fair use, that LLM training is not infringing in the first place, or that the plaintiffs can't prove that the LLM infringed their work.
Randos on the internet with a betting addiction are distinctively different from a court of law. I wish people would stop talking about prediction market as if they mattered.
in theory that legislation has teeth, too. they are not allowed to access your system if you say they are not; authentication is irrelevant.
every GET request to a system that doesn't permit access for training data is a felony
The only reason copyright is so strong in the US is that there are big players (Disney, Elsevier) who benefit from it. But gig tech is much bigger, and LLMs have created a situation where big tech has a vested interest in eroding copyright law. Both sides are gearing up for a war in the court systems, and it's definitely not a given who will win. But, if you try to enter the fray as an individual or small company, you definitely aren't going to win.
You could sue, if you can afford it, meanwhile all of your data is already training their models.
Meta is stating in court that knowingly downloading pirated content is perfectly fine (ref https://news.ycombinator.com/item?id=43125840) so they for one would have absolutely no issue completely ignoring your copyright notice and stated licensing costs. Good luck affording a legal team to try force them to pay attention.
Copyright is something for them to beat us with, not the other way around, apparently.