A lot of less scrupulous crawlers just seem to imitate the big ones. I feel a lot of people make assumptions because the user agent has to be true, right?
My fave method is still just to have bait info in robots.txt that gzip bombs and autoblocks all further requests from them. Was real easy to configure in Caddy and tends to catch the worst offenders.
Not excusing the bot behaviours but if a few bots blindly take down your site, then an intentionally malicious offender would have a field day.
Maybe I'm just a different generation than the folks writing these blog posts, but I really don't understand the fixation on such low resource usage.
It's like watching a grandparent freak out over not turning off an LED light or seeing them drive 15 miles to save 5c/gallon on gas.
20 requests per second is just... Nothing.
Even if you're dynamically generating them all (and seriously... Why? Time would have been so much better spent fixing that with some caching than this effort) it's just not much demand.
I get the "fuck the bots" style posts are popular in the Zeitgeist at the moment, but this is hardly novel.
There are a lot more productive ways to handle this that waste a lot less of your time.
2. Not all requests are created equal. 20 requests a second for the same static HTML file? No problem. But if you have, say, a release page for an open source project with binary download links for all past versions for multiple platforms, each one being a multi megabyte blob, and a scraper starts hitting these links, you will run into bandwidth problems very quickly, unless you live in a utopia where bandwidth is free.
3. You are underestimating the difficulty of caching dynamic pages. Cache invalidation is hard, they say. One notably problematic example is Git blames. So far I am not aware of any existing solution for caching blames, and jury rigging your own will likely not be any easier than the “solution” explored in the TFA.
A friend of mine had over 1000 requests/sec on his Gitea at peaks. Also, you aren't taking into account some of us don't have a "server", just some shitbox computer in the basement.
This isn't about mere dozen requests. It gets pretty bad. It also slows down his life.
I sympathize with the general gist of your post, but I've seen many a bot generate more traffic than legitimate users on our site.
Never had any actual performance issue, but I can see why a site that expects generally a very low traffic rate might freak out. Could they better optimize their sites? Probably, I know ours sucks big time. But in the era of autoscaling workloads on someone else's computer, a misconfigured site could rack up a big ass bill.
It's not fuck the bots, it's fuck the bot owners for using the websites as they want, and not at minimum, asking. Like 'hey cool if I use this tool to interact with your site for this and that reason?'
No, they just do it. So that can scrape data, which at this point in time for AI which has hit the cap on what it can consume knowledge wise, scrapes it because live updates and new information is most valuable to them.
So they will find tricky, evil ways to hammer resources that we as site operators own; even minimally to use site data to their profit, their success, their benefits while blatantly saying 'screw you' as they ignore robots.txt or pretend to be legitimate users.
There's a digital battle field going on. Clients are coming in as real users using IP lists like from https://infatica.io/
A system and site operator has every right to build the tools they want to protect their systems, data, and have a user experience that benefits their audiences.
Your points are valid and make sense, but; it's not about that. It's about valuing authentic works, intellectual properties, and some dweeb that wants to steal it doesn't get to just run their bots against resources at others detriments, and their benefits.
Some of us have little money or optimized for something else. I spent a good chunk of this and last year with hardly any groceries. So, even $30 a month in hosting and CDN costs was large.
Another situation is an expensive resource. This might be bandwidth hogs, CPU heavy, or higher licensing per CPU's in databases. Some people's sites or services dont scale well or hit their budget limits fast.
In a high-security setup, those boxes usually have limited performance. It comes from the runtime checks, context switches, or embedded/soft processors. If no timing channels, one might have to disable shared caches, too.
Those systems run slow enough that whatever is in front usually needs to throttle the traffic. We'd want no wasted traffic given their cost ranges from $2,000 / chip (FPGA) to six digits a system (eg XTS-500 w/ STOP OS). One could say the same if it was a custom or open-source chip, like Plasma MIPS.
Many people might be in the poor category. A significant number are in the low-scalability category. The others are rare but significant.
He said in the article there were requests that made a tarball of an entire repository, for each sha the git tree. No matter how you slice it that's pretty ugly.
Sure, you could require any number of (user hostile) interactions (logins, captchas, etc) to do expensive operations like that, but now the usability is compromised which sucks.
Can you further elaborate on this robots.txt? I was under the impression most AI just completely ignores anything to do with robots.txt so you may just be hitting the ones that are maybe attempting to obey it?
I'm not against the idea like others here seem to be, I'm more curious about implementing it without harming good actors.
If your robots.txt has a line specifying, for example
Disallow: /private/beware.zip
and you have no links to that file from elsewhere on the site, then if you get a request for that URL it was because someone/something read the robots.txt and explicitly violated it, then you can send it a zipbomb or ban the source IP or whatever.
But in my experience it isn't the robots.txt violations being so flagrant (half the requests are probably humans who were curious what you're hiding, and most bots written specifically for LLMs don't even check the robots.txt). The real abuse is the crawler that hits an expensive and frequently-changing URL more often than reasonable, and the card-testers hitting payment endpoints, sometimes with excessive chargebacks. And port-scanners, but those are a minor annoyance if your network setup is decent. And email spoofers who bring your server's reputation down if you don't set things up correctly early on and whenever changing hosts.
I run one of the largest wikis in my niche and convincing the other people on my dev team to use gzip bombs as a defensive measure has been impossible-- they are convinced that it is a dangerous liability (EU-brained) and isn't worth pursuing.
Do you guys really use these things on real public-facing websites?
You might as well say gfy to anyone using chat bots, search engines and price comparison tools. They're the one's who financially incentivize the scrapers.
Giving someone a "financial incentive" to do something (by gasp using a search engine, or comparing prices) does not make that thing ethical or cool to do in and of itself.
I wonder where you ever got the idea that it does.
I consider the disk space issue a bug in Gitea. When someone downloads a zip, it should be able to stream the zip to the client, but instead it builds the zip in temporary space, serves it to the client, and doesn't delete it.
I solved it by marking that directory read-only. Zip downloads, obviously, won't work. If someone really wants one, they can check out the repository and make it theirself.
If I really cared, of course I'd fix the bug or make sure there's a way to disable the feature properly or only enable it for logged-in users.
Also I server-side redirect certain user-agents to https://git.immibis.com/gptblock.html . This isn't because they waste resources any more but just because I don't like them, what they're doing is worthless anyway, and because I can. If they really want the data of the Git repository they can clone the Git repository instead of scraping it in a stupid way. That was always allowed.
8 requests per second isn't that much unless each request triggers intensive processing and indeed it wasn't a load on my little VPS other than the disk space bug. I blocked them because they're stupid, not because they're a DoS.
Semrush misbehaved so badly for such a long time with various levels of incompetence that I have special notes in my robots.txt files going back at least 8 years and eventually got someone in legal to pay attention. I see zero value to me or any potential users of mine in letting an 'SEO' firm lumpishly trample all over my site barging out real visitors. And some of Semrush's competitors were just as bad.
The current round of AI nonsense also very poor. Again had to send legal notes to investor relations and PR depts in at least one well-known case, as well as all the technical measures, to restore some sort of decorum.
My experience with them has been the same. On one of my employer's websites, the top three bots were Google, Bytedance and Semrush. It is a small website for a niche audience, not even in English, and changes very infrequently (like once or twice a quarter). That did not stop these three bots from hammering the site, every second
Because the bot requests are consuming significant amounts of bandwidth, memory, CPU, and disk space. Like the intro says, it's just rude, and there's no reason to serve traffic to harvesters like that.
Google also runs an AI scraper, which might be what you saw represented there?
From the article it's sure starting to seem like people across the internet are just starting to realize what happens when you don't have just 3-4 search engines responsible for crawling for data anymore. When data becomes truly democratized, its access increases dramatically, and we can either adjust or shelter ourselves while the world moves on without us.
Did Google never ever scrape individual commits from Gitea?
There are also bad actors who pretend to be the Google scraper. Google once upon a time had a reputation for respectfully scrapping, but if he's getting the traffic he needs with or without the Googlebot, why should he care?
I’ve turned off logging on my servers precisely because it’s growing too quickly due to these bots. They’re that relentless, and would fill every form, even access APIs otherwise accessible only by clicking around the site. Anthropic, openAI and Facebook are still scraping to this day.
They’re using the sign up and sign in forms, and also the search, and then clicking on those search results. I thought some bad actor is masquerading as AI scrapers to enumerate accounts, but their behavior is consistent with a scraper.
It is nice that the AI crawler bots honestly fill out the `User-Agent` header, I'm shocked that they were the source of that much traffic though. 99% of all websites do not change often enough to warrant this much traffic, let alone a dev blog.
However, I've also seen reports that after getting blocked one way or another, they start crawling with browser user-agents from residential IPs. But it might also be someone else misrepresenting their crawlers as OpenAI/Amazon/Facebook/whatever to begin with.
All the reports I've heard from organizations dealing with AI crawler bots say they are not honest about their user agent and do not respect robots.txt
We ended up writing similar rules to the article. It was just based on frequency.
While we were rate limiting bots based on UA, we ended up also having to apply wider rules because traffic started spiking from other places.
I can't say if it's the traffic shifting, but there's definitely a big amount of automated traffic not identifying itself properly.
If you look at all your web properties, look at historic traffic to calculate <hits per IP> in <time period>. Then look at the new data and see how it's shifting. You should be able to identify the real traffic and the automated very quickly.
While our platform is primarily designed for live, logged-in users, it also works well for bot detection and blocking.
We anonymize IP addresses by replacing the last octet with an asterisk, effectively grouping the same subnet under a single account. You can then use the built-in rule engine to automatically generate blacklists based on specific conditions, such as excessive 500 or 404 errors, brute-force login attempts, or traffic from data center IPs.
Finally, you can integrate tirreno blacklist API into your application logic to redirect unwanted traffic to an error page.
Bonus: a dashboard [2] is available to help you monitor activity and fine-tune the blacklist to avoid blocking legitimate users.
> We anonymize IP addresses by replacing the last octet with an asterisk, effectively grouping the same subnet under a single account
So as a user, not only do I have to suffer your blockwall's false positives based on "data center IPs" (ie most things that aren't naively browsing from the information-leaking address of a last-mile connection like some cyber-bumpkin). But if I do manage to find something that isn't a priori blocked (or manage to click through 87 squares of traffic lights), I still then get lumped in with completely unrelated address-neighbors to assuage your conscience that you're not building a user surveillance system based on nonconsentually processing personal information.
Just please make sure you have enough of a feedback process that your customers can see that they are losing real customers with real dollars.
You're right, blanket blocking based on IP ranges (like TOR or DC) often creates false positives and punishes privacy-conscious users. Therefore, unlike the traditional way of blocking an IP just because it is from a data center, tirreno suggests using a risk-based system that takes into account dozens or hundreds of rules.
As in my example, if the IP is from a data center and creates a lot of 404 errors, send it to a manual review queue or to automatic blocking (not recommended).
Personally, I prefer to manually review even bot activity, and tirreno, even if it's not directly designed for bot management, works great for this, especially in cases when bad bots are hidden behind legitimate bot UA's.
Increasing amounts of traffic are flowing through Google VPN if you’re on a Pixel, and on Apple, there is iCloud Private Relay. I’d have thought the address-neighbors issue would be especially likely to catch out these situations?
Overly simplistic solutions like this absolutely will actively cost you real customers and real revenue.
> While our platform is primarily designed for live, logged-in users, it also works well for bot detection and blocking.
tirreno is designed to work with logged-in users, so all actions are tied to usernames rather than IP addresses. From this perspective, we strongly avoid [1] making any decisions or taking actions based solely on IP address information.
My sites get 10-20k requests a day. Mostly AI scrapers. One thing I noticed is many look for specific, PHP pages. If you dont use PHP, you might be able to autoblock any IP requesting PHP pages. If you have PHP, block those requesting pages you dont have.
Some of us are happy to train AI's but want to block overload. For instance, I'm glad they're scraping pages about the Gospel and Biblical theology. It might help to put anything large that you dont want scraped into specifi directories. Then, upon detecting a bot, block the IP from accessing those.
In my case, I also have a baseline strategy to deal with a large number of requests. That's text only, HTML/CSS presentation, other stuff externally hosted, and BunnyCDN with Perma-Cache ($10/mo + 1 penny / GB). The BunnyCDN requests go to $5/mo. VM's on Digital Ocean. I didnt even notice AI scrapers at first since (a) they didn't affect performance and (b) a month of them changed my balance from $30 to $29.99.
(Note to DO and Bunny team members that may be here: Thanks for your excellent services.)
My fave method is still just to have bait info in robots.txt that gzip bombs and autoblocks all further requests from them. Was real easy to configure in Caddy and tends to catch the worst offenders.
Not excusing the bot behaviours but if a few bots blindly take down your site, then an intentionally malicious offender would have a field day.
Maybe I'm just a different generation than the folks writing these blog posts, but I really don't understand the fixation on such low resource usage.
It's like watching a grandparent freak out over not turning off an LED light or seeing them drive 15 miles to save 5c/gallon on gas.
20 requests per second is just... Nothing.
Even if you're dynamically generating them all (and seriously... Why? Time would have been so much better spent fixing that with some caching than this effort) it's just not much demand.
I get the "fuck the bots" style posts are popular in the Zeitgeist at the moment, but this is hardly novel.
There are a lot more productive ways to handle this that waste a lot less of your time.
2. Not all requests are created equal. 20 requests a second for the same static HTML file? No problem. But if you have, say, a release page for an open source project with binary download links for all past versions for multiple platforms, each one being a multi megabyte blob, and a scraper starts hitting these links, you will run into bandwidth problems very quickly, unless you live in a utopia where bandwidth is free.
3. You are underestimating the difficulty of caching dynamic pages. Cache invalidation is hard, they say. One notably problematic example is Git blames. So far I am not aware of any existing solution for caching blames, and jury rigging your own will likely not be any easier than the “solution” explored in the TFA.
This isn't about mere dozen requests. It gets pretty bad. It also slows down his life.
Never had any actual performance issue, but I can see why a site that expects generally a very low traffic rate might freak out. Could they better optimize their sites? Probably, I know ours sucks big time. But in the era of autoscaling workloads on someone else's computer, a misconfigured site could rack up a big ass bill.
No, they just do it. So that can scrape data, which at this point in time for AI which has hit the cap on what it can consume knowledge wise, scrapes it because live updates and new information is most valuable to them.
So they will find tricky, evil ways to hammer resources that we as site operators own; even minimally to use site data to their profit, their success, their benefits while blatantly saying 'screw you' as they ignore robots.txt or pretend to be legitimate users.
There's a digital battle field going on. Clients are coming in as real users using IP lists like from https://infatica.io/
A writeup posted to HN about it
https://jan.wildeboer.net/2025/04/Web-is-Broken-Botnet-Part-...
A system and site operator has every right to build the tools they want to protect their systems, data, and have a user experience that benefits their audiences.
Your points are valid and make sense, but; it's not about that. It's about valuing authentic works, intellectual properties, and some dweeb that wants to steal it doesn't get to just run their bots against resources at others detriments, and their benefits.
Another situation is an expensive resource. This might be bandwidth hogs, CPU heavy, or higher licensing per CPU's in databases. Some people's sites or services dont scale well or hit their budget limits fast.
In a high-security setup, those boxes usually have limited performance. It comes from the runtime checks, context switches, or embedded/soft processors. If no timing channels, one might have to disable shared caches, too.
Those systems run slow enough that whatever is in front usually needs to throttle the traffic. We'd want no wasted traffic given their cost ranges from $2,000 / chip (FPGA) to six digits a system (eg XTS-500 w/ STOP OS). One could say the same if it was a custom or open-source chip, like Plasma MIPS.
Many people might be in the poor category. A significant number are in the low-scalability category. The others are rare but significant.
Sure, you could require any number of (user hostile) interactions (logins, captchas, etc) to do expensive operations like that, but now the usability is compromised which sucks.
Unless you're running mediawiki.
Are there easy settings I should be messing with for that?
Deleted Comment
I'm not against the idea like others here seem to be, I'm more curious about implementing it without harming good actors.
But in my experience it isn't the robots.txt violations being so flagrant (half the requests are probably humans who were curious what you're hiding, and most bots written specifically for LLMs don't even check the robots.txt). The real abuse is the crawler that hits an expensive and frequently-changing URL more often than reasonable, and the card-testers hitting payment endpoints, sometimes with excessive chargebacks. And port-scanners, but those are a minor annoyance if your network setup is decent. And email spoofers who bring your server's reputation down if you don't set things up correctly early on and whenever changing hosts.
Do you guys really use these things on real public-facing websites?
Giving someone a "financial incentive" to do something (by gasp using a search engine, or comparing prices) does not make that thing ethical or cool to do in and of itself.
I wonder where you ever got the idea that it does.
Dead Comment
I solved it by marking that directory read-only. Zip downloads, obviously, won't work. If someone really wants one, they can check out the repository and make it theirself.
If I really cared, of course I'd fix the bug or make sure there's a way to disable the feature properly or only enable it for logged-in users.
Also I server-side redirect certain user-agents to https://git.immibis.com/gptblock.html . This isn't because they waste resources any more but just because I don't like them, what they're doing is worthless anyway, and because I can. If they really want the data of the Git repository they can clone the Git repository instead of scraping it in a stupid way. That was always allowed.
8 requests per second isn't that much unless each request triggers intensive processing and indeed it wasn't a load on my little VPS other than the disk space bug. I blocked them because they're stupid, not because they're a DoS.
Of these, I certainly wouldn’t ban Google, and probably not the others, if I wanted others to see it and talk about it.
Even if your content were being scraped for some rando’s AI bot, why have a public site if you don’t expect your site to be used?
Turning the lights off on the motel sign when you want people to find it is not a good way to invite people in.
The current round of AI nonsense also very poor. Again had to send legal notes to investor relations and PR depts in at least one well-known case, as well as all the technical measures, to restore some sort of decorum.
Google also runs an AI scraper, which might be what you saw represented there?
Did Google never ever scrape individual commits from Gitea?
Especially when this happens ?
Google is using AI to censor independent websites like mine
https://news.ycombinator.com/item?id=44124820
Sure sounds like we've reached the point where it's more of a liability !
How else?
And to clarify,
It's a part of the UI or something and only a human should be pressing it, and there's no other way to access that API or something?
AI agents exist now, there is virtually no way to distinguish between real user and bot if they mimic human patterns.
However, I've also seen reports that after getting blocked one way or another, they start crawling with browser user-agents from residential IPs. But it might also be someone else misrepresenting their crawlers as OpenAI/Amazon/Facebook/whatever to begin with.
All the reports I've heard from organizations dealing with AI crawler bots say they are not honest about their user agent and do not respect robots.txt
"It's futile to block AI crawler bots because they lie, change their user agent, use residential IP addresses as proxies, and more." https://xeiaso.net/notes/2025/amazon-crawler/
While we were rate limiting bots based on UA, we ended up also having to apply wider rules because traffic started spiking from other places.
I can't say if it's the traffic shifting, but there's definitely a big amount of automated traffic not identifying itself properly.
If you look at all your web properties, look at historic traffic to calculate <hits per IP> in <time period>. Then look at the new data and see how it's shifting. You should be able to identify the real traffic and the automated very quickly.
Deleted Comment
While our platform is primarily designed for live, logged-in users, it also works well for bot detection and blocking.
We anonymize IP addresses by replacing the last octet with an asterisk, effectively grouping the same subnet under a single account. You can then use the built-in rule engine to automatically generate blacklists based on specific conditions, such as excessive 500 or 404 errors, brute-force login attempts, or traffic from data center IPs.
Finally, you can integrate tirreno blacklist API into your application logic to redirect unwanted traffic to an error page.
Bonus: a dashboard [2] is available to help you monitor activity and fine-tune the blacklist to avoid blocking legitimate users.
[1] https://github.com/tirrenotechnologies/tirreno
[2] https://play.tirreno.com/login (admin/tirreno)
So as a user, not only do I have to suffer your blockwall's false positives based on "data center IPs" (ie most things that aren't naively browsing from the information-leaking address of a last-mile connection like some cyber-bumpkin). But if I do manage to find something that isn't a priori blocked (or manage to click through 87 squares of traffic lights), I still then get lumped in with completely unrelated address-neighbors to assuage your conscience that you're not building a user surveillance system based on nonconsentually processing personal information.
Just please make sure you have enough of a feedback process that your customers can see that they are losing real customers with real dollars.
As in my example, if the IP is from a data center and creates a lot of 404 errors, send it to a manual review queue or to automatic blocking (not recommended).
Personally, I prefer to manually review even bot activity, and tirreno, even if it's not directly designed for bot management, works great for this, especially in cases when bad bots are hidden behind legitimate bot UA's.
Overly simplistic solutions like this absolutely will actively cost you real customers and real revenue.
tirreno is designed to work with logged-in users, so all actions are tied to usernames rather than IP addresses. From this perspective, we strongly avoid [1] making any decisions or taking actions based solely on IP address information.
[1] https://www.tirreno.com/bat/?post=2025-05-25
Some of us are happy to train AI's but want to block overload. For instance, I'm glad they're scraping pages about the Gospel and Biblical theology. It might help to put anything large that you dont want scraped into specifi directories. Then, upon detecting a bot, block the IP from accessing those.
In my case, I also have a baseline strategy to deal with a large number of requests. That's text only, HTML/CSS presentation, other stuff externally hosted, and BunnyCDN with Perma-Cache ($10/mo + 1 penny / GB). The BunnyCDN requests go to $5/mo. VM's on Digital Ocean. I didnt even notice AI scrapers at first since (a) they didn't affect performance and (b) a month of them changed my balance from $30 to $29.99.
(Note to DO and Bunny team members that may be here: Thanks for your excellent services.)