AI is going to damage society not in fancy sci-fi ways but by centralizing profit made at the expense of everyone else on the internet, who is then forced to erect boundaries to protect themselves, worsening the experience for the rest of the public. Who also have to pay higher electricity bills, because keeping humans warm is not as profitable as a machine which directly converts electricity into stock price rises.
I'm far from being an AI enthusiast as anyone can be, but this issue has nothing to do with AI specifically. It's just that some greedy companies are writing incredibly shitty crawlers that don't follow any of the enstablished conventions (respecting robots.txt, using a proper UA string, rate limiting, whatever). This situation could have easily happened earlier than the AI boom, for different reasons.
I strongly believe that AI companies are running a DDOS attack on the open web. Making websites go down aligns with their intetests: it removes training data that competitors could use, and it removes sources for humans to browse, making us even more reliant on chatbots to find anything.
If it was crap coding, then the bots wouldn't have so many mechanisms to circumvent blocks. Once you block the OpenAI IP ranges, they start using residential proxies. Once you block their UA strings, they start impersonating other crawlers or browsers.
"It's just that some greedy companies are writing incredibly shitty crawlers that don't follow any of the enstablished [sic] conventions (respecting robots.txt, using proper UA string, rate limiting, whatever)."
How does "proper UA string" solve this "blowing up websites" problem
The only thing that matters with respect to the "blowing up websites" problem is rate-limiting, i.e., behaviour
"Shitty crawlers" are a nuisance because of their behaviour, i.e., request rate, not because of whatever UA string they send; the behaviour is what is "shitty" not the UA string. The two are not necessarily correlated and any heuristic that naively assumes so is inviting failure
"Spoofed" UA strings have been facilitated and expected since the earliest web browsers
To borrow the parent's phrasing, the "blowing up websites" problem has nothing to do with UA string specifically
It may have something to do with website operator reluctance to set up rate-limiting though; this despite widespread impelementation of "web APIs" that use rate-limiting
NB. I'm not suggesting rate-limiting is a silver bullet. I'm suggesting that without rate-limiting, UA string as a means of addressing the "blowing up websites" problem is inviting failure
This isn't AI damaging anything. This is corporations damaging things. Same as it ever was. No need for scifi non-human persons when legal corporate persons exist. They latch on to whatever big new thing in tech that people don't understand which comes along and brand themselves with it and cause damage trying to make money; even if they mostly fail at it. And for most actual humans they only ever see or interact with the scammy corporation versions of $techthing and so come to believe $techthing = corporate behavior.
And as for denying service and preventing human people from visiting websites: cloudflare does more of that damage in a single day than all these "AI" associated corporations and their crappy crawlers have in years.
> This isn't AI damaging anything. This is corporations damaging things.
This is corporations damaging things because of AI. Corporations will damage things for other reasons too but the only reason they are breaking the internet in this way, at this time, is because of AI.
I think the "AI doesn't kill websites, corporations kill websites" argument is as flawed as the "Guns don't kill people, people kill people" argument.
Cloudflare exists because people can't be good stewards of the internet.
> This isn't AI damaging anything. This is corporations damaging things
This is the guns don't kill people, people kill people argument. The problem with crawlers is about 10x worse than it was previously because of AI and their hunger for data.
If you don't want to receive data, don't. If you don't want to send data, don't. No one is asking you to receive traffic from my IPs or send to my IPs. You've just configured your server one way.
Or to use a common HN aphorism “your business model is not my problem”. Disconnect from me if you don’t want my traffic.
I don't know if I want your traffic until I see what your traffic is.
You want to look at one of our git commits? Sure! That's what our web-fronted git repo is for. Go right head! Be our guest!
Oh ... I see. You want to download every commit in our repository. One by one, when you have used git clone. Hmm, yeah, I don't want your traffic.
But wait, "your traffic" seems to originate from ... consults fail2ban logs ... more than 900k different IP addresses, so "disconnecting" from you is non-trivial.
I can't put it more politely than this: fuck off. Do not pass go. Do not collect stock options. Go to hell, and stay there.
> Disconnect from me if you don’t want my traffic.
The problem is precisely that that is not possible. It is very well known that these scrapers aren’t respecting the wishes of website owners and even circumvent blocks any way they can. If these companies respected the website owners’ desires for them to disconnect, we wouldn’t be having this conversation.
My worst offender for scraping one of my sites was Anthropic. I deployed an ai tar pit (https://news.ycombinator.com/item?id=42725147) to see what it would do it with it, and Anthropic's crawler kept scraping it for weeks. I calculated the logs and I think I wasted nearly a year of their time in total, because they were crawling in parallel. Other scrapers weren't so persistent.
For me it was OpenAI. GTPBot hammered my honeypot with 0.87 requests per second for about 5 weeks. Other crawlers only made up 2% of the traffic. 1.8 million requests, 4 GiB of traffic. Then it just abruptly stopped for whatever reason.
My book discovery website shepherd.com is getting hammered every day by AI crawlers (and crashing often)... my security lists in CloudFlare are ridiculous and the bots are getting smarter.
put a honeypot link in your site that only robots will hit because it’s hidden. make sure it’s not in robots.txt or ban it if you can in robots.txt. setup a rule that any ip that hits that link will get a 1 day ban in your fail2ban or the like.
If you're not updating the publicly accessible part of the database open, try to see if you can put some cache strategy up and let cloudflare take the hit.
Yep, all but one page type is heavily cached at multiple levels. We are working to get the rest and improve it further... just annoying as they don't even respect limits..
At this point I'd take a thermostat that can read when my dashboard starts getting heated (always the same culprits causing these same server spikes) and flicks attack mode on for cloudflare.... it's so ridiculous trying to run anything that's not a wordpress these days
The site is about a particular type of pipeline cleaning (think water/oil pipelines). I am certain that nobody was asking about this particular site or even the industry its in 15,000 times a minute 24 hours a day.
It's much more likely that their crawler is just garbage and got stuck into some kind of loop requesting my domain.
I suppose that they just keep referring to the website in their chats, and probably they have selected the search function, so before every reply, the crawler hits the website
> "I don't know what this actually gives people, but our industry takes great pride in doing this"
> "unsleeping automatons that never get sick, go on vacation, or need to be paid health insurance that can produce output that superficially resembles the output of human employees"
> "This is a regulatory issue. The thing that needs to happen is that governments need to step in and give these AI companies that are destroying the digital common good existentially threatening fines and make them pay reparations to the communities they are harming."
A bit off-topic but wtf is this preview image of a spider in the eye?
It’s even worse than the clickbait title of this post.
I think this should be considered bad practice.
I fully agree, and speaking as someone macroinsectophobia (fear of large or many insect (or insect-like) creatures), seeing it really makes me uncomfortable. It isn't enough to send me into panic mode or anything, but damn if it doesn't freak me out.
In the same time it’s so practical to ask a question and it opens 25 pages to search and summarize the answer. Before that’s more or less what I was trying to do by hand. Maybe not 25 websites because of crap SEO the top 10 contains BS content so I curated the list but the idea is the same no ?
My personal experience is that OpenAI's crawler was hitting a very, very low traffic website I manage 10s of 1000s of times a minute non-stop. I had to block it from Cloudflare.
I run a very small browser game (~120 weekly users currently), and until I put its Wiki (utterly uninteresting to anyone who doesn't already play the game) behind a login-wall, the bots were causing massive amounts of spurious traffic. Due to some of the Wiki's data coming live from the game through external data feeds, the deluge of bots actually managed to crash the game several times, necessitating a restart of the MariaDB process.
Even if it is generating 39k req/minute I would expect most of the pages already be locally cached by Meta, or served statically by their respective hosts. We have been working hard on catching websites and it has been a solved problem for the last decade or so.
AI is going to damage society not in fancy sci-fi ways but by centralizing profit made at the expense of everyone else on the internet, who is then forced to erect boundaries to protect themselves, worsening the experience for the rest of the public. Who also have to pay higher electricity bills, because keeping humans warm is not as profitable as a machine which directly converts electricity into stock price rises.
If it was crap coding, then the bots wouldn't have so many mechanisms to circumvent blocks. Once you block the OpenAI IP ranges, they start using residential proxies. Once you block their UA strings, they start impersonating other crawlers or browsers.
How does "proper UA string" solve this "blowing up websites" problem
The only thing that matters with respect to the "blowing up websites" problem is rate-limiting, i.e., behaviour
"Shitty crawlers" are a nuisance because of their behaviour, i.e., request rate, not because of whatever UA string they send; the behaviour is what is "shitty" not the UA string. The two are not necessarily correlated and any heuristic that naively assumes so is inviting failure
"Spoofed" UA strings have been facilitated and expected since the earliest web browsers
For example,
https://raw.githubusercontent.com/alandipert/ncsa-mosaic/mas...
To borrow the parent's phrasing, the "blowing up websites" problem has nothing to do with UA string specifically
It may have something to do with website operator reluctance to set up rate-limiting though; this despite widespread impelementation of "web APIs" that use rate-limiting
NB. I'm not suggesting rate-limiting is a silver bullet. I'm suggesting that without rate-limiting, UA string as a means of addressing the "blowing up websites" problem is inviting failure
And as for denying service and preventing human people from visiting websites: cloudflare does more of that damage in a single day than all these "AI" associated corporations and their crappy crawlers have in years.
This is corporations damaging things because of AI. Corporations will damage things for other reasons too but the only reason they are breaking the internet in this way, at this time, is because of AI.
I think the "AI doesn't kill websites, corporations kill websites" argument is as flawed as the "Guns don't kill people, people kill people" argument.
> This isn't AI damaging anything. This is corporations damaging things
This is the guns don't kill people, people kill people argument. The problem with crawlers is about 10x worse than it was previously because of AI and their hunger for data.
The same incentives to do this already existed for search engine operators.
Or to use a common HN aphorism “your business model is not my problem”. Disconnect from me if you don’t want my traffic.
You want to look at one of our git commits? Sure! That's what our web-fronted git repo is for. Go right head! Be our guest!
Oh ... I see. You want to download every commit in our repository. One by one, when you have used git clone. Hmm, yeah, I don't want your traffic.
But wait, "your traffic" seems to originate from ... consults fail2ban logs ... more than 900k different IP addresses, so "disconnecting" from you is non-trivial.
I can't put it more politely than this: fuck off. Do not pass go. Do not collect stock options. Go to hell, and stay there.
The problem is precisely that that is not possible. It is very well known that these scrapers aren’t respecting the wishes of website owners and even circumvent blocks any way they can. If these companies respected the website owners’ desires for them to disconnect, we wouldn’t be having this conversation.
10/10. No notes.
I wish there were a better way to solve this.
I have to make sure legit bots don't get hit, as a huge percent of our traffic which helps the project stay active is from google, etc.
It's much more likely that their crawler is just garbage and got stuck into some kind of loop requesting my domain.
> "I don't know what this actually gives people, but our industry takes great pride in doing this"
> "unsleeping automatons that never get sick, go on vacation, or need to be paid health insurance that can produce output that superficially resembles the output of human employees"
> "This is a regulatory issue. The thing that needs to happen is that governments need to step in and give these AI companies that are destroying the digital common good existentially threatening fines and make them pay reparations to the communities they are harming."
<3 <3
I run a very small browser game (~120 weekly users currently), and until I put its Wiki (utterly uninteresting to anyone who doesn't already play the game) behind a login-wall, the bots were causing massive amounts of spurious traffic. Due to some of the Wiki's data coming live from the game through external data feeds, the deluge of bots actually managed to crash the game several times, necessitating a restart of the MariaDB process.
e.g. they'll take an entire sentence the user said and put it in quotes for no reason.
Thankfully search engines started ignoring quotes years ago, so it balances out...