AI crawlers, fetchers are blowing up websites; Meta, OpenAI are worst offenders

pjc50 · 7 months ago

Place alongside https://news.ycombinator.com/item?id=44962529 "Why are anime catgirls blocking my access to the Linux kernel?". This is why.

AI is going to damage society not in fancy sci-fi ways but by centralizing profit made at the expense of everyone else on the internet, who is then forced to erect boundaries to protect themselves, worsening the experience for the rest of the public. Who also have to pay higher electricity bills, because keeping humans warm is not as profitable as a machine which directly converts electricity into stock price rises.

rnhmjoj · 7 months ago

I'm far from being an AI enthusiast as anyone can be, but this issue has nothing to do with AI specifically. It's just that some greedy companies are writing incredibly shitty crawlers that don't follow any of the enstablished conventions (respecting robots.txt, using a proper UA string, rate limiting, whatever). This situation could have easily happened earlier than the AI boom, for different reasons.

mostlysimilar · 7 months ago

But it didn't, and it's happening now, because of AI.

Fomite · 7 months ago

I'd argue it's part of the baked in, fundamental disrespect AI firms have for literally everyone else.

majkinetor · 7 months ago

Obaying robots.txt can not be enforced. Even if one country makes laws about it, another one will have 0 fucks to give.

sznio · 7 months ago

I strongly believe that AI companies are running a DDOS attack on the open web. Making websites go down aligns with their intetests: it removes training data that competitors could use, and it removes sources for humans to browse, making us even more reliant on chatbots to find anything.

If it was crap coding, then the bots wouldn't have so many mechanisms to circumvent blocks. Once you block the OpenAI IP ranges, they start using residential proxies. Once you block their UA strings, they start impersonating other crawlers or browsers.

1vuio0pswjnm7 · 7 months ago

"It's just that some greedy companies are writing incredibly shitty crawlers that don't follow any of the enstablished [sic] conventions (respecting robots.txt, using proper UA string, rate limiting, whatever)."

How does "proper UA string" solve this "blowing up websites" problem

The only thing that matters with respect to the "blowing up websites" problem is rate-limiting, i.e., behaviour

"Shitty crawlers" are a nuisance because of their behaviour, i.e., request rate, not because of whatever UA string they send; the behaviour is what is "shitty" not the UA string. The two are not necessarily correlated and any heuristic that naively assumes so is inviting failure

"Spoofed" UA strings have been facilitated and expected since the earliest web browsers

For example,

https://raw.githubusercontent.com/alandipert/ncsa-mosaic/mas...

To borrow the parent's phrasing, the "blowing up websites" problem has nothing to do with UA string specifically

It may have something to do with website operator reluctance to set up rate-limiting though; this despite widespread impelementation of "web APIs" that use rate-limiting

NB. I'm not suggesting rate-limiting is a silver bullet. I'm suggesting that without rate-limiting, UA string as a means of addressing the "blowing up websites" problem is inviting failure

superkuh · 7 months ago

This isn't AI damaging anything. This is corporations damaging things. Same as it ever was. No need for scifi non-human persons when legal corporate persons exist. They latch on to whatever big new thing in tech that people don't understand which comes along and brand themselves with it and cause damage trying to make money; even if they mostly fail at it. And for most actual humans they only ever see or interact with the scammy corporation versions of $techthing and so come to believe $techthing = corporate behavior.

And as for denying service and preventing human people from visiting websites: cloudflare does more of that damage in a single day than all these "AI" associated corporations and their crappy crawlers have in years.

autoexec · 7 months ago

> This isn't AI damaging anything. This is corporations damaging things.

This is corporations damaging things because of AI. Corporations will damage things for other reasons too but the only reason they are breaking the internet in this way, at this time, is because of AI.

I think the "AI doesn't kill websites, corporations kill websites" argument is as flawed as the "Guns don't kill people, people kill people" argument.

ujkhsjkdhf234 · 7 months ago

Cloudflare exists because people can't be good stewards of the internet.

> This isn't AI damaging anything. This is corporations damaging things

This is the guns don't kill people, people kill people argument. The problem with crawlers is about 10x worse than it was previously because of AI and their hunger for data.

msgodel · 7 months ago

This isn't really about AI. This is a couple corporations being bad netizens and abusing infrastructure.

The same incentives to do this already existed for search engine operators.

renewiltord · 7 months ago

If you don't want to receive data, don't. If you don't want to send data, don't. No one is asking you to receive traffic from my IPs or send to my IPs. You've just configured your server one way.

Or to use a common HN aphorism “your business model is not my problem”. Disconnect from me if you don’t want my traffic.

PaulDavisThe1st · 7 months ago

I don't know if I want your traffic until I see what your traffic is.

You want to look at one of our git commits? Sure! That's what our web-fronted git repo is for. Go right head! Be our guest!

Oh ... I see. You want to download every commit in our repository. One by one, when you have used git clone. Hmm, yeah, I don't want your traffic.

But wait, "your traffic" seems to originate from ... consults fail2ban logs ... more than 900k different IP addresses, so "disconnecting" from you is non-trivial.

I can't put it more politely than this: fuck off. Do not pass go. Do not collect stock options. Go to hell, and stay there.

latexr · 7 months ago

> Disconnect from me if you don’t want my traffic.

The problem is precisely that that is not possible. It is very well known that these scrapers aren’t respecting the wishes of website owners and even circumvent blocks any way they can. If these companies respected the website owners’ desires for them to disconnect, we wouldn’t be having this conversation.

pjc50 · 7 months ago

People are doing exactly that. And then other people who want to use the website are asking why they get blocked by false positives.

IT4MD · 7 months ago

>AI is going to damage society not in fancy sci-fi ways but by centralizing profit made at the expense of everyone else on the internet

10/10. No notes.

mcpar-land · 7 months ago

My worst offender for scraping one of my sites was Anthropic. I deployed an ai tar pit (https://news.ycombinator.com/item?id=42725147) to see what it would do it with it, and Anthropic's crawler kept scraping it for weeks. I calculated the logs and I think I wasted nearly a year of their time in total, because they were crawling in parallel. Other scrapers weren't so persistent.

fleebee · 7 months ago

For me it was OpenAI. GTPBot hammered my honeypot with 0.87 requests per second for about 5 weeks. Other crawlers only made up 2% of the traffic. 1.8 million requests, 4 GiB of traffic. Then it just abruptly stopped for whatever reason.

whatevaa · 7 months ago

Tar pits and serve fake but legitimate looking content. Poison it.

Group_B · 7 months ago

That's hilarious. I need to set up one of these myself

bwb · 7 months ago

My book discovery website shepherd.com is getting hammered every day by AI crawlers (and crashing often)... my security lists in CloudFlare are ridiculous and the bots are getting smarter.

I wish there were a better way to solve this.

weaksauce · 7 months ago

put a honeypot link in your site that only robots will hit because it’s hidden. make sure it’s not in robots.txt or ban it if you can in robots.txt. setup a rule that any ip that hits that link will get a 1 day ban in your fail2ban or the like.

bwb · 7 months ago

Got a good link to something on github that does this?

I have to make sure legit bots don't get hit, as a huge percent of our traffic which helps the project stay active is from google, etc.

skydhash · 7 months ago

If you're not updating the publicly accessible part of the database open, try to see if you can put some cache strategy up and let cloudflare take the hit.

bwb · 7 months ago

Yep, all but one page type is heavily cached at multiple levels. We are working to get the rest and improve it further... just annoying as they don't even respect limits..

p3rls · 7 months ago

At this point I'd take a thermostat that can read when my dashboard starts getting heated (always the same culprits causing these same server spikes) and flicks attack mode on for cloudflare.... it's so ridiculous trying to run anything that's not a wordpress these days

shepherdjerred · 7 months ago

ah, you're the one who stopped me from being jerred@shepherd.com!

bwb · 7 months ago

hah eh?

rco8786 · 7 months ago

OpenAI straight up DoSed a site I manage for my in-laws a few months ago.

muzani · 7 months ago

What is it about? I'm curious what kinds of things people ask that floods sites.

rco8786 · 7 months ago

The site is about a particular type of pipeline cleaning (think water/oil pipelines). I am certain that nobody was asking about this particular site or even the industry its in 15,000 times a minute 24 hours a day.

It's much more likely that their crawler is just garbage and got stuck into some kind of loop requesting my domain.

average_r_user · 7 months ago

I suppose that they just keep referring to the website in their chats, and probably they have selected the search function, so before every reply, the crawler hits the website

exasperaited · 7 months ago

Xe Iaso is my spirit animal.

> "I don't know what this actually gives people, but our industry takes great pride in doing this"

> "unsleeping automatons that never get sick, go on vacation, or need to be paid health insurance that can produce output that superficially resembles the output of human employees"

> "This is a regulatory issue. The thing that needs to happen is that governments need to step in and give these AI companies that are destroying the digital common good existentially threatening fines and make them pay reparations to the communities they are harming."

<3 <3

tehwebguy · 7 months ago

This is a feature! If half the internet is nuked and the other half put up fences there is less readily available training data for competitors.

AutoDunkGPT · 7 months ago

I love this for us!

timsh · 7 months ago

A bit off-topic but wtf is this preview image of a spider in the eye? It’s even worse than the clickbait title of this post. I think this should be considered bad practice.

encrypted_bird · 7 months ago

I fully agree, and speaking as someone macroinsectophobia (fear of large or many insect (or insect-like) creatures), seeing it really makes me uncomfortable. It isn't enough to send me into panic mode or anything, but damn if it doesn't freak me out.

shinycode · 7 months ago

In the same time it’s so practical to ask a question and it opens 25 pages to search and summarize the answer. Before that’s more or less what I was trying to do by hand. Maybe not 25 websites because of crap SEO the top 10 contains BS content so I curated the list but the idea is the same no ?

rco8786 · 7 months ago

My personal experience is that OpenAI's crawler was hitting a very, very low traffic website I manage 10s of 1000s of times a minute non-stop. I had to block it from Cloudflare.

Leynos · 7 months ago

Where is caching breaking so badly that this is happening? Are OpenAI failing to use etags or honour cache validity?

danaris · 7 months ago

Same here.

I run a very small browser game (~120 weekly users currently), and until I put its Wiki (utterly uninteresting to anyone who doesn't already play the game) behind a login-wall, the bots were causing massive amounts of spurious traffic. Due to some of the Wiki's data coming live from the game through external data feeds, the deluge of bots actually managed to crash the game several times, necessitating a restart of the MariaDB process.

pm215 · 7 months ago

Sure, but if the fetcher is generating "39,000 requests per minute" then surely something has gone wrong somewhere ?

miohtama · 7 months ago

Even if it is generating 39k req/minute I would expect most of the pages already be locally cached by Meta, or served statically by their respective hosts. We have been working hard on catching websites and it has been a solved problem for the last decade or so.

andai · 7 months ago

They're not very good at web queries, if you expand the thinking box to see what they're searching for, like half of it is nonsense.

e.g. they'll take an entire sentence the user said and put it in quotes for no reason.

Thankfully search engines started ignoring quotes years ago, so it balances out...