AI scrapers request commented scripts

Most web scrapers, even if illegal, are for... business. So they scrape amazon, or shops. So yeah. Most unwanted traffic is from big tech, or bad actors trying to sniff vulnerabilities.

I know a thing or two about web scraping.

There are sometimes status codes 404 for protection, so that you skip this site, so my crawler tries, as a hammer, several of faster crawling methods (curlcffi).

Zip bombs are also not for me. Reading header content length is enough to not read the page/file. I provide byte limit to check if response is not too big for me. For other cases reading timeout is enough.

Oh, and did you know that requests timeout is not really timeout a timeout for page read? So server can spoonfeed you bytes, one after another, and there will be no timeout.

That is why I created my own crawling system to mitigate these problems, and have one consistent mean of running selenium.

https://github.com/rumca-js/crawler-buddy

Based on library

https://github.com/rumca-js/webtoolkit

hnav · 2 months ago

content-length is computed after content-encoding

ahoka · 2 months ago

If it’s present at all.

Deleted Comment

1vuio0pswjnm7 · 2 months ago

Is there a difference between "scraping" and "crawling"

Mars008 · 2 months ago

Looks like it's time for in-browser scrappers. They will be indistinguishable from the servers side. With AI driver can pass even human tests.

overfeed · 2 months ago

> Looks like it's time for in-browser scrappers.

If scrapers were as well-behaved as humans, website operators wouldn't bother to block them[1]. It's the abuse that motivates the animus and action. As the fine articles spelt out, scrapers are greedy in many ways, one of which is trying to slurp down as many URLs as possible without wasting bytes. Not enough people know about common crawl, or know how to write multithreaded scrapers with high utilization across domains without suffocating any single one. If your scraper is URL FIFO or stack in a loop, you're just DOSing one domain at a time.

1. The most successful scrapers avoid standing out in any way

bartread · 2 months ago

Not a new idea. For years now, on the occasions I’ve needed to scrape, I’ve used a set of ViolentMonkey scripts. I’ve even considered creating an extension, but have never really needed it enough to do the extra work.

But this is why lots of sites implement captchas and other mechanisms to detect, frustrate, or trap automated activity - because plenty of bots run in browsers too.

eur0pa · 2 months ago

you mean OpenAI Atlas?

It doesn't seem that abusive. I don't comment things out thinking "this will keep robots from reading this".

mostlysimilar · 2 months ago

The article mentions using this as a means of detecting bots, not as a complaint that it's abusive.

EDIT: I was chastised, here's the original text of my comment: Did you read the article or just the title? They aren't claiming it's abusive. They're saying it's a viable signal to detect and ban bots.

ang_cire · 2 months ago

They call the scrapers "malicious", so they are definitely complaining about them.

> A few of these came from user-agents that were obviously malicious:

(I love the idea that they consider any python or go request to be a malicious scraper...)

pseudalopex · 2 months ago

Please don't comment on whether someone read an article. "Did you even read the article? It mentions that" can be shortened to "The article mentions that".[1]

[1] https://news.ycombinator.com/newsguidelines.html

woodrowbarlow · 2 months ago

the first few words of the article are:

> Last Sunday I discovered some abusive bot behaviour [...]

michael1999 · 2 months ago

Crawlers ignoring robots.txt is abusive. That they then start scanning all docs for commented urls just adds to the pile of scummy behaviour.

tveyben · 2 months ago

Human behavior is interesting - me, me, me…

renegat0x0 · 2 months ago

rokkamokka · 2 months ago

I'm not overly surprised, it's probably faster to search the text for http/https than parse the DOM

embedding-shape · 2 months ago

Not probably, searching through plaintext (which they seem to be doing) VS iterating on the DOM have vastly different amount of work behind them in terms of resources used and performance that "probably" is way underselling the difference :)

franktankbank · 2 months ago

Reminds me of the shortcut that works for the happy path but is utterly fucked by real data. This is an interesting trap, can it easily be avoided without walking the dom?

marginalia_nu · 2 months ago

The regex approach is certainly easier to implement, but honestly static DOM parsing is pretty cheap, but quite fiddly to get right. You're probably gonna be limited by network congestion (or ephemeral ports) before you run out of CPU time doing this type of crawling.

sharkjacobs · 2 months ago

Fun to see practical applications of interesting research[1]

[1]https://news.ycombinator.com/item?id=45529587

Noumenon72 · 2 months ago

stevage · 2 months ago

The title is confusing, should be "commented-out".

pimlottc · 2 months ago

Agree, I thought maybe this was going to be a script to block AI scrapers or something like that.

zahlman · 2 months ago

I thought it was going to be AI scraper operators getting annoyed that they have to run reasoning models on the scraped data to make use of it.

latenightcoding · 2 months ago

when I used to crawl the web, battle tested Perl regexes were more reliable than anything else, commented urls would have been added to my queue.

rightbyte · 2 months ago

DOM navigation for fetching some data is for tryhards. Using a regex to grab the correct paragraph or div or whatever is fine and is more robust versus things moving around on the page.

chaps · 2 months ago

Doing both is fine! Just, once you've figured out your regex and such, hardening/generalizing demands DOM iteration. It sucks but it is what is is.

horseradish7k · 2 months ago

but not when crawling. you don't know the page format in advance - you don't even know what the page contains!

bigbuppo · 2 months ago

Sounds like you should give the bots exactly what they want... a 512MB file of random data.

kelseyfrog · 2 months ago

That's leaving a lot of opportunity on the table.

The real money is in monetizing ad responses to AI scrapers so that LLMs are biased toward recommending certain products. The stealth startup I've founded does exactly this. Ad-poisoning-as-a-service is a huge untapped market.

Now that's a paid subscription I can get behind, especially if it suggests that Meta should cut Rob Schneider a check for $200,000,000,000 to make more movies.

aDyslecticCrow · 2 months ago

Scraper sinkhole of randomly generated inter-linked files filled with AI poison could work. No human would click that link, so it leads to the "exclusive club".

oytis · 2 months ago

Outbound traffic normally costs more than inbound one, so the asymmetry is set up wrong here. Data poisoning is probably the way.

kelnos · 2 months ago

Most people have to pay for their bandwidth, though. That's a lot of data to send out over and over.

jcheng · 2 months ago

512MB file of incredibly compressible data, then?

AlienRobot · 2 months ago

512 MB of saying your service is the best service.

mikeiz404 · 2 months ago

Two thoughts here when it comes to poisoning unwanted LLM training data traffic

1) A coordinated effort among different sites will have a much greater chance of poisoning the data of a model so long as they can avoid any post scraping deduplication or filtering.

2) I wonder if copyright law can be used to amplify the cost of poisoning here. Perhaps if the poisoned content is something which has already been shown to be aggressively litigated against then the copyright owner will go after them when the model can be shown to contain that banned data. This may open up site owners to the legal risk of distributing this content though… not sure. A cooperative effort with a copyright holder may sidestep this risk but they would have to have the means and want to litigate.

Anamon · 2 months ago

As for 1, it would be great to have this as a plugin for WordPress etc. that anyone could simply install and enable. Pre-processing images to dynamically poison them on each request should be fun, and also protect against a deduplication defense. I'd certainly install that.