Readit News logoReadit News
renegat0x0 · 2 months ago
Most web scrapers, even if illegal, are for... business. So they scrape amazon, or shops. So yeah. Most unwanted traffic is from big tech, or bad actors trying to sniff vulnerabilities.

I know a thing or two about web scraping.

There are sometimes status codes 404 for protection, so that you skip this site, so my crawler tries, as a hammer, several of faster crawling methods (curlcffi).

Zip bombs are also not for me. Reading header content length is enough to not read the page/file. I provide byte limit to check if response is not too big for me. For other cases reading timeout is enough.

Oh, and did you know that requests timeout is not really timeout a timeout for page read? So server can spoonfeed you bytes, one after another, and there will be no timeout.

That is why I created my own crawling system to mitigate these problems, and have one consistent mean of running selenium.

https://github.com/rumca-js/crawler-buddy

Based on library

https://github.com/rumca-js/webtoolkit

hnav · 2 months ago
content-length is computed after content-encoding
ahoka · 2 months ago
If it’s present at all.

Deleted Comment

1vuio0pswjnm7 · 2 months ago
Is there a difference between "scraping" and "crawling"
Mars008 · 2 months ago
Looks like it's time for in-browser scrappers. They will be indistinguishable from the servers side. With AI driver can pass even human tests.
overfeed · 2 months ago
> Looks like it's time for in-browser scrappers.

If scrapers were as well-behaved as humans, website operators wouldn't bother to block them[1]. It's the abuse that motivates the animus and action. As the fine articles spelt out, scrapers are greedy in many ways, one of which is trying to slurp down as many URLs as possible without wasting bytes. Not enough people know about common crawl, or know how to write multithreaded scrapers with high utilization across domains without suffocating any single one. If your scraper is URL FIFO or stack in a loop, you're just DOSing one domain at a time.

1. The most successful scrapers avoid standing out in any way

bartread · 2 months ago
Not a new idea. For years now, on the occasions I’ve needed to scrape, I’ve used a set of ViolentMonkey scripts. I’ve even considered creating an extension, but have never really needed it enough to do the extra work.

But this is why lots of sites implement captchas and other mechanisms to detect, frustrate, or trap automated activity - because plenty of bots run in browsers too.

eur0pa · 2 months ago
you mean OpenAI Atlas?
rokkamokka · 2 months ago
I'm not overly surprised, it's probably faster to search the text for http/https than parse the DOM
embedding-shape · 2 months ago
Not probably, searching through plaintext (which they seem to be doing) VS iterating on the DOM have vastly different amount of work behind them in terms of resources used and performance that "probably" is way underselling the difference :)
franktankbank · 2 months ago
Reminds me of the shortcut that works for the happy path but is utterly fucked by real data. This is an interesting trap, can it easily be avoided without walking the dom?
marginalia_nu · 2 months ago
The regex approach is certainly easier to implement, but honestly static DOM parsing is pretty cheap, but quite fiddly to get right. You're probably gonna be limited by network congestion (or ephemeral ports) before you run out of CPU time doing this type of crawling.
sharkjacobs · 2 months ago
Fun to see practical applications of interesting research[1]

[1]https://news.ycombinator.com/item?id=45529587

Noumenon72 · 2 months ago
It doesn't seem that abusive. I don't comment things out thinking "this will keep robots from reading this".
mostlysimilar · 2 months ago
The article mentions using this as a means of detecting bots, not as a complaint that it's abusive.

EDIT: I was chastised, here's the original text of my comment: Did you read the article or just the title? They aren't claiming it's abusive. They're saying it's a viable signal to detect and ban bots.

ang_cire · 2 months ago
They call the scrapers "malicious", so they are definitely complaining about them.

> A few of these came from user-agents that were obviously malicious:

(I love the idea that they consider any python or go request to be a malicious scraper...)

pseudalopex · 2 months ago
Please don't comment on whether someone read an article. "Did you even read the article? It mentions that" can be shortened to "The article mentions that".[1]

[1] https://news.ycombinator.com/newsguidelines.html

woodrowbarlow · 2 months ago
the first few words of the article are:

> Last Sunday I discovered some abusive bot behaviour [...]

michael1999 · 2 months ago
Crawlers ignoring robots.txt is abusive. That they then start scanning all docs for commented urls just adds to the pile of scummy behaviour.
tveyben · 2 months ago
Human behavior is interesting - me, me, me…
stevage · 2 months ago
The title is confusing, should be "commented-out".
pimlottc · 2 months ago
Agree, I thought maybe this was going to be a script to block AI scrapers or something like that.
zahlman · 2 months ago
I thought it was going to be AI scraper operators getting annoyed that they have to run reasoning models on the scraped data to make use of it.
latenightcoding · 2 months ago
when I used to crawl the web, battle tested Perl regexes were more reliable than anything else, commented urls would have been added to my queue.
rightbyte · 2 months ago
DOM navigation for fetching some data is for tryhards. Using a regex to grab the correct paragraph or div or whatever is fine and is more robust versus things moving around on the page.
chaps · 2 months ago
Doing both is fine! Just, once you've figured out your regex and such, hardening/generalizing demands DOM iteration. It sucks but it is what is is.
horseradish7k · 2 months ago
but not when crawling. you don't know the page format in advance - you don't even know what the page contains!
bigbuppo · 2 months ago
Sounds like you should give the bots exactly what they want... a 512MB file of random data.
kelseyfrog · 2 months ago
That's leaving a lot of opportunity on the table.

The real money is in monetizing ad responses to AI scrapers so that LLMs are biased toward recommending certain products. The stealth startup I've founded does exactly this. Ad-poisoning-as-a-service is a huge untapped market.

bigbuppo · 2 months ago
Now that's a paid subscription I can get behind, especially if it suggests that Meta should cut Rob Schneider a check for $200,000,000,000 to make more movies.
aDyslecticCrow · 2 months ago
Scraper sinkhole of randomly generated inter-linked files filled with AI poison could work. No human would click that link, so it leads to the "exclusive club".
oytis · 2 months ago
Outbound traffic normally costs more than inbound one, so the asymmetry is set up wrong here. Data poisoning is probably the way.
kelnos · 2 months ago
Most people have to pay for their bandwidth, though. That's a lot of data to send out over and over.
jcheng · 2 months ago
512MB file of incredibly compressible data, then?
AlienRobot · 2 months ago
512 MB of saying your service is the best service.
mikeiz404 · 2 months ago
Two thoughts here when it comes to poisoning unwanted LLM training data traffic

1) A coordinated effort among different sites will have a much greater chance of poisoning the data of a model so long as they can avoid any post scraping deduplication or filtering.

2) I wonder if copyright law can be used to amplify the cost of poisoning here. Perhaps if the poisoned content is something which has already been shown to be aggressively litigated against then the copyright owner will go after them when the model can be shown to contain that banned data. This may open up site owners to the legal risk of distributing this content though… not sure. A cooperative effort with a copyright holder may sidestep this risk but they would have to have the means and want to litigate.

Anamon · 2 months ago
As for 1, it would be great to have this as a plugin for WordPress etc. that anyone could simply install and enable. Pre-processing images to dynamically poison them on each request should be fun, and also protect against a deduplication defense. I'd certainly install that.