Most web scrapers, even if illegal, are for... business. So they scrape amazon, or shops. So yeah. Most unwanted traffic is from big tech, or bad actors trying to sniff vulnerabilities.
I know a thing or two about web scraping.
There are sometimes status codes 404 for protection, so that you skip this site, so my crawler tries, as a hammer, several of faster crawling methods (curlcffi).
Zip bombs are also not for me. Reading header content length is enough to not read the page/file. I provide byte limit to check if response is not too big for me. For other cases reading timeout is enough.
Oh, and did you know that requests timeout is not really timeout a timeout for page read? So server can spoonfeed you bytes, one after another, and there will be no timeout.
That is why I created my own crawling system to mitigate these problems, and have one consistent mean of running selenium.
If scrapers were as well-behaved as humans, website operators wouldn't bother to block them[1]. It's the abuse that motivates the animus and action. As the fine articles spelt out, scrapers are greedy in many ways, one of which is trying to slurp down as many URLs as possible without wasting bytes. Not enough people know about common crawl, or know how to write multithreaded scrapers with high utilization across domains without suffocating any single one. If your scraper is URL FIFO or stack in a loop, you're just DOSing one domain at a time.
1. The most successful scrapers avoid standing out in any way
Not a new idea. For years now, on the occasions I’ve needed to scrape, I’ve used a set of ViolentMonkey scripts. I’ve even considered creating an extension, but have never really needed it enough to do the extra work.
But this is why lots of sites implement captchas and other mechanisms to detect, frustrate, or trap automated activity - because plenty of bots run in browsers too.
Not probably, searching through plaintext (which they seem to be doing) VS iterating on the DOM have vastly different amount of work behind them in terms of resources used and performance that "probably" is way underselling the difference :)
Reminds me of the shortcut that works for the happy path but is utterly fucked by real data. This is an interesting trap, can it easily be avoided without walking the dom?
The regex approach is certainly easier to implement, but honestly static DOM parsing is pretty cheap, but quite fiddly to get right. You're probably gonna be limited by network congestion (or ephemeral ports) before you run out of CPU time doing this type of crawling.
The article mentions using this as a means of detecting bots, not as a complaint that it's abusive.
EDIT: I was chastised, here's the original text of my comment: Did you read the article or just the title? They aren't claiming it's abusive. They're saying it's a viable signal to detect and ban bots.
Please don't comment on whether someone read an article. "Did you even read the article? It mentions that" can be shortened to "The article mentions that".[1]
DOM navigation for fetching some data is for tryhards. Using a regex to grab the correct paragraph or div or whatever is fine and is more robust versus things moving around on the page.
The real money is in monetizing ad responses to AI scrapers so that LLMs are biased toward recommending certain products. The stealth startup I've founded does exactly this. Ad-poisoning-as-a-service is a huge untapped market.
Now that's a paid subscription I can get behind, especially if it suggests that Meta should cut Rob Schneider a check for $200,000,000,000 to make more movies.
Scraper sinkhole of randomly generated inter-linked files filled with AI poison could work. No human would click that link, so it leads to the "exclusive club".
Two thoughts here when it comes to poisoning unwanted LLM training data traffic
1) A coordinated effort among different sites will have a much greater chance of poisoning the data of a model so long as they can avoid any post scraping deduplication or filtering.
2) I wonder if copyright law can be used to amplify the cost of poisoning here. Perhaps if the poisoned content is something which has already been shown to be aggressively litigated against then the copyright owner will go after them when the model can be shown to contain that banned data. This may open up site owners to the legal risk of distributing this content though… not sure. A cooperative effort with a copyright holder may sidestep this risk but they would have to have the means and want to litigate.
As for 1, it would be great to have this as a plugin for WordPress etc. that anyone could simply install and enable. Pre-processing images to dynamically poison them on each request should be fun, and also protect against a deduplication defense. I'd certainly install that.
I know a thing or two about web scraping.
There are sometimes status codes 404 for protection, so that you skip this site, so my crawler tries, as a hammer, several of faster crawling methods (curlcffi).
Zip bombs are also not for me. Reading header content length is enough to not read the page/file. I provide byte limit to check if response is not too big for me. For other cases reading timeout is enough.
Oh, and did you know that requests timeout is not really timeout a timeout for page read? So server can spoonfeed you bytes, one after another, and there will be no timeout.
That is why I created my own crawling system to mitigate these problems, and have one consistent mean of running selenium.
https://github.com/rumca-js/crawler-buddy
Based on library
https://github.com/rumca-js/webtoolkit
Deleted Comment
If scrapers were as well-behaved as humans, website operators wouldn't bother to block them[1]. It's the abuse that motivates the animus and action. As the fine articles spelt out, scrapers are greedy in many ways, one of which is trying to slurp down as many URLs as possible without wasting bytes. Not enough people know about common crawl, or know how to write multithreaded scrapers with high utilization across domains without suffocating any single one. If your scraper is URL FIFO or stack in a loop, you're just DOSing one domain at a time.
1. The most successful scrapers avoid standing out in any way
But this is why lots of sites implement captchas and other mechanisms to detect, frustrate, or trap automated activity - because plenty of bots run in browsers too.
[1]https://news.ycombinator.com/item?id=45529587
EDIT: I was chastised, here's the original text of my comment: Did you read the article or just the title? They aren't claiming it's abusive. They're saying it's a viable signal to detect and ban bots.
> A few of these came from user-agents that were obviously malicious:
(I love the idea that they consider any python or go request to be a malicious scraper...)
[1] https://news.ycombinator.com/newsguidelines.html
> Last Sunday I discovered some abusive bot behaviour [...]
The real money is in monetizing ad responses to AI scrapers so that LLMs are biased toward recommending certain products. The stealth startup I've founded does exactly this. Ad-poisoning-as-a-service is a huge untapped market.
1) A coordinated effort among different sites will have a much greater chance of poisoning the data of a model so long as they can avoid any post scraping deduplication or filtering.
2) I wonder if copyright law can be used to amplify the cost of poisoning here. Perhaps if the poisoned content is something which has already been shown to be aggressively litigated against then the copyright owner will go after them when the model can be shown to contain that banned data. This may open up site owners to the legal risk of distributing this content though… not sure. A cooperative effort with a copyright holder may sidestep this risk but they would have to have the means and want to litigate.