Just about 2 years ago, a long dormant project surged back to life becoming one of the best crawlers out there. Zimit was originally made to scrape mediawiki sites but it now is able to crawl literally anything into an offline archive file. I have been trying to grab the data i am guessing will shortly be put under much stricter anti scraping controls soon, and I am not the only one hoarding data. The winds are blowing towards a closed internet faster than I have ever seen in my life. Get it while you can.
The winds may be changing, but those who don't fear resorting to piracy will always be sailing smoothly. We will just have more walled gardens and more illegal offerings, leaving normal people stranded and thirsty.
Unfortunately things like a 500gb dump of sacredtexts website is not often found on torrent trackers or other 'warez' sites. Anna's is pretty great for written material that has been published offline, but even the wayback machine and archive.org have limited full scrapes aside from the published ones by the kiwix team.
Curious what people think is an appropriate request rate for crawling a web page. I have seen many examples where author will spin up N machines with M threads and just hammer a server up until it starts returning more than a certain failure rate.
I have never done anything serious, but have always tried to keep my hit rate fairly modest.
I don't do much crawling or scraping either, but when I have, I go to the opposite extreme. There's no reason why I need the data right that second, so I set it up to pause for some random amount of seconds between requests.*Except YouTube, I'll yt-dl an entire playlist, no probs*
One of the best use cases for "serverless" functions like AWS lambda is easily proxying around web crawling requests from the comfort of your codebase. To the developer, it's just a function in a file, but it runs in isolation from a random IP address on each invocation - independent of app state. Like Puppeteer, one of Big Tech's little gifts to indie hackers.
I think, one other side effect of this is the increasing restrictions on VPN usage for accessing big websites, pushing users towards logging in or using (mobile) apps instead. Recent examples include X, Reddit, and more recently, YouTube, which has started blocking VPNs.
I'm also concerned that Free and open APIs might become a thing of the past as more "AI-driven" web scrapers/crawlers begin to overwhelm them.
I think that it is helpful for public mirrors to be made, in case the original files are lost or if the original server is temporarily off, or if you want to be able to access it without needing to access the same servers all the time (including if you have no internet connection at all but you might have a local copy); you can make your own copies from mirrors and from those too etc. Cryptographic hashing can be used too (in order to check that it matches by a shorter code than the entire file). However, they should not make excessive amount of attempted access, so I do block everything in robots.txt (but it is OK if someone wants to use curl to download files, or clone a repository, and then wants to mirror them; I also mirror some of my own stuff on GitHub). What I do not want others to do is to claim that I wrote something that actually I did not write, or to claim copyright on something that I wrote that will restrict their use by others.
None the less dark actors are still going to get through and sell this data later.
There are probably tons of those crawlers and with intent to later launder that data, probably using LLMs intended to change the content just enough to be outside of copyright zone.
Another way of stating it: now that worthless text and images can be made to be worth money people no longer want to have public websites and are trying to change the longstanding culture of an open remixable web to fit their new paranoia over not getting their cut.
I welcome all useragents and have been delighted to see openai, anthropic, microsoft, and other AI related crawlers mirroring my public website(s). Just the same as when I see googlebot or bingbot or firefox using humans. For information that I don't want to be public I simply don't put it on public websites.
Another way of stating it: how dare all those greedy web publishers want to eat! /s
The first couple decades of the web were build on an implicit promise: publishers could put their content out on freely accessible websites, and search engines would direct traffic to those sites. It was a mutually beneficial arrangement: publishers got the benefit of traffic that they could monetize in different ways if they saw fit (even hosting ads from the search engines a la AdSense), and Google etc. could earn mountains in AdWords revenue. I'm not saying it was always a fair tradeoff on both sides, but both sides had an incentive to share content openly.
AI breaks that model. Publishers create all the content, but then the big search engines and AI companies can answer user questions without giving any reference at all to the original sites (companies like Google are providing source citations, but I guarantee the click-through rates go WAY down). This breakdown in the web's economic model has been happening for a while even before AI (e.g. with Google hosting more and more "informational content" directly on SERP pages - see https://news.ycombinator.com/item?id=24105465), but with AI the breakdown is really complete.
So no, I don't fault people at all who don't want all the toils of their labor to be sucked up by trillion dollar megacorps.
That’s their expectation but not the original design. The design of HTTP is that, if you run a web server, you are willing to send content to anyone that sends you a HTTP request. Anyone putting content online with a system like that should expect the data to go out to as many as request it.
Alternative systems, like paywalls with prominent terms and conditions, exist to protect copyrighted content. They intentionally avoid them. So, they’re using a system designed for unrestricted distribution to publish content they hope people will use in restricted ways. I think the default, reasonable expectation should be that anyone can scrape public data for at least personal use.
I activated the Block AI Scrapers and Crawlers feature on Cloudflare months ago. When I checked the event logs, I was surprised by the persistent attempts from GPTBot, PetalBot, Amazonbot, and PerplexityBot to crawl my site. They were making multiple requests per hour.
Considering my blog's niche focus on A/V production with Linux, I can only imagine how much more frequent their crawling would be on more popular websites.
I feel like talking about robots.txt in this context is kind of a pointless enterprise given how there's no guarantee it will be followed by crawlers (and TFA fully acknowledges this). Before AI, there was a mutually (not necessarily equal, but mutual) beneficial economic arrangement where websites published open content freely, and search engines indexed that content. That arrangement fundamentally no longer exists, and we can't pretend it's coming back. The end game of this is more and stronger paywalls (and not ones easily bypassed by incognito mode), and I think that's inevitable.
I have never done anything serious, but have always tried to keep my hit rate fairly modest.
I am not looking for ways of circumventing DDOS, more what do people consider a reasonable rate without putting undue burden on a host?
I'm also concerned that Free and open APIs might become a thing of the past as more "AI-driven" web scrapers/crawlers begin to overwhelm them.
Or a list of IP addresses or user agents or other identifying info known to be used by AI crawlers?
Ultimately pointless but just curious.
I wonder if there are crowd-sourced iptables instead. Similar to how ad filters are maintained.
There are probably tons of those crawlers and with intent to later launder that data, probably using LLMs intended to change the content just enough to be outside of copyright zone.
I welcome all useragents and have been delighted to see openai, anthropic, microsoft, and other AI related crawlers mirroring my public website(s). Just the same as when I see googlebot or bingbot or firefox using humans. For information that I don't want to be public I simply don't put it on public websites.
And there is no middle ground?
I want others to read it but not to sell it?
The first couple decades of the web were build on an implicit promise: publishers could put their content out on freely accessible websites, and search engines would direct traffic to those sites. It was a mutually beneficial arrangement: publishers got the benefit of traffic that they could monetize in different ways if they saw fit (even hosting ads from the search engines a la AdSense), and Google etc. could earn mountains in AdWords revenue. I'm not saying it was always a fair tradeoff on both sides, but both sides had an incentive to share content openly.
AI breaks that model. Publishers create all the content, but then the big search engines and AI companies can answer user questions without giving any reference at all to the original sites (companies like Google are providing source citations, but I guarantee the click-through rates go WAY down). This breakdown in the web's economic model has been happening for a while even before AI (e.g. with Google hosting more and more "informational content" directly on SERP pages - see https://news.ycombinator.com/item?id=24105465), but with AI the breakdown is really complete.
So no, I don't fault people at all who don't want all the toils of their labor to be sucked up by trillion dollar megacorps.
Alternative systems, like paywalls with prominent terms and conditions, exist to protect copyrighted content. They intentionally avoid them. So, they’re using a system designed for unrestricted distribution to publish content they hope people will use in restricted ways. I think the default, reasonable expectation should be that anyone can scrape public data for at least personal use.
Considering my blog's niche focus on A/V production with Linux, I can only imagine how much more frequent their crawling would be on more popular websites.