Readit News logoReadit News
kelsey98765431 · a year ago
Just about 2 years ago, a long dormant project surged back to life becoming one of the best crawlers out there. Zimit was originally made to scrape mediawiki sites but it now is able to crawl literally anything into an offline archive file. I have been trying to grab the data i am guessing will shortly be put under much stricter anti scraping controls soon, and I am not the only one hoarding data. The winds are blowing towards a closed internet faster than I have ever seen in my life. Get it while you can.
sigmoid10 · a year ago
The winds may be changing, but those who don't fear resorting to piracy will always be sailing smoothly. We will just have more walled gardens and more illegal offerings, leaving normal people stranded and thirsty.
kelsey98765431 · a year ago
Unfortunately things like a 500gb dump of sacredtexts website is not often found on torrent trackers or other 'warez' sites. Anna's is pretty great for written material that has been published offline, but even the wayback machine and archive.org have limited full scrapes aside from the published ones by the kiwix team.
mewpmewp2 · a year ago
What is the size of your storage?
kelsey98765431 · a year ago
About 50tb on my lan.
0cf8612b2e1e · a year ago
Curious what people think is an appropriate request rate for crawling a web page. I have seen many examples where author will spin up N machines with M threads and just hammer a server up until it starts returning more than a certain failure rate.

I have never done anything serious, but have always tried to keep my hit rate fairly modest.

Cordiali · a year ago
I don't do much crawling or scraping either, but when I have, I go to the opposite extreme. There's no reason why I need the data right that second, so I set it up to pause for some random amount of seconds between requests.

*Except YouTube, I'll yt-dl an entire playlist, no probs*
bschmidt1 · a year ago
One of the best use cases for "serverless" functions like AWS lambda is easily proxying around web crawling requests from the comfort of your codebase. To the developer, it's just a function in a file, but it runs in isolation from a random IP address on each invocation - independent of app state. Like Puppeteer, one of Big Tech's little gifts to indie hackers.
0cf8612b2e1e · a year ago
That’s a technical solution to being a bad actor.

I am not looking for ways of circumventing DDOS, more what do people consider a reasonable rate without putting undue burden on a host?

mfiro · a year ago
I think, one other side effect of this is the increasing restrictions on VPN usage for accessing big websites, pushing users towards logging in or using (mobile) apps instead. Recent examples include X, Reddit, and more recently, YouTube, which has started blocking VPNs.

I'm also concerned that Free and open APIs might become a thing of the past as more "AI-driven" web scrapers/crawlers begin to overwhelm them.

zzo38computer · a year ago
I think that it is helpful for public mirrors to be made, in case the original files are lost or if the original server is temporarily off, or if you want to be able to access it without needing to access the same servers all the time (including if you have no internet connection at all but you might have a local copy); you can make your own copies from mirrors and from those too etc. Cryptographic hashing can be used too (in order to check that it matches by a shorter code than the entire file). However, they should not make excessive amount of attempted access, so I do block everything in robots.txt (but it is OK if someone wants to use curl to download files, or clone a repository, and then wants to mirror them; I also mirror some of my own stuff on GitHub). What I do not want others to do is to claim that I wrote something that actually I did not write, or to claim copyright on something that I wrote that will restrict their use by others.
pton_xd · a year ago
Is there an agreed upon best effort robots.txt that I can reference to cut out crawlers?

Or a list of IP addresses or user agents or other identifying info known to be used by AI crawlers?

Ultimately pointless but just curious.

alentred · a year ago
> Ultimately pointless

I wonder if there are crowd-sourced iptables instead. Similar to how ad filters are maintained.

mewpmewp2 · a year ago
None the less dark actors are still going to get through and sell this data later.

There are probably tons of those crawlers and with intent to later launder that data, probably using LLMs intended to change the content just enough to be outside of copyright zone.

superkuh · a year ago
Another way of stating it: now that worthless text and images can be made to be worth money people no longer want to have public websites and are trying to change the longstanding culture of an open remixable web to fit their new paranoia over not getting their cut.

I welcome all useragents and have been delighted to see openai, anthropic, microsoft, and other AI related crawlers mirroring my public website(s). Just the same as when I see googlebot or bingbot or firefox using humans. For information that I don't want to be public I simply don't put it on public websites.

croes · a year ago
>For information that I don't want to be public I simply don't put it on public websites.

And there is no middle ground?

I want others to read it but not to sell it?

hn_throwaway_99 · a year ago
Another way of stating it: how dare all those greedy web publishers want to eat! /s

The first couple decades of the web were build on an implicit promise: publishers could put their content out on freely accessible websites, and search engines would direct traffic to those sites. It was a mutually beneficial arrangement: publishers got the benefit of traffic that they could monetize in different ways if they saw fit (even hosting ads from the search engines a la AdSense), and Google etc. could earn mountains in AdWords revenue. I'm not saying it was always a fair tradeoff on both sides, but both sides had an incentive to share content openly.

AI breaks that model. Publishers create all the content, but then the big search engines and AI companies can answer user questions without giving any reference at all to the original sites (companies like Google are providing source citations, but I guarantee the click-through rates go WAY down). This breakdown in the web's economic model has been happening for a while even before AI (e.g. with Google hosting more and more "informational content" directly on SERP pages - see https://news.ycombinator.com/item?id=24105465), but with AI the breakdown is really complete.

So no, I don't fault people at all who don't want all the toils of their labor to be sucked up by trillion dollar megacorps.

nickpsecurity · a year ago
That’s their expectation but not the original design. The design of HTTP is that, if you run a web server, you are willing to send content to anyone that sends you a HTTP request. Anyone putting content online with a system like that should expect the data to go out to as many as request it.

Alternative systems, like paywalls with prominent terms and conditions, exist to protect copyrighted content. They intentionally avoid them. So, they’re using a system designed for unrestricted distribution to publish content they hope people will use in restricted ways. I think the default, reasonable expectation should be that anyone can scrape public data for at least personal use.

Venn1 · a year ago
I activated the Block AI Scrapers and Crawlers feature on Cloudflare months ago. When I checked the event logs, I was surprised by the persistent attempts from GPTBot, PetalBot, Amazonbot, and PerplexityBot to crawl my site. They were making multiple requests per hour.

Considering my blog's niche focus on A/V production with Linux, I can only imagine how much more frequent their crawling would be on more popular websites.

hn_throwaway_99 · a year ago
I feel like talking about robots.txt in this context is kind of a pointless enterprise given how there's no guarantee it will be followed by crawlers (and TFA fully acknowledges this). Before AI, there was a mutually (not necessarily equal, but mutual) beneficial economic arrangement where websites published open content freely, and search engines indexed that content. That arrangement fundamentally no longer exists, and we can't pretend it's coming back. The end game of this is more and stronger paywalls (and not ones easily bypassed by incognito mode), and I think that's inevitable.