I run a small startup called SEOJuice, where I need to crawl a lot of pages all the time, and I can say that the biggest issue with crawling is the blocking part and how much you need to invest to circumvent Cloudflare and similar, just to get access to any website. The bandwith and storage are the smallest cost factor.
Even though, in my case, users add their own domains, it's still took me quite a bit of time to reach 99% chance to crawl a website — with a mix of residential proxies, captcha solvers, rotating user-agents, stealth chrome binaries, otherwise I would get 403 immediately with no HTML being served.
Yes, in this day and age, I could definitely see web pages being harder to crawl by search engines (and SEO companies and other users of automated web crawling technologies (AI agents?)) than they were in the early days of the Internet due to many possible causes -- many of which you've excellently described!
In other words, there's more to be aware of for anyone writing a search engine (or search-engine-like piece of software -- SEO, AI Agent, etc., etc.) than there was in the early days of the Internet, where everything was straight unencrypted http and most URLs were easily accessible without having to jump through additional hoops...
Which leads me to wonder... on the one hand, a website owner may not want bots and other automated software agents spidering their site (we have ROBOTS.TXT for this), but on the flip side, most business owners DO want publicity and easy accessibility for sales and marketing purposes, thus, they'd never want to issue a 403 (or other error code) for any public-facing product webpage...
Thus there may be a market for testing public facing business/product websites against faulty "I can't give you that web page for whatever reason" error codes from a wide variety of clients, from a wide variety of locations around the world.
That market is related to the market for testing if a website is up and functioning properly (the "uptime market"), again, from a wide variety of locations around the world, using a wide variety of browsers...
So, a very interesting post!
Also (for future historians!) compare all of the restrictive factors which may prevent access to a public-facing web page today Vs. Tim Berners-Lee original vision for the web, which was basically to let scientists (and other academic types!) SHARE their data PUBLICLY with one another!
(Things have changed... a bit! :-) )
Anyway, a very interesting post, and a very interesting article -- for both present and future Search Engine programmers!
I wonder if circumvention is legal. It's so odd. In the US it seems you can just do this whereas if I'd start something like this in the EU, I don't think I could.
In Italy it’s a crime punishable up to 12 years to access any protected computer system without authorization, especially if it causes a DoS to the owner
Consider the case of selfhosting a web service on a low performance server and the abusive crawling goes on loop fetching data (which was happening when I was self hosting gitlab!)
I'm in a similar boat and getting customers to whitelist IPs is always a big ask. In the best case they call their "tech guy", in the worst case it's a department far away and it has to go through 3 layers of reviews for someone to adapt some Cloudflare / Akamai rules.
And then you better make sure your IP is stable and a cloud provider isn't changing any IP assignments in the future, where you'll then have to contact all your clients again with that ask.
They're mostly non-technical/marketing people, but yes that would be a solution. I try to solve the issue "behind the scenes" so for them it "just works", but that means building all of these extra measures.
> spinning disks have been replaced by NVMe solid state drives with near-RAM I/O bandwidth
Am I missing something here? Even Optane is an order of magnitude slower than RAM.
Yes, under ideal conditions, SSDs can have very fast linear reads, but IOPS / latency have barely improved in recent years. And that's what really makes a difference.
Of course, compared to spinning disks, they are much faster, but the comparison to RAM seems wrong.
In fact, for applications like AI, even using system RAM is often considered too slow, simply because of the distance to the GPU, so VRAM needs to be used. That's how latency-sensitive some applications have become.
>for applications like AI, even using system RAM is often considered too slow, simply because of the distance to the GPU
That's not why. It's because RAM has a narrower bus than VRAM. If it was a matter of distance it'd just have greater latency, but that would still give you tons of bandwidth to play with.
I can't edit my comment, but to the people responding here, thank you for adding all this information. It really helped elucidate why VRAM vs RAM is a distinction and also prevents my somewhat naive interpretation from being the only thing people see. Thanks!
Nice work, but I feel like it's not required to use AWS for this. There are small hosting companies with specialized servers (50gbit shared medium for under 10$), you could probably do this under 100$ with some optimization.
I did some crawling on hetzner back in the day. They monitor traffic and make sure you don't automate publically available data retrieval. They send you an email telling you that they are concerned because you got the ip blacklisted. Funny thing is: They own the blacklist that they refer to.
This. I tried to run a very slow DHT scraper I was writing on a Hetzner server and within minutes they were on my ass. I don't want to make an enemy of them so I killed it immediately, but they are clearly very sensitive to anything outside of "normal".
If Hetzner actually puts their own customers on their blacklist then that list becomes more trustworthy.
They were right to blacklist you, they were right to complain to you, and they were right not to assume malice and kick you off their platform/shut down your server.
I was able to get 35k req/sec on a single node with Rust (custom http stack + custom html parser, custom queue, custom kv database) with obsessive optimization.
It's possible to scrape Bing size index (say 100B docs) each month with only 10 nodes, under 15k$.
Thought about making it public but probably no one would use it.
Well the most important part seems to be glossed over and that’s the IP addresses. Many websites simply block /want to block anything that’s not google and is not a “real user”.
If anyone actually needs such a dataset, look into CommonCrawl first. I feel using something that already exists will be more cooperative and considerate than everyone overloading every website with their spider. https://commoncrawl.org/overview
Even though, in my case, users add their own domains, it's still took me quite a bit of time to reach 99% chance to crawl a website — with a mix of residential proxies, captcha solvers, rotating user-agents, stealth chrome binaries, otherwise I would get 403 immediately with no HTML being served.
Yes, in this day and age, I could definitely see web pages being harder to crawl by search engines (and SEO companies and other users of automated web crawling technologies (AI agents?)) than they were in the early days of the Internet due to many possible causes -- many of which you've excellently described!
In other words, there's more to be aware of for anyone writing a search engine (or search-engine-like piece of software -- SEO, AI Agent, etc., etc.) than there was in the early days of the Internet, where everything was straight unencrypted http and most URLs were easily accessible without having to jump through additional hoops...
Which leads me to wonder... on the one hand, a website owner may not want bots and other automated software agents spidering their site (we have ROBOTS.TXT for this), but on the flip side, most business owners DO want publicity and easy accessibility for sales and marketing purposes, thus, they'd never want to issue a 403 (or other error code) for any public-facing product webpage...
Thus there may be a market for testing public facing business/product websites against faulty "I can't give you that web page for whatever reason" error codes from a wide variety of clients, from a wide variety of locations around the world.
That market is related to the market for testing if a website is up and functioning properly (the "uptime market"), again, from a wide variety of locations around the world, using a wide variety of browsers...
So, a very interesting post!
Also (for future historians!) compare all of the restrictive factors which may prevent access to a public-facing web page today Vs. Tim Berners-Lee original vision for the web, which was basically to let scientists (and other academic types!) SHARE their data PUBLICLY with one another!
(Things have changed... a bit! :-) )
Anyway, a very interesting post, and a very interesting article -- for both present and future Search Engine programmers!
Consider the case of selfhosting a web service on a low performance server and the abusive crawling goes on loop fetching data (which was happening when I was self hosting gitlab!)
https://www.brocardi.it/codice-penale/libro-secondo/titolo-x...
And then you better make sure your IP is stable and a cloud provider isn't changing any IP assignments in the future, where you'll then have to contact all your clients again with that ask.
Dead Comment
Seems like they're only scraping websites their clients specifically ask them to
Am I missing something here? Even Optane is an order of magnitude slower than RAM.
Yes, under ideal conditions, SSDs can have very fast linear reads, but IOPS / latency have barely improved in recent years. And that's what really makes a difference.
Of course, compared to spinning disks, they are much faster, but the comparison to RAM seems wrong.
In fact, for applications like AI, even using system RAM is often considered too slow, simply because of the distance to the GPU, so VRAM needs to be used. That's how latency-sensitive some applications have become.
That's not why. It's because RAM has a narrower bus than VRAM. If it was a matter of distance it'd just have greater latency, but that would still give you tons of bandwidth to play with.
They were right to blacklist you, they were right to complain to you, and they were right not to assume malice and kick you off their platform/shut down your server.
Thought about making it public but probably no one would use it.
Suspicious. I don’t think I’ve ever read anything that says redis taps out below tens of thousands of ops…
If anyone actually needs such a dataset, look into CommonCrawl first. I feel using something that already exists will be more cooperative and considerate than everyone overloading every website with their spider. https://commoncrawl.org/overview