Scraperr – A Self Hosted Webscraper

lucb1e · 10 months ago

Funny, I saw this HN headline just after banning another scraper's IP range

You're welcome to scrape my sites but please do it ethically. Idk how to define that but some examples of things I consider not cool:

- Scraping without a contact method, or at least some unique identifier (like your project's codename), in the user agent string.

This is common practice, see e.g.: <https://en.wikipedia.org/wiki/User-Agent_header#Format_for_a...>. Many sites mention in public API guidelines to include an email address so you can be contacted in case of problems. If you don't include this and you're causing trouble, all I can do is ban your IP address altogether (or entire ranges: if you hop between several IPs I'll have to assume you have access to the whole range). Nobody likes IP bans: you have to get a new IP, your provider has a burned IP address, the next customer runs into issues... don't be this person, include an identifier.

- Timing out the request after a few seconds.

Some pages on my site involve number crunching and take 20 seconds to load. I could add complexity to do this async instead, but, by having it live, the regular users get the latest info and they know to just wait a few seconds and everybody is happy. Even the scrapers can get the info, I'm fine computing those pages for you. But if you ask for me to do work and then walk away, that's just rude. It shows up in my logs as HTTP status 499 and I'll ban scrapers that I notice doing this regularly

- Ignoring robots.txt.

I have exactly 1 entry in there, and that's a caching proxy for another site that is struggling with load. If you ignore the robots file and just crawl the thing from A to Z at a high rate, that causes a lot of requests to the upstream site for updating stale caches. You can obviously expect a ban because it's again just a waste of resources

whazor · 10 months ago

I find it unethical for a website robots.txt to allow-list particular search engines and ban all others. Essentially you are colluding with established search providers.

Loic · 10 months ago

Not necessarily, I have a website with 95% (maybe even more) of the traffic generated by crawlers. If some of them are behaving badly, it is fair to exclude them with my robots.txt.

But of course, the ones behaving badly tend to not respect the robots.txt, so you end up banning the IP or IP block.

And here, I am a nice guy, the crawler must really be a piece of crap for me to start to block.

ToucanLoucan · 10 months ago

This rather bluntly runs up against the fact that permitting crawling is an expense the web operator is taking on, ergo, receiving that content is by definition a privilege not a right.

lucb1e · 10 months ago

I don't know if that's a reply at me or a general remark, but yes, I never understood why you'd include a few big names and ban the rest for example. That's just screaming for anticompetitiveness. I don't know if my mention of robots.txt sounded like I do this, but I do not

datatrashfire · 10 months ago

> - Scraping without a contact method, or at least some unique identifier (like your project's codename), in the user agent string.

This is a very effective way to make sure you won't get any scraping done!

lucb1e · 10 months ago

Tell that to Googlebot, Bingbot, Petalbot, SemrushBot, MJ12bot, MojeekBot, DotBot, YandexBot, SeznamBot, Barkrowler, AhrefsBot, DuckDuckBot, AcademicBotRTU, Bytespider, Applebot, ZoominfoBot, TelegramBot, TwitterBot, SemanticScholarBot, redditbot, Pinterestbot... From a quick peek at my access log, all include either a link (most) or an email address (zoom, tiktok/bytedance, dotbot, and that academic bot)

Very few individual bots don't follow this good practice. Most of the IP ranges of violating bots are owned by Huawei (a few is Huawei Cloud so it could be anyone, but the majority seems to be Huawei themselves) and the remainder is all small beans as far as I remember (few thousand accesses in a day and then disappear forever, for example)

Fokamul · 10 months ago

Who cares, IP ranges are cheap. You're just banning datacenters.

edoceo · 10 months ago

What do you have for log analytics and ban automation? Could you say more about how to identify these bad-bots?

lucb1e · 10 months ago

There is no automation, I use `tail -f access.log`

I just look at what's happening on my server every now and then. Sometimes not for months, but then when I set up a project like that caching proxy, I'm currently keeping a more regular eye to see that crawlers aren't bothering the upstream via me. Most respect the robots policy, most of the ones that don't set a user agent string that include the word 'bot' and so I know not to refresh the cache based on that request. So far it has mostly been Huawei who pretend to be a regular user but request millions of pages (from 12 separate IP ranges so far, some of them bigger than /16, some of them a handful of /24s).

> Could you say more about how to identify these bad-bots?

Many requests per day to random pages from either the same IP address (range), or ranges owned by the same corporation

reconnecting · 10 months ago

We use tirreno [1] to manually and automatically analyze traffic and block unwanted bots. Although bot management is not currently listed as an official feature, it works well and is particularly helpful in complex bot hunting.

[1] https://github.com/TirrenoTechnologies/tirreno

VladVladikoff · 10 months ago

What sort of pages require 20 seconds to generate? This is extremely slow by most web standards and even your users would be frustrated by this. It sounds like poorly designed database queries with unindexed joins.

Google will also abandon page loads that take too long, and will demote rankings for that page (or the entire site!)

lucb1e · 10 months ago

> It sounds like poorly designed database queries with unindexed joins

Neither of those assumptions are correct. As an example, one page needs to look through 2.5 million records to find where the world record holder changed because it provides stats on who held the most records, held them for the greatest cumulative time, etc. The only thing to do would be introducing caching layers for parts of the computation, but for the number of users this system has, it's just not worth spending more development time than I already have. Also keep in mind it's a free web service and I don't run ads or anything, it's just a fan project for a game

> Google will ... demote rankings for that page (or the entire site!)

Google employs anticompetitive practices to maintain the search monopoly. We need more diversity in search engines, I don't know how else to encourage people to use something instead of, or at least in addition to, Google, besides by making Google Search just not competitive anymore. Google's crawler cannot access my site in the first place (but their other crawlers can; I'm pretty selective about this). My sites never show up in Google searches, on purpose

It's also not the whole site that's slow, it's when you click on a handful of specific pages. If that makes those pages not appear in search results, that's fine. Besides that it's not my loss, it's not like any other site has the info so people will find their way to the main page and click on what they want to see

selcuka · 10 months ago

> It sounds like poorly designed database queries with unindexed joins.

I find it amusing that you think every database operation imaginable can be performed in less than 20 seconds if we throw in a few indexes. Some things are slow no matter how much you optimise them.

The GP could have implemented them as async endpoints, or callbacks, but obviously they've already considered those options.

beatthatflight · 10 months ago

So what about flight searches where we have to query several 3rd party providers, and can take 45 seconds to get results from all of them (out of my control). I can dynamically update the page (and do) but a scraper would have to wait 20-45 seconds to get the 'cheapest' flight from my site. I can add async the queries and have the fastest pipes, but if the upstream providers take their time (they need to query their GDSs as well), there's not much you can do.

andrethegiant · 10 months ago

Shameless plug: prefix any URL with https://pure.md/ to get the pure markdown of that page. Useful for direct piping into an LLM. Has bot detection avoidance, proxy rotation, and headless JS rendering built in.

yoble · 10 months ago

Love the easter egg when going to https://pure.md/https://pure.md

matt-p · 10 months ago

That's excellent pricing from a structural perspective.

fredoliveira · 10 months ago

that looks fantastic - well done!

smartmic · 10 months ago

My preferred "self-hosted" webscraper is a local, single binary called xidel [1]. The feature I really like is that it can also follow links.

[1] https://github.com/benibela/xidel

darkwater · 10 months ago

Wow, it's written in Pascal! That surely brings me to memory lane.

DocTomoe · 10 months ago

With Pascal being my first "adult" language, not used in 20 years ... it is surprising how readable that code is. Makes me wish for such simpler times.

renegat0x0 · 10 months ago

Not a web scraper, but a web crawler software. Allows to specify method of crawling, selenium, and others. Returns data in JSON (status code, text contents, etc).

[1] https://github.com/rumca-js/crawler-buddy

TheTaytay · 10 months ago

Does anyone know of a scraper that uses LLMs/natural language to build a deterministic, robust script that I can use to scrape the same site in the future? All of the natural language extractors I’ve seen so far need an LLM every time, but that seems unnecessary…

throwup238 · 10 months ago

llm-scraper [1] does a decent job but it's still a bit fragile. The biggest problem I have is all the React CSS-in-JS libraries that use hashes in their class names, which the LLM isn't smart enough to ignore.

[1] https://github.com/mishushakov/llm-scraper

cdolan · 10 months ago

What have you had success doing with this? Curious to test it

TheTaytay · 10 months ago

Nice! Thanks!

cdolan · 10 months ago

We’ve built one internally using browser-use to generate playwright code

Works ok. Not as automated as I’d like

nicman23 · 10 months ago

they are all quite bad

nomilk · 10 months ago

I used to scrape back in the day when it was easy (literally just make a request and parse html). Seems cloudflare checkboxes / human verification are very commonplace nowdays. Curious how(/if) web scrapers get around those?

welanes · 10 months ago

1. Clicking the box programmatically – possible but inconsistent

2. Outsourcing the task to one of the many CAPTCHA-solving services (2Captcha etc) – better

3. Using a pool of reliable IP addresses so you don't encounter checkboxes or turnstiles – best

I run a web scraping startup (https://simplescraper.io) and this is usually the approach[0]. It has become more difficult, and I think a lot of the AI crawlers are peeing in the pool with aggressive scraping, which is making the web a little bit worse for everyone.

[0] Worth mentioning that once you're "in" past the captcha, a smart scraper will try to use fetch to access more pages on the same domain so you only need to solve a fraction of possible captchas.

nomilk · 10 months ago

That's awesome. Thanks for sharing.

First time hearing of the fetch() approach! If I understand correctly, regular browser automation might typically involve making separate GET requests for each page. Whereas the fetch() strategy involves making a GET for the first page (just as with regular browser automation), then after satisfying cloudflare, rather than going on to the next GET request, use fetch(<url>) to retrieve the rest of the pages you're after.

This approach is less noisy/impact on the server and therefore less likely to get noticed by bot detection.

This is fascinating stuff. (I'd previously used very little javascript in scrapes, preferring ruby, R, or python but this may tilt my tooling preferences toward using more js)

cess11 · 10 months ago

Low effort baseline would be https://seleniumbase.io/, to drive a preconfigured web browser that looks relatively human to the network service. Typically it just clicks through the one-click captcha:s.

If that's not good enough you'll likely have to fiddle with your own web driver and possibly a computer vision rig to manage to click through 'find the motorcycle' kind of challenges. Paying a click farm to do it for you is probably cheaper in the short run.

An important hurdle is getting reputable IPv4 addresses to do it from, if you're going to do it a lot. Having or renting a botnet could help, but might be too illegal for your use case.

gruez · 10 months ago

>Seems cloudflare checkboxes / human verification are very commonplace nowdays. Curious how(/if) web scrapers get around those?

You can get a real browser[1] to check the box for you, then use the cookies in your "dumb" scraper.

[1] https://github.com/FlareSolverr/FlareSolverr

ricardo81 · 10 months ago

Some CDNs go to the length of fingerprinting the TLS and HTTP/2 handshakes to see if you're a bot. As others have mentioned, using an automated browser tends to be the broadest solution.

nicman23 · 10 months ago

i usually use a real browser that i use, profile and all

anxman · 10 months ago

By clicking the box

monkeydust · 10 months ago

Practical use-case.

I am looking for a way to throw an address at a planning authority (UK) and download the associated documents for that property. Could this or another tool help?

e.g.

https://publicaccess.barnet.gov.uk/online-applications/appli...

As pure random example.

A property can have multiple planning applications and under each many documents.

What I have found useful (saved me time and potential lost £££) is to take the documents, combine to single pdf and provide to Gemini 2.5 Pro and then ask it to validate against agent specification for a property.

Over the weekend found a place that was advertising a feature of the house that was explicitly prohibited through planning decision notice.

Called the Agent up on it who claimed no knowledge but said this would have come up through solicitor checks, which it would have done, much later down the process with more or my money spent and considerable time lost.

Of course all this possible without LLMs but just makes it easier/cheaper to check at scale.

cess11 · 10 months ago

Could just cut out the href-value with grep and sed or a bit of scripting, '.pdf' seems to only occur on those links.

I'd keep it simple like that until I need to do periodic comparisons, i.e. actually need scrapers and is prepared to build what's needed to automatically watch and process directories where the scrapers put the files.

3abiton · 10 months ago

> extract data from websites with precision using XPath selectors.

I've used XPath for crawling with selenium, and it used to be my favorite way, but turned out quite unreliable if you don't combine it with other selectors as certain website are really badly designed and have no good patterns. So what's the added value over pure selenium?

cess11 · 10 months ago

Check whether the site is actually server side rendered, because if it's a browser client that talks JSON to the backend, you could do the same.