Let us serve you, but don't bring us down

I run a system at my employer that occasionally gets scraped by malicious users. It can be used to infer the purchasability of a specific domain, which is a moderately-interesting API endpoint, since that requires talking to domain registries. For a while, nobody cared enough about it to abuse the endpoint. But then we started getting about 40 QPS of traffic. We normally get less than 1.

I was keeping an eye on it, because we are hard-capped at 100 QPS to our provider, beyond that and they start dropping our traffic (and it is an outside provider, bundling domain registries like verisign and stuff), which makes regular users break if their traffic gets unlucky.

Anyway, after a week of 40qps, they start spiking to 200+, and we pull the plug on the whole thing: now each request to our endpoint requires a recaptcha token. This is not great (more friction for legit users = more churn) but it is successful. IF they had only kept their QPS low, nobody would have cared. I wanted to send some kind of response code like, "nearing quota".

FTR before people ask: it was quite difficult to stop this particular attack, since it worked like a DDOS, smeared across a _large_ number of ipv4 and ipv6 requesters. 50 QPS just isn't enough quota to do stuff like reactively banning IP numbers if the attacker has millions of IPs available.

rapnie · 2 years ago

Maybe mCaptcha [0] is worth a look. It applies a Proof-of-Work like algorithm (not blockchain-related) which makes it very expensive for scrapers to get data in bulk, but poses least amount of friction to individual users. The project is implemented in Rust and received NGI.eu/NLnet funding. I don't know its state of production-readiness, but Codeberg.org is considering using it (this choice is informed by higher respect for privacy and improved a11y compared to hCaptcha).

[0] https://mcaptcha.org/

emporas · 2 years ago

In my opinion, PoW is the only reliable way, to avoid DDoS attacks. Scraping too much, too quickly is a light form of DDoS, although not intentional. PoW was invented in 2006 exactly for that purpose.

The genius of bitcoin (not BTC) is that it provides an organized and practical way, for PoW to be used by everyone on the planet. Some people find it strange, because there is an imaginary token created out of pure nothing which represents the PoW. It would work fine, without that imaginary token i.e. bitcoin, just by dollars of euros, but it wouldn't be internet native. The point is always to just send a minimal PoW with every http or tcp/ip request.

toastal · 2 years ago

If it doesn’t involve my free labor training machine learning models like reCAPTCHA and hCAPTCHA and I can still visit sites from a non-Western IP, then it’s already an improvement.

8organicbits · 2 years ago

Really interesting! How efficiently does a web browser compute the PoW? I'm concerned that a bot would use an efficient GPU implementation while real users would run an inefficient JS/webcrypto version.

tambourine_man · 2 years ago

I got interested in mCaptcha, followed your link, but couldn’t find anywhere an example of what the end user would deal with. What kind of PoW are we talking about?

ranger_danger · 2 years ago

There is also this proof-of-work solution https://gitgud.io/fatchan/haproxy-protection/

RobotToaster · 2 years ago

There's also (the imaginatively named) powcaptcha https://git.sequentialread.com/forest/pow-captcha

Zuiii · 2 years ago

> It applies a Proof-of-Work like algorithm (not blockchain-related)

If you're going to waste my energy to do proof-of-work anyway, I'd rather you use it for something useful (even mining crypto-currency to pay for server costs) rather than let it go to waste.

hackernewds · 2 years ago

how does it work? perhaps it's proprietary that the authors wouldn't want to disclose, but I did not see anything referenced in the site. I wouldn't plug it into mine without knowing g

RockRobotRock · 2 years ago

hCaptcha's accessibility model seems quite good to me: https://www.hcaptcha.com/accessibility

Szpadel · 2 years ago

I had similar issue for one of our clients. My strategy to not affect legit users was to enable mitigations if global traffic crossed some threshold.

eg. in your case this could mean if traffic is above eg. 75QPS then captcha is enabled, and if it's below that it's disabled.

I don't know what tech stack you are using, but nice trick that i figured out was to abuse rate limiting to detect global traffic (doing if branch with rate limit with const as client id)

pfooti · 2 years ago

That is more or less what I landed on myself. (Not quite, but similar reactive configs based on traffic thresholds).

For a while, we had to just set off pagers when global traffic exceeded a threshold and manually toggle the extra hardening, but eventually it became a lot more reactive.

pierat · 2 years ago

QPS seems to be "queries per second".

I first thought it was something like "quadrillion bytes per second" or some newer over-the-top data measurement :)

vaylian · 2 years ago

I haven't heard of that abbreviation before. But there is a stub on Wikipedia if anyone cares about improving that article: https://en.wikipedia.org/wiki/Queries_per_second

sbierwagen · 2 years ago

Temporarily add 500ms of latency to all ipv6 users, backoff timers for ipv4 addresses. Since there's only 4 billion v4 addresses, it's easy enough to just track them all in a sqlite db.

pfooti · 2 years ago

Mostly works, but there do exist a couple of botnets that contain 1 million compromised machines. If each makes one request before hitting backoff, spread evenly throughout the day, that's about 10 QPS alone before they use an IP number twice. But they tend to not actually level out their usage (which is a bummer - if they did they could have kept using it). Instead they hit with a lot of parallel queries all at once.

There's only so much you can really do when your underlying resource is so limited. Luckily the value of the query is lower than the cost of a recaptcha solve, so the attackers moved on to some other target.

Ironically I could now turn off the endpoint protection (or have it responsive to traffic load), until the attackers return. I shall not go into too many details, no need to give people a map.

Taek · 2 years ago

I don't think that works if your attacker has millions of IPs and is only using 200 per second.

scotty79 · 2 years ago

I think you could make your requests two step.

If somebody wants to access endpoint you might send him a challenge first. Random text. The client must append to the text some other text chosen by him, so that when you calculate sha256 on concatenated text, first byte or two of it will be zeros.

To access your actual endpoint client needs to send that generated text and you can check it if it results in the required number of zeros. You might demand more zeros in times of heavier load. Each additional bit that you require to be zero increases number of random attempts to find the text by a factor of two.

To make stuff easier for yourself the challenge text instead of being random might be a hash of clients request parameters and some secret salt. Then you don't have to remember the text or any context at all between client requests. You can regenerate it when the client sends second requests with answer.

Honestly I don't know why this isn't a standard option in frameworks for building public facing apis that are expected to be used reasonably.

hombre_fatal · 2 years ago

Because it doesn’t accomplish anything. Things take longer for honest users while botnet abusers don’t even notice that the rented hardware is burning more CPU. Nor does it matter because each request still goes through.

Pmop · 2 years ago

There's a demand, why not supply it, and make money while you are at it?

This reminds me of the RMT driven botting problem in WoW (World of Warcraft). Instead of fighting the neverending game of cat and mouse against botters, Blizzard just decided to supply the long reprimanded demand for in-game currency by creating the WoW token, and they make money while they're at.

emodendroket · 2 years ago

If such a simple mitigation as adding captchas worked then it seems like the value is pretty marginal, plus it sounds like the limit is upstream.

WA · 2 years ago

I usually make it a two step process:

  host example.com

If this returns with an IP address, no need to talk to a registrar. Only if there is no IP address, I go with

  whois example.com

schoen · 2 years ago

You could also do an in-between step of

  host -t soa example.com

which should give you domains that have any DNS record at all, not just an A record.

jb_gericke · 2 years ago

Why not put auth on the endpoint and enforce quotas and rate limiting (an api gateway like kong could handle this for you).

pfooti · 2 years ago

Endpoint was invoked in our signup funnel, so there was a bootstrapping problem for quota enforcement, the attackers weren't making a whole signup, just getting to the point where the domain search ran.

ricardo81 · 2 years ago

Understandable. I've seen services that offer residential IP address proxies for as low as $1/GB. FWIW the particular service in mind actually pays the IP owners who opt into it.

I guess your tool is asking the registries if domains are registered.

elliotto · 2 years ago

It's always interesting to see these prisoner's dilemma / tragedy of the commons show up in networking. If they hadn't abused the commons it would have worked out better for everyone.

boulos · 2 years ago

Rate limit and return a 429?

pfooti · 2 years ago

See sibling, but the endpoint was part of a signup funnel, so short of rearchitecting it completely to put that check after customer creation, there's no real persistent key to rate limit on. Any one IP ended up getting rate limited to 5 requests per hour on that API, but the attack was incoming from what looked like a botnet, so it was tricky.

fragmede · 2 years ago

did you ever look inside those queries? were they the same, just repeated ad nauseam just polling in case their status had changed?

tyingq · 2 years ago

If you can detect it reliably, poisoning them with bad answers might be enough to cause them to find an easier target.

dspillett · 2 years ago

Be careful with that. A friend once did something similar to an active scraper who then went off and set his site up for a full blown DDoS.

He knew that was the reason for the intentional DDoS due to messages (along the lines of “think it is funny to poison my information do you?”) in the query string of the bulk requests. Like unpleasant fools making a big noise in shops because they aren't served immediately after cutting in line or some such, the entitled can be quite petty when actively taken to task for the inconvenience they cause.

Passive defences are safer in that regard, assuming you don't get into an arms race with the scrapers, though are unfortunately more likely to mildly inconvenience your good users.

hattmall · 2 years ago

Captcha seems like overkill. Were you not able to implement a JS fingerprinting / bot check before captcha.

reaperman · 2 years ago

Most of the time, the new CAPTCHAs do that and then never show an actual interactive element. The interactive gimmicks are a fallback in cases of high uncertainty.

The webmaster doesn’t need to worry about it, the anti-bot services handle who gets what difficulty of challenge. But the webmaster can specify whether they’d like to be more or less strict/difficult than usual.

pfooti · 2 years ago

Yeah, "recaptcha" in this case is shorthand for a lot of stuff we did to harden the endpoint that ultimately represents a minor shift in balance from no friction to some friction for otherwise legitimate users, but a pretty significant drop in illegitimate traffic.

The main idea here is that at some point just leaving the scrape running in a way that didn't overwhelm my backend would have resulted in me not caring enough to do that. But now they get nothing. Even if you're borrowing bandwidth and not paying, you should be a good neighbor is all.

Unpopular opinion: Severely rate limit retrieving the files from the website / HTTP endpoint, and loudly point towards downloading the files via torrents.

The torrent protocol was meant to relieve this level of server load in mind.

userbinator · 2 years ago

and loudly point towards downloading the files via torrents.

By "loudly", perhaps actually redirect to the .torrent instead, for those who trip the rate limit?

remram · 2 years ago

A Link header would be more appropriate (maybe with rel="alternate"?)

x-complexity · 2 years ago

...that can work.

maksimur · 2 years ago

Torrents have the habit of disappearing when no users keep them alive. It happened to me enough times to be wary of such solution. If there's a way to keep them alive regardless of interest I'm all for it.

5e92cb50239222b · 2 years ago

How is that worse than serving files via HTTP? With HTTP archive.org is the only peer that can serve files to you, with BitTorrent it will be one of the (hopefully) many peers, and will degrade to the same level of redundancy as HTTP if all other peers leave.

BitTorrent also supports web seeds and they don't even really have to keep a full client running, just embed an HTTP link into the .torrent file.

chronogram · 2 years ago

Have you tried an Archive.org torrent yet? It's backed by the servers, but has the advantages of selecting which parts of the archive you want and being verifiable and being able to have more bandwidth on popular archives. The Archive.org servers show up on the "HTTP sources" tab next to the "Peers" tab for me.

jeroenhd · 2 years ago

The torrents the archive provides are backed by web seeds served by the internet archive web servers.

This approach allows the internet archive to effectively rate limit hosts without making content unavailable entirely, and allows others to help carry the load.

maksimur · 2 years ago

Thanks everyone for your reply. It made me reconsider my opinion. Though the downvotes are unnecessary. Use them when they're actually necessary. Not as gate keeping or as a "don't agree" button or whatever frivolous thing is passing through your mind.

shepherdjerred · 2 years ago

That's no worse than just relying on Arhcive.org to be sole provider

ElongatedMusket · 2 years ago

See https://webtorrent.io/faq for some ideas of doing what you're wishing

nextaccountic · 2 years ago

The very same machines that serve http can also seed torrents. If those machines go offline, they also take down the http endpoint

npteljes · 2 years ago

Torrent needs seeders the same as URLs need their original servers. There's a way to upgrade http to torrent though, like this: https://en.wikipedia.org/wiki/BitTorrent#Web_seeding

happywolf · 2 years ago

Seeding peers or machines are meant to mitigate this issue.

jlokier · 2 years ago

This is what "proof of storage" or "proof of data availability" blockchain networks are for. They use economic incentives to continuously pay nodes a small amount to store some data and keep it available, and the cryptographic sampling mechanism ensures that less popular data must remain in the available dataset for nodes to be paid, even if it is rarely requested in full.

slim · 2 years ago

consider seeding the torrents you download

ranger_danger · 2 years ago

Depending on the type of content you download, the chances of anyone else already seeding that might actually be zero.

andruby · 2 years ago

Archive.org would also act as a seed, so at worst it’s similar to regular download traffic. At best there’s a few more people downloading and seeding.

Dead Comment

Roark66 · 2 years ago

I just wonder, is archive.org getting any government grant money? If they aren't they should. And I'm not even talking about just the US. All sorts of countries (US, UK, Germany - to name the few) and international organisations like EU pour hundreds of millions into "cultural projects" of very questionable value.

How about they actually fund something really worthy of preservation? Of course it is archive.org role to reach out first. For example EU could fund an archive.org mirror in the EU (with certain throughout etc).

Of course opponents of public/government funding have a very good point in that many organisations when they get public money, they find a way to burn through all of it in a lot less efficient way. This can be mitigated by attaching concrete conditions to the grants. One example is a mirror in a specific location.

brylie · 2 years ago

The Internet Archive lists some of their major donors on the About IA page:

https://archive.org/about/

Since they are a U.S. 501(c)(3), they also publish annual reports, which can be downloaded e.g., at the ProPublica Nonprofit Explorer

https://projects.propublica.org/nonprofits/organizations/943...

tkgally · 2 years ago

The Internet Archive has many smaller donors, too. I use the Archive frequently for a variety of purposes and derive immense value from it, so I donate to it every year.

kodah · 2 years ago

Public money would also mean regulation and I doubt the IA wants to be regulated. They already see themselves apart from things like web standards given their approach to robots.txt [1]

1: https://blog.archive.org/2017/04/17/robots-txt-meant-for-sea...

mellosouls · 2 years ago

There are various underfunded digital archives in the EU and elsewhere that have legal duties to preserve online content relating to their specific countries.

They may be "of very questionable value" to you but your solution to remove funding from them to channel it to a much wealthier organisation in a wealthier country is neither ethical, legal or practical.

hutzlibu · 2 years ago

I think he or she rather meant some other cultural projects than digital archives.

There are indeed lots of cultural projects that get funding, that I consider not that important in comparison, but of course those people involved would think different. (Opera for example is heavily subsidized)

hgsgm · 2 years ago

This is almost completely unresponsive to what parent actually wrote.

jpswade · 2 years ago

Government money would risk making them dependent on it, which would reduce their ability to be self sustainable and independent, which risks becoming political. Best avoided if possible.

Deleted Comment

tastysandwich · 2 years ago

I have a side project that scrapes thousands and thousands of pages of a single website.

So as not to piss them off (and so they don't try to block me), my script will take about 6 hours. Between each page fetch it sleeps for a small, random amount of time. It's been working like that for years.

bazmattaz · 2 years ago

Yes, this is the way. When I was purchasing a car last year I scraped a popular used car website for cars I liked to keep track of deals. I added a random sleep in between each page so the entire script took a few hours to run.

that_guy_iain · 2 years ago

I build a system that scraped GitHub. Even though GitHub can clearly handle lots of traffic I still rate limited the hell out of it. The only time I've ever seen scrapers get banned is when go super fast. Unless it's LinkedIn and Instagram who guard their product aka data as much as possible.

mdaniel · 2 years ago

> I still rate limited the hell out of it

Merely for your consideration, they actually do a great job of indicating in the response how many more requests per "window" the current authentication is allowed, and a header containing the epoch at which time the window will reset: https://docs.github.com/en/rest/overview/resources-in-the-re...

I would suspect, all things being equal, that politely spaced requests are better than "as fast as computers can go" but I was just trying to point out that one need not bend oppressively over the other direction when the site is kind enough to tell the caller about the boundaries in a machine-readable format

thangngoc89 · 2 years ago

Yes. I did the same. My side project takes 4-5 days for the whole routine because there are random wait in between requests and only 2 active requests at any given moment.

perryprog · 2 years ago

What’s the benefit of waiting a random amount of time between requests?

Lalabadie · 2 years ago

Some zealous systems will infer a very regular request rate as coming from automated services and block them, no matter how gentle the rate.

galleywest200 · 2 years ago

I recall back when I was less experienced I wrote a scraper for downloading recipes from Taste of Home and let Go firebomb them with as many goroutines as it could handle.

I am sorry, admins! I won't do it next time.

Alifatisk · 2 years ago

Wish more people did like you!

BlueTemplar · 2 years ago

It's not just about "wishing", (D)DoSing is a punishable offence in many jurisdictions, and considering the damage taking down such a popular and essential website causes, I hope tIA will report them to law enforcement if it happens a third time.

is_true · 2 years ago

Have you tried getting in touch to ask if they have an API?

crazygringo · 2 years ago

Is there an open-source rate limiter that works well for sites large and small?

It just strikes me as surprising that sites are still dealing with problems like this in 2023.

The idea that a site is still manually having to identify a set of IP addresses and block them with human intervention seems absolutely archaic by this point.

And I don't think different sites have particularly different needs here... basic pattern-matching heuristics can identify IP addresses/blocks (plus things like HTTP headers) that are suddenly ramping up requests, and use CAPTCHAs to allow legitimate users through when one IP address is doing NAT for many users behind it. Really the main choice is just whether to block spidering as much as possible, always allow it as long as it's at a reasonable speed, or limit it to a standard whitelist of known orgs (Google, Bing, Facebook, Internet Archive, etc.).

It just strikes me as odd that when you follow a basic tutorial for installing Apache on Linux on Digital Ocean, rate limiting isn't part of it. It seems like it should be almost as basic as an HTTPS certificate by this point.

thayne · 2 years ago

It isn't that hard to set up naive rate limiting per ip address. It's a few lines in haproxy, and there is documentation on how to do it. There are a couple of problems that make it more complicated though. The first is that with NATs, you can have a lot of users behind a single IP address, which can result in legitimate requests getting blocked. The second is that, while it can help against a DoS, it doesn't help that much against a DDoS, because the requests are coming from a large number of distinct IP addreesses.

But it's precisely those complicated parts though that seem like they should be solved by now. The solution is both of the problems is for when traffic gets unexpectedly high either behind a NAT or globally, everybody gets a CATPCHA when gives you a token and then you continue to rate-limit each token. (If users log in then each login already does this, no CAPTCHA needed.) This strategy is basically what e.g. CloudFlare does with the CAPTCHAs.

Obviously if someone is attempting a large-scale DDoS your servers can't handle it and you'll be using CloudFlare for its scale. But otherwise, for basic protection against greedy spiders who are even trying to evade detection across a ton of VPN/cloud IP's, this strategy works fine. It's exactly the kind of thing that I would expect any large website to implement.

If there isn't an open-source tool that does this, I wonder why not. Or if there is, I wonder why IA isn't using something like it. But heck, IA wasn't even using a simple version -- it was just 64 IP addresses where basic rate-limiting would have worked fine.

nektro · 2 years ago

ipv6 would solve these issues because they remove the need for NATs

klysm · 2 years ago

I think rate limiting is the wrong approach. If there’s too much load you need to manage the queue

What does "manage the queue" mean? And how would that be better than dropping requests from the source of the abuse?

radq · 2 years ago

Sounds like they don't have rate limiting implemented? That seems odd, and it's also surprising it isn't talked about in the post.

p-e-w · 2 years ago

Indeed. Tens of thousands of requests per second from just 64 hosts? So they allow individual hosts to make hundreds of requests per second, sustained? That sounds crazy. Even for a burst limit hundreds per second would be extremely high.

akiselev · 2 years ago

Archive.org is a core utility for the web to the point where Wikipedia and many other sites would collapse without it in the sense that many if not most of their outbound links would be dead forever. I’m pretty sure it would even impact the US justice system [1].

Obviously judges aren’t going to have to worry about reasonable rate limits but if these DDoSes are rare, I’d much rather they dealt with them on a case by case basis. Without some complex dynamic rate limit that scales based on available compute and demand, rate limiting would be a blunt solution that will necessarily generate false positives.

[1] https://www.theregister.com/2018/09/04/wayback_machine_legit...

Precisely. What kinds of legitimate processes would require this kind of utility from Archive.org?

alentred · 2 years ago

Just yesterday there was a comment here on HN [1] about https://jsonip.com, which is essentially supported by a single person (all operational costs included) and gets abused in a somewhat similar manner. I am not even sure what to think: do the folks not understand what they do, or are they just bluntly ignorant of it?

[1] https://news.ycombinator.com/item?id=36092417

marginalia_nu · 2 years ago

Yeah I'm getting this shit too with Marginalia Search. I'm getting about 2-2.5M queries per day that are definitely from bots that would 100% sink my server if they somehow go through the bot mitigation. It peaks at hundreds of search queries per second. To be clear these are search queries, and search queries typically trigger disk reads of about ~10-20 Mb.

I get about 20,000 queries per day that may be human.

Terretta · 2 years ago

You could update the line:

> The search engine is currently serving about 36 queries/minute.

To:

> The search engine is currently serving about 36 real queries/minute, and deflecting 1806 bot queries/minute (please don't).

lagniappe · 2 years ago

does your server have its own isp connection or is that your home connection

raverbashing · 2 years ago

> o the folks not understand what they do, or are they just bluntly ignorant of it?

They don't care. And yes I've heard stories of people "finding a service" that does something basic and just dumping their whole traffic onto it

(of course they play the victim once they're found out)

The beatings (via judge) shall continue until they learn netiquette.

yyyk · 2 years ago

Archive.org is a bit of a special case, you need to call them repeatedly to archive a website. They do have a rate limit there, it's pretty aggressive* to the point you could trip it by manually using the site. They must have forgotten to limit the OCR files download.

* If they had a better API (a simple non-synchronous API would be enough, one where we could send a list of URLs would be even better), one could have made a lot less calls.

Retr0id · 2 years ago

Last time I wanted to bulk-archive a bunch of urls, I asked about it, and sent a txt file full of URLs to someone and they put it in the archival queue.

llui85 · 2 years ago

They have a Google Sheets "API" which I've used and works reasonably well:

https://archive.org/services/wayback-gsheets/

krackers · 2 years ago

This has been broken for the past month (just stuck on waiting for workers for several days), did they fix it?

abbe98 · 2 years ago

I believe you can upload WRAC files to IA and ask them to index the content. Saves them the need to do the archiving and you won't be rate limited on their end.

account42 · 2 years ago

So they just trust you that your archives are not manipulated?