Readit News logoReadit News
strogonoff · 8 months ago
A friend of mine co-runs a semi-popular semi-niche news site (for now more than a decade), and complains that recently traffic rose with bots masquerading as humans.

How would they know? Well, because Google, in its omniscience, started to downrank them for faking views with bots (which they do not do): it shows bot percentage in traffic stats, and it skyrocketed relative to non-bot traffic (which is now less than 50%) as they started to fall from the front page (feeding the vicious circle). Presumably, Google does not know or care it is a bot when it serves ads, but correlates it later with the metrics it has from other sites that use GA or ads.

Or, perhaps, Google spots the same anomalies that my friend (an old school sysadmin who pays attention to logs) did, such as the increase of traffic along with never seen before popularity among iPhone users (who are so tech savvy that they apparently do not require CSS), or users from Dallas who famously love their QQBrowser. I’m not going to list all telltale signs as the crowd here is too hype on LLMs (which is our going theory so far, it is very timely), but my friend hopes Google learns them quickly.

These newcomers usually fake UA, use inconspicuous Western IPs (requests from Baidu/Tencent data center ranges do sign themselves as bots in UA), ignore robots.txt and load many pages very quickly.

I would assume bot traffic increase would apply to feeds, since they are of as much use for LLM training purposes.

My friend does not actually engage in stringent filtering like Rachel does, but I wonder how soon it becomes actually infeasible to operate a website with actual original content (which my friend co-writes) without either that or resorting to Cloudflare or the like for protection because of the domination of these creepy-crawlies.

Edit: Google already downranked them, not threatened to downrank. Also, traffic rose but did not skyrocket, but relative amount of bot traffic skyrocketed. (Presumably without downranking the traffic would actually skyrocket.)

afandian · 8 months ago
Are you saying that Google down-ranked them in search engine rankings for user behaviour in AdWords? Isn't that an abuse of monopoly? It still surprises me a little bit.
malfist · 8 months ago
Who's going to call them on it if it is?
EdwardDiego · 8 months ago
Yeah, but then who is going to stop them acting monopolistic?

New administration is going to be monopoly friendly.

I was honestly pleased that Gaetz was nominated for AG solely because he's big on antitrust. Or has been.

m3047 · 8 months ago
It's not that hard to dominate bots. I do it for fun, I do it for profit. Block datacenters. Run bot motels. Poison them. Lie to them. Make them have really really bad luck. Change the cost equation so that it costs them more than it costs you.

You're thinking of it wrong, the seeds of the thinking error are here: "I wonder how soon it becomes actually infeasible to operate a website with actual original content".

Bots want original content, no? So what's the problem with giving it to them? But that's the issue, isn't it? Clearly, contextually, what you should be saying is "I wonder how soon it becomes actually infeasible to operate a website for actual organic users" or something like that. But phrased that way, I'm not sure a CDN helps (I'm not sure they don't suffer false positives which interfere with organic traffic when they intermediate, more security theater because hangings and executions look good, look at the numbers of enemy dead).

Take measures that any damn fool (or at least your desired audience) can recognize.

Reading for comprehension, I think Rachel understands this.

throaway89 · 8 months ago
what is a bot motel and how do you run one?
blfr · 8 months ago
QQBrowser users from Dallas are more likely to be Chinese using a VPN than bots, I would guess.
strogonoff · 8 months ago
That much is clear, yeah. The VPN they use may not be a service advertised to public and featured in lists, however.

Some of the new traffic did come directly from Tencent data center IP ranges and reportedly those bots signed themselves in UA. I can’t say whether they respect robots.txt because I am told their ranges were banned along with robots.txt tightening. However, US IP bots that remain unblocked and fake UA naturally ignore robot rules.

m3047 · 8 months ago
I'm seeing some address ranges in the US clearly serving what must be VPN traffic from Asia, and I'm also seeing an uptick in TOR traffic looking for feeds as well as WP infra.
BadHumans · 8 months ago
At my company we have seen a massive increase in bot traffic since LLMs have become mainstream. Blocking known OpenAI and Anthropic crawlers has decreased traffic somewhat so I agree with your theory.
nicbou · 8 months ago
I don’t think it’s a bot thing. Traffic is down for everyone and especially smaller independent websites. This year has been really rough for some websites.
wkat4242 · 8 months ago
I think it's also because a lot of sites have started paywalling. So users walk away.
is_true · 8 months ago
I too found an extremely unlikely % of iphone users when checking access logs.
wiseowise · 8 months ago
> who are so tech savvy that they apparently do not require CSS

Lmao!

m3047 · 8 months ago
Heres Crime^H^H^H^H^(ahem)Cloudflare requesting assets from one of my servers. I don't use Cloudflare, they have no business doing this.

  104.28.42.8 - - [21/Dec/2024:13:58:35 -0800] consulting.m3047.net "GET /apple-touch-icon-precomposed.png HTTP/1.1" 404 980 "-" "NetworkingExtension/8620.1.16.10.11 Network/4277.60.255 iOS/18.2"
  104.28.42.8 - - [21/Dec/2024:13:58:35 -0800] consulting.m3047.net "GET /favicon.ico HTTP/1.1" 200 302 "-" "NetworkingExtension/8620.1.16.10.11 Network/4277.60.255 iOS/18.2"
  104.28.42.8 - - [21/Dec/2024:13:58:35 -0800] consulting.m3047.net "GET /dubai-letters/balkanized-internet.html HTTP/1.1" 200 16370 "-" "Mozilla/5.0 (Macintosh; Intel Mac OS X 10_11_1) AppleWebKit/601.2.4 (KHTML, like Gecko) Version/9.0.1 Safari/601.2.4 facebookexternalhit/1.1 Facebot Twitterbot/1.0"
  104.28.42.8 - - [21/Dec/2024:13:58:35 -0800] consulting.m3047.net "GET /apple-touch-icon.png HTTP/1.1" 404 980 "-" "NetworkingExtension/8620.1.16.10.11 Network/4277.60.255 iOS/18.2"

  # dig -x 104.28.42.8

  ; <<>> DiG 9.12.3-P1 <<>> -x 104.28.42.8
  ;; global options: +cmd
  ;; Got answer:
  ;; ->>HEADER<<- opcode: QUERY, status: NXDOMAIN, id: 35228
  ;; flags: qr rd ra; QUERY: 1, ANSWER: 0, AUTHORITY: 1, ADDITIONAL: 1

  ;; OPT PSEUDOSECTION:
  ; EDNS: version: 0, flags:; udp: 1280
  ; COOKIE: 6b82e88bcaf538fc7ab9d44467685e82becd47ff4492b1be (good)
  ;; QUESTION SECTION:
  ;8.42.28.104.in-addr.arpa.      IN      PTR

  ;; AUTHORITY SECTION:
  28.104.in-addr.arpa.    3600    IN      SOA     cruz.ns.cloudflare.com. dns.cloudflare.com. 2288625504 10000 2400 604800 3600

  ;; Query time: 212 msec
  ;; SERVER: 127.0.0.1#53(127.0.0.1)
  ;; WHEN: Sun Dec 22 10:46:26 PST 2024
  ;; MSG SIZE  rcvd: 176
Further osint left as an exercise for the reader.

Crosseye_Jack · 8 months ago
104.28.42.0/25 Is one of the ip ranges used by Apples Private Relay (via Cloudflare)

https://github.com/hroost/icloud-private-relay-iplist/blob/m...

(There is also a list of ranges on apples site, but I forget where…)

Edit: found it https://mask-api.icloud.com/egress-ip-ranges.csv

shadowgovt · 8 months ago
What is the issue with this request?
Apreche · 8 months ago
Feed readers should be sending the If-Modified-Since header and web sites should properly recognize it and send the 304 Unmodified response. This isn’t new tech.
graemep · 8 months ago
That is exactly what the article says.
smallerize · 8 months ago
The article implies this but doesn't actually say it. It's nice to have the extra detail.
shkkmo · 8 months ago
The people who already know that a "conditional request" means a request with an If-Modified-After header aren't the ones who need to learn this information.
dartos · 8 months ago
If only people know of the standards
righthand · 8 months ago
Yeah but my LLM won’t generate that code.
Havoc · 8 months ago
Blocked for 2 hits in 20 minutes on a light protocol like rss?

That seems hilariously aggressive to me, but her server her rules I guess.

II2II · 8 months ago
If your feed reader is refreshing every 20 minutes for a blog that is updated daily, nearly 99% of the data sent is identical. It looks like Rachel's blog is updated (roughly) weekly, so that jumps to 99.8%. It's not the least efficient thing in the world of computers, but it is definitely incurring unnecessary costs.
elashri · 8 months ago
I opened the xml file she provides in the blog and it seems very long but okay. Then I decided it is a good blog to subscribe so I went and tried to add to my freshrss selfhosted instance (same ip obviously) and I couldn't because I got blocked/rate limited. So yes it is aggressive for different reasons.
wakawaka28 · 8 months ago
It should be a timeboxed block if anything. Most RSS users are actual readers and expecting them to spend lots of time figuring out why clicking "refresh" twice on their RSS app got them blocked is totally unreasonable. I've got my feeds set up to refresh every hour. Considering the small number of people still using RSS and how lightweight it is, it's not bad enough to freak out over. At some point all Rachel's complaining and investigating will be more work than her simply interacting directly with the makers of the various readers that cause the most traffic.
sccxy · 8 months ago
Her rss feed is last 100 posts with full content.

So it means 30 months of blog posts content in single request.

Sending 0.5MB in single rss request is more crime than those 2 hits in 20 minutes.

horsawlarway · 8 months ago
I generally agree here.

There are a lot of very valid use cases where defaulting to deny for an entire 24 hour cycle after a single request is incredible frustrating for your downstream users (shared IP at my university means I will never get a non-429 response... And God help me if I'm testing new RSS readers...)

It's her server, so do as you please, I guess. But it's a hilariously hostile response compared to just returning less data.

aidenn0 · 8 months ago
If there were a widely supported standard for pagination in RSS, then it would make sense to limit the number of posts. As there isn't, sending 500kB seems eminently reasonable, and RSS readers that send conditional requests are fine.
EdwardDiego · 8 months ago
Did you actually write 500KB as 0.5MB to make it sound BIGGER?

Clever.

wakawaka28 · 8 months ago
Yes that's right. Most blogs that are popular enough to have this problem send you the last 10 post titles and links or something. THAT is why people refresh every hour, so they don't miss out.
BonoboIO · 8 months ago
Complains about traffic, sends 0.5mb of everything.

That’s my kind of humor.

xyzsparetimexyz · 8 months ago
sigh feed readers set the If-Modified-Since header so that the feed is only resent when there are new items.
cesarb · 8 months ago
> Blocked for 2 hits in 20 minutes on a light protocol like rss?

I might be getting old, but 500KB in a single response doesn't feel "light" to me.

sccxy · 8 months ago
Yes, this is a very poorly designed RSS feed.

500KB is horrible for RSS.

garfij · 8 months ago
I believe if you read carefully, it's not blocked, it's rate limited to once daily, with very clear remediation steps included in the response.
that_guy_iain · 8 months ago
If you understand what rate limiting is, you block them for a period of time. Let's stop being pedantic here.

72 requests per day is nothing and acting like it's mayhem is a bit silly. And for a lot of people would result in them getting possible news slower. Sure OP won't publish that often but their rate limiting is an edge case and should be treated as such. If they're blocked until the next day and nothing gets updated then the only person harmed is OP for being overly bothered by their HTTP logs.

Sure it's their server and they can do whatever they want. But all this does is hurts the people trying to reach their blog.

mrweasel · 8 months ago
But it's not a "light" protocol when you're serving 36MB per day, when 500KB would suffice. RSS/Atom is light weight, if clients play by the rules. This could also have been a news website, imagine how much traffic would be dedicated to pointless transfers of unchanged data. Traffic isn't free.

A similar problem arise from the increase in AI scraper activities. Talking to other SREs the problem seems pretty wide spread. AI companies will just hoover up data, but revisit so frequently and aggressively that it's starting to affect the transit feeds for popular websites. Frequently user-agents wouldn't be set to something unique, or deliberately hidden, and traffic originates from AWS, making it hard to target individual bad actors. Fair enough that you're scraping websites, that's part of the game when your online, but when your industry starts to affect transit feeds, then we need to talk compensation.

yladiz · 8 months ago
That’s a bit disingenuous. 429s aren’t “blocking”, they’re telling the requester that they’re done too many requests and to try again later (with a value in the header). I assume the author configured this because they know how often the site is going to change typically. That the web server eventually stops responding if the client ignores requests isn’t that surprising, but I doubt it was configured directly too.
Havoc · 8 months ago
Semantics. 429 is an error code. Rate limiting...blocking...too many requests...ignoring...call it whatever you like but it amounts to the same, namingly server isn't serving the requested content.
luckylion · 8 months ago
> 429s aren’t “blocking”

Like how "unlimited traffic, but will slow down to 1bps if you use more than 100gb in a month" is technically "unlimited traffic".

But for all intents and purposes, it's limited. And 429 are blocking. They include a hint towards the reason why you are blocked and when the block might expire (retry-after doesn't promise that you'll be successful if you wait), but besides that, what's the different compared to 403?

that_guy_iain · 8 months ago
I would say it's disingenuous to claim sending HTTP status and body that is not expected for a period of time is not blocking them for that period of time. You can be pedantic and claim "but they can still access the server" but in reality that client is blocked for a period of time.
jannes · 8 months ago
The HTTP protocol is a lost art. These days people don't even look at the status code and expect some mumbo jumbo JSON payload explaining the error.
klntsky · 8 months ago
I would argue that HTTP statuses are a bad design decision, because they are intended to be consumed by apps, but are not app-specific. They are effectively a part of every API automatically without considerations whether they are needed.

People often implement error handling using constructs like regexp matching on status codes, while with domain-specified errors it would be obvious what exactly is the range of possible errors.

Moreover, when people do implement domain errors, they just have to write more code to handle two nested levels of branching.

throw0101b · 8 months ago
> I would argue that HTTP statuses are a bad design decision, because they are intended to be consumed by apps, but are not app-specific.

Perhaps put the app-specific part in the body of the reply. In the RFC they give a human specific reply to (presumably) be displayed in the browser:

   HTTP/1.1 429 Too Many Requests
   Content-Type: text/html
   Retry-After: 3600

   <html>
      <head>
         <title>Too Many Requests</title>
      </head>
      <body>
         <h1>Too Many Requests</h1>
         <p>I only allow 50 requests per hour to this Web site per
            logged in user.  Try again soon.</p>
      </body>
   </html>
* https://datatracker.ietf.org/doc/html/rfc6585#section-4

* https://developer.mozilla.org/en-US/docs/Web/HTTP/Status/429

But if the URL is specific to an API, you can document that you will/may give further debugging details (in text, JSON, XML, whatever).

marcosdumay · 8 months ago
> because they are intended to be consumed by apps, but are not app-specific

Well, good luck designing any standard app-independent protocol that works and doesn't do that.

And yes, you must handle two nested levels of branching. That's how it works.

The only improvement possible to make it clearer is having codes for API specific errors... what 400 and 500 aren't exactly. But then, that doesn't gain you much.

est · 8 months ago
> error handling using constructs like regexp matching on status codes

Oh the horror. I would assume the practice is encourage by "RESTful" people?

KomoD · 8 months ago
That's because a lot of people refuse to use status codes properly, like just using 200 everywhere.
kstrauser · 8 months ago
A colleague who should’ve known better argued that a 404 response to an API call was confusing because we were, in fact, successfully returning a response to the client. We had a long talk about that afterward.
AznHisoka · 8 months ago
I dont look at the code because its wrong sometimes. Some pages return a 200 yet display an error in the page
DaSHacka · 8 months ago
Nothing more annoying than a 200 response when the server 'successfully' serves a 404 page

Deleted Comment

shepherdjerred · 8 months ago
I like Rachel's writing, but I don't understand this recent crusade against RSS readers. Sure, they should work properly and optimizations can be made to reduce bandwidth and processing power.

But... why not throw a CDN in front of your site and focus your energy somewhere else? I guess every problem has to be solved by someone, but this just seems like a very strange hill to die on.

EdwardDiego · 8 months ago
Because she's old school sysadmin mate, likes running her own stuff her own way, fair enough.

And she posts on it lots because she has a bunch of RSS clients pointed at her writing, because she's rather popular.

And she'd rather people writing this stuff just learn HTTP properly, at least out of professionalism, if not courtesy.

Hey, you might not, I might not, but we all choose our hills to die on.

My personal hill is "It's lollies and biscuits, not candy and cookies".

rollcat · 8 months ago
> why not throw a CDN in front of your site [...]

Because this is how the open web dies - one website at a time. It's already near-dead on the client side - web browsers are not really "user" agents, but agents of oligopolist corporations, that have a stake in abusing you[1].

It's been attempted before with WAP[2], then AMP. But effectively, we're almost there.

[1]: https://www.5snb.club/posts/2023/do-not-stab/

[2]: https://news.ycombinator.com/item?id=42479172

est · 8 months ago
> But... why not throw a CDN in front of your site and focus your energy somewhere else?

Yes it's been invented before, known as Feedburner, which was acquired & abandoned by Google.

generationP · 8 months ago
Rejecting every unconditional GET after the first? That sounds a bit excessive. What if the reader crashed after the first and lost the data?
brookst · 8 months ago
It’s a RSS feed. In that case, wait until the specified time and try again and any missed article will appear then. If it is constantly crashing so articles never get loaded, fix that.
aleph_minus_one · 8 months ago
> If it is constantly crashing so articles never get loaded, fix that.

This often requires to do lots of tests against the endpoint, which the server prohibits.

XCabbage · 8 months ago
For that matter, what if it's a different device (or entire different human being) on the same IP address?
bombcar · 8 months ago
At some point instead of 429 it should return a feed with this post as always newest.
cpeterso · 8 months ago
That’s a great point: the client software isn’t listening to the server, so the server software should break the loop by escalating to the human reader. The message response should probably be even more direct with a call to action about their feed reader (naming it, if possible) causing server problems.
0xDEAFBEAD · 8 months ago
Or a feed with only this post
PaulHoule · 8 months ago
This is why RSS for the birds.

My RSS reader YOShInOn subscribes to 110 RSS feeds through Superfeedr which absolves me of the responsibility of being on the other side of Rachel's problem.

With RSS you are always polling too fast or too slow; if you are polling too slow you might even miss items.

When a blog gets posted Superfeedr hits an AWS lambda function that stores the entry in SQS so my RSS reader can update itself at its own pace. The only trouble is Superfeedr costs 10 cents a feed per month which is a good deal for an active feed such as comments from Hacker News or article from The Guardian but is not affordable for subscribing to 2000+ indy blogs which YOShInOn could handle just fine.

I might yet write my own RSS head end, but there is something to say for protocols like ActivityPub and AT Protocol.

rakoo · 8 months ago
That's why websub (formerly pubsubhubbub) was created and should be the proper solution, not a proprietary middleware
PaulHoule · 8 months ago
Superfeedr is pubsubhubhub.