A sysadmin's rant about feed readers and crawlers (2022)

dijit · 6 months ago

Using HTTP meta-headers is actually something we seem to have forgotten how to do.

The one that annoys me most is the accept-language header which is almost entirely ignored in favour of GeoIP lookups to figure out regionality... which I find super odd; as if people are walking around using a browser in a language they don't speak. (or, an operating system configured for a language they don't speak).

ETAG's though, are a bit fraught- if you're a company, a security scan will fire if an etag is detected because you might be able to figure out the inode on the filesystem based on it... which, idk why that's a security problem eitherway[0], but it's common for there to be false-positives[1]... which makes people not respect the header.

Last-Modified should work though, I love the idea of checking headers and not content.

I think people don't care to imagine the computer doing as little as possible to get the job done, and instead use the near unlimited computing power to just avoid thinking about consequences.

[0]: https://www.pentestpartners.com/security-blog/vulnerabilitie...

[1]: https://github.com/sullo/nikto/issues/469

yonatan8070 · 6 months ago

To add to the language problem, when I travelled to Europe, some websites (like YouTube) changed to whatever regional language based on where I was, despite me being logged in and Google knowing full well which languages I speak. Even the ads changed language, as if advertising in a language I don't speak will help anyone

steanne · 6 months ago

almost all of my spam is in french, which is an assumption on the part of the spammers based on the email username. almost all my gmail is spam, because i have directed most real email elsewhere. therefore, almost all the mail i receive at gmail is in french. this has lead to google blocking things (like voter registration confirmation!) that are in english because they're "not in your normal language."

golph · 6 months ago

I think that Ads on Google are different in regards to licensing and targeting. A company might target "Users in Europe" for example.

shadowgovt · 6 months ago

IIUC, accept-language is mostly ignored because the tooling to configure it on the user agent is really poor for most user agents. So users log into a site, they get the site in the wrong language, and because only the site is visible they blame the site, not their UA.

It's the "Your site's broken if IE won't load it" problem.

dijit · 6 months ago

Can someone attest that this is actually the issue?

FWIW Outlook does accept the "Accept-Language" header and I don't think anyone is saying that outlook is wrong for doing that or claiming it to be broken?

Are you totally sure that this isn't a backwards myth?

I think the most likely situation is that locale information for English speaking countries would be incorrect if the default (en_US) was used to install the operating system, which happens on occasion.

OptionOfT · 6 months ago

Growing up in Belgium I feel your pain about GeoIPs and accept-language.

I lived in Flanders, with my accept-language set to en-US, en.

Ads would pop up in Dutch, Flemish, French and sometimes German. When you think about it, from a brick-and-mortar point of view, it makes sense. I'm more likely to buy <physical product advertised> at the <local chain grocery store> vs buying it anywhere in the USA, based on my IP.

Next to that, imagine you browsing Reuters.com in with a Berlin IP and accept-language set to en-US, en.

What SHOULD they show you? Local news in German, auto translated? Local news in German? Or redirect you to the US page?

Ghoelian · 6 months ago

Locality is different from language. In your example, it would have to show you the local German news, as that's local to you, and it would have to show it to you in the first supported language in your accept-language header.

Personally I would prefer, for example, Reuters.com to be a "hub", and all the regional variants on de.reuters.com. Then just let the user choose what they want.

unregistereddev · 6 months ago

Even when etag's have nothing to do with the filesystem they can still be a security vector. Some API's use etag's to identify what has changed since the last time you called a particular API. This means the ETAG values are probably stored in a database, which means the API server needs to protect against SQL injection in the request headers.

mikevin · 6 months ago

I mean that's something you need to do every time a DB is involved. Not really an argument against ETAGS.

balamatom · 6 months ago

>as if people are walking around using a browser in a language they don't speak. (or, an operating system configured for a language they don't speak

Well, yes, they are! Computers translated in my native language sound dumb. That's how a whole generation of my world learned better English than native speakers, ffs!

Half of the time it's just translated wrong. You think anyone has any incentive to translate any technology to a language with a couple million speakers, all of whom are obligate pirates?

And it seems like you might be surprised to hear that people speak more than one language. Then where's my global setting to tell the browser what languages I speak, so it'd know what header to send? Same place that lets me configure what ads I'm actually interested in. Nowhere.

>I think people don't care to imagine the computer doing as little as possible to get the job done, and instead use the near unlimited computing power to just avoid thinking about consequences.

This, friend, is what computers are for in the XXI century. "Bicycle for the mind", ha...

dijit · 6 months ago

Accept-Language is an array, not a string.

https://developer.mozilla.org/en-US/docs/Web/HTTP/Headers/Ac...

trurl42 · 6 months ago

> Then where's my global setting to tell the browser what languages I speak, so it'd know what header to send?

In Chrome: chrome://settings/languages

In Firefox: https://support.mozilla.org/en-US/kb/choose-display-language...

Tijdreiziger · 6 months ago

> Then where's my global setting to tell the browser what languages I speak, so it'd know what header to send?

Firefox: https://support.mozilla.org/en-US/kb/choose-display-language...

Chrome insists that the first language be the UI language, and Safari insists that the first language be the _system_ language.

3np · 6 months ago

> Then where's my global setting to tell the browser what languages I speak, so it'd know what header to send? ...Nowhere

Look again. Or switch browser. It is a basic feature and the issue is indeed websites ignoring it.

gildas · 6 months ago

Bonus point for clients that don't support the HTTP “Accept-Encoding” header [1] and consume all your bandwidth.

[1] https://developer.mozilla.org/en-US/docs/Web/HTTP/Headers/Ac...

fc417fc802 · 6 months ago

Seems like a reasonable case for disregarding the client preference. If you're able to speak TLS then you're able to load up a public domain (de)compression library.

tomrod · 6 months ago

I always appreciate Rachel's writings. I don't know much about her, but my takeaway is that she has worked at some of the hardest sysadmin jobs in the past few decades and writes to her experience super well.

someothherguyy · 6 months ago

IMO, it is also unreasonable to have ultra-restrictive rate limits, like blocking a client after one request.

https://rachelbythebay.com/w/atom.xml

horsawlarway · 6 months ago

I'm with you.

Especially as the cost to serve this content approaches zero.

I find the take in the blog to be relatively hostile. It's a "technically correct" rant. Not wrong, but mostly missing the point, and being a bit of a dick in the process.

Sure - block the readers that make a request every 10 seconds. It's perfectly reasonable to block clients if they hit a limit like 20 to 50 requests in a day.

It's damn hostile to block for 24 hours after a single request. If the 10MB of traffic for 20 requests is going to break the bank... maybe don't host an atom or RSS feed at all?

---

That said - weirdos can weird on their own sites as they like. It's not a public service.

But I bucket this into the same category of weird as posting a whole bunch of threatening "no trespassing", "beware of dog", "homeowner is armed", "Solicitors not welcome", etc style signs all over their property.

Like - point out on the doll where the rss client hurt you. Because something's up.

rmholt · 6 months ago

Maybe a warning after 3 requests and a ban on 4 per 24hr, but I understand the sentiment

ing33k · 6 months ago

haha, true. Just got blocked because I opened the link once and clicked on refresh.

Aeolun · 6 months ago

160 gigabytes of feed over the course of a month (when polling a 640kb feed every 10 seconds), in case anyone else was wondering.

shadowgovt · 6 months ago

Rachel makes an excellent point here about feed change frequency.

Seems like it'd be straightforward to implement a backoff strategy based on how frequently the feed content changed into most readers. For a regular, periodic fetch, if the content has proven it doesn't update frequently, just back off the period for that endpoint.

account42 · 6 months ago

If-Modified-Since and ETag are nice and everyone should implement them but IME the implementation status is much better on the reader side than on the feed side. Trim your (main) feed to only recent posts and use Atom's paginatio to link to the rest for new subscribers and the difference in data transferred becomes much smaller.

> Besides that, a well-behaved feed will have the same content as what you will get on the actual web site. The HTML might be slightly different to account for any number of failings in stupid feed readers in order to save the people using those programs from themselves, but the actual content should be the same. Given that, there's an important thing to take away from this: there is no reason to request every single $(&^$(&^@#* post that's mentioned in the feed.

> If you pull the feed, don't pull the posts. If you pull the posts, don't pull the feed. If you pull both, you're missing the whole point of having an aggregated feed!

Unfortunately there are too many feeds that don't include the full content for this to work. And a reader won't know if the feed has the full content before fetching the HTML page. This can also change from post to post so it can't just determine this when subscribing.

> Then there are the user-agents who lie about who they are or where they are coming from because they think it's going to get them special treatment somehow.

These exist because of misbehaved web servers that block based on user agen't or send different content. And since you are complaining about faked user agents that probably includes you.

> Sending referrers which make no sense is just bad manners.

HTTP Referer should not exist. And has been abused by spammers for ages.

spiderfarmer · 6 months ago

> These exist because of misbehaved web servers that block based on user agen't or send different content. And since you are complaining aber faked user agents that probably includes you.

That's a niche. It's about 1 million percent more likely a fake request is coming from an overzealous AI scraper nowadays. I have blocked hundreds of them and I'm on the verge of giving up and handing over money to Cloudflare just for their AI scraping protection.

jstanley · 6 months ago

> If you pull the feed, don't pull the posts. If you pull the posts, don't pull the feed. If you pull both, you're missing the whole point of having an aggregated feed!

People probably do this because some sites only give you a preview in the feed, to force you to go to the site and view the ads.

So if you want the full post in the feed reader, you need to pull the post as well.

hylaride · 6 months ago

This. My feed reader pulls a "reader" view so I don't have to leave the app. I normally wouldn't mind going to the website, except that to do so would mean waiting for it to fully load, dealing with javascript popups, and often bad scrolljacking.

This person isn't thinking as a user.