13ft – A site similar to 12ft.io but self-hosted

Running a server just to set the user agent header to the googlebot one for some requests feels a bit heavyweight.

But perhaps it’s necessary, as it seems Firefox no longer has an about:config option to override the user agent…am I missing it somewhere?

Edit: The about:config option general.useragent.override can be created and will be used for all requests (I just tested). I was confused because that config key doesn’t exist in a fresh install of Firefox. The user agent header string from this repo is: "Mozilla/5.0 (Linux; Android 6.0.1; Nexus 5X Build/MMB29P) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/W.X.Y.Z Mobile Safari/537.36 (compatible; Googlebot/2.1; +http://www.google.com/bot.html)"

codetrotter · 2 years ago

> set the user agent header to the googlebot one

Also, how effective is this really? Don’t the big news sites check the IP address of the user agents that claim to be GoogleBot?

mdotk · 2 years ago

This. 12ft has never ever worked for me.

dutchmartin · 2 years ago

If you would host that server on Google cloud, you would make it a lot harder already.

Zaheer · 2 years ago

If this is all it's doing then you could also just use this extension: https://requestly.com/

Create a rule to replace user agent with "Mozilla/5.0 (Linux; Android 6.0.1; Nexus 5X Build/MMB29P) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/W.X.Y.Z Mobile Safari/537.36 (compatible; Googlebot/2.1; +http://www.google.com/bot.html)"

I just tried it and seems to work.

unethical_ban · 2 years ago

I tried "User Agent Switcher" since it doesn't require a login. Washingtonpost.com blocked me, and NYT won't load article contents.

mathfailure · 2 years ago

It used to be a good extension. Now it is crapware tied to web services. I don't want any web services, I don't want seeing ads about paid features, I want a free extension working absolutely locally and not phoning home.

This piece of crap is, unfortunately, unfit.

NoboruWataya · 2 years ago

I use this extension which has a decent UI: https://webextension.org/listing/useragent-switcher.html

Beijinger · 2 years ago

This does not work anymore?

https://addons.mozilla.org/en-US/firefox/addon/random_user_a...

chefandy · 2 years ago

It says it blocks ads and other things, too. I imagine the use case is someone wanting this for multiple devices/people so they don't have to set up an extension for every platform/device individually. I have no idea how effective it is

judge2020 · 2 years ago

I always do DevTools -> Network Conditions to set UA, at least in Chrome.

darknavi · 2 years ago

Personally I find it nice for sending articles to friends.

samstave · 2 years ago

You can make a search related function in FF by rightclicking on that box and 'add keyword for this search' https://i.imgur.com/AkMxqIj.png

and then in your browser just type the letter you assign it to: for example, I have 'i' == the searchbox for IMDB, so I type 'i [movie]' in my url and it brings up the IMDB search of that movie . https://i.imgur.com/dXdwsbA.png

So you can just assign 'a' to that search box and type 'a [URL]' in your address bar and it will submit it to your little thing.

cortesoft · 2 years ago

That would mean that your self-hosted install is exposed to the internet. I don't think I want to run a publicly accessible global relay.

Nice effort, but after one successful NYT session, it fails and treats the access as though it were an end user. But don't take my word for it : try it. One access, succeeds. Two or more ... fails.

The reason is the staff at the NYT appear to be very well versed in the technical tricks people use to gain access.

WatchDog · 2 years ago

They probably asynchronously verify that the IP address actually belongs to googlebot, then ban the IP when it fails.

Synchronously verifying it, would probably be too slow.

You can verify googlebot authenticity by doing a reverse dns lookup, then checking that reverse dns name resolves correctly to the expected IP address[0].

[0]: https://developers.google.com/search/docs/crawling-indexing/...

selcuka · 2 years ago

> Synchronously verifying it, would probably be too slow.

Why would it be slow? There is a JSON documenbt that lists all IP ranges on the same page you linked to:

https://developers.google.com/static/search/apis/ipranges/go...

immibis · 2 years ago

Which leads to the possibility of triggering a self-inflicted DoS. I am behind a CGNAT right now. You reckon that if I set myself to Googlebot and loaded NYT, they'd ban the entire o2 mobile network in Germany? (or possibly shared infrastructure with Deutsche Telekom - not sure)

Not to mention the possibility of just filling up the banned IP table.

katzgrau · 2 years ago

There are easily installable databases of IP block info, super easy to do it synchronously, especially if it’s stored in memory. I run a small group of servers that each have to do it thousands of times per second.

lodovic · 2 years ago

Some sites do work, but others such as WSJ just give a blank page. Worse, Economist actively blocks this through Cloudflare.

echelon · 2 years ago

We need a P2P swarm for content. Just like Bittorrent back in the day. Pop in your favorite news article (or paraphrase it yourself), and everyone gets it.

With recommender systems, attention graph modeling, etc. it'd probably be a perfect information ingestion and curation engine. And nobody else could modify the algorithm on my behalf.

Swarmeggcute · 2 years ago

I get the feeling there is some sort of vector of attack behind this idea, but I'm not well versed enough to figure it out.

Deleted Comment

bredren · 2 years ago

I think you have something here.

I don’t know about paraphrased versions but it would need to handle content revisions by the publisher somehow.

1vuio0pswjnm7 · 2 years ago

"The reason is the staff at the NYT appear to be very well versed in the technical tricks people use to gain access."

It appears anyone can read any new NYT article in the Internet Archive. I use a text-only browser. I am not using Javascript. I do not send a User-Agent header. Don't take my word for it. Here is an example:

https://web.archive.org/web/20240603124838if_/https://www.ny...

If I am not mistaken the NYT recently had their entire private Github repository, a very large one, made public without their consent. This despite the staff at the NYT being "well-versed" in whatever it is the HN commenter thinks they are well versed in.

markerz · 2 years ago

https://securityboulevard.com/2024/08/the-secrets-of-the-new...

Because I had to learn more, sounds like a pretty bad breach. But I’m still pretty impressed by NYTs technical staff for the most part for the things they do accomplish, like the interactive web design of some very complicated data visualizations.

xp84 · 2 years ago

I would hope that they’re in IA on purpose. It would be exceptionally lame if NYT didn’t let their articles be archived by the Internet’s best and most definitive archive. It would be scary to me if they had only been allowing IA archiving because they were too stupid to know about it.

eks391 · 2 years ago

> I use a text-only browser.

Which browser is this?

noodlesUK · 2 years ago

FWIW, if you happen to be based in the U.S., you might find that your local public library provides 3-day NYT subscriptions free of charge, which whilst annoying is probably easier than fighting the paywall. Of course this only applies to the NYT.

akvadrako · 2 years ago

In the Netherlands the library provides free access to thousands of newspapers for 5 days after visiting, including The Economist and WSJ, which actually have paywalls that aren't trivial to bypass.

https://www.pressreader.com/

roshankhan28 · 2 years ago

i tried to access my own website and it says internal server error. i also tried to access Youtube and it said the same.

wasi_master · 2 years ago

Hello everyone, it's the author here. I initially created 13ft as a proof of concept, simply to test whether the idea would work. I never anticipated it would gain this much traction or become as popular as it has. I'm thrilled that so many of you have found it useful, and I'm truly grateful for all the support.

Regarding the limitations of this approach, I'm fully aware that it isn't perfect, and it was never intended to be. It was just a quick experiment to see if the concept was feasible—and it seems that, at least sometimes, it is. Thank you all for the continued support.

Apologies for submitting it here if it caused any sense of being overwhelmed. Hopefully FOSS is supportive here instead of overwhelming.

Thanks for sharing the project with the internet!

refibrillator · 2 years ago

sam_goody · 2 years ago

It seems to me that google should not allow a site to serve different content to their bot than they serve to their users. If the content is unavailable to me, it should not be in the search results.

It obviously doesn't seem that way to Google, or to the sites providing the content.

They are doing what works for them without ethical constraints (Google definitely, many content providers, eg NYT). Is it fair game to do what works for you (eg. 13ft)?!

rurp · 2 years ago

> It seems to me that google should not allow a site to serve different content to their bot than they serve to their users.

That would be the fair thing to do and was Google's policy for many years, and still is for all I know. But modern Google stopped caring about fairness and similar concerns many years ago.

sltkr · 2 years ago

The policy was that if a user lands on the page from the Google search results page, then they should be shown the full content, same as Googlebot (“First Click Free”). But that policy was abandoned in 2017:

https://www.theguardian.com/technology/2017/oct/02/google-to...

efilife · 2 years ago

This is called cloaking[0] and is against Google's policies for many years. But they don't care

[0] https://en.wikipedia.org/wiki/Cloaking

justinl33 · 2 years ago

"organizing the world's information"

querez · 2 years ago

I disagree, I think the current approach actually makes for a better and more open web long term. The fact is that either you pay for content, or the content has to pay for itself (which means it's either sponsored content or full of ads). Real journalism costs money, there's no way around that. So we're left with a few options:

Option a) NYT and other news sites makes their news open to everyone without paywall. To finance itself it will become full of ads and disgusting.

Option b) NYT and other news sites become fully walled gardens, letting no-one in (including Google bots). It won't be indexed by Google and search sites, we won't be able to find its context freely. It's like a discord site or facebook groups: there's a treasure trove of information out there, but you won't be able to find it when you need it.

Option c): NYT and other news sites let Google and search sites index their content, but asks the user to pay to access it.

siwatanejo · 2 years ago

> Real journalism costs money, there's no way around that

I agree, but journals should allow paying for reading the article X amount of money, where X is much much much lower than the usual amount Y they charge for a subscription. Example: X could be 0.10 USD, while Y is usually around 5-20USD.

And in this day and age there are ways to make this kind of micropayments work, example: lightning. Example of a website built around this idea: http://stacker.news

I don't think this will work reliably as other commenters pointed out. A better solution could be to pass the URL through an archiver, such as archive.today:

https://archive.is/20240719082825/https://www.nytimes.com/20...

thinkloop · 2 years ago

How do they do it?

karmakaze · 2 years ago

Missed opportunity to call it 2ft, as in standing on one's own.

dredmorbius · 2 years ago

I kind of like the implied concept that self-hosting is the last-foot problem.

trackofalljades · 2 years ago

...or 11ft8, which can open anything

numpad0 · 2 years ago

  https://en.wikipedia.org/wiki/11foot8

spoonfeeder006 · 2 years ago

666ft....

j_maffe · 2 years ago

has 12ft.io even been working anymore? I feel like the only reliable way now is archive.is

91bananas · 2 years ago

I just had anecdotal success with it last week and the atlantic, but before that it has been very hit and miss.

mvonballmo · 2 years ago

I'm using the [Bypass Paywalls](https://github.com/iamadamdev/bypass-paywalls-chrome/blob/ma...) extension but it looks like that's been DMCA-ed in the interim.

Gurathnaka · 2 years ago

Very rarely works.

LetsGetTechnicl · 2 years ago

Yeah I also have been using archive.today as well, since 12ft hasn't worked on NYT in forever

lutusp · 2 years ago

Zambyte · 2 years ago

This is awesome! People who use Kagi can also set up a regex redirect to automatically use this for problematic sites.