Hello everyone, it's the author here. I initially created 13ft as a proof of concept, simply to test whether the idea would work. I never anticipated it would gain this much traction or become as popular as it has. I'm thrilled that so many of you have found it useful, and I'm truly grateful for all the support.
Regarding the limitations of this approach, I'm fully aware that it isn't perfect, and it was never intended to be. It was just a quick experiment to see if the concept was feasible—and it seems that, at least sometimes, it is. Thank you all for the continued support.
Running a server just to set the user agent header to the googlebot one for some requests feels a bit heavyweight.
But perhaps it’s necessary, as it seems Firefox no longer has an about:config option to override the user agent…am I missing it somewhere?
Edit: The about:config option general.useragent.override can be created and will be used for all requests (I just tested). I was confused because that config key doesn’t exist in a fresh install of Firefox. The user agent header string from this repo is: "Mozilla/5.0 (Linux; Android 6.0.1; Nexus 5X Build/MMB29P) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/W.X.Y.Z Mobile Safari/537.36 (compatible; Googlebot/2.1; +http://www.google.com/bot.html)"
If this is all it's doing then you could also just use this extension: https://requestly.com/
Create a rule to replace user agent with "Mozilla/5.0 (Linux; Android 6.0.1; Nexus 5X Build/MMB29P) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/W.X.Y.Z Mobile Safari/537.36 (compatible; Googlebot/2.1; +http://www.google.com/bot.html)"
It used to be a good extension. Now it is crapware tied to web services. I don't want any web services, I don't want seeing ads about paid features, I want a free extension working absolutely locally and not phoning home.
It says it blocks ads and other things, too. I imagine the use case is someone wanting this for multiple devices/people so they don't have to set up an extension for every platform/device individually. I have no idea how effective it is
You can make a search related function in FF by rightclicking on that box and 'add keyword for this search' https://i.imgur.com/AkMxqIj.png
and then in your browser just type the letter you assign it to: for example, I have 'i' == the searchbox for IMDB, so I type 'i [movie]' in my url and it brings up the IMDB search of that movie . https://i.imgur.com/dXdwsbA.png
So you can just assign 'a' to that search box and type 'a [URL]' in your address bar and it will submit it to your little thing.
It seems to me that google should not allow a site to serve different content to their bot than they serve to their users. If the content is unavailable to me, it should not be in the search results.
It obviously doesn't seem that way to Google, or to the sites providing the content.
They are doing what works for them without ethical constraints (Google definitely, many content providers, eg NYT). Is it fair game to do what works for you (eg. 13ft)?!
> It seems to me that google should not allow a site to serve different content to their bot than they serve to their users.
That would be the fair thing to do and was Google's policy for many years, and still is for all I know. But modern Google stopped caring about fairness and similar concerns many years ago.
The policy was that if a user lands on the page from the Google search results page, then they should be shown the full content, same as Googlebot (“First Click Free”). But that policy was abandoned in 2017:
I disagree, I think the current approach actually makes for a better and more open web long term. The fact is that either you pay for content, or the content has to pay for itself (which means it's either sponsored content or full of ads). Real journalism costs money, there's no way around that. So we're left with a few options:
Option a) NYT and other news sites makes their news open to everyone without paywall. To finance itself it will become full of ads and disgusting.
Option b) NYT and other news sites become fully walled gardens, letting no-one in (including Google bots). It won't be indexed by Google and search sites, we won't be able to find its context freely. It's like a discord site or facebook groups: there's a treasure trove of information out there, but you won't be able to find it when you need it.
Option c): NYT and other news sites let Google and search sites index their content, but asks the user to pay to access it.
> Real journalism costs money, there's no way around that
I agree, but journals should allow paying for reading the article X amount of money, where X is much much much lower than the usual amount Y they charge for a subscription. Example: X could be 0.10 USD, while Y is usually around 5-20USD.
And in this day and age there are ways to make this kind of micropayments work, example: lightning. Example of a website built around this idea: http://stacker.news
I don't think this will work reliably as other commenters pointed out. A better solution could be to pass the URL through an archiver, such as archive.today:
Nice effort, but after one successful NYT session, it fails and treats the access as though it were an end user. But don't take my word for it : try it. One access, succeeds. Two or more ... fails.
The reason is the staff at the NYT appear to be very well versed in the technical tricks people use to gain access.
They probably asynchronously verify that the IP address actually belongs to googlebot, then ban the IP when it fails.
Synchronously verifying it, would probably be too slow.
You can verify googlebot authenticity by doing a reverse dns lookup, then checking that reverse dns name resolves correctly to the expected IP address[0].
Which leads to the possibility of triggering a self-inflicted DoS. I am behind a CGNAT right now. You reckon that if I set myself to Googlebot and loaded NYT, they'd ban the entire o2 mobile network in Germany? (or possibly shared infrastructure with Deutsche Telekom - not sure)
Not to mention the possibility of just filling up the banned IP table.
There are easily installable databases of IP block info, super easy to do it synchronously, especially if it’s stored in memory. I run a small group of servers that each have to do it thousands of times per second.
We need a P2P swarm for content. Just like Bittorrent back in the day. Pop in your favorite news article (or paraphrase it yourself), and everyone gets it.
With recommender systems, attention graph modeling, etc. it'd probably be a perfect information ingestion and curation engine. And nobody else could modify the algorithm on my behalf.
"The reason is the staff at the NYT appear to be very well versed in the technical tricks people use to gain access."
It appears anyone can read any new NYT article in the Internet Archive. I use a text-only browser. I am not using Javascript. I do not send a User-Agent header. Don't take my word for it. Here is an example:
If I am not mistaken the NYT recently had their entire private Github repository, a very large one, made public without their consent. This despite the staff at the NYT being "well-versed" in whatever it is the HN commenter thinks they are well versed in.
Because I had to learn more, sounds like a pretty bad breach. But I’m still pretty impressed by NYTs technical staff for the most part for the things they do accomplish, like the interactive web design of some very complicated data visualizations.
I would hope that they’re in IA on purpose. It would be exceptionally lame if NYT didn’t let their articles be archived by the Internet’s best and most definitive archive. It would be scary to me if they had only been allowing IA archiving because they were too stupid to know about it.
FWIW, if you happen to be based in the U.S., you might find that your local public library provides 3-day NYT subscriptions free of charge, which whilst annoying is probably easier than fighting the paywall. Of course this only applies to the NYT.
In the Netherlands the library provides free access to thousands of newspapers for 5 days after visiting, including The Economist and WSJ, which actually have paywalls that aren't trivial to bypass.
Regarding the limitations of this approach, I'm fully aware that it isn't perfect, and it was never intended to be. It was just a quick experiment to see if the concept was feasible—and it seems that, at least sometimes, it is. Thank you all for the continued support.
Thanks for sharing the project with the internet!
But perhaps it’s necessary, as it seems Firefox no longer has an about:config option to override the user agent…am I missing it somewhere?
Edit: The about:config option general.useragent.override can be created and will be used for all requests (I just tested). I was confused because that config key doesn’t exist in a fresh install of Firefox. The user agent header string from this repo is: "Mozilla/5.0 (Linux; Android 6.0.1; Nexus 5X Build/MMB29P) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/W.X.Y.Z Mobile Safari/537.36 (compatible; Googlebot/2.1; +http://www.google.com/bot.html)"
Also, how effective is this really? Don’t the big news sites check the IP address of the user agents that claim to be GoogleBot?
Create a rule to replace user agent with "Mozilla/5.0 (Linux; Android 6.0.1; Nexus 5X Build/MMB29P) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/W.X.Y.Z Mobile Safari/537.36 (compatible; Googlebot/2.1; +http://www.google.com/bot.html)"
I just tried it and seems to work.
This piece of crap is, unfortunately, unfit.
https://addons.mozilla.org/en-US/firefox/addon/random_user_a...
and then in your browser just type the letter you assign it to: for example, I have 'i' == the searchbox for IMDB, so I type 'i [movie]' in my url and it brings up the IMDB search of that movie . https://i.imgur.com/dXdwsbA.png
So you can just assign 'a' to that search box and type 'a [URL]' in your address bar and it will submit it to your little thing.
It obviously doesn't seem that way to Google, or to the sites providing the content.
They are doing what works for them without ethical constraints (Google definitely, many content providers, eg NYT). Is it fair game to do what works for you (eg. 13ft)?!
That would be the fair thing to do and was Google's policy for many years, and still is for all I know. But modern Google stopped caring about fairness and similar concerns many years ago.
https://www.theguardian.com/technology/2017/oct/02/google-to...
[0] https://en.wikipedia.org/wiki/Cloaking
Option a) NYT and other news sites makes their news open to everyone without paywall. To finance itself it will become full of ads and disgusting.
Option b) NYT and other news sites become fully walled gardens, letting no-one in (including Google bots). It won't be indexed by Google and search sites, we won't be able to find its context freely. It's like a discord site or facebook groups: there's a treasure trove of information out there, but you won't be able to find it when you need it.
Option c): NYT and other news sites let Google and search sites index their content, but asks the user to pay to access it.
I agree, but journals should allow paying for reading the article X amount of money, where X is much much much lower than the usual amount Y they charge for a subscription. Example: X could be 0.10 USD, while Y is usually around 5-20USD.
And in this day and age there are ways to make this kind of micropayments work, example: lightning. Example of a website built around this idea: http://stacker.news
https://archive.is/20240719082825/https://www.nytimes.com/20...
The reason is the staff at the NYT appear to be very well versed in the technical tricks people use to gain access.
Synchronously verifying it, would probably be too slow.
You can verify googlebot authenticity by doing a reverse dns lookup, then checking that reverse dns name resolves correctly to the expected IP address[0].
[0]: https://developers.google.com/search/docs/crawling-indexing/...
Why would it be slow? There is a JSON documenbt that lists all IP ranges on the same page you linked to:
https://developers.google.com/static/search/apis/ipranges/go...
Not to mention the possibility of just filling up the banned IP table.
With recommender systems, attention graph modeling, etc. it'd probably be a perfect information ingestion and curation engine. And nobody else could modify the algorithm on my behalf.
Deleted Comment
I don’t know about paraphrased versions but it would need to handle content revisions by the publisher somehow.
It appears anyone can read any new NYT article in the Internet Archive. I use a text-only browser. I am not using Javascript. I do not send a User-Agent header. Don't take my word for it. Here is an example:
https://web.archive.org/web/20240603124838if_/https://www.ny...
If I am not mistaken the NYT recently had their entire private Github repository, a very large one, made public without their consent. This despite the staff at the NYT being "well-versed" in whatever it is the HN commenter thinks they are well versed in.
Because I had to learn more, sounds like a pretty bad breach. But I’m still pretty impressed by NYTs technical staff for the most part for the things they do accomplish, like the interactive web design of some very complicated data visualizations.
Which browser is this?
https://www.pressreader.com/