Readit News logoReadit News
popcalc · a year ago

  # Welcome to Reddit's robots.txt
  # Reddit believes in an open internet, but not the misuse of public content.
  # See https://support.reddithelp.com/hc/en-us/articles/26410290525844-Public-Content-Policy Reddit's Public Content Policy for access and use restrictions to Reddit content.
  # See https://www.reddit.com/r/reddit4researchers/ for details on how Reddit continues to support research and non-commercial use.
  # policy: https://support.reddithelp.com/hc/en-us/articles/26410290525844-Public-Content-Policy

  User-agent: *
  Disallow: /
Source: https://www.reddit.com/robots.txt

sunaookami · a year ago
They serve a different robots.txt to Google: https://merj.com/blog/investigating-reddits-robots-txt-cloak...

You can see it here: https://search.google.com/test/rich-results/result?id=_mYogl... (click on "View Tested Page")

dogleash · a year ago
> # Reddit believes in an open internet, but not the misuse of public content.

Calling it "public" content in the very act of exercising their ownership over it. The balls on whoever wrote that.

shit_game · a year ago
Their license/Eula clearly state that Reddit has perpetual whatever to content posted on Reddit, but relying solely on DMCA for "stolen" content _yet again_ feels like a terrible way to deal with non-original content. Part of me hopes that Reddit gets hit with some new precidence-setting lawsuits regarding non-original content that requires useful attribution, but I double t that will ever happen.
pas · a year ago
it's even worse. it's not theirs (it's the users'), they are merely hosting it and using it (ToS gives them a fancy irrevocable license I guess).

so they can do whatever they want with it and the actual owners/authors have no chance to really influence Reddit at all to make it crawlable. (the GDPR-like data takeout is nice, but ... completely useless in these cases where the value is in the composition and aggregation with other users' content.)

Dead Comment

Khelavaster · a year ago
The Fake News police should shut down this sort of messaging
will0 · a year ago
immibis · a year ago
Nobody who wants to be successful obeys robots.txt. And I do mean nobody.
chippiewill · a year ago
They changed it to disallow so that scrapers can't just claim the robots.txt gave them permission.
latexr · a year ago
That’s a weird statement to be absolutist about. The majority of individuals and companies who want to be successful do not do so by scrapping websites, thus have no reason to disobey robots.txt. Most people in the world, ambitious or not, wouldn’t even understand what your sentence refers to.
maxnevermind · a year ago
Has not NYT tried to sue OpenAI because of them ignoring robots.txt or you mean it's impossible to prove and / or it's still more profitable to just ignore robots.txt?
JohnFen · a year ago
Sadly true. That's why I gave up on robots.txt years ago and started blocking crawlers outright in .htaccess

Of course, that became unsustainable so now I have everything behind a login wall.

Dead Comment

Zuiii · a year ago
> We believe in something that we will now proceed to violate.

I will never take a statement given by a company that blatantly lies like this at face value going forward. What a bunch of clowns.

raverbashing · a year ago
With the amount of crap in Reddit, cleaning it must be a very non-trivial problem. (I mean, it never is, but in the case of Reddit it's probably extra complicated)
arnaudsm · a year ago
I understand the AI context, but this is dangerously anticompetitive for other search engines.

This is a dangerous precedent for the internet. Business conglomerates have been controlling most of the web, but refusing basic interoperability is even worse.

zooq_ai · a year ago
There is nothing preventing search companies paying the same $60 Million to license content.

If reddit had exclusive agreement, it would be anti-competive.

This is classic HN anti-Google tirade (and downvoting facts, logic and concepts of free market)

not_wyoming · a year ago
> There is nothing preventing search companies paying the same $60 Million to license content.

Yes, actually, there is - having $60m to throw around.

"Barriers to entry often cause or aid the existence of monopolies and oligopolies" [0]. Monopolies and oligopolies are definitionally the opposite of free market forces. This is quite literally Econ 101.

[0] - https://en.wikipedia.org/wiki/Barriers_to_entry

not_wyoming · a year ago
Juiciest update I’ve ever gotten to share: https://www.nytimes.com/2024/08/05/technology/google-antitru...
pluc · a year ago
Paying 60 million to every site you want to index is also a bad precedent to set. Why can Reddit get paid and XYZ can't?
onlyrealcuzzo · a year ago
This is an interesting development.

How many other sites might have leverage to charge to be indexed?

I don't want to live in a world where you have to use X search engine to get answers from Y site - but this seems like the beginning of that world.

From an efficiency perspective - it's obviously better for websites to just lease their data to search engines then both sides paying tons of bandwidth and compute to get that data onto search engines.

Realistically, there are only 2 search engines now.

This seems very bad for Kagi - but possibly could lead the old, cool, hobbiest & un-monetized web being reinvented?

ColinHayhurst · a year ago
Kagi uses at least Google and Mojeek

edit:

> Realistically, there are only 2 search engines now.

https://seirdy.one/posts/2021/03/10/search-engines-with-own-...

WarOnPrivacy · a year ago
> Realistically, there are only 2 search engines now.

From the article:

     Many alternatives to GBY [Google, Bing, and Yandex] exist, but almost none of them have their own results;
This seems to assert that ~0 other search providers do any crawling at all. Ever. Are we sure that's the case?

   (they could crawl but never ever return those results == more odd).

Yawrehto · a year ago
Doesn't it list three major ones, Google, Bing, and Yandex, plus Mojeek and a few other small ones? That's a bit more than two.
McDyver · a year ago
That seems like the business model for streaming. You subscribe to X provider to watch Y series. So, as for streaming, I suppose a pirate bay search engine will come up
toomuchtodo · a year ago
Pirate Bay is probably not the most optimal analogy, more like Anna's Archive imho [1], individually offered by web property scrape runs compressed into a package, maybe served by torrents like this Academic Torrents site example [2].

Scraper engine->validation/processing/cleanup->object storage->index + torrent serving is rough pipeline sketch.

[1] https://hn.algolia.com/?dateRange=all&page=0&prefix=false&qu... ("HN Search: annas archive")

[2] https://academictorrents.com/details/9c263fc85366c1ef8f5bb9d... ("AcademicTorrents: Reddit comments/submissions 2005-06 to 2023-12 [2.52TB]")

gtirloni · a year ago
> but this seems like the beginning of that world.

It's not the beginning, it's mere continuation.

Walled gardens have existed since the AOL days. They deteriorate over time but it doesn't prevent companies from trying (each time, in bigger attempts).

aAaaArrRgH · a year ago
> but possibly could lead the old, cool, hobbiest & un-monetized web being reinvented?

It still exists. It just isn't that popular.

splwjs · a year ago
idk man i bet you five bucks and a handshake it's just going to play out like the existing startup grift.

There's an established player with institutional protections, then a scrappy upstart takes a bunch of VC money, converts it into runway, gives away the product for free, gradually replaces and becomes the standard, then puts out an s-1 document saying "we don't make money and we never have, want to invest?" and then they start to enjoy all the institutional protections. Or they don't. Either way you pay yourself handsomely from the runway money so who cares.

The upstart gets indexed and has an API, the established player doesn't.

The upstart is more easily found and modular but the institutional player can refuse to be indexed to own their data and they can block their API to prevent ai slop from getting in and dominating their content.

StrauXX · a year ago
IANAL but as far as I understand the current legal status (in the US) a change in robots.txt or terms and conditions is not binding for web scrapers since the data is publicly accessible. Neither does displaying a banner "By using this site you accept our terms and conditions" change anything about that. The only thing that can make these kinds of terms binding is if the data is only accessible after proactively accepting terms. For instance by restricting the website until one has created an account. Linkedin lost a case against a startup scraping and indexing their data because of that a few years ago.
qingcharles · a year ago
At the federal level; but states have their own laws. For instance, it can get you 5 years in prison in Illinois to violate a web site ToS.

https://www.ilga.gov/legislation/ilcs/ilcs4.asp?DocName=0720...

redcobra762 · a year ago
Has anyone ever successfully been prosecuted for violating this statute?
jpalomaki · a year ago
Quite sure they are also enforcing these with some technical measures to limit scraping.
renlo · a year ago
As was LinkedIn, who was forced to rate stop limiting / IP-banning scrapers for public pages.
wtf242 · a year ago
This problem is only going to get worse. for my thegreatestbooks.org site i used to just get indexed/scraped by google and bing. now it's like 50+ AI bots scraping my entire site just so they can train a LLM to answer questions my site answers without having a user ever visit my site. I just checked cloudflare and in the past 24 hours I've had 1.2 million bot/automated requests
sct202 · a year ago
There's a new setting in Cloudflare to block AI/scraper bots. https://blog.cloudflare.com/declaring-your-aindependence-blo...
graeme · a year ago
Anyone have any experience with this? Is there nothing but upside in blocking these bots
jedberg · a year ago
They changed robots.txt a month or so ago. For the first 19 years of life, reddit had a very permissive robots.txt. We allowed all by default and then only restricted certain poorly behaved agents (and Bender's Shiny Metal Ass(tm))

But I can understand why they made the change they did. The data was being abused.

My guess is that this was an oversight -- that they will do an audit and reopen it for search engines after those engines agree not to use the data for training, because let's face it, reddit is a for profit business and they have to protect their income streams.

Closi · a year ago
> But I can understand why they made the change they did. The data was being abused.

Depends how you see it - if you see it as 'their' data (legally true) or if you see it as user content (how their users would likely see it).

If you see it as 'user content', they are actually selling the data to be abused by one company, rather than stopping it being abused at all.

From a commercial 'lets sell user data and make a profit' perspective I get it, although does seem short-sighted to decide to effectively de-list yourself from alternative search engines (guess they just got enough cash to make it worth their while).

Ajedi32 · a year ago
> if you see it as 'their' data (legally true)

Is that actually true? Reddit may indeed have a license to use that data (derived from their ToS), but I very much doubt they actually own the copyright to it. If I write a comment on Reddit, then copy-paste it somewhere else, can Reddit sue me for copyright infringement?

passwordoops · a year ago
Enough cash or enough data on hand to show the majority of traffic comes from the search monopoly
ColinHayhurst · a year ago
Person extensively quoted in the article here. They are welcome to reach out. But not a single person from any level did that, nor replied to my polite requests to explain and engage. We first contacted them in early June and by 13th June, I had escalated to Steve Huffman @spez.
toomuchtodo · a year ago
An acquaintance investigating Reddit's moderation mechanization inquired how a major subreddit was moderated after an Associated Press post was auto removed by automod. They were banned from said sub. They inquired why they were banned, and they shared they would share any responses with a journalism org (to be transparent where any replies would be going, because they are going to a journalism org). They were muted by mods for 28 days and were "told off" in a very poor manner (per the screenshots I've seen) by the anonymous mod who replied to them. They were then banned from Reddit for 3 days after an appeal for "harassment"; when they requested more info about what was considered harassment, they were ignored. Ergo, inquiring as to how the mods of a major sub are automodding non-biased journalism sources (the AP, in this case) without any transparency appears to be considered harassment by Reddit. The interaction was submitted to the FTC through their complaint system to contribute towards their existing antitrust investigation of Reddit.

Shared because it is unlikely Reddit responds except when required by law, so I recommend engaging regulators (FTC, and DOJ at the bare minimum) and legislators (primarily those focused on Section 230 reforms) whenever possible with regards to this entity. They're the only folks worth escalating to, as Reddit's incentives are to gate content, keep ad buyers happy, and keep the user base in check while they struggle to break even, sharing as little information publicly as possible along the way [1] [2].

[1] https://www.bloomberg.com/news/articles/2024-05-09/reddit-la... | https://archive.today/wQuKM

[2] https://www.sec.gov/edgar/browse/?CIK=1713445

JohnMakin · a year ago
One (in this case, 2) company's incentive for profit should not take priority over the usability/well being of the internet as a whole, ever, and is exactly why we are where we are now. This is an absolutely terrible precedent.
BeetleB · a year ago
I know people will hate to hear this, but Reddit it's not important to the A well being of the Internet.
jedberg · a year ago
I agree with you in theory, but in practice someone has to pay for all this magic.
ColinHayhurst · a year ago
The blocks for MojeekBot, as Cloudflare verified and respectful bot for 20 years, started before the robots.txt file changes. We first noticed in early June.

We thought it was an oversight too at first. It usually is. Large publishers have blocked us when they have not considered the details, but then reinstated us when we got in touch and explained.

ekidd · a year ago
I personally feel that this kind of "exclusive search only by Google deal" should result in an anti-trust case against Google. This is the kind of abuse of monopoly power that caused anti-trust laws to be passed in the 1890s.
eddd-ddde · a year ago
if i create a vacuum cleaner and decide to only sell it at Walmart you can't get mad at me for not wanting to sell it at costco

you can always buy a competitor's or make your own vacuum cleaner if you hate buying at Walmart

maybe what you are really mad about is Reddit monopolising content

fredgrott · a year ago
the article quotes reddit policy change: Reddit considers search and ads commercial activities and thus subject to robot.txt block and exclusion.
EasyMark · a year ago
how was it being abused. You still clicked on the information and saw the reddit ads? Now they won't get any of that from "rival" sites to google. I guess they figured the 60 million was more than that ad revenue. Seems greedy but I don't think it's illegal like others are suggesting.
account42 · a year ago
Ah so when reddit uses user content for monetization it's ok but when others do it then it isn't? Reddit may want that double standard but I think the only thing they are going to achieve with this stunt is more people ignoring robots.txt.
ykonstant · a year ago
It's ironic, because Reddit is the only search engine that works on Google now thanks to shittening.
maxwell · a year ago
They're both running on fumes at this point.
riiii · a year ago
Also sniffing them.
QVVRP4nYz · a year ago
For years reddit build-in search was broken (or at least broken) and people were forced to use 3rd parties like google, so we came full circle.