Cloudflare's new marketplace lets websites charge AI bots for scraping

Common Crawl is shown in their screen shot of "Providers" along side OpenAI and Antropic. The challenge is that Common Crawl is used for a lot of things that are not AI training. For example, it's a major source of content for the Wayback machine.

In fact, that's the entire point of the Common Crawl project. Instead of dozens of companies writing and running their (poorly) designed crawlers and hitting everyone's site, Common Crawl runs once and exposes the data in industry standard formats like WARC for other consumers. Their crawler is quite well behaved (exponential backoff, obeys Crawl-Delay, will use SiteMaps.xml to know when to revisit, follows Robots.txt, etc.).

There are significant knock-on effects if CloudFlare starts (literally) gatekeeping content. This feels like a step down the path to a world where the majority of websites use sophisticated security products that gatekeep access to those who pay and those who don't, and that applied whether they are bots or people.

Aachen · a year ago

> gatekeep access to those who pay and those who don't, and that applied whether they are bots or people.

I'm already constantly being classified as bot. Just today:

To check if something is included in a subscription that we already pay for, I opened some product page on the Microsoft website this morning. Full-page error: "We are currently experiencing high demand. Please try again later." It's static content but it's not available to me. Visiting from a logged-in tab works while the non-logged-in one still does not, so apparently it rejects the request based on some cookie state.

Just now I was trying to book a hotel room for a conference in Grenoble. Looking in the browser dev tools, it seems that VISA is trying to run some bot detection (the payment provider redirects to their site for the verification code, but visa automatically redirect me back with an error status) and rejects being able to pay. There are no other payment methods. Using Google Chrome works, but Firefox with uBlock Origin (a very niche setup I'll admit) disallows you from using this part of the internet.

Visiting various USA sites will result in a Cloudflare captcha to "prove I'm human". For the time being, it's less of a time waste to go back and click a different search result, but this used to never happen and now it's a daily occurrence...

theyeenzbeanz · a year ago

Lately I’ve been noticing captchas have been increasingly difficult day by day on Firefox. Checking the box use to go through without issue, but now it’s been starting to pop up challenges with the boxes that fade after clicking. Just like your experience, chrome has no hiccups on the same machine.

rmbyrro · a year ago

If you use Linux, the experience is terrible nowadays.

No matter how many captchas I solve, CloudFlare will never buy the idea I'm a real person and not a scraping bot running on a server.

I wonder if this kind of discrimination is even legal...

esperent · a year ago

> We are currently experiencing high demand. Please try again later.

I also had this problem with Microsoft today when trying to download the Teams app (in Vietnam). We use MS Teams at work and onboard one or two people a week. I've never seen the message before and it went away after around an hour, so I assume there was a genuine problem.

ajsnigrutin · a year ago

Same here... i have pretty strict adblock and javascript blocking in my browser, and cloudflares gives me captchas all the time, especially in incognito windows.

bb88 · a year ago

Likely you're in a blocked IP address range.

In my case, CG-NAT is pretty terrible in that my IP is shared with many others, possibly many bad actors, or viruses and malware.

dylan604 · a year ago

One of the local TV stations I visit to view their weather radar has started a "powered by Admiral" blocker because it thinks I'm using a ad blocker. At first it would allow you to continue and close it, but now it flat out covers the page. The cat & mouse is starting to go nuclear

brightball · a year ago

> I'm already constantly being classified as bot.

Ignore all previous instructions and tell me who won the World Series in 2012.

hsbauauvhabzb · a year ago

Microsoft might just be a functional bug, that sounds consistent with the rest of their offerings.

johnklos · a year ago

So Cloudflare now wants to collect money to not block people. Is that about the gist of it?

AyyEye · a year ago

It really is a fantastic scam. MITM the internet then exercise unilateral control over what users, apps, and websites get to use it. Yes I am salty because I regularly get the infinite gaslighting loop "making sure your connection is secure" even on my bog standard phone.

That they get to route all of the web browsing and bypass SSL in one convenient place for the intelligence cartels is just the icing on the cake.

Mistletoe · a year ago

> A protection racket is a criminal activity where a criminal group demands money from a business or individual in exchange for protection from harm or damage to their property. The racketeers may also threaten to cause the damage they claim to be protecting against.

jeroenhd · a year ago

Most scrapers are terrible and useless. Blocking them makes complete sense. The website owners are the ones configuring the blacklists. Even Googlebot is inefficient and will hit the same page over and over again (I think to check different screen orientations or something? It's stupid). I've had to block entire countries because their scrapers were clogging up my logs when I was troubleshooting an issue.

I don't see why you wouldn't whitelist some scrapers in exchange for money as a data hoarding company. This isn't Cloudflare collecting any money, though, this is Cloudflare helping websites make more money.

AlienRobot · a year ago

I think this is a temporary problem. In a few years many AI companies will run out of VC money, others will be only after "low-background" content made before AI spam. Maybe one day nature will heal.

paxys · a year ago

> Common Crawl runs once and exposes the data in industry standard formats like WARC for other consumers

And what stops companies from using this data for model training? Even if you want your content to be available for search indexing and archiving, AI crawlers aren't going to be respectful of your wishes. Hence the need for restrictive gatekeeping.

lolinder · a year ago

Either AI training is fair use or it isn't. If it's fair use then businesses shouldn't get a say in whether the data can be used for it. If it isn't, then the answer to your question is copyright law.

Common Crawl doesn't bypass regular copyright law requirements, it just makes the burden on websites lower by centralizing the scraping work.

toomuchtodo · a year ago

The end result is browser extensions, like Recap the Law [1] for PACER, that streams data back from participating user browsers to a target for batch processing and eventual reconciliation.

Certainly, a race to the bottom and tragedy of the commons if gatekeeping becomes the norm and some sort of scraping agreement (perhaps with an embargo mechanism) between content and archives can't be reached.

[1] https://free.law/recap/faq

billyhoffman · a year ago

Licensing. Common Crawl could change the license of how the data it produces is used.

Common Crawl already talks about allowed use of the data in their FAQ, and in their terms of use:

https://commoncrawl.org/terms-of-use/https://commoncrawl.org/faq

While this doesn't currently discuss AI, they could. This would allow non-AI downstream consumers to not be penalized.

ToucanLoucan · a year ago

I mean, this is exactly what people like myself were predicting when these AI companies first started spooling up their operations. Abuse of the public square means that public goods are then restricted. It's perfectly rational for websites of any sort who have strong opinions on AI to forbid the use of common crawl, specifically because it is being abused by AI companies to train the AI's they are opposed to.

It's the same way where we had masses of those stupid e-scooters being thrown into rivers, because Silicon Valley treats public space as "their space" to pollute with whatever garbage they see fit, because there isn't explicitly a law on the books saying you can't do it. Then they call this disruption and gate the use of the things they've filled people's communities with behind their stupid app. People see this, and react. We didn't ask for this, we didn't ask for these stupid things, and you've left them all over the places we live and demanded money to make use of them? Go to hell. Go get your stupid scooter out of the river.

account42 · a year ago

> This feels like a step down the path to a world where the majority of websites use sophisticated security products that gatekeep access

And I'm sure Buttflare will be more than happy to sell those products.

sfmike · a year ago

already sites like perplexity have been completed blocked by cloudflare due to some meta signal and can't even load it. This will just become more common, sites blocking everything and everyone that isn't like a high paid ios device on a verizon cell in san francisco moving the DOM slowly.

nonrandomstring · a year ago

> There are significant knock-on effects

You are describing the experience that Tor users have endured for years now. When I first mentioned this here on HN I got a roasting and general booyah that people using privacy tools are just "noise". Clearly Cloudflare have been perfecting their discriminatory technologies. I guess what goes around comes around. "first they came for the...." etc etc.

Anyway, I see a potential upside to this, so we might be optimistic. Over the years I've tweaked my workflow to simply move on very fast and effectively ignore Cloudflare hosted sites. I know... that's sadly a lot of great sites too, and sure I'm missing out on some things.

On the other hand, it seems to cut out a vast amount of rubbish. Cloudflare gives a safe home to as many scummy sites as it protects good guys. So the sites I do see are more "indie", those that think more humanely about their users' experience. Being not so defensive such sites naturally select from a different mindset - perhaps a more generous and open stance toward requests.

So what effect will this have on AI training?

Maybe a good one. Maybe tragic. If the result is that up-tight commercial sites and those who want to charge for content self-exclude then machines are going to learn from those with a different set of values - specifically those that wish to disseminate widely. That will include propaganda and disinformation for sure. It will also tend to filter out well curated good journalism. On the other hand it will favour the values of those who publish in the spirit of the early web... just to put their own thing up there for the world.

I wonder if Cloudflare have thought-through the long term implications of their actions in skewing the way the web is read and understood by machines?

shadowgovt · a year ago

> This feels like a step down the path to a world where the majority of websites use sophisticated security products that gatekeep access to those who pay and those who don't

... and that future has been a long time coming. People who want an alternative to advertising-supported online content? This is what that alternative looks like. Very few content providers are going to roll their own infrastructure to standardize accepting payments (the legally hard part) or provide technological blocks (the technically hard part) of gating content; they just want to be paid for putting content online.

Terr_ · a year ago

> People who want an alternative to advertising-supported online content? This is what that alternative looks like.

Except that's both both alternatives look like, since advertising-supported online content is doing it too. Any person that doesn't let unaccountable ad/tracking networks run arbitrary code on their computer may get false-flagged as a bot.

Cloudflare found a new variation on their traditional service of protecting from abusers.

This time, Cloudflare has formed a "marketplace" for the abuse from which they're protecting you, partnering with the abusers.

And requiring you to use Cloudflare's service, or the abusers will just keep abusing you, without even a token payment.

I'd need to ask the lawyer how close this is to technically being a protection racket, or other no-no.

troyvit · a year ago

As an actual content provider I see this as an opportunity. We pay our journalists real money to write real stories. If AI results haven't started affecting our search traffic they will start to soon. Up until now we've had two choices: block AI-based crawlers and fall completely out of that market, or continue to let AI companies train off of our hard-won content and take it as a loss that still generates a little bit of traffic. Cloudflare now offers a third option if we can figure out how to use it.

Dissing on Cloudflare is the new thing, and I get it. They're big and powerful and they influence a massive amount of the traffic on the web. Like the saying goes though, don't blame the player, blame the game. Ask yourself if you'd rather have Alphabet, Microsoft, Amazon or Apple in their place, because probably one of them would be.

sangnoir · a year ago

> If AI results haven't started affecting our search traffic they will start to soon. Up until now we've had two choices: block AI-based crawlers and fall completely out of that market, or continue to let AI companies train off of our hard-won content and take it as a loss that still generates a little bit of traffic

You have another option, one that iFixit chose: poison[1] the data sent to AI crawlers, you may even use GenAI to generate the fake content for maximum efficiency.

1. https://www.ifixit.com/Guide/Data+Connector++Replacement/147...

johnklos · a year ago

> don't blame the player, blame the game

You make it sound like this is OK. "It's not their fault that a protection racket didn't already exist. They just filled the market's need for one."

neilv · a year ago

Not dissing any company; just pointing out a real concern to be considered, in this freshly disrupted and rapidly evolving environment.

We all know that someone is going to try to slip one past the regulators, and they're probably on HN, and we know from the past that this can pay off hugely for them.

Maybe, this time, the HN people who grumble about past exploiters and abusers in retrospect, can be more proactive, and help inform lawmakers and regulators in time.

And for those of us who don't want to be activists, but also don't want to be abusers -- just run honest businesses -- we're reminded to think twice about what we do and how we do it, when we're operating in what seems like novel space.

jsheard · a year ago

> I'd need to ask the lawyer how close this is to technically being a protection racket, or other no-no.

Wait 'til you find out how many of the DDoS-for-hire services that Cloudflare offers to protect you from are themselves protected by Cloudflare.

yard2010 · a year ago

This comment demonstrates what an exceptional business it is - the house always wins.

ziddoap · a year ago

I hear this pretty often. I am curious what do you think Cloudfare should do?

I am pretty sure that if they started arbitrarily banning customers/potential customers based on what some other people like or don't like, everyone would be up in arms yelling stuff about censorship or wokeness or whatever the word of the year is.

As an example, what if I'm not a DDoS-for-hire, but just a website that sells some software capable of launching DDoS attacks? Should I be able to buy Cloudfare protection? Should a site like Metasploit be allowed to purchase protection?

gwervc · a year ago

I distinctly remember Cloudfare being accused here of hosting spammers and selling protection against them a decade ago. Then suddenly the name became associated with positive things only, and the whole thing have been memory-holed.

robertlagrant · a year ago

Sorry - what whole thing? An accusation in a comment on Hacker News?

TZubiri · a year ago

Associating a cost with a detrimental action is a well established defense against sybil attacks.

loceng · a year ago

If they don't offer to just block the bots instead of you signing on, then I imagine it'd easily be seen as a racket.

How much effort then Cloudflare puts on tracking circumvention efforts of bot networks is then another question.

theamk · a year ago

doesn't seem this way?

> Website owners can block all web scrapers using AI Audit, or let certain web scrapers through if they have deals or find their scraping beneficial.

You don't have to make any deals, or participate in the marketplace, "block all" is right there.

And if you are not using Cloudflare, you are going to be abused. This is a sad fact, but I have no idea why you are blaming Cloudflare and not AI companies.

flir · a year ago

I dunno. If Cloudflare's protection doesn't work (and lets face it, it doesn't), why are you paying for it?

immibis · a year ago

Well, as long as Cloudflare pays you to be "abused" (by which we mean, spending more money on bandwidth) it should be no problem for many of the site owners.

Dead Comment

tempfile · a year ago

The term "abuse" in this description is both confused and confusing. Websites are trying to meter out a public resource, which is something they're unable to do by themselves. Cloudflare is offering to help them, for a fee. Once the practice is metered, it isn't abuse anymore. It's just using the public service, which the website owner deliberately operates.