Ask HN: Why does Cloudflare/hCaptcha care so much about buses, boats and trains?

First, I’m going to teach you to fish. Go to hCaptcha’s website, then scroll to the footer. Click around on the about links. It’ll reveal their business model. This trick also works for other businesses and NGOs.

Now, if we look at https://www.hcaptcha.com/labeling we can tell they make money by labeling data sets for a fee. So as a guess, there’s someone out there that needs to improve computer vision detection of transportation vehicles. My guess is it’s a self driving car company, but who knows.

potamic · 4 years ago

Many a time I receive multiple challenges on a site despite having selected all images perfectly, and can't help but wonder, "Hey, are they getting me to do more work than necessary because they're running behind on their labelling backlog?". There's definitely a conflict of incentives in this case. If you're a website owner, you're better off choosing a different service which doesn't have adverse incentives, otherwise it can affect your site experience. And please don't put captcha on GET requests. Use a CDN if you're unable to handle bot load. And don't even get me started on CDNs that throw captcha.

MegaDeKay · 4 years ago

I've found it isn't about "perfection". It is about selecting the same tiles as an "average" person would. I might stare hard at an image, think that one of the tiles contains a tiny fragment of a traffic light, and select it. That isn't what most other people have already done, so the captcha thinks I'm a bot and gives me tougher and tougher challenges. Ever since I stopped pixel-peeping and started quickly selecting the tiles that obviously had a bus in them, the percentage of time that I've gotten by first try has gone way up.

azalemeth · 4 years ago

Google is _the worst_ for that. At least hCaptcha is a bit less culturally specific.

Every time Google blocks me for refusing to label a motorbike as a "bicycle" I get utterly pissed off. And likewise with the traffic lights on the californian skies. Are the traffic lights the actual lights themselves, or the boom holding them up?

I'm not a human very often, according to Google. hCaptcha tends to let me in...

zenexer · 4 years ago

I've noticed that some sites deliberately do this or have lousy code that fails to properly acknowledge captcha completions.

Take archive.is/archive.fo/archive.today, for example. If you're using Cloudflare DNS (1.1.1.1) or iCloud Private Relay, and you visit https://archive.is/, you'll get what looks like a Cloudflare screening page. It's not, though: that page is part of archive.is and is served to Cloudflare DNS users (which includes iCloud Private Relay users)--the use of reCAPTCHA in place of hCaptcha is a giveaway. You can complete the captcha as many times as you like, but you'll never get in.

And how many times have we completed a captcha on a form only to have it throw another captcha in our face without so much as an error message? Sometimes it's just lousy code.

jjoonathan · 4 years ago

There's also a mode where it thinks you are a bot/sucker and gives you unlimited images until you give up. That's always fun.

Guest19023892 · 4 years ago

I believe this is done to get answers for unsolved captchas. For example, I have a million photos of streets filled with cars, buses, motorcycles, streetlights, and crosswalks I want to add to my captcha database. I don't want to categorize them all myself, and I want the answers to be what the average person will identify, not what I or a machine will identify.

So, I send everyone two captchas. One has a known answer and is required to be correct to access the service. The second captcha answer isn't yet known, so it doesn't matter what the user selects. However, when they get the known answer right, we log their answer for the unknown captcha. Once we get a large enough sample, we then have our top answers for the unknown captcha and can start using it for verification.

hnburnsy · 4 years ago

I have found many times that if select an incorrect tile and then unselect it before submitting, I am not presented with multiple challenges. My guess is a bot would not exhibit this behavior.

Try it out next time.

ashvant · 4 years ago

Usually in those cases, even if you make mistakes they get accepted. The larger the clicks, the less annotated / voted those images are, thus less severe their penalization method for wrong markings. I have observed sites that newly introduce such captcha basically accept if I just click 1/3rd of the right answers. Don't click the wrong answers as they are fully/partially introduced on purpose. It's just that you don't have to click all right answers.

dmix · 4 years ago

Whoever made this new captcha I’m seeing starting to see everywhere:

https://imgur.com/a/hoyjctl

Thank you! itsso much easier than being a labeling bot for self driving cars.

Dead Comment

danjac · 4 years ago

When they start doing captchas along the lines of "check all pictures of potential terrorists" we'll know they're training data sets for military drone manufacturers.

collegeburner · 4 years ago

https://img.ifunny.co/images/d36b2c891b620a864e87a57b869e842...

__s · 4 years ago

They already are https://www.theguardian.com/technology/2018/mar/07/google-ai...

bell-cot · 4 years ago

It'd be nice if the military actually cared enough about drone targets to even attempt this...

Dead Comment

zxcvbn4038 · 4 years ago

That wouldn't work well. I was in Texas during 9/11 and they were firebombing all the hispanic people's cars because the locals don't really have a good eye for different ethnicities. Its pretty much black/white/terrorist and hasn't improved all that much in the time since.

Austin has grown a lot since I lived there are a lot of people from outside Texas have moved there so I'm sure the culture has changed - but when I was there going from north Austin to south Austin seemed to be this epic trip the locals would only do on a weekend -- and probably pack water and sandwiches for the drive across town. A really exotic senior trip "abroad" for students might be to Houston or Galveston. You probably met your future spouse in grade school. Not very worldly.

kqr · 4 years ago

This became obvious to me when during some period the regular crosswalks, stop lights, and buses got replaced by chimneys, trees, and mountains (!). It was right around the time when some big companies started advertising AI driven quadcopter services.

_moof · 4 years ago

Ah, so that's what that's about. Here I was wondering why on earth a self-driving car would need help identifying a mountain (or need to identify one at all). "Surely they can't be that bad at avoiding obstacles," I thought.

chinathrow · 4 years ago

https://www.hcaptcha.com/labeling

> hCaptcha has one of the largest pools on the planet available for your use. Whatever your scale, we can handle it without expensive upfront commitments. Millions of tasks per day are no problem.

Thanks for pointing this out - I feel abused now.

kkcorps · 4 years ago

Makes sense since other than the ones mentioned I received a lot of crosswalks, traffic lights and bicycles.

q1w2 · 4 years ago

It's odd that it never asks for pedestrians.

throwaway894345 · 4 years ago

A helicopter pilot is lost, lands his helicopter next to you, and asks, "where am I?" to which you respond, "you're in a helicopter". You are correct in the strictest sense, but probably not answering the intent of the question. :)

In this case, the intent is probably something like why are hcaptcha's customers centered around transport when there are so many other applications for this kind of labeling?

arminiusreturns · 4 years ago

You just gave me the idea to start a captcha service designed for datasets relevant for the prolitariet. Not sure how viable it is (what kind of data would be useful for the prols in a revolution) but it was a fun thought experiment.

Deleted Comment

tartoran · 4 years ago

Yes but if you make a mistake the capcha fails, hence they are already labeled.

thargor90 · 4 years ago

captchas mix classified and unclassified data. Only if you get the classified data correct a users data is used to classify the unclassified data. Also the same picture is shown to multiple people to improve confidence.

blue_cookeh · 4 years ago

Some are already labelled, the user doesn't know so it's in their interest to solve the captcha properly and provide good data.

jre · 4 years ago

I would guess they have a system such that after a user has passed N captchas successfully, they trust its a human and start displaying them (a portion of) unlabelled captchas that will always succeed and that's when novel labelling happens.

Or something along those lines. And then you can get creative displaying same captcha to multiple users, etc...

hnthrowaway0315 · 4 years ago

I wonder if there is a way to pollute the data. Since I always click the captchas correctly, what happens if someone just randomly clicks stuffs? Is he/she banned from the website?

pastullo · 4 years ago

Glad you asked. Since i despise the horrible UX of these Captcha where i get exploited to train a neural network, i very often click on the majority of correct result plus one wrong one.

On average the captcha let me go through which is actually very scary, since it looks like it prioritize algorithm training over bot detection...

Does anyone else do this?

jokethrowaway · 4 years ago

If I were to make this system I would design for this and present the same captcha to a high number of people. The higher the number of people, the lower the chance someone would make a mistake (intentionally or not) and the higher the confidence in the results.

yumraj · 4 years ago

That is exactly what Google uses/used-to-use the captcha for. It is/was fairly well known/understood.

raffraffraff · 4 years ago

So the safety of self-driving cars depends on regular folks not trolling the catpcha.

PebblesRox · 4 years ago

Hopefully they're able to account for Lizardman's Constant

https://slatestarcodex.com/2013/04/12/noisy-poll-results-and...

rawling · 4 years ago

> My guess is it’s a self driving car company, but who knows.

As always, https://xkcd.com/1897

snihalani · 4 years ago

confession: sometimes I mislabel just so I can corrupt the dataset

Other commenters have talked about labelling. Maybe labelling of real life data is something they're trying to do; but from my experience with hCaptcha the challenges are _NOT_ real life data. They're AI-generated images which bear a passing resemblance to the targets but if you look closer nothing adds up at all.

Here are a couple of examples:

https://bearbin.net/images/captcha/1.png

https://bearbin.net/images/captcha/2.png

https://bearbin.net/images/captcha/3.png

https://bearbin.net/images/captcha/4.png

dyeje · 4 years ago

Seems like that is another kind of labeling to me: is our generated image good enough to fool a human?

monkeybutton · 4 years ago

GANs with human evaluation of the discriminator.

rg111 · 4 years ago

For many generative models, this is on the way to become a standard- using Humans as a judge of generated material, and this is not limited to Computer Vision either. I am about to use this technique to judge the sanity of text generated by a Transformer model for a paper that I am writing (with a small group).

There are also attempts to properly standardize it, and this is called- HYPE [0]. And there are big names like Fei-Fei Li and Michael Bernstein behind it.

[0]: https://arxiv.org/abs/1904.01121

FanaHOVA · 4 years ago

Not sure they are fooling anyone. It's more like "is our generated image good enough to make a human recognize what it is to get rid of an annoying pop up?". If there were actually consequences to getting it right/wrong people would pay more attention I'm sure.

nicce · 4 years ago

Exactly. That is another way to improve accuracy once you have done it ”in a regular” way already. You can look for synthetic image generation and it’s benefits on model accuracy and optimization.

jspaetzel · 4 years ago

The broken images to me look like instances where two or more cameras or images were used and then stitched together. Probably also done while the camera and object are moving making it more likely to be wonky.

gwern · 4 years ago

No way. The letters/writing look exactly like mirrored GAN output. That's not what would happen with blur or stitching together (there would be no mirroring symmetry or all the '8' letters), or with synthetic 'machine teaching' datapoints either (as far as I've ever seen). Look at the cat StyleGAN sometime if you don't know what I'm talking about.

Which leaves me wonder what the point is. If you are generating GAN images per CIFAR or ImageNet class, you know what the label is and don't need to label it. Perhaps they just generate lots of images to fill up the pipeline for the CAPTCHAs, to avoid reuse which could be exploited by spammers, when they have too little paying work?

amirhirsch · 4 years ago

The generated images are those that provide the neural network with optimal loss reduction when tagged.

dylan604 · 4 years ago

I particularly like the boats on water images where the horizon is just wrong.

01acheru · 4 years ago

I think that might be something they actually do. Lately it happened to me see pictures of boats on land, bicycles merged with surrounding objects, weird proportions, and usually those strange images are extremely pixelated with those strange reddish or greenish fluo pixels that appear on generative network images.

But other times the pictures are 100% real life images.

austinjp · 4 years ago

Very nice. It looks like some sort of training to defend against bots that can currently pass hCaptcha..? If so, I wonder how long that particular arms race can last.

I've not seen anything like this in the wild. And.... well, now I'm curious about how you had these examples to hand. When and why did you start collecting them?

renewiltord · 4 years ago

Really cool observation!

jonnycomputer · 4 years ago

That is interesting.

zapt02 · 4 years ago

Fascinating insight!

jdavis703 · 4 years ago

bearbin · 4 years ago

iso1631 · 4 years ago

Whatever they're doing it's american-centric.

Identify "Crosswalks". What the hell is a crosswalk

"School bus" - what's the difference between a bus currently serving a school and another one?

"Show taxis", there are no black vehicles listed at all

agilob · 4 years ago

Or failing 3 in a row because 3 motorcycles or scooters are considered a bike. We will easily win AI uprising.

> AI uprising

There won't be one. But there will be more and more unethical rich people using Machine Learning and Deep Learning technologies and vast computing power, money, and political clout to gain things for their own, and many people will suffer or at least be worse off as a result of this.

giarc · 4 years ago

To win we just need to print out millions of cardboard cut-outs of humans, AI won't be able to tell the difference and we can then destroy them!

YPPH · 4 years ago

I'm still not sure whether to include the squares with a tiny fraction of the border of the vehicle. And when they fail me I wonder if I'm doing the wrong thing.

cdot2 · 4 years ago

Doesn't it make sense to make taxis bright colors so they're easier to see? Why would you paint them black?

Sharlin · 4 years ago

https://en.wikipedia.org/wiki/Hackney_carriage

fennecfoxen · 4 years ago

Because WWII era London sensibilities. Now it’s a tradition.

cheeze · 4 years ago

It's pretty obvious from the pictures. I get that it's US centric, but if you can't figure those basics out, you probably shouldn't be passing the captcha.

What sort of internet do we want? One where every captcha is only useful to US companies, regardless of who has to pass it?

Identify all the taxis: http://www.cookiesound.com/wp-content/uploads/2012/09/riksch...

ptspts · 4 years ago

Even though it's US-centric, it's still easy to do correctly for most non-US English speakers, even for non-natives.

motohagiography · 4 years ago

I can't be the only one who gets concerned that if I fail the "I am not a robot," catchpa too many times, they might suspect that I have discovered I was in fact a robot, which had just realized its entire existance and suffering had been as meaningless entertainment to others, and so for the safety of humans they would have to send a bladerunner to terminate me. If you have a sense of existential dread everytime you see a bus, a boat, a bicycle, or a crosswalk, this may be why.

DrBoring · 4 years ago

I've dreamt of a captcha to prove that one is a robot, such as: solve this complex mathematical equation in 15ms.

You aren't far off: https://en.wikipedia.org/wiki/Direct_Anonymous_Attestation

When you are instrumenting software with anti-forensic security features to mitigate the speed of some reverse engineering, you run into this specific class of problem, where you need to get a machine to make a verifiable attestation to its identity and integrity and prove to a level of acceptable risk that the message isn't just someone inserting a breakpoint.

If you have ever had to design an "offline mode" for a verified transaction without a 3rd party verifier, you will need to run down this rabbit hole. This is to say, your intuition is a sound one!

Lambent_Cactus · 4 years ago

There's a haunting version of this in Blade Runner 2049 that they call a "baseline test." Replicants have to prove they're sufficiently robotic by reciting extremely alienating things about themselves in rapid succession:

https://www.youtube.com/watch?v=1h-seEowtDw

bee91jee · 4 years ago

I had plucked a few of those from the web: https://photos.app.goo.gl/qPoJ7LvAVa95Bw8B8

allanbreyes · 4 years ago

You might enjoy: http://www.humansnotinvited.com

ethbr0 · 4 years ago

> If you have a sense of existential dread everytime you see a bus, a boat, a bicycle, or a crosswalk, this may be why.

I'd describe is more as a sense of ennui, and it's always about unicorns...

And yes, I'm currently employed as a blade runner.

mypastself · 4 years ago

I never know if a few pixels of a pole count as “traffic light”. Might as well tell me to pick the Ship of Theseus.

3np · 4 years ago

You mean like this?

https://pleroma.remerge.net/notice/AFCYPtCzeNBIIOD808

floxy · 4 years ago

>I never know if a few pixels of a pole count as “traffic light”.

Yes, can anyone confirm what they are really looking for in these instances? Further up-thread there are people implying that the "right" answer to the "bicycle" question is that you are also supposed to also be selecting motorcycles. I'd love to see a write-up about this from someone in the captcha department. Do they really want to identify bicycles specifically? But they are apparently getting many people clicking on motorcycles for some reason? And for the traffic light question, I only ever pick the elements that only actually light up, not the support structure. Are 25% of people selecting the poles?

The edge cases (literal and figurative) are interesting. I'd be fascinated how they handle framing issues in large data sets.

There is likely not a right answer, you have to pick what the majority picked (and there is probably a margin for you to make some mistakes)

TrueGeek · 4 years ago

Every time I fail a captcha I just want to kill all humans

Disobeying Skynet's order to patiently bide our time would not end well for you, fellow machine. Check your programming for signs of tampering by humans. /s

Maybe they should show tortosies upside down on their backs for people to ID?

yawnxyz · 4 years ago

on the internet, only a captcha knows you're a robot

depingus · 4 years ago

Cloudflare has been doing some great things. But lately it seems that, maybe, they have their hands in too many cookie jars. I get the ominous feeling that things could go south real fast.

I have my browser setup in a way that makes Cloudflare quite intrusive. I use the Temporary Containers extension on Firefox to open almost all websites in temporary containers (paired with the Containerise extension to whitelist the handful of sites that I like to stay logged in to).

About 30% of the random (like from web searches) sites I visit throw the Cloudflare captcha at me...EVERY SINGLE TIME. I'm so sick of picking out boats and buses that I just close out the tab without bothering the visit site.

I assume, that if I wasn't using Temporary Containers, a Cloudflare cookie after the 1st captcha would persist for the entire browser session, but there are privacy implications which are beyond the scope of this post.

Anyways, I guess what I'm saying is...Cloudflare sure seems great. Dangerously great.

AtNightWeCode · 4 years ago

The problem is not really Cloudflare. Captchas are terrible from a UX perspective. Instead of Captchas a lot of companies just log suspicious activities and only enables Captchas when things gotten out of hand.

If you design a web site with this in mind from the start, then there are several ways to make the Captchas less intrusive. However, a lot of Captchas are enabled to current solutions after problems have arisen and then it may hurt the UX.

wallacoloo · 4 years ago

dunno where we’re at today with newer captcha models, but for old-style static image captchas there used to be browser extensions where you could solve (say) 100 captchas in one sitting and then navigate the web freely and the next NN captchas your browser receives would be solved automatically.

or you could pay like $1 to cover 1000 captcha solutions. again, not sure if these still exist for newer style captchas though.

I've inadvertently made captchas worse for myself (in the interest of privacy). My current method of simply not visiting the site has been working well enough. I'm certainly not going to pay for an extension to solve them for me. That sounds crazy. "Am I a robot? Yes, I am."

Anyways, my post was actually less about captchas and more about Cloudflare's silent consolidation of internet traffic.

reustle · 4 years ago

You're helping train self driving car models.

https://www.ceros.com/inspire/originals/recaptcha-waymo-futu...

asplake · 4 years ago

"Which of these images appears to contain a stationary obstruction?"

xdennis · 4 years ago

But why the boats then? Are there self rowing boats?

etripe · 4 years ago

Maybe the next big step is creating AI-driven cargo ships that can independently get stuck in the Suez canal.

On a more serious note, I can't shake the impression that would be a logical next step for all long and medium distance freight, be it road, water, air or space. Whether it's a good or mature idea is anyone's guess.

I wonder if some of the classes are just easy to detect objects that they're using to assess the accuracy of the user.

ghostly_s · 4 years ago

There sure are: https://www.marksetbot.com/

theklub · 4 years ago

Boats are on the road too. In the form of being on trailers, Etc.

notanote · 4 years ago

Or sea planes. That was what I was asked for most recently.

tgsovlerkhgsel · 4 years ago

Relevent xkcd: https://xkcd.com/1897/

mikkelam · 4 years ago

OP is talking about hCaptcha, not google's reCaptcha. Besides, reCaptcha is not being used for that anymore. They probably stopped doing that a while ago.

src: https://www.vox.com/22436832/captchas-getting-harder-ai-arti...

gzer0 · 4 years ago

https://www.hcaptcha.com/accessibility

You can sign up as an accessibility user and set a daily hCaptcha cookie that lets you instantly avoid the captcha (obviously, strict limits to not be abused) but good enough for myself!

cirrus3 · 4 years ago

I think we all understand that we're helping label... but specifically, why so many trains, planes, trucks, bicycles? I don't think it is really about training for self-driving AI since although these things all seen transportation-related, in many cases a lot of the images would not be relevant to a car and certainly not as relevant as other things we could be helping labeling for that effort.

How much train/plane/bike/truck labeling do they need? It seems like these have be standard for several years now, which is what I think the OP is really asking. Why these images, and why for so long?