Google Safe Browsing missed 84% of confirmed phishing sites

I don't understand the metric they're using. Which is maybe to be expected of an article that looks LLM-written. But they started with ~250 URLs; that's a weirdly small sample. I'm sure there are tens of thousands malicious websites cropping up monthly. And I bet that Safe Browsing flags more than 16% of that?

So how did they narrow it down to that small number? Why these sites specifically?... what's the false positive / negative rate of both approaches? What's even going on?

john_strinlai · 9 days ago

>what's the false positive / negative rate of both approaches

the false positive rate is 100%. they just say everything is phishing:

"When we ran the full dataset through the deep scan, it caught every single confirmed phishing site with zero false negatives. The tradeoff is that it flagged all 9 of the legitimate sites in our dataset as suspicious, which is worth it when you're actively investigating a link you don't trust."

eurleif · 8 days ago

A very long time ago, I had the idea to set up a joke site advertising "SpamZero, the world's best spam filter", with a bunch of hype about how it never, ever misses spam. When you clicked the download link, the joke would be revealed: you would get a file consisting of `function isSpam(msg) { return true; }`.

Apparently that's not a joke anymore?!

lorenzoguerra · 9 days ago

it's 100% for what they call "deep scan", it's 66.7% for the "automatic scan". Practically unusable anyway

jdup7 · 9 days ago

Probably could have been a bit more descriptive around the dataset. Our tooling pulls in a lot more than 250 URLs but since we are manually confirming them that means a smaller dataset. In other words, out of the urls we pulled in these 250 were confirmed (by a human) as phishing. We did not do any selection beyond that. As for the article LLMs were used to help with the graphs and grammatical checks but that's it. This was our first month of going through this exercise and we definitely want to have larger datasets going forward as we expand capacity for review.

As for Safe Browsing catching more than 16% it depends on the timeline at the time these attacks are launched it's likely Safe Browsing catches closer to 0% but as the time goes on that number definitely climbs.

I never loved the idea of GSB or centralized blocklists in general due to the consequences of being wrong, or the implications for censorship.

So for my masters' thesis about 6-7 years ago now (sheesh) I proposed some alternative, privacy-preserving methods to help keep users safe with their web browsers: https://scholarsarchive.byu.edu/etd/7403/

I think Chrome adopted one or two of the ideas. Nowadays the methods might need to be updated especially in a world of LLMs, but regardless, my hope was/is that the industry will refine some of these approaches and ship them.

notepad0x90 · 9 days ago

Block lists will always be used for one reason or another, in this case these are verified malicious sites, there is no subjective analysis element in the equation that could be misconstrued as censorship. But even if there was, censorship implies a right to speech, in this case Google has the right to restrict the speech of it's users if it so wishes, matter of fact, through extensions there are many that do censor their users using Chrome.

like_any_other · 8 days ago

> censorship implies a right to speech, in this case Google has the right to restrict the speech of it's users

I don't follow. Even if Google does have the legal right [1], that does not make the censorship less problematic, or morally right. And even if it's hard to make a legislative fix ("You want to ban companies from trying to protect their users from phishing?") [2], that doesn't undo the problems of the current state, or mean we should be silent about it.

[1] This is far from certain, as it could be argued to be tortious interference, abuse of market power, defamation if they call something phishing when it's not.. Then there's the question of jurisdiction..

[2] It's a very common debating tactic to assert that a solution is difficult, to avoid admitting a problem exists.

rstupek · 9 days ago

I know for a fact that GSB contains non-malicious sites in its dataset.

lich_king · 9 days ago

mholt · 9 days ago

dvh · 9 days ago

Just yesterday I marked another Gmail phishing scam. This wouldn't be worth mentioning but they are using Google's own service for it. It has to be intentional, there is no other explanation. https://news.ycombinator.com/item?id=46665414

inemesitaffia · 8 days ago

Seen similar with CloudFlare

obblekk · 9 days ago

Maybe I’m an outlier but I’d rather this than accidentally block legit sites.

Otherwise this becomes just another tool for Google to wall in the subset of the internet they like.

timnetworks · 9 days ago

The most dangerous links recently have been from sharepoint.com, dropbox.com, etc. and nobody is going to block those.

Deleted Comment

Anamon · 6 days ago

Forget about checking Safe Browsing. Try reporting phishing sites to Google if you really want a reason to hate them.

I report all (manually verified) phishing sites I get spam for to Safe Browsing, then automatically check the listing status daily, and re-report every week if it's still not blacklisted. Usually, takedown by hosters is faster. Phishing sites that stay up long enough often go for 3-4 months before Google finally deems it apt to block them. I think the entire Safe Browsing team consists of one overworked, underpaid intern.

Compare this to Microsoft Defender. The time it takes for a site I report there to get blocked is usually measured in hours, not weeks. This is one thing MS actually seems to be running decently.

7777777phil · 9 days ago

Blocklists assume you can separate malicious infrastructure from legitimate infrastructure. Once phishing moves to Google Sites and Weebly that model just doesn't work.

nickphx · 9 days ago

So I tested out the extension.. First the extension spammed me with "login required".. So I click the notification to be taken to a login page.. Great? Now I have to create an account and verify a link.. Now I can test how great this is against a "fresh" facebook phishing page being actively promoted via Facebook Ads..

hxxps://r7ouhcqzdgae76-fsc0fydmbecefrap.z03.azurefd.net/new2/?utm_medium=paid&utm_source=fb&utm_id=6900429311725&utm_content=6900429312725&utm_t erm=6900429314125&utm_campaign=6900429311725

The "extension" did a "scan". {"url":"https://r7ouhcqzdgae76-fsc0fydmbecefrap.z03.azurefd.net/new2..."}

response: {"classification":"clean"}

great work?

If I click "Deep scan".. I see a screenshot blob being sent over.. response: { "classification": "phish", "reasons": [ "Our system has previously flagged this webpage as malicious." ] }

So if the site were already flagged, why does the "light" scan not show that?