Readit News logoReadit News
Posted by u/pketh 4 years ago
Ask HN: Why do search engines not let you blacklist spam domains?
Whenever I'm searching for anything even mildly off the beaten path, it's not uncommon for the top results to be SEO stuffed spam websites, or maybe even real websites that I can't access (like paywalls or requiring adblocker exceptions to proceed). Usually pages from the same domains are top-ranked for other related searches too.

As a user I'd love to be able to tell my search engine to "Never show me results from this domains" (similar to blocking an account on Twitter) – but as far as I can tell there is no way to do this in either Google or DuckDuckGo search.

This seems like such low-hanging fruit to me that I'm wondering if other people have ever wanted this, and if there's actually a reason not to do it.

rbinv · 4 years ago
Google used to have that feature: https://www.ghacks.net/2011/03/10/google-adds-block-all-doma...

In my opinion, back then, they needed the data as a training set for spammy domain detection. Now that SERP spam is no longer a serious issue (in Google's eyes anyway), why bother. Google always knows what's best for you.

Arnavion · 4 years ago
>Now that SERP spam is no longer a serious issue (in Google's eyes anyway)

Given the number of clones of StackOverflow and GitHub that show up on the front page, even at the cost of replacing the original SO and GH links that they copied from, I can only assume either Google's eyes are blind or the search engine devs are so good at their job that they never need to Google anything themselves.

tetsusaiga · 4 years ago
Another fun one— next time a public figure dies, especially a slightly more obscure one, just google them/their death.

Whole first page is just poorly written fiverr style articles, all by people for whom English is clearly a second language, at best.

Eventually, legit stuff gets on the first page too, but theres a few instances where weeks later the first result is still one of these garbage seo-torture sites.

Has really made me lose my faith in google lately.

dreamcompiler · 4 years ago
I'd be surprised if Googlers don't have an internal tool to filter out the crap that only they can use.
ncann · 4 years ago
Wow, I'm surprised that I never remembered that feature existed. I wonder why they removed it, just like why they removed the insanely helpful "Cached" feature.
Blackthorn · 4 years ago
Cached likely went away for the same reason image search was made to suck, legal issues.
ColinHayhurst · 4 years ago
Because Google and Bing (therefore Duck) are increasingly answer engines, keeping you on their page, supporting the SEO ecosystem and most importantly their ad revenues and network customers.

We and the other few independent search engines have not made enough dent on the market to suffer SEO spam. We'll have a way to deal with it (watch this space). Right now you'll certainly get results "off the beaten path" and with one click you can try out 8 other search options [0].

[0] https://blog.mojeek.com/2022/02/search-choices-enable-freedo...

mdasen · 4 years ago
Not only are they becoming answer engines, in Google's case there's a decent chance that they're making money off the spammy pages.

If the top results are "SEO stuffed spam websites", they're probably also loaded with ads. If they're loaded with ads, there's a good chance that they're using Google's ads.

If the top results are "SEO stuffed spam websites", there's a good chance they're chock-full of affiliate links. If a search for "best baby formula" is going to end up costing Amazon an affiliate fee via an "SEO stuffed spam website", Amazon might as well just buy a Google ad for that keyword and cut out the "SEO stuffed spam website". If the results go to pages that are just going to cost a seller money anyway, it gives the seller more incentive to buy ads since they aren't getting free traffic from the search engine anyway.

While being an answer engine keeps you on their page longer, feeding you SEO spam also keeps you coming back to their page; feeding you SEO spam signals to potential advertisers that they won't be getting free traffic from the search engine so they might as well pay for the traffic via ads.

I'm not suggesting that it's a conspiracy to send you bad results, but it does seem likely that as long as they aren't losing traffic to competitors, it might not be something that becomes a priority.

ColinHayhurst · 4 years ago
> If the top results are "SEO stuffed spam websites", they're probably also loaded with ads. If they're loaded with ads, there's a good chance that they're using Google's ads.

Bingo. If you can't get clicks on AdWords get them on AdSense.

csmeder · 4 years ago
My go-to test search term is: "grass-fed beef restaurant in sf"

You do much better than Google (Google will always include "Top 10 Best Grass Fed Organic Steak in San Francisco, CA" from Yelp and then link to places that don't have Grass-fed beef options.)

However, currently, your first ranking option is Pinterest spam FYI: https://www.mojeek.com/search?q=grass+fed+beef+restaurant+in...https://www.mojeek.com/search?q=grass+fed+beef+restaurant+in...

Your second option is the correct kind of result (a blog from a local that actually answers the question): https://www.grassfedgirl.com/paleo-friendly-restaurants-in-s...

Where most of the results on Google are not correct. They are mostly articles about Steaks (and a few actual restaurants that serve grass-fed beef, so that is good). Actually, FYI your results don't include these restaurants. eg. It would be nice if this Google result showed up in your results:

"SF / SOMA - Belcampo We source grass-fed and finished, pasture-raised meats directly from our own climate positive CA farms and seasonal vegetables from local farms."

ColinHayhurst · 4 years ago
Thanks. useful feedback.

As you point out #1 organic link on G is Yelp. They currently block all but G, Bing and Yahoo! - we'll get in touch with them again. https://www.yelp.com/robots.txt

Organic link #3 on G is Belcampo; we have some of that indexed so we'll take a look: https://www.mojeek.com/search?q=food+site%3Abelcampo.com

shrikant · 4 years ago
For the same reason that streaming services don't really give you the ability to filter by cast/crew or hide stuff you've already seen: to gently guide you into avenues that are more profitable for them, regardless of what you say you really want.
Joeri · 4 years ago
I’m not convinced they are targeting profitability, because that is hard to connect back to individual user action. They target what is easy to measure at the user level, probably engagement, and operate on the assumption that higher engagement means higher profit.
x3ro · 4 years ago
why do you think it’s difficult for google to measure which clicks to what websites are profitable for them? maybe i’m misunderstanding, but I would expect that this would be one of the primary things google tries to infer from you..
btgeekboy · 4 years ago
I just tried searching Netflix, HBO, and Prime for Bruce Willis; all showed me his movies. Is that not the filtering you’re expecting?
lostcolony · 4 years ago
That's a fuzzy search. If I search on Netflix for "Bruce Willis", the first ten or so items are movies with him. Then there are miscellaneous thrillers and whatnot; unless I look at the details I don't know that they don't have Bruce Willis in them.

A filter gives me a binary outcome; it either has Bruce Willis, or it doesn't.

All of which is sort of the ops point I think; the search engine is fuzzy because it wants to show you many things for profit reasons. I don't think it's geared toward maximizing profitability at the expense of what you're looking for, exactly, so much as engagement as a proxy for profitability. The more you use the service, the stickier you likely are as a paid subscriber, so they'll happily shovel things that don't match the search (but are similar!...and that quickly jump the shark into being quite dissimilar, but hey, maybe!) to try and keep your eyeballs.

A better demonstration of this handling is searching for movie titles they don't have. "The Princess Bride" on Netflix, for instance, doesn't even tell you "We don't have that", but instead "Explore titles related to: The Princess Bride". And while the first suggestion, The Neverending Story, is kinda a fair suggestion, some slightly later ones, like Top Gun and Zoolander, feel like a stretch.

klibertp · 4 years ago
I suspect the GP thought about something more complex, like "a movie directed by X or Y between year XXXX and YYYY starring Z but not V in the genre ... with tags ... but excluding ones tagged with ..." You can probably do this on IMDB or something similar, but then you can't watch it there.

I'm largely guessing - I don't watch movies, but I read a lot of manga online, and the only site that allows similar queries and still has UX a bit better than late '90s (ie. mangaupdates.com) is a fan-made (ie. pirated) one with porn/hentai...

rg111 · 4 years ago
Prime Video has an option where I can hide specific movies and seasons of series.

I use this feature a lot, and it works.

shrikant · 4 years ago
Wow TIL -- I'm going to try this out and see how well it works between a browser and my Roku, thanks!
krono · 4 years ago
uBlock Origin static filters to the rescue!

Block results from specific domains on Google or DDG:

    google.*##.g:has(a[href*="thetopsites.com"])
    duckduckgo.*##.results > div:has(a[href*="thetopsites.com"])
And it's even possible to target element content with regex with the `:has-text(/regex/)` selector.

    google.*##.g:has(*:has-text(/bye topic of noninterest/i))
    duckduckgo.*##.results > div:has(*:has-text(/bye topic of noninterest/i))
Bonus content: Ever tried getting rid of Medium's obnoxious cookie notification? Just nuke it from orbit on all domains:

    *##body>div:has(div:has-text(/To make Medium work.*Privacy Policy.*Cookie Policy/i))

samcrawford · 4 years ago
Filtering out the spam results is only half the problem. In my experience, a legitimate site's content is cloned by a spam site, and that one appears in a Google search and the legitimate one does not. The example that keeps hitting me is GitHub Issues.

Filtering out the spam only removes the clones; it doesn't get the good results back in.

3np · 4 years ago
Host a personal (potentially shared with friends) searx (for multi-engine) or whoogle (google only) instance. Filter out some domains completely, rewrite others. The rewrite part is what allows you to substitute spam clone sites for the real deal. At least searx does dedupe already.

The time spent (including maintenance) will be paid back faster than you might expect.

Optionally rewrite some sites to altfronts like nitter/scribe/piped. If you care about spending time on privacy and decoupling searches from visits, you can set up arbitrary proxying rules.

One benefit among others over browser extensions is that it's a one-time setup for all your devices and clients. All you need to do on reinstall is to change the default search engine.

jccalhoun · 4 years ago
it isn't even just 'clones' because so many sites will just summarize an article from somewhere else and give a link to it. Sometimes it is a game of telephone with one site summarizing a 2nd site which is a summary of a 3rd and so on. I want a search engine to show me the original source not the one with the best SEO
krono · 4 years ago
Sure, but at least it prevents you from accidentally clicking those unwanted results - something I kept doing all the time.

Either way, OP's ask was for a way to blacklist results, and I'm providing a method to accomplish exactly that. Edit: The rest is up to Google.

james2doyle · 4 years ago
Nice! I've used HTML attributes as targets as well. They aren't random so they are also easy to target. I use this one on Twitter:

    twitter.*##div[aria-label="Timeline: Trending now"]
This will hide the trending tweets box. You can see it's targeting the `aria-label` attribute on the `div` element

Inviz · 4 years ago
well that leaves blanks, right?
krono · 4 years ago
No, these completely remove any matched results
sharps1 · 4 years ago
If I had to guess, it's a lot of data they don't want to store.

Current solution is either ublacklist, or add filters to uBlock Origin. Both are linked in the thread https://news.ycombinator.com/item?id=29546433

https://github.com/h-matsuo/uBlacklist-subscription-for-deve...

ncann · 4 years ago
Data storage can't be the reason. No matter how you implement it, the storage required is peanuts compared to things like Youtube/Drive/Google Photos.
jolmg · 4 years ago
Adding -site:baddomain.com still seems to work in both Google and DuckDuckGo. You should be able to include that in the URL template used for search so it gets added to all queries. You can build your blacklist that way. E.g.

  https://duckduckgo.com/?t=hk&iar=images&q=-site:pinterest.com+-site:flickr.com+%s
As an aside, I haven't used Google in a while, and I find it interesting how the first page shows only like 5 results and at the very end. The rest of the page is widgets like "top stories" and "people also ask".

elmerfud · 4 years ago
It's a simple answer because you don't pay them for that. They run a balancing act of providing you useful results while also surreptitiously shoving trash, that they actually make money off of, at you and hoping you won't notice or complain.
dharmab · 4 years ago
Kagi will be paid when it exits beta and has this feature!
jbverschoor · 4 years ago
Which means you can never use it in an incognito session without logging in
PaulHoule · 4 years ago
If the search results were perfect you'd never look at an ad.
lijogdfljk · 4 years ago
A good argument for some distributed foss search then i guess. Wonder if it could ever be made to have search results return reasonably fast in a distributed large index?
beamatronic · 4 years ago
Sometimes I am searching for information, and I want facts. Sometimes I am researching a purchase, and I want the sellers to compete for my purchase. Perhaps it makes sense to give search engines more context, such as, whether your intent is to spend money or not.
throw_m239339 · 4 years ago
Because ultimately there is a conflict of interest between you as a search user, and google actual customers, advertisers who are often spammers themselves. Someone needs to pay for google search and since it isn't you...

Google might have been better in the past, but since there is absolutely no serious competition whatsoever from a market perspective, Google technically doesn't really need to care about the quality of its search results anymore, only maximizing profits.