Readit News logoReadit News
Posted by u/sanketpatrikar 4 years ago
Ask HN: Let's build an HN uBlacklist to improve our Google search results?
For the unaware, uBlacklist [0] is a browser extension that lets you blacklist sites from the google search results page. It lets you blacklist sites right from the results page, by regex, or by linking lists hosted somewhere.

The low quality of results has been a problem from a while now and has become worse lately thanks to all those StackOverflow and Github clones. So I was wondering if we could come together and contribute to a single blacklist hosted somewhere and then import it into each of our browsers. Who knows? We might end up improving the quality of the results we all get.

Lists to get rid of the StackOverflow and Github clones already exist. [1]

I would love to contribute to a project like this, but won't be able to be a maintainer due to time constraints. Would greatly appreciate it if someone could host this. A simple txt file on github would do.

What do you say, HN?

[0]: https://github.com/iorate/ublacklist [1]: https://github.com/rjaus/awesome-ublacklist

tyingq · 4 years ago
>become worse lately thanks to all those StackOverflow and Github clones

A google search showing some of these leech type sites:

https://www.google.com/search?q=%22code+that+protects+users+...

For me, "farath.com" is outranking stackoverflow.

Siira · 4 years ago
> farath.com was first indexed by Google more than 10 years ago

This seems pretty suspicious? Is it reporting the first time Google crawled the main domain farath.com? How is that relevant information?

judge2020 · 4 years ago
This is the first time it crawled the domain at all. It's been a website since at least 2008[0], but was recently re-registered in 2020[1].

0: https://web.archive.org/web/20080607010730/http://www.farath...

1: https://who.is/whois/farath.com

nebula8804 · 4 years ago
Thats weird. I noticed no ads on this farath.com site. Are they going to monetize the email subscriptions somehow? How are they making money off of this?
endisneigh · 4 years ago
This is a great example of why "Google sucks!!11" is mainly FUD. Let's say you're looking for the SO link, which is #2 for Google. Let's compare:

Google ("code that protects users from accidentally invoking the script when they didn't intend to")

Link: https://www.google.com/search?q=%22code+that+protects+users+...

SO - #2

Bing ("code that protects users from accidentally invoking the script when they didn't intend to")

Link: https://www.bing.com/search?q=%22code+that+protects+users+fr...

SO - #2

Brave Search

Link: https://search.brave.com/search?q=%22code+that+protects+user...

SO - Not on page

You.com

Link: https://you.com/search?q=%22code%20that%20protects%20users%2...

SO - Doesn't load

DuckDuckGo:

Link: https://duckduckgo.com/?q=%22code+that+protects+users+from+a...

SO - #2 (seems to depend on refresh)

Basically they're all the same. Google is faster, but the order of the results is identical.

If you did a large scale analysis in this manner I doubt Google would lose.

tyingq · 4 years ago
I'm not sure it's a good example, really. It's an "exact phrase search" with quotes, which doesn't happen much in real life.

It was helpful solely to show what some of these leech sites are.

Searching for (without quotes): What does if __name__ == "__main__": do?

Is probably a better test of which search engine has better results for the real-life query. Google might still win, but it should do a better job of screening out the spammy sites. It used to be better at this.

Terry_Roll · 4 years ago
I have noticed that Google and Bing seem to present results which link to sites like stackoverflow.com where the questions and solutions are absolute FUD.

I think someone or an entity has been engaged in a consertive effort to manipulate the results if its not something more nefarious in Google and Bing's domain.

Very few entities have the resources to do this either, its not something a ragtag band of goat herders could, thats for sure!

code2life · 4 years ago
In my experience, the you.com apps and overall search results aren't affected by SEO the same way that some of the other engines are, which is why I think their results work for me

Deleted Comment

tut-urut-utut · 4 years ago
Just tried your search in both Google and Duck Duck Go. On Google first page spam copies are ~80% of the links, on DDG maybe 40%. Not good, but much better than Google.
ahurmazda · 4 years ago
I tried you.com[1]. The first few results seem quite relevant. Best part is that you can actually personalize the weights to assign to your search (your very own bubble)

https://you.com/search?q=code%20that%20protects%20users%20fr...

tobyjsullivan · 4 years ago
This isn't the same search. The parent post had quotes around the phrase. You.com returns identical copy-cat results if you do the same search.

To be fair, not sure what other results we'd expect if we're going to search for a specific, plagiarized phrase.

Edit: actually, upon review, you.com does indeed give one extra useful result within the top three. So one point to gryffindor.

ffhhj · 4 years ago
I saw you.com displays some Code Complete snippets but the lines are too short and doesn't get the language highlighting, which make it harder to read. Nice try anyway.
darekkay · 4 years ago
uBlock Origin supports blocking search results, so I don't require an additional browser extension. I maintain a blocklist for myself, targetting Google and DuckDuckGo [1]. Feel free to contribute more websites or use this list as a template for your own repository.

[1] https://github.com/darekkay/config-files/blob/master/adblock...

laurentlbm · 4 years ago
dorianmariefr · 4 years ago
Blocking w3schools, I was not sure but I think you are right, MDN is just much better
jhchabran · 4 years ago
That's an ambitious goal, I'm not sure to see how that would be maintainable on the long run.

On a much smaller scale, if anyone is interested, I maintain a black list focused on those code snippet content farms that gets in the way when you're searching for some error message or particular function here https://github.com/jhchabran/code-search-blacklist.

nixcraft · 4 years ago
May I know why my domain (cyberciti.biz) was added to that list? I created my site back in 2000, and there was no StackOverflow or anything. So much for creating original content and then getting labelled as a spammer. In fact, some of the top answers on StackOverflow were copied from my work without giving any credit to me. Some people do give credit tho. But, go ahead block a site that actual humans maintain over 20+ years. Also check my About[1] and Twitter[2] page. There is no scrapping or spamming going on my part.

[1]https://www.cyberciti.biz/tips/about-us [2]https://twitter.com/nixcraft

mikevin · 4 years ago
Interesting, I have your site on my mental blocklist as one of those scrape and rehost sites.

I'll be honest, I don't remember how I came to that conclusion but I suspect I encountered an unsatisfactory answer to a question I was looking to answer, saw the .biz and drew my conclusions.

The noise to signal ratio for most of my queries is so high that I have to start judging a book by its title, not even its cover.

Karsteski · 4 years ago
I've noticed cyberciti.biz showing up in my DDG search results but I've always ignored it because of the initial captcha. I will try it now that I've seen your post here!

The .biz definitely does not help, since it hints to me that it's just another one of those worthless reposting sites, as someone else commented below.

travisporter · 4 years ago
Not OP but dot biz is associated with spam in my head for what it’s worth
zxexz · 4 years ago
cyberciti.biz is one of the few sites that come up in Google search results for anything code/linux related that has valuable content. I do wonder why someone would block it.
pbowyer · 4 years ago
I wanted to stop by and say thanks for cyberciti.biz! I've been using it since 2001-2002 when I got my first Verio Freebsd VPS and had to figure out what was going on.

When I see your site pop up in my search results I know the content is going to be more reliable than most of the others. Thanks for the effort you've put into it.

burnished · 4 years ago
At first scan your site looks like one of those automated scrape and republish sites. I'm curious what got you on that blacklist (misspelling? bad first impression? automated tool gone awry?) though.

Glad you said something though, I wouldn't have looked at it twice without a human attestation.

jodrellblank · 4 years ago
I'm always happy to see your site in search results, it's one I recognise and trust for CentOS/Linux related information for years. Thank you!
endisneigh · 4 years ago
Your comment is exactly why spam prevention is difficult. Sorry for that.
PinkSheep · 4 years ago
As a user I think I've put your website on my mental "avoid it" list for its design. I've opened a page now and I feel like I'm instantly in a tunnel vision mode. For UX: it's not a pleasure to scroll up & down; maybe there's also a psychological element about the main content area being so slim in width.

The other comment made me remember there was captcha too, right? I had been using my own rented server as a VPN for all my internet access. But I'd have never blocked it for a public list - I've read the 'about me' page.

ffhhj · 4 years ago
> some of the top answers on StackOverflow were copied from my work without giving any credit to me

That's really frustrating. I'm building a faster search engine for programming queries and just added your site cyberciti.biz as a recommended and curated source of Unix/Linux material. Hope more devs get aware of your work and you (and your collaborators) receive the credits deserved. Thanks for your work of many years.

nitrogen · 4 years ago
What CDN do you use? I was immediately asked to solve a captcha from my phone.
sanketpatrikar · 4 years ago
It's worth a try! Also, thanks for maintaining those lists!
anigbrowl · 4 years ago
Well there's only one way to find out
ZeroGravitas · 4 years ago
Isn't this Google's job? Are developers a small but lucrative target and so the suits at Google don't see the benefit of improving that experience by cleaning up the spam?

Can we just nudge them to do so under the threat of an influential minority leaving due to their use case being affected?

asdfasgasdgasdg · 4 years ago
Is there a word for this tendency to say, "it's someone else's job" as a justification for doing nothing at all to help or improve one's own circumstances? I see more and more of it in the public discourse over the last years and it kind of bothers me. I see it a lot in conversations related to poverty or climate change, but it is as we see here by no means exclusive to those topics.

To the original replyer: you could wait for Google to do something, but if they were going to fix the listicle issue, and it were fixable on their end, they'd probably have done it by now. I'm disappointed in the situation too but if there is a workable solution on our end it would be silly to ignore it because fixing the problem is someone else's job.

To the OP: I worry that the number of domains pumping out crap might be far greater than we know, and that might hamper the effectiveness of this. If the collaborative block list ever got big enough you might also have to deal with spam. But I think it would be a great thing to try. This is one of those issues that annoys me, but it's just below my action potential threshold. My biggest objection right now is the spammy recipe websites.

sanketpatrikar · 4 years ago
I suggest this because there can only be so many websites that use SEO to game their way to the top and bury the good results beneath them.

If we manage to block them, we might be able to get a results page with good sites upfront and the other meaningless content below it. I assume Google will also surface good content along with the bad, so our blacklist might enable the good stuff to reach the top.

The spam problem, I'm sure of yet, but we might either be able to block enough of it to be satisfied or it won't pose a problem for most searches that are currently giving bad results.

ZeroGravitas · 4 years ago
I like to think of it as solving the problem in the right place.

It’s often possible to work around issues in lower layers, but it's usually at least worth raising it upstream to get it fixed 'properly'.

It'll help me when I dont have a blocklist active, and it'll help new programmers who arent familiar. It'll reward good sites with extra traffic and discourage new spammers entering the market.

In the worst case, if Google really can't or won't address tge issue, understanding the upstream problem more fully can help make a better workaround.

andyjohnson0 · 4 years ago
> I worry that the number of domains pumping out crap might be far greater than we know, and that might hamper the effectiveness of this.

I'm sure you're right about the number of spam domains, but Pareto suggests that blocking even a small percentage of them might provide a large gain.

https://en.wikipedia.org/wiki/Pareto_principle

sanketpatrikar · 4 years ago
I ended up creating a repo with blacklist.txt myself and will add to it for my own usage. I don't see anyone else who'd maintain this. Feel free to use it / contribute to it.

https://github.com/sanketpatrikar/hn-search-blacklist

renewiltord · 4 years ago
Internal v External Loci of Control perhaps?
germandiago · 4 years ago
> Is there a word for this tendency to say, "it's someone else's job"

At the risk of sounding pretentious, I call it "socialist", since they spend their lives telling others what to do or what is good or not for the rest of us but they rarely do anything about it. Surprisingly, this is the group that is really worried about poverty and climate change and do as much as I do for it, with the difference that I do it by myself, the few times I do it, not requiring the rest to do it.

It is always someone else who will do it. Though the other day I had a conversation with a non-socialist person that had that same attitude ("other should do it") towards what OTHERS should do. I really dislike that attitude, no matter where it comes from.

Point at hand: when I want or promote something, I am the first one to do it no matter others do it or not. The rest, no matter the ideology, all b*llshit.

As imperfect as I am, I try to do what I think is good (and sometimes my imperfection prevents me from doing it) but I do not spend my life telling other people why they are worse than me and telling them what they should do or not. The most I have for someone is good suggestions, never requirements.

nottorp · 4 years ago
> Isn't this Google's job?

Have you searched anything on Google lately? The answer is "no". Their new job seems to be to stuff your results with anything even remotely related (and sometimes related in a way that only machine learning can see) so you have things to click on.

Edit: with the lone exception of "find me this bussiness nearby".

mikevin · 4 years ago
It's very obvious Google is no longer the equivalent of grepping the web. There's some ML/NLP interpretation that's rewarded for returning the substring/interpretation that returns the most/highest ranking results.

It's very noticeable if your search contains a short keyword that has to interpreted in the context of the other keywords. As an example, if I search for 'ARM assembly' plus another keyword (macro, syntax etc) it will see 'ARM assembly' without the extra keyword has way more high ranking results and happily show me how much it knows about armchairs that don't require assembly. Ignoring the fact that the extra keywords are there specifically to limit the search results.

It's tiring, a lot of time I previously spent browsing the limited but valuable results it returned I now have to spend mangling the keywords enough to outsmart their ML/NLP interpretation and get it to admit I am actually asking for the thing I am asking for so I can finally get to the part where I have to solve the modern captcha: click all the results that are:

1. Not stolen/rehosted 2. Not a "Hello World" level Medium blog 3. Written by an actual human

goodlinks · 4 years ago
If I search for a business name its normally not the first result any more. Usually an advert for another company in the same space I dont want to use (basically an offensive / scamming result from user perspective) and then also the standard "buy search term on amazon".
ziggus · 4 years ago
I think you're wildly overestimating the influence of the minority within HN (or other similar communities) that actually care enough to switch to another search engine.

This reminds me of the Linux gamers who claim that they can influence game development companies by purchasing games with Linux ports, but wind up being less than 0.5% of sales of most games with Linux ports, which leads manufacturers to ignore that customer base almost completely.

fileeditview · 4 years ago
Not disagreeing with you.. big companies truly mostly ignore Linux but there are more than a few indie devs who support Linux as a platform. And I tend to play only indie games these days anyways because all the big commercial games have been reduced to some kind of click-and-succeed or free-to-play-and-milk-some-whales crap.

I personally am kinda happy where Linux gaming has come to be. Sure it could always be better but I remember times where there were only like 3 games for Linux and you had to compile them yourself..

yellowsir · 4 years ago
game companies might have ignored us, but in the end it created a space for valve and codeweavers to fill.
alangibson · 4 years ago
> cleaning up the spam

I'll be happy to be proven wrong, but I think Google is now fully in the 'optimize for engagement' camp. If that's what they're doing, it's by definition not spam (from their point of view) if people are clicking on it more than the non-spam results.

Again, only my guess as to what's going on. I don't see another good explanation for them only serving cloned Stackoverflow and top X lists for basically everything now.

onionisafruit · 4 years ago
From a user point of view, a search engine’s job is to link you away from the search engine, so how does a search engine measure engagement? Is it time on page or maybe number of searches performed? When you don’t have viable competitors both of those are improved by worse search results. Even number of ads clicked would be improved with worse search results because ads don’t have as much competition for your attention when there are no relevant results on the page.
sanketpatrikar · 4 years ago
It is Google's job, but they either aren't doing it or are failing at it. We could do something about it at least until a better alternative or a solution appears.

> Can we just nudge them to do so under the threat of an influential minority leaving due to their use case being affected?

Many influential people have tried and nothing seems to have transpired from it.

Google.com is the most popular website. I don't think the leaving of any minority group we manage to create would even matter to Google, let alone force them to fix the issue. Not that I discourage using alternatives.

anigbrowl · 4 years ago
Can we just nudge them to do so under the threat of an influential minority leaving

No. This is a classic mistake of intellectual types, who are impressed by each others' cogent arguments. But there is a much wider pool of people who are not, and among whom the intellectual types actual have very little influence, due to being boring and hard to understand (plus, it has to be said, kind of snobbish about how smart they are).

Now, you might reason that Google is full of smart people who should care about cogent arguments. But that assumes as an unspoken premise that Google's internal goal is to maximize the quality of the service and profit from being The Best. They passed that goal years ago are now so awash in money that it's cheaper to just squash competition than to innovate. They can be moved by threats to advertising revenue got up by angry crowds on social media (a market when they have little direct power), but Google would probably be delighted if grumpy nerds wandered off somewhere else. If they need talent or access to some compelling technology they can just throw a pile of cash at the problem.

BuyMyBitcoins · 4 years ago
>” Can we just nudge them to do so under the threat of an influential minority leaving due to their use case being affected?”

I sense Google is too big to cater to us like this. Despite a steady decline in quality, Google is still the dominant search engine and the competition isn’t even close to its market share. Not only would they not notice many of “us” leaving, the amount of change they would have to implement in order to satisfy our desires would end up changing the product for the rest of the market. On some level, the product managers must be satisfied with the metrics as they stand since Google is continuing with their current course.

tjpnz · 4 years ago
>Isn't this Google's job?

Or more fundamentally perhaps this is just the system working as Google intended?

ineedasername · 4 years ago
Google's goal isn't to create the best possible search engine, it's to have a search engine that is good enough that people won't actively seek an alternative at the same time that they put as much ad content as possible in there, again before it's so much that people seek an alternative.

I doubt many advertisers like the status quo very much either. They basically have to pay for ad placement to ensure the first results for their product aren't ads for competing products. On mobile when I search for Boox the first result linking to them is an ad. Same for Kobo. In other instances I'll search for company or product and a competitor ad is the first to show. So vendors get stuck paying for ads when their own site should probably be the first organic result, above the ads.

reaperducer · 4 years ago
Are developers a small but lucrative target and so the suits at Google don't see the benefit of improving that experience by cleaning up the spam?

Google doesn't make money from people finding what they're searching for. Google makes money by keeping people searching.

tut-urut-utut · 4 years ago
Instead of spending energy to change Google, why not just leave them for good?

Start with changing default search engine to DuckDuckGo or something else, install uBlockOrigin and Privacy Badger to disable tracking, and gradualy reduce using every Google or application, starting with Chrome.

Be the change you want to see.

sanketpatrikar · 4 years ago
I relate to this opinion. There are two reasons why my suggestion might still be useful:

1. DuckDuckGo too is affected by these SEO-gaming sites, so maintaining a blacklist will help us make that experience better too.

2. There are times when only Google can find us what we're looking for, so this will prove useful when we go back to it.

moneywoes · 4 years ago
No, google wants more clicks so they would prefer poor results that keep users searching
PragmaticPulp · 4 years ago
I think the disconnect comes from people expecting perfect search results as curated by humans, whereas Google necessarily must optimize for automated results. Automated results will never be perfect.
omnicognate · 4 years ago
Are you paying them to do it?
MarcelOlsz · 4 years ago
Worst case scenario if Google drops the ball I just go back to the library.
jstx1 · 4 years ago
I'm sure they have great books on stackoverflow answers, reddit reviews of products and opening times of local stores.
beepbooptheory · 4 years ago
Google has a fiduciary responsibility to shareholders, which is so much work as it is! Why are you trying to ask them to do more?
hooande · 4 years ago
The problem with this is illustrated in another comment where nixcraft's site, cyberciti.biz, was added to a personal block list. The content on the site does seem to be original and productive. I'd guess it was added based on the criteria of "I haven't heard of this site and the domain looks suspicious". I have a feeling that this will be true for other domains on this proposed master list. And the owners of those domains will have no recourse.

Specifically blocking github clones seems doable. Adding anything else needs equally specific criteria or it will quickly become subjective and unfair.

littlecranky67 · 4 years ago
I wonder why Apple is not starting it's own search engine. I mean yes, they get >$1Bn per year making Google the default on iOS+macOS, but they have plenty of cash so they wouldn't need it. They would get immediately ~10% market share when it is launched, just because it would be made the default on their devices. From their they just need to present better search results than Google (which shouldn't be that hard right now) and can only grow further.

As another commenter here said "Google does not make money by helping you find what you are searching, it makes money by keeping you searching". That only works when there is no competition. But once Apple would be in the game, people would use what presents them with the better results. Right now, I don't feel there is real competition.

ericbarrett · 4 years ago
Apple is allegedly paid lots of money to not do this: https://www.macrumors.com/2021/08/27/google-could-pay-apple-...
littlecranky67 · 4 years ago
Wow, $20Bn per year. This smells a lot like anti-competetive behavior, I wonder what happened to that lawsuit.
lgats · 4 years ago
apple already has its own search engine. the crawler is known as AppleBot and the results power siri search suggestions.

it’s limited to popular queries, so for many searches you may get ‘no results, search the web (google)’.

i made a bit buggy web front end for siri search so i could better play around with the results https://luke.lol/search/

achtung82 · 4 years ago
But what would be their incentive to do so? Normally they launch products and make it exclusive to their devices so more people will buy iPhones, but that is difficult to do with a search engine. Otherwise they would have to get into the ad business like Google.
littlecranky67 · 4 years ago
Apple is a publicly traded company, and every company needs to grow into new markets to make more revenue. And they also maintain their own browser Safari, even though on macOS they could just withdraw from market and leave the field to Chrome and Firefox. Even amongst macOS users Safari usage is very low and doesn't make Apple any money.

On the other hand you can see how Google is using its dominance in Search to push its browser and mobile OS - once you login to Google in Chrome on your phone, suddenly they can track you when you use their mobile Apps etc. And Apple is trying hard to grow in the "Services" field, i.e. through Apple Music and Apple TV - both available to Windows and Android users too. Just as they made a buttload of money with iTunes and the iPod because they also targeted Windows users.

paxys · 4 years ago
Running a search engine is a massive money sink, regardless of its popularity. It's the surrounding ad network which makes money. Competing with Google and Facebook in that regard is an impossible battle, and something Apple has already failed at a couple times now. They have since pivoted into creating a privacy friendly image, so emulating Google simply does not make sense for them.
LinuxBender · 4 years ago
This is just my own personal preference, but I manage my own list of what is blocked or allowed on my systems. I would be concerned that a group contributed list for this category of blocking could quickly devolve into a group-think censorship dominated by whomever is the most devoted to blocking and extending echo bubbles to peoples browsers.
hayesall · 4 years ago
Seeing "how other people configure their tools" can be interesting. I love seeing how people configure their .bashrc with custom commands.

I don't think I'd want to download a list of the most blocked sites and plug it into one of my tools though, for some of the reasons you mentioned.

throwawayboise · 4 years ago
That, or it would be gamed by the SEO folks like they do every other thing that was once good.
fsflover · 4 years ago
This looks like a big, time-consuming project that would rely on a private Google API that can change any time. I think it's not worth to invest your effort into that. I wish more people would help to improve FLOSS, peer-to-peer search engine YaCy instead, https://yacy.net.
TrueDuality · 4 years ago
This improves other search engines as well, not just the Google universe. I'm sure even an opensource, peer-to-peer search engine will have similar issues of content farm content and gamed pages if it becomes large enough to compare with search engines like DuckDuckGo.

On the other hand it is absolutely ridiculous to conflate the difficulty of occasionally adding a domain to a local filter, and assisting to build a random unproven search engine. People volunteer their development effort for projects they personally find interesting or challenging. If you want more developers advocate for the project don't try to scold people for wanting to spend a small amount of their time refining a solution that works for them.

upbeat_general · 4 years ago
I’m not sure why you think that a domain blocklist would be harder than custom search engine development.

Plus there’s no private Google API here, just an extension that removes search results from the page. I suppose you could say the extension APIs are from Google (Chromium) but they’re certainly not private and are commonly used.

fsflover · 4 years ago
Doesn't this extension depend on how exactly the ads are presented on the page? Can't this be changed by Google easily?

> I’m not sure why you think that a domain blocklist would be harder than custom search engine development.

I didn't say this. The custom search is already created. Helping it's development is much easier now. AFAIK it's main problem is the lack of hosted servers.