solso (u/solso) - Readit News

solso commented on Brave Search Goggles: Alter search rankings with rules and filters github.com/brave/goggles-... · Posted by u/llevert

freediver · 4 years ago

Hello Josep ! (Kagi founder here)

For a bit of historic accuracy if it ever matters for future readers:

Kagi was founded in 2019 and we have operated for years in private beta with thousands of users before public beta release this June.

Goggles were not inspired by Kagi”s lenses and I can confirm seeing the whitepaper before we got the lens feature out last year.

Kagi”s Lenses were inspired by Blekko’s “slashtags” which is probably the original “prior art” for this kind of feature.

Looks like we arrived to similar idea, but different execution. Kagi”s Lens feature is osimple to create filter for the web, that anyone can make with a few clicks plus a bunch of powerful built-in lenses like “noncommercial” or “discussions” search.

solso · 4 years ago

Hi Vladimir! Good to see we are in sync. Glad to see Kagi doing well as well.

For the record, when I said "long before Kagi was even announced to the public." I meant exactly what I wrote, not that Kagi did not exist, it did.

solso commented on Brave Search Goggles: Alter search rankings with rules and filters github.com/brave/goggles-... · Posted by u/llevert

tacotacotaco · 4 years ago

Sounds a lot of Kagi’s block/boost and their lenses features. I haven’t used either. Just pointing out prior art.

https://blog.kagi.com/kagi-features

solso · 4 years ago

Disclaimer: Works at Brave, before at Tailcat.

Goggles white-paper was released more than a year ago, long before Kagi was even announced to the public.

Additionally, before Brave acquired Tailcat (Jan 2021) I had the pleasure to share the draft of the paper with Kagi's founder.

So no, there is no prior art.

Let me add that I do not claim that Goggles is prior art of Lenses either.

One of the key features of Goggles design is that the instructions, rules and filters are open and URL accessible.

A Goggle is not so much a personal preference configuration, but a way to collaborative come up with shareable and expandable search re-rankers.

Very different goals if you ask me. Of course, Goggles can be used for personal preferences exclusively, but that's not the use case we had in mind.

solso commented on Brave Search Goggles: Alter search rankings with rules and filters github.com/brave/goggles-... · Posted by u/llevert

agloeregrets · 4 years ago

The polarization bit is like watching a person who hit your parked car explain why the crash is really your own fault. Explanations that 'the web is too broad' or 'it takes an active choice to enable the goggle' mean nothing. It takes an active choice to watch polarizing news and those sources will tell you that you really need to only get your info from them. I would not be shocked to see major sites use this to control how people view the world.

"Use our brave search and escape the leftist google agenda!" or such.

solso · 4 years ago

An active choice is better than a passive one, if only because it requires an effort, in that respect the explicitness is an advantage over the typical personalization.

The article also mentions that Goggles will not stop polarization, it suffices to not exacerbate it.

No technology/system on any period of time has been able to suppress it, censorship included.

Disclaimer: I work at Brave search

solso commented on Bing contract prohibits DuckDuckGo from completely blocking Microsoft tracking twitter.com/shivan_kaul/s... · Posted by u/etamponi

ColinHayhurst · 4 years ago

Mojeek - totally independent, no-tracking; but I'm biased.

Independent take: https://seirdy.one/2021/03/10/search-engines-with-own-indexe...

solso · 4 years ago

Mojeek is one of the very few players that are building their own index, my respects.

solso commented on Bing contract prohibits DuckDuckGo from completely blocking Microsoft tracking twitter.com/shivan_kaul/s... · Posted by u/etamponi

thewebcount · 4 years ago

> The question is whether DDG would be able to operate if Bing were to shut DDG down tomorrow.

No, that doesn't appear to be the question at all. The original post appears to be an attempt to smear DDG by posting misleading information that you know will confuse users into thinking that their search engine sends PII to Microsoft when you know it doesn't. The original tweet doesn't appear to mention Bing shutting down at all. Here's the entirety of the tweet:

"This is shocking. DuckDuckGo has a search deal with Microsoft which prevents them from blocking MS trackers. And they can't talk about it! This is why privacy products that are beholden to giant corporations can never deliver true privacy; the business model just doesn't work."

I see nothing in there questioning whether DuckDuckGo will still be around if Bing goes under. I also see nothing in yegg's response above that has anything to do with this irrelevant question you mention.

solso · 4 years ago

There is plenty of comments discussing on the provenance of DDG results, including from Gabriel himself, which is the one we both have participated in,

"it is misleading to say our results just come from Bing."

Discussing how many sources can one bring together it's a distraction to not discuss the degree of dependency between DDG and Bing. More-so when claiming that others suffer from the same, which is factually incorrect for Brave search.

solso commented on Bing contract prohibits DuckDuckGo from completely blocking Microsoft tracking twitter.com/shivan_kaul/s... · Posted by u/etamponi

yegg · 4 years ago

First, it is misleading to say our results just come from Bing. That's far from the case in actuality. Please see https://news.ycombinator.com/item?id=31490994 for a more detailed explanation on that.

On other search engines, they all rely somewhat on either Google's or Bing's web crawling: Qwant, Bing and Brave, Google (and Bing for images). This is easy to see as a webmaster since you don't see their crawlers much (if at all). Only Google and Bing are doing full scale web crawls. However, search is a lot more than traditional web links -- in fact it is about half now from instant answers that can come from dozens of sources and indexes (which the above comment gets into).

solso · 4 years ago

This is Josep M. Pujol from Brave Search.

I'd like to correct some factually incorrect information regarding Brave Search.

Brave search crawls the web through the Web Discovery Project and has its own crawler, which fetches a bit more than 100M pages daily.

Brave search uses Bing API and Google fallback for about 8% of the results shown to the users, the remaining 92% are served from our own index, when we launched almost 1 year ago the number of results from 3rd parties was 13%.

There is no need to mention "multiple source" when a number can be given. The underlying theme here is not if DDG provides no value on top of Bing, it does, no one is questioning that. The question is whether DDG would be able to operate if Bing were to shut DDG down tomorrow.

If Bing and Google were to disappear tomorrow, for whatever reason, Brave search would continue to operate, that's the independence Brave search is building.

solso commented on Brave Search beta search.brave.com/... · Posted by u/vmullin

Seirdy · 5 years ago

That makes a bit more sense; I just read the blog posts. I'm concerned about the effects of optimizing against Google (namely, the extremely similar results); I don't think I understand the point of an alternative if it tries to replicate a competitor to this degree. The whole idea I was going for in that article was a diversity of information sources: if one engine isn't giving the results you want, try another.

Right now, users who want Google results and privacy can use a Searx instance or Startpage.

I updated the article to fix the inaccuracy. Diff: https://git.sr.ht/~seirdy/seirdy.one/commit/ddeeb36248ce5318...

Any other fact-checks are welcome.

solso · 5 years ago

You bring a very good point on the diversity of information sources, which is something we plan to attack in the near future with open ranking [0]

In my opinion having similar results to Google will facilitate adoption. After all, Google is pretty good for many types of queries (not all), and people in general have strong habits.

The fact that we are similar with our own index is great. It means that we have the power of deviating from it when needed, as we mature/evolve.

Allow me to repurposed your statement on why not use startpage if you want Google-like results: if tomorrow Google disappears (or for some reason becomes unusable), brave search will continue to operate as normal (similar to old Google). What will happen to searx or startpage? What till happen to ddg or swisscows if the provider turning bad is Microsoft. IMHO, no matter how much reranking or nice features they you put on top, unless you do not control the search results themselves, diversity can only be superficial.

Sorry for the "rant". Thanks a lot for the inputs and for updating the doc, appreciate it.

[0] https://brave.com/wp-content/uploads/2021/03/goggles.pdf

solso commented on Brave Search beta search.brave.com/... · Posted by u/vmullin

Seirdy · 5 years ago

This has not been my experience. Comparing results with Google, Startpage, and a Searx instance with only Google enabled reveals that the results are almost always from Google. Sometimes they merge multiple results that share a domain.

I decided to add them to the "Semi-Independent" category of my collection of indexing search engines: https://seirdy.one/2021/03/10/search-engines-with-own-indexe...

solso · 5 years ago

Mixing with Google results only can happen after opt-in and only in Brave browser. You can see if a single query has been mixed clicking on the `Info`, or check the independence metrics on the `Settings` tab.

The fact that you see results similar to Google for popular queries is a by-product of the fact that our ranking is trained using anonymous query-log. There is plenty of references to the methodology (https://0x65.dev/).

The fact that we are similar to Google on certain types of queries, is good (at from the perspective of human assessment). It's easy to find other types of queries for which we are not similar to Google. It would be rather stupid if we were to "use google" on easy to solve queries but not on the complicated ones, don’t you think? In any case, very nice article besides a couple of miss-conceptions (like this one), will bookmark.

Disclaimer: work at Brave search, used to work at Cliqz

solso commented on Brave buys a search engine, promises no tracking, no profiling theregister.com/2021/03/0... · Posted by u/samizdis

ThePhysicist · 5 years ago

I'm not saying it's not anonymous, just that it's impossible to assert the anonymity.

Also, I saw a lot of "anonymous" clickstream data offered by other companies, which was often trivial to de-anonymize. We did a DEF CON 25 talk about it, just google "Dark Data DEF CON 25". Robustly anonymizing high-dimensional data like user clickstreams is practically impossible, and often knowing a combination of 4-7 websites a user regularly visits is enough to identify him/her in a pool of millions of users (see the talk for details), so I'm highly doubtful about any company that claims it can robustly anonymize such data. If you're confident your data is anonymous why not release a large sample and have researchers look at it?

So while I'm not saying Ghostery is also doing that I don't have a lot of good faith in these data collection practices in general (also, I think before Cliqz acquired Ghostery it collected a lot of data like cookies from the users). Again, it's a smart way to collect data but I wouldn't call it very privacy-friendly.

solso · 5 years ago

It is trivial to de-anonymize if records are linkable, which is the case you mention on Dark Data DEFCON25. Another famous case was the de-anonymization of the Netflix data set.

However, you are assuming that HumanWeb data collection is record-linkable, which is not the case, precisely to avoid this attack.

If what is being collected is linkable: e.g. (user_id, url_1), ... (urser_id, url_n). No matter how you anonymize user_id, it will eventually leak. A single url containing personal identifiable information, e.g. a username, will compromise the whole session. No matter how sophisticated the user_id generation is. The real problem, privacy-wise, is the fact that record can be linked to the same origin. An attacker (or the collector) has the ability to know if two records have the same origin.

The anonymization of HumanWeb, however, ensures that linkability across data points is not present. Hence, an attacker cannot know if two records come from the same origin. As a consequence, the fact that one url might give away user data, for instance a username, it would not compromise all the urls sent by that person.

If you are interested in more details I recommend this article: https://0x65.dev/blog/2019-12-03/human-web-collecting-data-i...

[Disclaimer I'm one of the authors]

solso commented on Brave buys a search engine, promises no tracking, no profiling theregister.com/2021/03/0... · Posted by u/samizdis

a254613e · 5 years ago

https://blog.mozilla.org/press-uk/2017/10/06/testing-cliqz-i...

> Users who receive a version of Firefox with Cliqz will have their browsing activity sent to Cliqz servers, including the URLs of pages they visit.

solso · 5 years ago

The chosen excerpt omits the fact that it is predicated on the HumanWeb. In the technical papers above there is a more precise description on what and how was collected. There was no user tracking, session or history being sent as all data points are anonymous and record-unlinkable by the receiver. The vague language, required for a general audience journal, certainly does not help.