If we were to design a brand new internet for today's world, can we develop it such a way that:
1- Finding information is trivial
2- You don't need services indexing billions of pages to find any relevant document
In our current internet, we need a big brother like Google or Bing to effectively find any relevant information in exchange for sharing with them our search history, browsing habits etc. Can we design a hypothetical alternate internet where search engines are not required?
Indexing isn't the source of problems. You can index in an objective manner. A new architecture for the web doesn't need to eliminate indexing.
Ranking is where it gets controversial. When you rank, you pick winners and losers. Hopefully based on some useful metric, but the devil is in the details on that.
The thing is, I don't think you can eliminate ranking. Whatever kind of site(s) you're seeking, you are starting with some information that identifies the set of sites that might be what you're looking for. That set might contain 10,000 sites, so you need a way to push the "best" ones to the top of the list.
Even if you go with a different model than keywords, you still need ranking. Suppose you create a browsable hierarchy of categories instead. Within each category, there are still going to be multiple sites.
So it seems to me the key issue isn't ranking and indexing, it's who controls the ranking and how it's defined. Any improved system is going to need an answer for how to do it.
* Indexing is expensive. If there's a shared public index, that'd make it a lot easier for people to try new ranking algorithms. Maybe the index can be built into the way the new internet works, like DNS or routing, so the cost is shared.
* How fast a ranking algorithm is depends on how the indexing is done. Is there some common set of features we could agree on that we'd want to build the shared index on? Any ranking that wants something not in the public index would need either a private index or a slow sequential crawl. Sometimes you could do a rough search using the public index and then re-rank by crawling the top N, so maybe the public index just needs to be good enough that some ranker can get the best result within the top 1000.
* Maybe the indexing servers execute the ranking algorithm? (An equation or SQL-like thing, not something written in a Turing Complete language). Then they might be able to examine the query to figure out where else in the network to look, or where to give up because the score will be too low.
* Maybe the way things are organized and indexed is influenced by the ranking algorithms used. If indexing servers are constantly receiving queries that split a certain way, they can cache / index / shard on that. This might make deciding what goes into a shared index easier.
But what are you storing in your index? The content that is considered in your ranking will vary wildly by your ranking methods. (example - early indexes cared only for the presence of words. Then we started to care about the count of words, then the relationships between words and the context. Then about figuring out if the site was scammy, or slow.
The only way to store an index of all content (to cover all the options) is to...store the internet.
I'm not trying to be negative - I feel very poorly served by the rankings that are out there, as I feel on 99% of issues I'm on the longtail rather than what they target. But I can't see how a "shared index" would be practical for all the kinds of ranking algorithms both present and future.
I want to rank my results by what is most popular to my friends (Facebook or otherwise) so I just look for a search engine extension that allows me to do that. This could get complex but can also be simple if novices just use the most popular ranking algorithms.
One thing I haven't seen much on these recent threads on search is the ability to create your own Google Custom Search Engine based on domains you trust - https://cse.google.com/cse/all
Also, not many people have mention the use of search operators, which allows you to control the results returned. Such as "Paul Graham inurl:interview -site:ycombinator.com -site:techcrunch.com"
===Edit=== I mean to say you as the user would gain control over the ranking sources, the company operating this search service would perform the aggregation and effectively operate marketplace of ranking providers. ===end edit===
For example, one could be an index of "canonical" sites for a given search term, such that it would return an extremely high ranking for the result "news.ycombinator.com" if someone searches the term "hacker news". Layer on a "fraud" ranking built off lists of sites and pages known for fraud, a basic old-school page rank (simply order by link credit), and some other filters. You could compose the global ranking dynamically based off weighted averages of the different ranked sets, and drill down to see what individual ones recommended.
Seems hard to crunch in real time, but not sure. It'd certainly be nicer to have different orgs competing to maintain focused lists, rather than a gargantuan behemoth that doesn't have to respond to anyone.
Maybe you could even channel ad or subscription revenue from the aggregator to the ranking agencies based off which results the user appeared to think were the best.
...To which people responded with various schemes for fair ranking systems.
...To which people observed that someone will always try to game the ranking systems.
Yep! So long as somebody stands to benefit (profit) from artificially high rankings, they'll aim for that, and try to break the system. Those with more resources will be better able to game the system, and gain more resources... ad nauseam. We'd end up right where we are.
The only way to break that [feedback loop](https://duckduckgo.com/?q=thinking+in+systems+meadows) is to disassociate profit from rank.
Say it with me: we need a global, non-commercial network of networks--an internet, if you will. (Insert Al Gore reference here.)
(Note: I don't have time to read all the comments on this page before my `noprocrast` times out, so please pardon me if somebody already said this.)
---
Unrelated I searched for "Penguin exhibits in Michigan". Of which we have several. It reports 880,000 results but I can only go to page 12 (after telling it to show omitted results). Interesting...
https://www.google.com/search?q=penguin+exhibits+in+michigan
I'm old enough to remember sorting sites by new to see what new URLs were being created, and getting to that bottom of that list within a few minutes. Google and search was a natural response to solving that problem as the number of sites added to the internet grew exponentially...meaning we need search.
The Web is too big for a single large directory - but a network of small directories seems promising. (Supported by link-sharing sites like Pinboard and HN.)
https://en.wikipedia.org/wiki/List_of_lists_of_lists
It was wonderful to have things so carefully organized, but it took months for them to add sites. Their backlog was enormous.
Their failure to keep up is basically what pushed people to an automated approach, i.e. the search engine.
Deleted Comment
Either you find a way to make information findable in a library without an index (how?!?) or you find a novel way to make a neutral search engine - one that provides as much value as Google but whose costs are paid in a different way, so that it does not have Google's incentives.
- identify the book's theme
- measure the quality of the information
- determine authenticity / malicious content
- remember the position of the book in the colossal stacks
Then the librarian can start to refer people to books. This problem was actually present in libraries before the revolutionary Dewy Decimal System [1]. Libraries found that the disorganization caused too much reliance on librarians and made it hard to train replacements if anything happened.
The Internet just solved the problem by building a better librarian rather than building a better library. Personally I welcome any attempts to build a more organized internet. I don't think the communal book pile approach is scaling very well.
[1]: https://en.wikipedia.org/wiki/Dewey_Decimal_Classification
Let me know if I misunderstand your comment but to me, this has already been tried.
Yahoo's founders originally tried to "organize" the internet like a good librarian. Yahoo in 1994 was originally called, "Jerry and David's Guide to the World Wide Web"[0] with hierarchical directories to curated links.
However, Jerry & David noticed that Google's search results were more useful to web surfers and Yahoo was losing traffic. Therefore, in 2000 they licensed Google's search engine. Google's approach was more scaleable than Yahoo's.
I often see several suggestions that the alternative to Google is curated directories but I can't tell if people are unaware of the early internet's history and don't know that such an idea was already tried and how it ultimately failed.
[0] http://static3.businessinsider.com/image/57977a3188e4a714088...
A "better library" can't be permissionless and unfiltered; Dewey Decimal System relies on the metadata being truthful, and the internet is anything but.
You can't rely on information provided by content creators; Manual curation is an option but doesn't scale (see the other answer re: early Yahoo and Google).
Any attempt to create a decentralized index will need to tackle the quality metric problem.
We're talking about billions of pages and if not ranked (authority is a good hueristic), filtered (de-ranked), etc then good luck finding valuable information because everyone is gaming the systems to improve their ranking.
I think this is part of the reason you get a lot of fake news on social media. It's a constant stream of information (a new dimension of time has been added to the ranking, basically) that needs to be ranked and with humans in the loop, there's no way to do this very easily without filtering for noise and outright malicious content.
Maybe personal whitelist/blacklist for domains and authors could improve things. Sort of "Web of trust" but done properly.
Not completely without search engines, but for example, if every website was responsible for maintaining it's own index, we could effectively run our own search engines after initialising "base" trusted website lists. Let's say I'm new to this "new internet", I ask around what are some good websites for information I'm interested in. My friend tells me wikipedia is good for general information, webmd for health queries, stackoverflow for programming questions, and so on. I add wikipedia.org/searchindex, webdm.com/searchindex and stackoverflow.com/searchindex to my personal search engine instance, and every time I search something, these three are queried. This could be improved with local cache, synonyms, etc. As you carry on using it, you expand your "library". Of course it would increase workload of individual resources, but has potential to give feel of that web 1.0 once again.
The problem isn't solvable without a good AI content scraper.
The scraper/indexer either has to be centralised - an international resource run independently of countries, corporations, and paid interest groups - or it has be an impossible-to-game distributed resource.
The former is hugely challenging politically, because the org would effectively have editorial control over online content, and there would be huge fights over neutrality and censorship.
(This is more or less where are now with Google. Ironically, given the cognitive distortions built into corporate capitalism, users today are more likely to trust a giant corporation with an agenda than a not-for-profit trying to run independently and operate as objectively as possible.)
Distributed content analysis and indexing - let's call it a kind of auto-DNS-for-content - is even harder, because you have to create an un-hackable un-gameable network protocol to handle it.
If it isn't un-gameable it become a battle of cycles, with interests with access to more cycles being able to out-index those with fewer - which will be another way to editorialise and control the results.
Short answer - yes, it's possible, but probably not with current technology, and certainly not with current politics.
Despite their incentives to make money, Google have actually been trying for years to stop people from gaming the system. It's impressive how far they've been able to come, but their efforts are thrwarted at every turn thanks to the big budgets employed to get traffic to commercial websites.
Neutral in that sense is only "not serving the agenda or judgement of another" at the obvious cost of labor and not just as a one off thing as the searched content often attempts to optimize for views. It isn't like a library of passive books to sort through but a Harry Potter wizard portrait gallery full of jealous media vying for attention.
And pendantically it isn't true neutral - but serves your agenda to the best of your ability. A "true neutral" would serve all to the best of their ability.
Besides neutrality in a search engine on a literal level is oxymoronic and self defeating - its whole function is to prioritize content in the first place.
So it's easier to have 2~4 aggregators in where all the information you desire resides, even if in each of them there are different forums.
A unified entry point helps adoption.
Read a cool blog post? Nobody around you will ever give a shit, because in order to do so, they'd have to read it too. Shared a photo from a vacation? It might start a conversation or two with people around you, while you receive dozens or hundreds of affirmations (in the form of likes).
I don't like to use social networks, but that's what I fall back on when I have a few minutes to spare. I rarely look at my list of articles I've saved for later — who has time for that?
b) mainstream culture > closely-knit communities (facebook > forums)
c) big-player takeovers (facebook for groups, google for search) over previously somewhat niche areas and, actually, internet infrastructure
d) if you're not a big player, you don't exist... and back to c)
If you'd like to see an experimental discovery interface for a library that goes deeper into book contents, check out https://books.archivelab.org/dateviz/ -- sorry, not very mobile friendly.
Not surprisingly, this book thingie is a big centralized service, like a web search engine.
The canonical example to me of something to exclude would be the expertsexchange site. After stack overflow, ee was more than useless, and even before it was just annoying. There are lots of sites with paywalls, and other obfuscations to content and imho these sites are the ones that should be dropped/low-ranked.
But the fact that there's no autocomplete for "Hillary Clinton is|has" (though "Donald Trump is" is also filtered). Yes, it's been heavily gamed. It's also had active meddling. And their control over YouTube seems to be even worse, with disclosed documents/video that indicate they're willing to go so far as outright election manipulation. With all indications that Facebook, Pinterest and others are going the same route.
Just because nobody's said it in this thread yet: blockchain? I never bought into the whole bitcoin buzz, but using a blockchain as an internet index could be interesting.
Deleted Comment
Maybe I misunderstand your proposal but to me, this is not technically possible. We can think of a modern search engine as a process that reduces a raw dataset of exabytes[0] into a comprehensible result of ~5000 bytes (i.e. ~5k being the 1st page of search result rendered as HTML.)
Yes, one can take a version of the movies & tv data on IMDB.com and put it on the phone (e.g. like copying the old Microsoft Cinemania CDs to the smartphone storage and having a locally installed app search it) but that's not possible for a generalized dataset representing the gigantic internet.
If you don't intend for the exabytes of the search index to be stored on your smartphone, what exactly is the "on-device search agent" doing? How is it iterating through the vast dataset over a slow cellular connection?
[0] https://www.google.com/search?q="trillion"+web+pages+exabyte...
We already have the means to execute arbitrary code (JS) or specific database queries (SQL) on remote hosts. It's not inconceivable, to me, that my device "knowing me" could consist of building up a local database of the types of things that I want to see, and when I ask it to do a new search, it can assemble a small program which it sends to a distributed system (which hosts the actual index), runs a sophisticated and customized query program there, securely and anonymously (I hope), and then sends back the results.
Google's index isn't architected to be used that way, but I would love it if someone did build such a system.
I'd love to be able to configure rules like:
+2 weight for clean HTML sites with minimal Javascript
+5 weight for .edu sites
-10 weight for documents longer than 2 pages
-5 weight for wordy documents
I'd also like to increase the weight for hits on a list of known high quality sites. Either a list I maintain myself, or one from an independent 3rd party.
Once upon a time I tried to use Google's custom search engine builder with only hand curated high quality sites as my main search engine. It was to much trouble to be practical, but I think that could change with an actual tool.
Apple uses local differential privacy to help protect the privacy of user activity in a given time period, while still gaining insight that improves the intelligence and usability of such features as: • QuickType suggestions • Emoji suggestions • Lookup Hints • Safari Energy Draining Domains • Safari Autoplay Intent Detection (macOS High Sierra) • Safari Crashing Domains (iOS 11) • Health Type Usage (iOS 10.2)
Found via Google...
What if this new Internet instead of using URI based on ownership (domains that belong to someone), would rely on topic?
In examples:
netv2://speakers/reviews/BW netv2://news/anti-trump netv2://news/pro-trump netv2://computer/engineering/react/i-like-it netv2://computer/engineering/electron/i-dont-like-it
A publisher of webpage (same html/http) would push their content to these new domains (?) and people could easily access list of resources (pub/sub like). Advertisements are driving Internet nowadays, so to keep everyone happy, what if netv2 is neutral, but web browser are not (which is the case now anyway)? You can imagine that some browsers would prioritise some entries in given topic, some would be neutral, but harder to retrieve data that you want.
Second thought: Guess what, I'm reinventing NNTP :)
The Internet has become synonymous with the web/http protocol. The web alternatives to NNTP won instead of newer versions of Usenet. New versions of IRC, UUCP, S/FTP, SMTP, etc., instead of webifying everything would be nice. But those services are still there and fill an important niche for those not interested in seeing everything eternal septembered.
What if we implement DNS-like protocol for searching. Think of recursive DNS. Do you have "articles about pistachio coloured usb-c chargers"? Home router says nope, ISP says nope, Cloudflare says nope, let's scan A to Z. Eventually someone gives an answer. This of course can (must?) be cached, just like DNS. And just like DNS, it can be influenced by your not-so-neutral browser or ISP.
For example, if a publisher has a particular pro-Trump article, they would likely want (for obvious financial reasons) to push it to both etv2://news/anti-trump and netv2://news/pro-trump . What would prevent them from doing that?
Also, a publisher of "GET RICH QUICK NOW!!!" article would want to push it to both netv2://news/anti-trump and netv2://computer/engineering/electron/i-dont-like-it topics.
You can't simply have topics, you can have communities like news/pro-trump that are willing to spend the labor required for moderation i.e. something like reddit. But not all content has such communities willing and able to do so well.
The idea of moving to a pub-sub like system is a good one. It makes a lot of sense for what the internet has become. It's more than simple document retreival today.
Deleted Comment
The problem is that the amount of content and the size of the potential user base are so large that is is impossible to offer search as a free service, i.e. it has to be funded in some way. Perhaps instead of having a free advertising-driven search, there would be space for a subscription-based model? Subscription based (and advert free) models seem to be working in other areas, e.g. TV/films and music.
Another problem though is that more and more content seems to be becoming unsearchable, e.g. behind walled gardens or inside apps.
Maybe we'll see advent of specialised paid search engines SaaSs with authentic and independent content authors like professional blogs.
Maybe in 2009. Today there are businesses today that exist solely on Instagram, Facebook, Amazon, etc.
Almost all of my customers find me through classified advertising websites. Organic and paid search visitors to my site tend to be window shoppers.
The early Web wrestled with this, early on it was going to be directories and meta keywords. But that quickly broke down (information isn't hierarchical, meta keywords can be gamed). Google rose up because they use a sort of reputation system based index. In between that, there was a company called RealNames, that tried to replace domains and search with their authoritative naming of things, but that is obviously too centralized.
But back to Google, they now promote using schema.org descriptions of pages, over page text, as do other major search engines. This has tremendous implications for precise content definition (a page that is "not about fish" won't show up in a search result for fish). Google layers it with their reputation system, but these schemas are an important, open feature available to anyone to more accurately map the web. Schema.org is based on Linked Data, its principle being each piece of data can be precisely "followed." Each schema definition is crafted by participation from industry and interest groups to generally reflect its domain. This open world model is much more suitable to the Web, compared to the closed world of a particular database (but, some companies, like Amazon and Facebook, don't adhere to it since apparently they would rather their worlds have control; witness Facebook's open graph degeneration to something that is purely self-serving).
If we could kill advertisement permanently, we can have an internet as described in the question. This will almost be like an emergent feature of the internet.
- ranking content that users you have upvoted higher
- ranking content that users with similar upvote behaviour higher
While there is a risk of upvote bubbles, it should potentially make it easier for niche content to spread to interested people and make it possible for products and services to spread using peer trust rather than cold shouting.
This is what Reddit originally tried to do before they pivoted.
https://www.reddit.com/r/self/comments/11fiab/are_memes_maki...
I was also wondering what would be good options to store votes/upvotes in a decentralized way.
That's how you make echochambers
But I think the combination of advertising+search engines is particularly bad, so paying for search would be a great first step.
https://hackernoon.com/wealth-a-new-era-of-economics-ce8acd7...
For the remaining free sites you will see advertising in different forms (self promotion blog, the upsell, t-shirt stores on everysite, spam-bait).
Advertising saved the internet.
Now tracking.. for advertising or other purposes is the real problem.
Deleted Comment
By and large, people don't seem to be willing to pay for content on the web. Hence, advertising became the dominant business model for content on the web.
Find another way for someone to pay for relevant content and you can do away with advertising. It's as simple as that.
I don't think the causality is right here. People might not be willing to pay for content on the web because advertising enables competitors to offer content for free. If you removed that option, if people had no choice but to pay, it might just turn out that people would pay.
Not so simple. What is relevant for me may be irrelevant for you.
There's a saying in sales: "people hate to be sold, but they love to buy"... which is akin to what you are saying here. Advertising isn't the problem... the problem is that the reasons why people are promoting aren't novel enough... (rent seeking... which creates noise)
Until then, you're going to have demand for ferrying information between sellers and buyers, and vice versa, because of information asymmetry. You may disagree with some of the mediums currently used, finding them annoying, but advertising is always evolving to solve this problem, as is evident in the last three decades.