Show HN: Building a web search engine from scratch with 3B neural embeddings

This is really really cool. I had earlier wanted to entirely run my searches on it and though that seems possible, I feel like it would be sadly a little bit more waste of time in terms of searches but still I'll maybe try to run some of my searches against this too and give me thoughts on this after doing something like this if I could, like, it is a big hit or miss but it will almost land you to the right spot, like not exactly.

For example, I searched lemmy hoping to find the fediverse and it gave me their liberapay page though.

Please, actually follow up on that common crawl promise and maybe even archive.org or other websites too and I hope that people are spending billions in this AI industry, I just hope that you can whether even through funding or just community crowdwork, actually succeed in creating such an alternative. People are honestly fed up with the current search engine almost monopoly.

Wasn't Ecosia trying to roll out their own search engine, They should definitely take your help or have you in their team..

I just want a decentralized search engine man, I understand that you want to make it sustaianable and that's why you haven't open sourced but please, there is honestly so much money going into potholes doing nothing but make our society worse and this project almost works good enough and has insane potential...

Please open source it and lets hope that the community tries to figure out a way around some ways of monetization/crowd funding to actually make it sustainable

But still, I haven't read the blog post in its entirety since I was so excited that I just started using the search engine.., But I feel like the article feels super indepth and that this idea can definitely help others to create their own proof of concepts or actually create some open source search engine that's decent once and for all.

Not going to lie, But this feels like a little magic and I am all for it. I have never been this excited the more I think about it of such projects in actual months!

I know open source is tough and I come from a third country but this is actually so cool that I will donate ya as much as I can / have for my own right now. Not much around 50$ but this is coming from a guy who has not spent a single penny online and wanting to donate to ya, please I beg ya to open source and use that common crawl, but I just wish you all the best wishes in your life and career man.

Very cool project!

Just out of interest, I sent a query I've had difficulties getting good results for with major engines: "what are some good options for high-resolution ultrawide monitors?".

The response in this engine for this query at this point seems to have the same fallacy as I've seen in other engines. Meta-pages "specialising" in broad rankings are preferred above specialist data about the specific sought-after item. It seems that the desire for a ranking weighs the most.

If I were to manually try to answer this query, I would start by looking at hardware forums and geeky blogs, pick N candidates, then try to find the specifications and quirks for all products.

Of course, it is difficult to generically answer if a given website has performed this analysis. It can be favourable to rank sites citing specific data higher in these circumstances.

As a user, I would prefer to be presented with the initial sources used for assembling this analysis. Of course, this doesn't happen because engines don't perform this kind of bottom-to-top evaluation.

ricardobeat · 21 days ago

You could argue that it is not really a search query. There is not a particular page that answers the question “correctly”, it requires collating multiple sources and reasoning. That is not a search problem.

creesch · 20 days ago

This argument almost feels disingenuous to me. Of course, there isn't going to be one resource that will completely answer the question. However, there are going to be resources that are much likelier to contain correct parts to the answer and there are resources that are much likely to contain just SEO fluff.

The whole premise of what makes a good search engine has been based on the idea of surfacing those results that most likely contain good information. If that was not the case Google would not have risen to such dominance in the first place.

jacobr1 · 20 days ago

And yet ... that is exactly the kind of problem average people want to have solved for them by search engines, and google kept trying to solve. It is probably one reason why ai-chat with websearch is going to beat just search.

lysecret · 21 days ago

Just wow. My greatest respect! Also an incredible write up. I like the take-away that an essential ingredient to a search engine is curated and well filtered data (garbage in garbage out) I feel like this has been a big learning of the LLM training too, rather work with less much higher quality data. I'm curious how a search engine would perform where all content has been judged by an LLM.

throwawaylaptop · 21 days ago

I'm currently trying to get a friends small business website to rank. I have a decent understanding of SEO, doing more technically correct things and did a decent amount of hand written content specific to local areas and services provided.

Two months in, bing still hasn't crawled the fav icon. Google finally did after a month. I'm still getting outranked by tangentially related services, garbage national lead collection sites, yelp top 10 blog spam, and even exact service providers from 300 miles away that definitely don't serve the area.

Something is definitely wrong with pagerank and crawling in general.

mv4 · 21 days ago

Sadly, that ship has sailed. The web is dead. SEO should be called SEM (Search Engine Manipulation).

what · 21 days ago

>something is wrong with pagerank

Do you have any backlinks? If not, it’s working as intended?

johnthescott · 21 days ago

i recall from years ago a site index url could be submitted to google. creating that index took some work.

ccgreg · 21 days ago

At the end, the author thinks about adding Common Crawl data. Our ranking information, generated from our web graph, would probably be a big help in picking which pages to crawl.

I love seeing the worked out example at scale -- I'm surprised at how cost effective the vector database was.

Imustaskforhelp · 21 days ago

demarq · 20 days ago

The title should be “10x engineer creates Google in their spare time”

But seriously what an amazing write up, plus animations, analysis etc etc. Bravo.

It was also ironic to see AWS failing quite a few use cases here. Stuff to think about.

Also looking into the AWS limits;

> SQS had very low concurrent rate limits that could not keep up with the throughput of thousands of workers across the pipeline.

I could not find this perhaps the author meant Lambda limits?

> services like S3 have quite low rate limits — there are hard limits, but also dynamic per-account/bucket quotas

You have virtually unlimited throughput with prefix partitions

wilsonzlin · 20 days ago

I'm not sure what were the exact limits, but I definitely recall running into server errors with S3 and the OCI equivalent service — not technically 429s but enough to essentially limit throughput. SQS had 429s, I believe due to number of requests and not messages, but they only support batching at most 10.

I definitely wanted these to "just work" out of the box (and maybe I could've worked more with AWS/OCI given more time), as I wanted to focus on the actual search.

jll29 · 20 days ago

Not sure where you are based, but if you were in the EU and had no commercial intentions, you might want to consider adding the crawls from OpenWebSearch.eu, an EU-funded research project to provide an open crawl of a substantial part of the Web (they also collaborate with Common Crawl), its plain text and an index:

  https://openwebsearch.eu/

It would be fantastic if someone could provide a not-for-profit decent quality Web search engine.

bjornsing · 20 days ago

Why the hell do all those ”renowned institutions” need this (admittedly brilliant) guy to turn their crawl into a usable search engine? What’s wrong with this continent…?

dcreater · 21 days ago

This wasn't even in the realm of what I thought is possible for a single person to do. Incredible work!

It doesn't seem that far in diatance from a commercial search engine? Maybe even Google?

50k to run is a comically small number. I'm tempted to just give you that money to seed.

poly2it · 21 days ago

divineg · 21 days ago

It's incredible. I can't believe it but it actually works quite nicely.

If 10K $5 subscriptions can cover its cost, maybe a community run search engine funded through donations isn't that insane?

noosphr · 21 days ago

It's been clear to anyone familiar with encoder only LLMs that Google is effectively dead. The only reason why it still lives is that it takes a while to crawl the whole web and keep the index up to date.

If someone like common crawl, or even a paid service, solves the crawling of the web in real time then the moat Google had for the last 25 years is dead and search is commoditized.

The team that runs the Common Crawl Foundation is well aware of how to crawl and index the web in real time. It's expensive, and it's not our mission. There are multiple companies that are using our crawl data and our web graph metadata to build up-to-date indexes of the web.

nickpsecurity · 20 days ago

It's not dead but will take a huge hit. I still use DuckDuckGo since I get good answers, good discovery, taken right to the sources (whom I can cite), and the search indexes are legal vs all the copyright infringement in AI training.

If AI training becomes totally legal, I will definitely start using them more in place of or to supplement search. Right now, I don't even use the AI answers.

kiririn · 20 days ago

You can see their panic - in my country they are running TV ads for Google search, showing it answering LLM-prompt-like queries. They are desperately trying to win back that mind share, and if they lose traditional keyword search too they’re cooked

gunalx · 20 days ago

Kagi seems to partially be that. Yes really corpo but way Better wibes than Google. Searxng is a bit diffrent but also a thing.

echelon · 21 days ago

I think even more spectacularly, we may be witnessing the feature to feature obsolescence of big tech.

Models make it cheap to replicate and perform what tech companies do. Their insurmountable moats are lowering as we speak.

yep, seems the big guys running out of ideas, to some degree.