Any link list, or domain list is not worth much without any rating, or meta. I lead a hobby project, and I am not expert, so I provide ratings based on what kind of data pages provide (title, social, description), and my own manual voting system. It is not ideal, but it is something. Also I provide tags, so it is easily known what the domain provides, or domains can be filtered by tags.
I know that you cannot count and visit every domain, so the list will never be finished, but I am happy with the results.
you can, though you must provide a reason compelling enough to the person maintaining access (I provided a few sentences and was approved for most but maybe 20% of registrars declined my request):
I was trying to do this in 2023! The hardest part about building a search engine is not the actual searching though, it is (like others here have pointed out), building your index and crawling the (extremely adversarial) internet, especially when you're running the thing from a single server in your own home without fancy rotating IPs.
I hope this guy succeeds and becomes another reference in the community like the marginalia dude. This makes me want to give my project another go...
You know, it's possible the cure to an adversarial internet is to just have some non-profit serve as a repo for a universal clearnet index that anyone can access to build their own search engine. That way we don't have endless captchas and anubis and Cloudflare tests every time I try and look for a recipe online. Why send AI scrapers to crawl literally everything when you're getting the data for free?
I'll add it to the mile-long list of things that should exist and be online public goods.
I'm the creator of searcha.page and seek.ninja, those are the basis of my index. The biggest problem with ONLY using that is freshness. I've started my own crawling too, but for sure common crawl will backfill a TON of good pages. It's priceless and I would say common crawl should be any search engines starting point. I have 2 billion pages from common crawl! There were a lot more but I had to scrub them out due to resources. My native crawling is much more targeted and I'd be lucky to pull 100k but as long as my heuristics for choosing the right targets it will be very high value pulls.
The crawl seems hard but the difference between having something and not having it is is very obvious. Ordering the results is not. What should go on page 200 and do those results still count as having them?
The IP thing is interesting, I was trying to make this CSGO bot one time to scrape steam's prices and there are proxy services out there you rent, tried at least one and it was blocked by steam. So I wonder if people buy real IPs.
Yep, my usage increased 20x week over week. It was actually the context expansion that was my bottleneck, not the search itself. My usage graph looks almost vertical. Not sure if this counts as a good week or a bad week.
It claims I reached the article limit. The last time I saw a fastcompany link must have been a decade ago! I was nostalgically looking forward to read another article of theirs. Alas...
> The secret to making it all happen? Large language models. “What I’m doing is actually very traditional search,” Pearce says. “It’s what Google did probably 20 years ago, except the only tweak is that I do use AI to do keyword expansion and assist with the context understanding
> Fellow ambitious hobbyist Wilson Lin, who on his personal blog <https://blog.wilsonl.in/search-engine/> recently described his efforts to create a search engine of his own, took the opposite approach from Pearce.
> And then there’s the concept of doing a small-site search, along the lines of the noncommercial search engine Marginalia <https://marginalia-search.com>, which favors small sites over Big Tech
And the obvious answer to the title: "Why the laundry room? Two reasons: Heat and noise." It runs on a a 32-core AMD EPYC 7532, half a terabyte of RAM, and "all in, cost $5,000, with about $3,000 of that going toward storage"
I absolutely devoured Wilson Lins articles recently .. they are very high quality and informative for any amateur interested in search engines and LLMs! - https://blog.wilsonl.in/search-engine/
"The beefy CPU running this setup, a 32-core AMD EPYC 7532, underlines just how fast technology moves. At the time of its release in 2020, the processor alone would have cost more than $3,000. It can now be had on eBay for less than $200"
why do I never get deals like that when I am shopping for the homelab on eBay?
Has eBay fixed their "and then they ship you a box of rocks" problem?
I feel like there was a five year span where everyone I talked to said buying or selling electronics on eBay was a nightmare, so I'm a little curious if I need to re-evaluate my priors.
I searched "AMD EPYC 7532" and there are a ton of listings for $150-$200. Are you just regretful that it wasn't like this when you were shopping parts for your homelab?
TheServerStore.com often has good deals. I actually bought a brand new 64-core EPYC 7702 server with 256 GB RAM and 8TB NVMe storage for about $3K fully assembled earlier this year.
Get a QC type chip and roll the dice, that's how I got mine. The biggest cost for me is disk and to a lesser extent ram, the chip itself was relatively cheap.
This is a cool project, and I hope he has fun with it.
I've daydreamed about how I'd create my own search engine so, so many times. But I always run into an impassable wall: The internet now isn't at all the same as the internet in 1999.
Discovery isn't really that useful. If you find someone's self-hosted blog about dinosaurs, it probably hasn't been updated since 2004, all the links and images are broken, and it's just thoroughly upstaged by Wikipedia and the Smithsonian. Sure, it's fun to find these quirky sites, but they aren't as valuable as they once were.
We've basically come full circle to the AOL model, where there are "hubs" of content that cater to specific categories. YouTube has ALL the long-form essays. Tiktok has ALL the humorous videos. Medium has ALL the opinion pieces. Reddit has ALL the flame wars. Mayo Clinic has ALL the drug side-effects. Amazon has ALL the shopping. Ebay has ALL the collectables.
None of these big companies want nasty little web crawlers poking and prodding their site. But they accept Google crawlers, because Google brings them users. Are they going to be that friendly to your crawler?
Of course, I still dream. Maybe a hub-based internet needs a hub-aware search engine?
Well I can't respond to everyone - I am the one running the search engine. And yes, it did crash today from load. Usage increased 20x this week vs last and I was totally unprepared. I don't know if that counts as a good launch or a bad one. For some reason in my head I imagined usage would be some slow steady ramp.
Thank you for those who tried it, and I'm sorry if you were one of the people it didn't perform for. As far as load goes this was the first day it truly had a "trial by fire".
'Google rival' is quite a stretch, surely 'search engine' is not just more accurate, but clearer too with all that Google does today, as if that's new.
I have 1542766 domains. Might not be much, but it is an honest work.
It is available as a github repo, so anybody that wants to start crawling has some initial data to kick off.
Links
https://github.com/rumca-js/Internet-Places-Database
FYI there's a broken link in your readme:
I know that you cannot count and visit every domain, so the list will never be finished, but I am happy with the results.
https://czds.icann.org/home
also be prepared for thousands of emails about status changes to your access.
This is why we have computer-variants of Library Science and Archeology, Forensic Science and a bunch of other advanced knowledge (not AI, mind you).
Dead Comment
I hope this guy succeeds and becomes another reference in the community like the marginalia dude. This makes me want to give my project another go...
While the index is currently not open source, it should be at some point. Maybe when they get out of the beta stage (?) details are yet unclear.
I'll add it to the mile-long list of things that should exist and be online public goods.
https://commoncrawl.org
Deleted Comment
"An error has occurred building the search results."
He can then exhaust the remaining server heat through the dryer vent stack.
https://archive.is/HA7y4
Some bits and pieces:
> his new search engine, the robust Search-a-Page <https://searcha.page>, which has a privacy-focused variant called Seek Ninja <https://seek.ninja>
> The secret to making it all happen? Large language models. “What I’m doing is actually very traditional search,” Pearce says. “It’s what Google did probably 20 years ago, except the only tweak is that I do use AI to do keyword expansion and assist with the context understanding
> Fellow ambitious hobbyist Wilson Lin, who on his personal blog <https://blog.wilsonl.in/search-engine/> recently described his efforts to create a search engine of his own, took the opposite approach from Pearce.
> And then there’s the concept of doing a small-site search, along the lines of the noncommercial search engine Marginalia <https://marginalia-search.com>, which favors small sites over Big Tech
And the obvious answer to the title: "Why the laundry room? Two reasons: Heat and noise." It runs on a a 32-core AMD EPYC 7532, half a terabyte of RAM, and "all in, cost $5,000, with about $3,000 of that going toward storage"
why do I never get deals like that when I am shopping for the homelab on eBay?
I see this for pretty much all hardware out on eBay, just go back 5 years and watch the price fall 10x.
I feel like there was a five year span where everyone I talked to said buying or selling electronics on eBay was a nightmare, so I'm a little curious if I need to re-evaluate my priors.
Deleted Comment
I've daydreamed about how I'd create my own search engine so, so many times. But I always run into an impassable wall: The internet now isn't at all the same as the internet in 1999.
Discovery isn't really that useful. If you find someone's self-hosted blog about dinosaurs, it probably hasn't been updated since 2004, all the links and images are broken, and it's just thoroughly upstaged by Wikipedia and the Smithsonian. Sure, it's fun to find these quirky sites, but they aren't as valuable as they once were.
We've basically come full circle to the AOL model, where there are "hubs" of content that cater to specific categories. YouTube has ALL the long-form essays. Tiktok has ALL the humorous videos. Medium has ALL the opinion pieces. Reddit has ALL the flame wars. Mayo Clinic has ALL the drug side-effects. Amazon has ALL the shopping. Ebay has ALL the collectables.
None of these big companies want nasty little web crawlers poking and prodding their site. But they accept Google crawlers, because Google brings them users. Are they going to be that friendly to your crawler?
Of course, I still dream. Maybe a hub-based internet needs a hub-aware search engine?
Thank you for those who tried it, and I'm sorry if you were one of the people it didn't perform for. As far as load goes this was the first day it truly had a "trial by fire".