Readit News logoReadit News
renegat0x0 · 6 months ago
Well, I created my own domain index. I have not crawled every page inside domains, but it is not my goal.

I have 1542766 domains. Might not be much, but it is an honest work.

It is available as a github repo, so anybody that wants to start crawling has some initial data to kick off.

Links

https://github.com/rumca-js/Internet-Places-Database

raybb · 6 months ago
What a nice project. What inspired this initially?

FYI there's a broken link in your readme:

    https://rumca-js.github.io/internet full internet search

renegat0x0 · 6 months ago
thanks, I replaced it with a other link demo
hobs · 6 months ago
Cant you just request the ICANN’s zone files and have the canonical list of the day?
renegat0x0 · 6 months ago
Any link list, or domain list is not worth much without any rating, or meta. I lead a hobby project, and I am not expert, so I provide ratings based on what kind of data pages provide (title, social, description), and my own manual voting system. It is not ideal, but it is something. Also I provide tags, so it is easily known what the domain provides, or domains can be filtered by tags.

I know that you cannot count and visit every domain, so the list will never be finished, but I am happy with the results.

beaugunderson · 6 months ago
you can, though you must provide a reason compelling enough to the person maintaining access (I provided a few sentences and was approved for most but maybe 20% of registrars declined my request):

https://czds.icann.org/home

also be prepared for thousands of emails about status changes to your access.

egberts1 · 6 months ago
Avoiding GIGO (Garbage In, Garbage Out).

This is why we have computer-variants of Library Science and Archeology, Forensic Science and a bunch of other advanced knowledge (not AI, mind you).

didip · 6 months ago
This is amazing. Thanks for sharing!

Dead Comment

luizfelberti · 6 months ago
I was trying to do this in 2023! The hardest part about building a search engine is not the actual searching though, it is (like others here have pointed out), building your index and crawling the (extremely adversarial) internet, especially when you're running the thing from a single server in your own home without fancy rotating IPs.

I hope this guy succeeds and becomes another reference in the community like the marginalia dude. This makes me want to give my project another go...

mhitza · 6 months ago
You might want to bookmark https://openwebsearch.eu/open-webindex/

While the index is currently not open source, it should be at some point. Maybe when they get out of the beta stage (?) details are yet unclear.

3RTB297 · 6 months ago
You know, it's possible the cure to an adversarial internet is to just have some non-profit serve as a repo for a universal clearnet index that anyone can access to build their own search engine. That way we don't have endless captchas and anubis and Cloudflare tests every time I try and look for a recipe online. Why send AI scrapers to crawl literally everything when you're getting the data for free?

I'll add it to the mile-long list of things that should exist and be online public goods.

moduspol · 6 months ago
Is the common crawl usable for something like this?

https://commoncrawl.org

chiefsearchaco · 6 months ago
I'm the creator of searcha.page and seek.ninja, those are the basis of my index. The biggest problem with ONLY using that is freshness. I've started my own crawling too, but for sure common crawl will backfill a TON of good pages. It's priceless and I would say common crawl should be any search engines starting point. I have 2 billion pages from common crawl! There were a lot more but I had to scrub them out due to resources. My native crawling is much more targeted and I'd be lucky to pull 100k but as long as my heuristics for choosing the right targets it will be very high value pulls.
giancarlostoro · 6 months ago
Most likely it is, the issue then becomes being able to store and afford the storage for all the files.
wordpad · 6 months ago
Why can't crawling be crowd sourced? It would solve ip rotation and spread the load
6510 · 6 months ago
Poomba · 6 months ago
That’s how residential proxies work, in a perverse way

Deleted Comment

chiefsearchaco · 6 months ago
Common crawl sort of serves this function. I use it. It's a really good foundation.
6510 · 6 months ago
The crawl seems hard but the difference between having something and not having it is is very obvious. Ordering the results is not. What should go on page 200 and do those results still count as having them?
ge96 · 6 months ago
The IP thing is interesting, I was trying to make this CSGO bot one time to scrape steam's prices and there are proxy services out there you rent, tried at least one and it was blocked by steam. So I wonder if people buy real IPs.
kccqzy · 6 months ago
Yeah people buy residential IPs on the black market. They are essentially infected home PCs and botnets.
cheema33 · 6 months ago
I tried the search site at https://searcha.page/ by searching for something random and got the following message:

"An error has occurred building the search results."

authnopuz · 6 months ago
hug of death? I fear the temperature will get very high in his laundry room
DannyBee · 6 months ago
I'm sure it depends on how much laundry he is doing - his dryer is probably heated entirely by servers.

He can then exhaust the remaining server heat through the dryer vent stack.

robofanatic · 6 months ago
Might not even need a dryer :-)
ape4 · 6 months ago
Change it to a sauna?
chiefsearchaco · 6 months ago
Yep, my usage increased 20x week over week. It was actually the context expansion that was my bottleneck, not the search itself. My usage graph looks almost vertical. Not sure if this counts as a good week or a bad week.
HelloUsername · 6 months ago
eschulz · 6 months ago
Before this happened to me, my first search returned an impressive SERP.
lucb1e · 6 months ago
It claims I reached the article limit. The last time I saw a fastcompany link must have been a decade ago! I was nostalgically looking forward to read another article of theirs. Alas...

https://archive.is/HA7y4

Some bits and pieces:

> his new search engine, the robust Search-a-Page <https://searcha.page>, which has a privacy-focused variant called Seek Ninja <https://seek.ninja>

> The secret to making it all happen? Large language models. “What I’m doing is actually very traditional search,” Pearce says. “It’s what Google did probably 20 years ago, except the only tweak is that I do use AI to do keyword expansion and assist with the context understanding

> Fellow ambitious hobbyist Wilson Lin, who on his personal blog <https://blog.wilsonl.in/search-engine/> recently described his efforts to create a search engine of his own, took the opposite approach from Pearce.

> And then there’s the concept of doing a small-site search, along the lines of the noncommercial search engine Marginalia <https://marginalia-search.com>, which favors small sites over Big Tech

And the obvious answer to the title: "Why the laundry room? Two reasons: Heat and noise." It runs on a a 32-core AMD EPYC 7532, half a terabyte of RAM, and "all in, cost $5,000, with about $3,000 of that going toward storage"

udkl · 6 months ago
I absolutely devoured Wilson Lins articles recently .. they are very high quality and informative for any amateur interested in search engines and LLMs! - https://blog.wilsonl.in/search-engine/
wvenable · 6 months ago
Reader mode in Firefox (plus sometimes a page refresh) gets me past most paywalls -- including this article.
ofrzeta · 6 months ago
"The beefy CPU running this setup, a 32-core AMD EPYC 7532, underlines just how fast technology moves. At the time of its release in 2020, the processor alone would have cost more than $3,000. It can now be had on eBay for less than $200"

why do I never get deals like that when I am shopping for the homelab on eBay?

progval · 6 months ago
You need to spend a lot of time looking through badly labeled offers, and be willing to buy from sellers with no reputation.
_fat_santa · 6 months ago
Not for a CPU but earlier this year I bought a Thinkpad workstation off eBay for $500. It's a machine from 2020 and when it was new cost $5,700.

I see this for pretty much all hardware out on eBay, just go back 5 years and watch the price fall 10x.

saalweachter · 6 months ago
Has eBay fixed their "and then they ship you a box of rocks" problem?

I feel like there was a five year span where everyone I talked to said buying or selling electronics on eBay was a nightmare, so I'm a little curious if I need to re-evaluate my priors.

robrtsql · 6 months ago
I searched "AMD EPYC 7532" and there are a ton of listings for $150-$200. Are you just regretful that it wasn't like this when you were shopping parts for your homelab?
throwawayffffas · 6 months ago
I got a 7551p plus motherboard and ram for about 600 bucks from China this January. I may have overpaid but it works great, and gets the job done.
Gormo · 6 months ago
TheServerStore.com often has good deals. I actually bought a brand new 64-core EPYC 7702 server with 256 GB RAM and 8TB NVMe storage for about $3K fully assembled earlier this year.
ThatMedicIsASpy · 6 months ago
Epyc7000+MB+256GB-512GB RAM (from china) usually starts at 800 euros + import tax
chiefsearchaco · 6 months ago
Get a QC type chip and roll the dice, that's how I got mine. The biggest cost for me is disk and to a lesser extent ram, the chip itself was relatively cheap.

Deleted Comment

renewiltord · 6 months ago
AliExpress broseph. You'll get it in no time. I've gotten. Go do QS if you have some risk tolerance and ES if you also have time tolerance.
phendrenad2 · 6 months ago
This is a cool project, and I hope he has fun with it.

I've daydreamed about how I'd create my own search engine so, so many times. But I always run into an impassable wall: The internet now isn't at all the same as the internet in 1999.

Discovery isn't really that useful. If you find someone's self-hosted blog about dinosaurs, it probably hasn't been updated since 2004, all the links and images are broken, and it's just thoroughly upstaged by Wikipedia and the Smithsonian. Sure, it's fun to find these quirky sites, but they aren't as valuable as they once were.

We've basically come full circle to the AOL model, where there are "hubs" of content that cater to specific categories. YouTube has ALL the long-form essays. Tiktok has ALL the humorous videos. Medium has ALL the opinion pieces. Reddit has ALL the flame wars. Mayo Clinic has ALL the drug side-effects. Amazon has ALL the shopping. Ebay has ALL the collectables.

None of these big companies want nasty little web crawlers poking and prodding their site. But they accept Google crawlers, because Google brings them users. Are they going to be that friendly to your crawler?

Of course, I still dream. Maybe a hub-based internet needs a hub-aware search engine?

chiefsearchaco · 6 months ago
Well I can't respond to everyone - I am the one running the search engine. And yes, it did crash today from load. Usage increased 20x this week vs last and I was totally unprepared. I don't know if that counts as a good launch or a bad one. For some reason in my head I imagined usage would be some slow steady ramp.

Thank you for those who tried it, and I'm sorry if you were one of the people it didn't perform for. As far as load goes this was the first day it truly had a "trial by fire".

OJFord · 6 months ago
'Google rival' is quite a stretch, surely 'search engine' is not just more accurate, but clearer too with all that Google does today, as if that's new.