I've been thinking about building a home-local "mini-Google" that indexes maybe 1,000 websites. In practice, I rarely need more than a handful of sites for my searches, so it seems like overkill to rely on full-scale search engines for my use case.
My rough idea for architecture:
- Crawler: A lightweight scraper that visits each site periodically.
- Indexer: Convert pages into text and create an inverted index for fast keyword search. Could use something like Whoosh.
- Storage: Store raw HTML and text locally, maybe compress older snapshots.
- Search Layer: Simple query parser to score results by relevance, maybe using TF-IDF or embeddings.
I would do periodic updates and build a small web UI to browse.
Anyone tried it or are there similar projects?
I didn't get the feeling that lesswrong is the right place for such a post. Is it? What tag should I use? Anything else I should know when posting there for these this kind of content?
https://www.lesswrong.com/tag/grants-and-fundraising-opportu...
Also, it would be better if you reach out to the other blogs since they might have some doubts regarding this program of yours.
Paper: https://www.nber.org/papers/w34388