Deleted Comment
I built mine on top of an RSS feed I generate from Hacker News which filters out any posts linking to the top 1 million domains [1] and creates a readable version of the content. I use it to surface articles on smaller blogs/personal websites—it's become my main content source. It's generated via Github Actions every 4 hours and stored in a detached branch on Github (~2 GB of data from the past 4 years). Here's an example for posts with >= 10 upvotes [2].
It only took several hours to build the semantic search on top. And that included time for me to try out and learn several different vector DBs, embedding models, data pipelines, and UI frameworks! The current state of AI tooling is wonderfully simple.
In the end I landed on (selected in haste optimizing for developer ergonomics, so only a partial endorsement):
- BAAI/bge-small-en as an embedding model
- Python with
- HuggingFaceBgeEmbeddings from langchain_community for creating embeddings
- SentenceSplitter from llama_index for chunking documents
- ChromaDB as a vector DB + chroma-ops to prune the DB
- sqlite3 for metadata
- FastAPI, Pydantic, Jinja2, Tailwind for API and server-rendered webpages
- jsdom and mozilla-readability for article extraction
I generated the index locally on my M2 Mac which ripped through the ~70k articles in ~12 hours to generate all the embeddings.I run the search site with Podman on a VM from Hetzner—along with other projects—for ~$8 / month. All requests are handled on CPU w/o calls to external AI providers. Query times are <200 ms, which includes embedding generation → vector DB lookup → metadata retrieval → page rendering. The server source code is here [3].
Nice work @jnnnthnn! What you built is fast, the rankings were solid, and the summaries are convenient.
[1] https://majestic.com/reports/majestic-million
[2] https://github.com/awendland/hacker-news-small-sites/blob/ge...
[3] https://github.com/awendland/hacker-news-small-sites-website...
https://en.wikipedia.org/wiki/R%2FK_selection_theory?wprov=s...
> The theory was popular in the 1970s and 1980s, when it was used as a heuristic device, but lost importance in the early 1990s, when it was criticized by several empirical studies.[5][6] A life-history paradigm has replaced the r/K selection paradigm, but continues to incorporate its important themes as a subset of life history theory.[7] Some scientists now prefer to use the terms fast versus slow life history as a replacement for, respectively, r versus K reproductive strategy.[8]
So far, I’ve been dealing with a tradeoff between latency + error handling in my API endpoints. I’ll either 1.) embed content + upsert into to the vector DB inside a transaction block for my main DB in the handler, which kills latency, or 2.) kickoff the embedding work separate from the main handler work, which risks data desynchronizing.
I’d much prefer a set-it-and-forget-it approach like Retake.
A few questions:
* If the “real-time server” goes offline temporarily, will it catch up on any newly added rows in the interim?
* Do you intend to emit any OpenTelemetry metrics? I’d like to monitor lag in production.
* Will I be able to deploy this as a single container on ECS/Kubernetes?
I'd be interested to hear how people use this outside of IoT context?
I use Node-Red for a few scheduled activities: archiving Reddit posts or tweets I upvote and pulling information from real estate websites that match criteria I’m interested in.
I like Node-Red vs. cron-managed shell/Python scripts for several reasons:
- the admin/editor UI is accessible on any device with a web browser (no git, ssh, etc. tooling required)
- the node-based visual flow is easy to reason about and debug (so even after years of ignoring my scripts I can quickly come back to them and grok what’s going on)
- the barrier to entry continues to be low (I can pop in and create a new flow in <1 hr)
I prefer it over Zapier or IFTTT since it’s more flexible. I’ve authored arbitrary JavaScript and request logic to retrieve and filter data in ways these pre-packaged tools can’t.I run it on an AWS LightSail server for ~$4 per month. I use Ansible to manage Ubuntu with podman + systemd running the Node-Red docker image and TLS provided by Caddy. Roughly ~4 hours to setup from scratch and something I return to once every ~18 months to update/tweak with minimal issue.
To sum it up, I appreciate the grok-ability + flexibility + accessibility. It just works and it scales in complexity as I need it to!
For a quick preview of each library on a random sample of 16 articles posted to HN, see https://github.com/awendland/readable-web-extractor-comparis... (you’ll need to expand a row to see its results).
Deleted Comment