awendland (u/awendland)

awendland commented on Launch HN: Metriport (YC S22) – Open-source API for healthcare data exchange · Posted by u/dgoncharov

awendland · 2 years ago

What percentage, or how many millions, of patients are accessible on the network today?

awendland commented on Show HN: Hacker Search – A semantic search engine for Hacker News hackersearch.net... · Posted by u/jnnnthnn

awendland · 2 years ago

Following @isoprohplex, I'll be the fourth comment to say I also built a variant of this: https://hnss.alexwendland.com/

I built mine on top of an RSS feed I generate from Hacker News which filters out any posts linking to the top 1 million domains [1] and creates a readable version of the content. I use it to surface articles on smaller blogs/personal websites—it's become my main content source. It's generated via Github Actions every 4 hours and stored in a detached branch on Github (~2 GB of data from the past 4 years). Here's an example for posts with >= 10 upvotes [2].

It only took several hours to build the semantic search on top. And that included time for me to try out and learn several different vector DBs, embedding models, data pipelines, and UI frameworks! The current state of AI tooling is wonderfully simple.

In the end I landed on (selected in haste optimizing for developer ergonomics, so only a partial endorsement):

  - BAAI/bge-small-en as an embedding model
  - Python with
    - HuggingFaceBgeEmbeddings from langchain_community for creating embeddings
    - SentenceSplitter from llama_index for chunking documents
    - ChromaDB as a vector DB + chroma-ops to prune the DB
    - sqlite3 for metadata
    - FastAPI, Pydantic, Jinja2, Tailwind for API and server-rendered webpages
  - jsdom and mozilla-readability for article extraction

I generated the index locally on my M2 Mac which ripped through the ~70k articles in ~12 hours to generate all the embeddings.

I run the search site with Podman on a VM from Hetzner—along with other projects—for ~$8 / month. All requests are handled on CPU w/o calls to external AI providers. Query times are <200 ms, which includes embedding generation → vector DB lookup → metadata retrieval → page rendering. The server source code is here [3].

Nice work @jnnnthnn! What you built is fast, the rankings were solid, and the summaries are convenient.

[1] https://majestic.com/reports/majestic-million

[2] https://github.com/awendland/hacker-news-small-sites/blob/ge...

[3] https://github.com/awendland/hacker-news-small-sites-website...

awendland commented on Sociosexual orientations are not reflective of life trajectories sciencedirect.com/science... · Posted by u/ZunarJ5

solardev · 2 years ago

Is the fast/slow continuum another way of talking about r/K strategies, or are they different somehow?

https://en.wikipedia.org/wiki/R%2FK_selection_theory?wprov=s...

awendland · 2 years ago

I was also curious. Wikipedia addresses the question:

> The theory was popular in the 1970s and 1980s, when it was used as a heuristic device, but lost importance in the early 1990s, when it was criticized by several empirical studies.[5][6] A life-history paradigm has replaced the r/K selection paradigm, but continues to incorporate its important themes as a subset of life history theory.[7] Some scientists now prefer to use the terms fast versus slow life history as a replacement for, respectively, r versus K reproductive strategy.[8]

awendland commented on Show HN: Open-Source Infrastructure for Vector Data Streams github.com/getretake/reta... · Posted by u/pnoel

awendland · 2 years ago

I’ve been looking for something like this: eventually consistent syncing of DB content -> embeddings in a vector DB.

So far, I’ve been dealing with a tradeoff between latency + error handling in my API endpoints. I’ll either 1.) embed content + upsert into to the vector DB inside a transaction block for my main DB in the handler, which kills latency, or 2.) kickoff the embedding work separate from the main handler work, which risks data desynchronizing.

I’d much prefer a set-it-and-forget-it approach like Retake.

A few questions:

* If the “real-time server” goes offline temporarily, will it catch up on any newly added rows in the interim?

* Do you intend to emit any OpenTelemetry metrics? I’d like to monitor lag in production.

* Will I be able to deploy this as a single container on ECS/Kubernetes?

awendland commented on Node-Red 3.0 Released nodered.org/blog/2022/07/... · Posted by u/rcarmo

viraptor · 3 years ago

NodeRed is one of those projects that look so cool I'd like to have a reason to use it. So far nothing, but maybe I'll be automating my home one day and it will come useful.

I'd be interested to hear how people use this outside of IoT context?

awendland · 3 years ago

This is a little lengthy, but I wanted to share the tactical details of my use case to give you a full picture:

I use Node-Red for a few scheduled activities: archiving Reddit posts or tweets I upvote and pulling information from real estate websites that match criteria I’m interested in.

I like Node-Red vs. cron-managed shell/Python scripts for several reasons:

  - the admin/editor UI is accessible on any device with a web browser (no git, ssh, etc. tooling required)
  - the node-based visual flow is easy to reason about and debug (so even after years of ignoring my scripts I can quickly come back to them and grok what’s going on)
  - the barrier to entry continues to be low (I can pop in and create a new flow in <1 hr)

I prefer it over Zapier or IFTTT since it’s more flexible. I’ve authored arbitrary JavaScript and request logic to retrieve and filter data in ways these pre-packaged tools can’t.

I run it on an AWS LightSail server for ~$4 per month. I use Ansible to manage Ubuntu with podman + systemd running the Node-Red docker image and TLS provided by Caddy. Roughly ~4 hours to setup from scratch and something I return to once every ~18 months to update/tweak with minimal issue.

To sum it up, I appreciate the grok-ability + flexibility + accessibility. It just works and it scales in complexity as I need it to!

awendland commented on Rdrview – Firefox Reader View as a Linux command line tool github.com/eafer/rdrview... · Posted by u/ashitlerferad

awendland · 5 years ago

I needed a reader view library for a side project and decided to compare the most popular options (repo at https://github.com/awendland/readable-web-extractor-comparis...). Among cleanview, metascraper, @postlight/mercury-parser, and mozilla/readability I thought that mozilla/readability performed the best because of its consistent extraction of the primary content and minimal mangling of the semantic structure.

For a quick preview of each library on a random sample of 16 articles posted to HN, see https://github.com/awendland/readable-web-extractor-comparis... (you’ll need to expand a row to see its results).