Readit News logoReadit News
jamesgresql commented on Elasticsearch was never a database   paradedb.com/blog/elastic... · Posted by u/jamesgresql
jamesgresql · a month ago
I know it sounds obvious, but some people are pretty determined to us it that way!
jamesgresql commented on Teaching Postgres to Facet Like Elasticsearch   paradedb.com/blog/facetin... · Posted by u/jamesgresql
jamesgresql · 2 months ago
Hey HN! Author here. We added faceted search capabilities to our `pg_search` extension for Postgres, which is built on Tantivy (Rust's answer to Lucene). This brings Elasticsearch-style faceting directly into Postgres with a 14x performance improvement over a CTE based approach by performing facet aggregations in a single BM25 index pass and making use of our columnar store.

You get the same faceting features you'd expect from a dedicated search engine while maintaining full ACID compliance. Happy to answer technical questions about the implementation!

jamesgresql commented on From text to token: How tokenization pipelines work   paradedb.com/blog/when-to... · Posted by u/philippemnoel
flakiness · 2 months ago
Oh it's good old tokenization vs for-LLM tokenizations like sentence piece or tiktoken. We shouldn't forget there are non-ML simple things like this one which doesn't ask you to buy more GPUs.
jamesgresql · 2 months ago
Haha, I like “good old tokenization”
jamesgresql commented on From text to token: How tokenization pipelines work   paradedb.com/blog/when-to... · Posted by u/philippemnoel
nawazgafar · 2 months ago
You beat me to the punch. I wrote a blog post[1] with the exact same title last week! Though, I went into a bit more detail with regard to embedding layers, so maybe my title is not accurate.

1. https://gafar.org/blog/text-to-tokens

jamesgresql · 2 months ago
Amazing, will have a read!
jamesgresql commented on From text to token: How tokenization pipelines work   paradedb.com/blog/when-to... · Posted by u/philippemnoel
tgv · 2 months ago
It's a rather old-fashioned style of tokenization. In the 1980s this was common, I think. But, as noted in another comment, it doesn't work that well for languages with a richer morphology, or compounding. It's a very "English" approach.
jamesgresql · 2 months ago
Chinese, Japanese, Korean etc.. don’t work like this either.

However, even though the approach is “old fashioned” it’s still widely used for English. I’m not sure there is a universal approach that semantic search could use that would be both fast and accurate?

At the end of the day people choose a tokenizer that matches their language.

I will update the article to make all this clearer though!

jamesgresql commented on From text to token: How tokenization pipelines work   paradedb.com/blog/when-to... · Posted by u/philippemnoel
wongarsu · 2 months ago
Notably tokenization for traditional search. LLMs use very different tokenization with very different goals
jamesgresql · 2 months ago
100%, maybe we should do a follow up on other types of tokenization.

u/jamesgresql

KarmaCake day100May 17, 2022View Original