jamesgresql (u/jamesgresql)

Readit News

jamesgresql commented on Elasticsearch was never a database paradedb.com/blog/elastic... · Posted by u/jamesgresql

jamesgresql · a month ago

I know it sounds obvious, but some people are pretty determined to us it that way!

Posted by u/jamesgresql a month ago

Elasticsearch was never a database paradedb.com/blog/elastic...

Posted by u/jamesgresql a month ago

Elastic style faceted search from PostgreSQL paradedb.com/blog/facetin...

Posted by u/jamesgresql a month ago

ParadeDB Makes Faceted Search 14× Faster Inside PostgreSQL paradedb.com/blog/facetin...

jamesgresql commented on Teaching Postgres to Facet Like Elasticsearch paradedb.com/blog/facetin... · Posted by u/jamesgresql

jamesgresql · 2 months ago

Hey HN! Author here. We added faceted search capabilities to our `pg_search` extension for Postgres, which is built on Tantivy (Rust's answer to Lucene). This brings Elasticsearch-style faceting directly into Postgres with a 14x performance improvement over a CTE based approach by performing facet aggregations in a single BM25 index pass and making use of our columnar store.

You get the same faceting features you'd expect from a dedicated search engine while maintaining full ACID compliance. Happy to answer technical questions about the implementation!

Posted by u/jamesgresql 2 months ago

Teaching Postgres to Facet Like Elasticsearch paradedb.com/blog/facetin...

jamesgresql commented on From text to token: How tokenization pipelines work paradedb.com/blog/when-to... · Posted by u/philippemnoel

flakiness · 2 months ago

Oh it's good old tokenization vs for-LLM tokenizations like sentence piece or tiktoken. We shouldn't forget there are non-ML simple things like this one which doesn't ask you to buy more GPUs.

jamesgresql · 2 months ago

Haha, I like “good old tokenization”

jamesgresql commented on From text to token: How tokenization pipelines work paradedb.com/blog/when-to... · Posted by u/philippemnoel

nawazgafar · 2 months ago

You beat me to the punch. I wrote a blog post[1] with the exact same title last week! Though, I went into a bit more detail with regard to embedding layers, so maybe my title is not accurate.

1. https://gafar.org/blog/text-to-tokens

jamesgresql · 2 months ago

Amazing, will have a read!

jamesgresql commented on From text to token: How tokenization pipelines work paradedb.com/blog/when-to... · Posted by u/philippemnoel

tgv · 2 months ago

It's a rather old-fashioned style of tokenization. In the 1980s this was common, I think. But, as noted in another comment, it doesn't work that well for languages with a richer morphology, or compounding. It's a very "English" approach.

jamesgresql · 2 months ago

Chinese, Japanese, Korean etc.. don’t work like this either.

However, even though the approach is “old fashioned” it’s still widely used for English. I’m not sure there is a universal approach that semantic search could use that would be both fast and accurate?

At the end of the day people choose a tokenizer that matches their language.

I will update the article to make all this clearer though!

jamesgresql commented on From text to token: How tokenization pipelines work paradedb.com/blog/when-to... · Posted by u/philippemnoel

wongarsu · 2 months ago

Notably tokenization for traditional search. LLMs use very different tokenization with very different goals

jamesgresql · 2 months ago

100%, maybe we should do a follow up on other types of tokenization.

u/jamesgresql

KarmaCake day100May 17, 2022View Original