prempv (u/prempv) - Readit News

prempv commented on Multilingual transformer and BERTopic for topic modeling: The case of Serbian arxiv.org/abs/2402.03067... · Posted by u/nikolamilosevic

uniqueuid · 2 years ago

BERTopic is great, but some people forget that even magic, er, UMAP+HDBScan on embeddings cannot solve some problems:

- statistical tools (including LDA and variants) define topics to be coherent latent clusters of words/embeddings. These correspond to a mixture of real-world concepts, including events, topics, issues etc. So when you apply BERTopic, you often get clusters that represent very different things on a conceptual level

- the end-to-end pipeline is very nice, especially when adding things like cluster labeling from LLMs on top. But we should not forget that this stacks many steps with implicit errors on top of each other. It is not easy to get a transparent and robust story why one cluster solution is better than any other.

- one of the implicit choices is picking UMAP, which will tend to find very coherent clusters but "throw out" many (up to ~50%) cases into an outlier cluster (-1). Sometimes that's not what we want, and then tuning is needed (e.g. use kmeans instead).

- random footnote: cuML for really fast BERTopic is great, but seems to produce inferior solutions. Better test that before putting it into production.

With all that said, I love that now we can use this tool and debate its merits on this level, rather than everyone implementing their own homegrown and probably bug-rich version of it.

prempv · 2 years ago

__we should not forget that this stacks many steps with implicit errors on top of each other__ - No words truer have been spoken.

prempv commented on Show HN: LLMs can generate valid JSON 100% of the time github.com/normal-computi... · Posted by u/remilouf

padolsey · 2 years ago

I've had more luck with getting it to output XML as (1) You can imbue XML with actual language/meaning (which LLMs adore) and (2) parsers can be made to be more forgiving. I get why people want to make JSON, but to me it's a bit like trying to get a cat to swim - you might eventually succeed, but it's not their natural inclination.

prempv · 2 years ago

I've had the same experience as well. I suspect if it's due to large presence of HTML in the training data as part of codebases and online content

prempv commented on Mighty Makes Google Chrome Faster mightyapp.com/... · Posted by u/amasad

claytoneast · 4 years ago

Why would you use a browser to visualize 1m nodes?

prempv · 4 years ago

Making the visualization a web-app is the fastest approach we could think of for prototyping. What other options are there for an interactive visualization? Quite curious..

prempv commented on Mighty Makes Google Chrome Faster mightyapp.com/... · Posted by u/amasad

prempv · 4 years ago

I'm surprised by the amount of skepticism I'm noticing here. I have been following Suhail's work on Mighty for a few months and I was very much looking forward to it.

One use case that where I think this would make my life easier is in the big-data / ML space where I'm trying to visualize large quantities of data. JS, WebGL and other supporting tools are all available today, but it's quite painful to load a graph visualization with 1M nodes and make it responsive without spending a lot of time optimizing the JS code. As a data scientist when I'm simply hacking stuff and want a quick prototype it's nearly impossible.

Graphistry [https://www.graphistry.com/] has a decent setup for graphs viz, but it didn't quite fit my needs. I've also tried JS running on a large machine (with GPU) and VNCed to it. That experience was quite poor.