I've had more luck with getting it to output XML as (1) You can imbue XML with actual language/meaning (which LLMs adore) and (2) parsers can be made to be more forgiving. I get why people want to make JSON, but to me it's a bit like trying to get a cat to swim - you might eventually succeed, but it's not their natural inclination.
I've had the same experience as well. I suspect if it's due to large presence of HTML in the training data as part of codebases and online content
- statistical tools (including LDA and variants) define topics to be coherent latent clusters of words/embeddings. These correspond to a mixture of real-world concepts, including events, topics, issues etc. So when you apply BERTopic, you often get clusters that represent very different things on a conceptual level
- the end-to-end pipeline is very nice, especially when adding things like cluster labeling from LLMs on top. But we should not forget that this stacks many steps with implicit errors on top of each other. It is not easy to get a transparent and robust story why one cluster solution is better than any other.
- one of the implicit choices is picking UMAP, which will tend to find very coherent clusters but "throw out" many (up to ~50%) cases into an outlier cluster (-1). Sometimes that's not what we want, and then tuning is needed (e.g. use kmeans instead).
- random footnote: cuML for really fast BERTopic is great, but seems to produce inferior solutions. Better test that before putting it into production.
With all that said, I love that now we can use this tool and debate its merits on this level, rather than everyone implementing their own homegrown and probably bug-rich version of it.