- Embed all the text documents.
- Project to 2D using UMAP which also creates its own emergent "clusters".
- Use k-means clustering with a high cluster count depending on dataset size.
- Feed the ChatGPT API ~10 examples from each cluster and ask it to provide a concise label for the cluster.
- Bonus: Use DBSCAN to identify arbitrary subclusters within each cluster.
It is extremely effective and I have a theoetical implementation of a more practical use case to use said UMAP dimensionality reduction for better inference. There is evidence that current popular text embedding models (e.g. OpenAI ada, which outputs 1536D embeddings) are way too big for most use cases and could be giving poorly specified results for embedding similarity as a result, in addition to higher costs for the entire pipeline.
The problem is language nerds write languages for other language nerds.
They all want it to be whatever the current sexiness is in language design and want it to be self-hosting and be able to write fast multithreaded webservers in it and then it becomes conceptually complicated.
What we need is like a "Logo" for systems engineers / devops which is a simple toy language that can be described entirely in a book the size of the original K&R C book. It probably needs to be dynamically typed, have control structures that you can learn in a weekend, not have any threading or concurrency, not be object oriented or have inheritance and be functional/modular in design. And have a very easy to use FFI model so it can call out to / be called from other languages and frameworks.
The problem is that language nerds can't control themselves and would add stuff that would grow the language to be more complex, and then they'd use that in core libraries and style guides so that newbies would have to learn it all. I myself would tend towards adding "each/map" kinds of functions on arrays/hashmaps instead of just using for loops and having first class functions and closures, which might be mistakes. There's that immutable FP language for configuration which already exists (i can't google this morning yet) which is exactly the kind of language which will never gain any traction because >95% of the people using templated YAML don't want to learn to program that way.