Am I missing something?
Once knowledge is distilled you can build on top of it easily by merging concepts for example.
So no secret here.
Voice sounds robotic and plain. Most likely a lot of audiobooks in training data and less conversational speech. And dropping diffusion was not a great idea, voice is not crystal clear anymore, it is more like a telephony recording.
If this year becomes the year when high quality Open Source TTS and ASR models appear that can run in real-time on an Nvidia RTX 40x0 or 30x0, then that would be great. On CPU even better.
Also note the Ethical Statement on BASE TTS:
> An application of this model can be to create synthetic voices of people who have lost the ability to speak due to accidents or illnesses, subject to informed consent and rigorous data privacy reviews. However, due to the potential misuse of this capability, we have decided against open-sourcing this model as a precautionary measure.
There are much more clear sounding systems around. You can listen for StyleTTS2 to compare.
https://github.com/openai/whisper/blob/main/language-breakdo...
https://github.com/SesameAILabs/csm/issues/80