Unlike TTS models that generate each speaker turn and stitch them together, Dia generates the entire conversation in a single pass. This makes it faster, more natural, and easier to use for dialogue generation.
It also supports audio prompts — you can condition the output on a specific voice/emotion and it will continue in that style.
Demo page comparing it to ElevenLabs and Sesame-1B https://yummy-fir-7a4.notion.site/dia
We started this project after falling in love with NotebookLM’s podcast feature. But over time, the voices and content started to feel repetitive. We tried to replicate the podcast-feel with APIs but it did not sound like human conversations.
So we decided to train a model ourselves. We had no prior experience with speech models and had to learn everything from scratch — from large-scale training, to audio tokenization. It took us a bit over 3 months.
Our work is heavily inspired by SoundStorm and Parakeet. We plan to release a lightweight technical report to share what we learned and accelerate research.
We’d love to hear what you think! We are a tiny team, so open source contributions are extra-welcomed. Please feel free to check out the code, and share any thoughts or suggestions with us.
And you are definitely right about 懶音. They are both explained in the same section not because they are the same thing but because they are both modifications occuring for the sound pronunciations.
Thank you for creating this! But I'm afraid this is the misunderstanding -- words like san1 cing2 申請 are very much everyday words, even though the reading of the character is deemed literary. You should think of characters like 請 and 聽 as just having multiple in-context pronunciations, some of which you should learn, some of which you probably don't need to.
As a Cantonese speaker, I love the effort here! However, the above isn't correct. This is an example of vernacular vs. literary pronunciation, and 請 has both pronunciations, depending on context. For instance, 請 is ceng2 when used as the verb "to invite", but cing2 in compounds like jiu1 cing2 邀請.
It shouldn't be conflated with the phenomenon later in that same paragraph about 懶音 "lazy pronunciation".
But someone of obvious Asian descent and accent who introduces themselves as “Simon Cartwright” and has vague tales of growing up in London… again, it’s possible, and we should treat each individual with respect and assumption of good intent, but that might make me dig a little deeper.
I suspect it will only be useful for emergencies as latency will be terrible, though.
Of course they could pronounce the words in any modern Chinese language, but why not pick the largest and most standardized one?