Show HN: Dia, an open-weights TTS model for generating realistic dialogue

[S1] It really sounds as if they've started using NPR to source TTS models [S2] Yeah... yeah... it's kind of disturbing (laughs dejectedly). [S3] I really wish, that they would just Stop with this.

This is really impressive; we're getting close to a dream of mine: the ability to generate proper audiobooks from EPUBs. Not just a robotic single voice for everything, but different, consistent voices for each protagonist, with the LLM analyzing the text to guess which voice to use and add an appropriate tone, much like a voice actor would do.

I've tried "EPUB to audiobook" tools, but they are really miles behind what a real narrator accomplishes and make the audiobook impossible to engage with

mclau157 · 4 months ago

Realistic voice acting for audio books, realistic images for each page, realistic videos for each page, oh wait I just created a movie, maybe I can change the plot? Oh wait I just created a video game

hleszek · 4 months ago

Now do it in VR and make it fully interactive.

azinman2 · 4 months ago

Wouldn’t it be more desirable to hear an actual human on an audiobook? Ideally the author?

Versipelle · 4 months ago

> Wouldn’t it be more desirable to hear an actual human on an audiobook? Ideally the author?

Of course, but it's not always available.

For example, I would love an audiobook for Stanisław Lem's "The Invincible," as I just finished its video game adaptation, yet it simply doesn't exist in my native language.

It's quite seldom that the author narrates the audiobooks I listen to, and sometimes the narrator does a horrible job, butchering the characters with exaggerated tones.

satvikpendem · 4 months ago

Why a human? There are many cases where I like a book but dislike the audiobook speaker, so I essentially can't listen to that book anymore. With a machine, I can tweak the voice to my heart's content.

ks2048 · 4 months ago

With 1M+ new books every year, that’s not possible for all but the few most popular.

senordevnyc · 4 months ago

Honestly, I’d say that’s true only for the author. Anyone else is just going to be interpreting the words to understand how to best convey the character / emotion / situation / etc., just like an AI will have to do. If an AI can do that more effectively than a human, why not?

The author could be better, because they at least have other info beyond the text to rely on, they can go off-script or add little details, etc.

fennecfoxy · 4 months ago

It'd be nice if there were mainstream releases on GBC/GBA/PSP again too! But apparently if there's no money in something then people don't really wanna do it.

cchance · 4 months ago

You really think people writing these papers actually have good speaking voices? LOL, theirs a reason not everyone could be an audio book maker or podcaster, a lot of peoples voices suck for audiobooks

Hey HN! We’re Toby and Jay, creators of Dia. Dia is 1.6B parameter open-weights model that generates dialogue directly from a transcript.

Unlike TTS models that generate each speaker turn and stitch them together, Dia generates the entire conversation in a single pass. This makes it faster, more natural, and easier to use for dialogue generation.

It also supports audio prompts — you can condition the output on a specific voice/emotion and it will continue in that style.

Demo page comparing it to ElevenLabs and Sesame-1B https://yummy-fir-7a4.notion.site/dia

We started this project after falling in love with NotebookLM’s podcast feature. But over time, the voices and content started to feel repetitive. We tried to replicate the podcast-feel with APIs but it did not sound like human conversations.

So we decided to train a model ourselves. We had no prior experience with speech models and had to learn everything from scratch — from large-scale training, to audio tokenization. It took us a bit over 3 months.

Our work is heavily inspired by SoundStorm and Parakeet. We plan to release a lightweight technical report to share what we learned and accelerate research.

We’d love to hear what you think! We are a tiny team, so open source contributions are extra-welcomed. Please feel free to check out the code, and share any thoughts or suggestions with us.

dangoodmanUT · 4 months ago

I know it’s taboo to ask, but I must: where’s the dataset from? Very eager to play around with audio models myself, but I find existing datasets limiting

zelphirkalt · 4 months ago

Why would that be a taboo question to ask? It should be the question we always ask, when presented with a model and in some cases we should probably reject the model, based on that information.

xdfgh1112 · 4 months ago

I suspect podcasts, as you have a huge amount of transcribed data with good diction and mic quality. The voices sound like podcast voices to me.

gfaure · 4 months ago

Amazing that you developed this over the course of three months! Can you drop any insight into how you pulled together the audio data?

isoprophlex · 4 months ago

+1 to this, amazing how you managed to deliver this, and iff you're willing to share i'd be most interested in learning what you did in terms of train data..!

heystefan · 4 months ago

Could one usecase be generating an audiobook with this from existing books? I wonder if I could fine-tune the "characters" that speak these lines since you said it's a single pass whole the whole convo. Wonder if that's a limitation for this kind of a usecase (where speed is not imperative).

toebee · 4 months ago

Yes! But you would need to put together a LLM system that created scripts from the book content. There is an open source project called OpenNotebookLM (https://github.com/gabrielchua/open-notebooklm) that does something similar. If you hook the Dia model to that kind of system, it will be very possible :) Thanks for the interest!

smusamashah · 4 months ago

Hi! This is awesome for size and quality. I want to see a book reading example or try it myself.

This is a tangent point but it would have been nicer if it wasn't a notion site. You could put the same page on github pages and it will be much lighter to open, navigate and link (like people trying to link some audio)

toebee · 4 months ago

Thanks for the kind words! You can try it now on https://huggingface.co/spaces/nari-labs/Dia-1.6B Also, we'll try to update the Demo Page to something lighter when we have time. Thanks for the feedback :))

karimf · 4 months ago

This is super awesome. Several questions.

1. What GPU did you use to train the model? I'd love to train a model like this, but currently, I only have a 16GB MacBook. Thinking about buying a 5090 if it's worth.

2. Is it possible to use this for real time audio generation, similar to the demo on the Sesame website?

llm_nerd · 4 months ago

This is a pretty incredible three month creation for a couple of people who had no experience with speech models.

toebee · 4 months ago

Thanks for the kind words! We're just following our interests and staying upwind.

cchance · 4 months ago

Its really amazing cant wait to play with it some, the samples are great... but oddly all seem... really fast, like they'd be perfect but they feel like they're playing at 1.2x speed or is that just me?

claiir · 4 months ago

It’s not just you. The speedup is an artefact of the CFG (Classifier-Free Guidance) the model uses. The other problem is the speedup isn’t constant—it actually accelerates as the generation progresses. The Parakeet paper [1] (which OP lifted their model architecture almost directly from [2]) gives a fairly robust treatment to the matter:

> When we apply CFG to Parakeet sampling, quality is significantly improved. However, on inspecting generations, there tends to be a dramatic speed-up over the duration of the sample (i.e. the rate of speaking increases significantly over time). Our intuition for this problem is as follows: Say that is our model is (at some level) predicting phonemes and the ground truth distribution for the next phoneme occuring is 25% at a given timestep. Our conditional model may predict 20%, but because our uncondtional model cannot see the text transcription, its prediction for the correct next phoneme will be much lower, say 5%. With a reasonable level of CFG, because [the logit delta] will be large for the correct next phoneme, we’ll obtain a much higher final probability, say 50%, which biases our generation towards faster speech. [emphasis mine]

Parakeet details a solution to this, though this was not adopted (yet?) by Dia:

> To address this, we introduce CFG-filter, a modification to CFG that mitigates the speed drift. The idea is to first apply the CFG calculation to obtain a new set of logits as before, but rather than use these logits to sample, we use these logits to obtain a top-k mask to apply to our original conditional logits. Intuitively, this serves to constrict the space of possible “phonemes” to text-aligned phonemes without heavily biasing the relative probabilities of these phonemes (or for example, start next word vs pause more). [emphasis mine]

The paper contains audio samples with ablations you can listen to.

[1]: https://jordandarefsky.com/blog/2024/parakeet/#classifier-fr...

[2]: https://news.ycombinator.com/item?id=43758686

new_user_final · 4 months ago

Easily 10 times better than recent OpenAI voice model. I don't like robotic voices.

Example voices seems like over loud, over excitement like Andrew Tate, Speed or advertisement. It's lacking calm, normal conversation or normal podcast like interaction.

toebee · 4 months ago

Thank you! You can add audio prompts of calm voices to make them a bit smoother. https://huggingface.co/spaces/nari-labs/Dia-1.6B you can try it here!

nickthegreek · 4 months ago

Are there any examples of the audio differences between the this and the larger model?

toebee · 4 months ago

We're still experimenting, so do not have samples yet from the larger model. All we have is Dia-1.6B at the moment.

amp-lifier · 4 months ago

In terms of guiding voice and expression, audio prompts are promising but I believe text instructions serve different experiences as well. Will there be support for that as well?

bzuker · 4 months ago

hey, this looks (or rather, sounds) amazing! Does it work with different languages or is it English only?

toebee · 4 months ago

Thank you!! Works for English only unfortunately :((

sebstefan · 4 months ago

I inserted the non-verbal command "(pauses)" in the middle of a sentence and I think I caused it to have an aneurysm.

https://i.horizon.pics/4sEVXh8GpI (27s)

It starts with an intro, too. Really strange

antiraza · 4 months ago

That was... amazing.

abrookewood · 4 months ago

That's certainly unusual ...

[pauses] i think i heard demons

throwaway-alpha · 4 months ago

I have a hunch they're pulling data from radio shows to give it that "high quality" vibe. Tried running it through this script and hit some weird bugs too:

https://i.horizon.pics/Tx2PrPTRM3

yahoozoo · 4 months ago

The “Yeah…” followed by an uncomfortably long pause then a second “Yeah…” killed me.

degosuke · 4 months ago

It even added an extra f-word at the end. Still veeery impressive

Just noticed he says dejectedly too

bt1a · 4 months ago

You hittin them balloons again, mate ?

hemloc_io · 4 months ago

Very cool!

Insane how much low hanging fruit there is for Audio models right now. A team of two picking things up over a few months can build something that still competes with large players with tons of funding

miki123211 · 4 months ago

Yeah, Eleven Labs must be raking it in.

You can get hours of audio out of it for free with Eleven Reader, which suggests that their inference costs aren't that high. Meanwhile, those same few hours of audio, at the exact same quality, would cost something like $100 when generated through their website or API, a lot more than any other provider out there. Their pricing (and especially API pricing) makes no sense, not unless it's just price discrimination.

Somebody with slightly deeper pockets than academics or one guy in a garage needs to start competing with them and drive costs down.

Open TTS models don't even seem to utilize audiobooks or data scraped off the internet, most are still Librivox / LJ Speech. That's like training an LLM on just Wikipedia and expecting great results. That may have worked in 2018, but even in 2020 we knew better, not to mention 2025.

TTS models never had their "Stable Diffusion moment", it's time we get one. I think all it would take is somebody doing open-weight models applying the lessons we learned from LLMs and image generation to TTS models, namely more data, more scraping, more GPUs, less qualms and less safety. Eleven Labs already did, and they're profiting from it handsomely.

pzo · 4 months ago

Kokoro gives great results especially when speaking english. Model is small enough to run even on smartphone ~3x faster than realtime.

Dead Comment

Thank you for the kind words <3

kreelman · 4 months ago

This is amazing. Is it possible to build in a chosen voice, a bit like Eleven Labs does? ...This may be on the git summary, being lazy and asking anyway :=) Thanks for your work.

JonathanFly · 4 months ago

Yes, see: https://github.com/nari-labs/dia/blob/main/example/voice_clo...

tyrauber · 4 months ago

Hey, do yourself a favor and listen to the fun example:

> [S1] Oh fire! Oh my goodness! What's the procedure? What to we do people? The smoke could be coming through an air duct!

Seriously impressive. Wish I could direct link the audio.

Kudos to the Dia team.

jinay · 4 months ago

For anyone who wants to listen, it's on this page: https://yummy-fir-7a4.notion.site/dia

mrandish · 4 months ago

Wow. Thanks for posting the direct link to examples. Those sound incredibly good and would be impressive for a frontier lab. For two people over a few months, it's spectacular.

DoctorOW · 4 months ago

A little overacted, it reminds me of the voice acting in those flash cartoons you'd see in the early days of YouTube. That's not to say it isn't good work, it still sounds remarkably human. Just silly humans :)

dostick · 4 months ago

This is an instant classic. Sesame comparison examples all sound like clueless rich people from The White Lotus.

intalentive · 4 months ago

Sounds great. One of the female examples has convincing uptalk. There must be a way to manipulate the latent space to control uptalk, vocal fry, smoker’s voice, lispiness, etc.

Thank you!! Indeed the script was inspired from a scene in the Office.

3abiton · 4 months ago

This is oddly reminiscent of the office. I wonder if tv shows were part of its training data!

nojs · 4 months ago

This is so good. Reminds me of The Office. I love how bad the other examples are.

fwip · 4 months ago

The text is lifted from a scene in The Office: https://youtu.be/gO8N3L_aERg?si=y7PggNrKlVQm0qyX&t=82

hombre_fatal · 4 months ago

Yeah, that example is insane.

Is there some sort of system prompt or hint at how it should be voiced, or does it interpret it from the text?

Because it would be hilarious if it just derived it from the text and it did this sort of voice acting when you didn't want it to, like reading a matter-of-fact warning label.

notdian · 4 months ago

made a small change and got it running on M2 Pro 16GB Macbook pro, the quality is amazing.

https://github.com/nari-labs/dia/pull/4

rahimnathwani · 4 months ago

Thank you for this! My desktop GPU has only 8GB VRAM, but my MacBook has plenty of unified RAM.

emmelaich · 4 months ago

Thanks, works well but slowly on a Mac Air M3 with 24gb. Will have to try it again after freeing up more ram as it was doing a bit of swapping with Chrome running too.

(later). It did nicely for the default example text but just made weird sounds for a "hello all" prompt. And took longer?!

Thank you for the contribution! We'll be merging PRs and cleaning code up very soon :)

noiv · 4 months ago

Can confirm, runs straight forward on 15.4.1@M4, THX.

rustc · 4 months ago

Is this Apache licensed or a custom one? The README contains this:

> This project is licensed under the Apache License 2.0 - see the LICENSE file for details.

> This project offers a high-fidelity speech generation model *intended solely for research and educational use*. The following uses are strictly forbidden:

> Identity Misuse: Do not produce audio resembling real individuals without permission.

> ...

Specifically the phrase "intended solely for research and educational use".

Sorry for the confusion. the license is plain Apache 2.0, and we changed the wording to "intended for research and educational use." The point was, users are free to use it for their use cases, just don't do shady stuff with it.

Thanks for the feedback :)

crooked-v · 4 months ago

So is that actually part of the license (making it non-Apache 2.0), or not?

montroser · 4 months ago

Hmm, the "strictly forbidden" part seems more important than whatever are their stated intentions... Either way, it seems like it needs clarifying.

Deleted Comment

moritonal · 4 months ago

Isn't it weird how "We don't have a full list of non-verbal [commands]". Like, I can imagine why, but it's wild we're at a point where we don't know what our code can do.

kevmo314 · 4 months ago

I have a sneaking suspicion it's because they lifted the model architecture almost directly from Parakeet: https://jordandarefsky.com/blog/2024/parakeet/

Parakeet references WhisperD which is at https://huggingface.co/jordand/whisper-d-v1a and doesn't include a full list of non-speech events that it's been trained with, except "(coughs)" and "(laughs)".

Not saying the authors didn't do anything interesting here. They put in the work to reproduce the blog post and open source it, a praiseworthy achievement in itself, and they even credit Parakeet. But they might not have the list of commands for more straightforward reasons.

You're absolutely right. We used Jordan's Whisper-D, and he was generous enough to offer some guidance along the way.

It's also a valid criticism that we haven’t yet audited the dataset for existing list of tags. That’s something we’ll be improving soon.

As for Dia’s architecture, we largely followed existing models to build the 1.6B version. Since we only started learning about speech AI three months ago, we chose not to innovate too aggressively early on. That said, we're planning to introduce MoE and Sliding Window Attention in our larger models, so we're excited to push the frontier in future iterations.