Qwen3-TTS family is now open sourced: Voice design, clone, and generation

If you want to try out the voice cloning yourself you can do that an this Hugging Face demo: https://huggingface.co/spaces/Qwen/Qwen3-TTS - switch to the "Voice Clone" tab, paste in some example text and use the microphone option to record yourself reading that text - then paste in other text and have it generate a version of that read using your voice.

I shared a recording of audio I generated with that here: https://simonwillison.net/2026/Jan/22/qwen3-tts/

javier123454321 · 19 days ago

This is terrifying. With this and z-image-turbo, we've crossed a chasm. And a very deep one. We are currently protected by screens, we can, and should assume everything behind a screen is fake unless rigorously (and systematically, i.e. cryptographically) proven otherwise. We're sleepwalking into this, not enough people know about it.

rdtsc · 19 days ago

That was my thought too. You’d have “loved ones” calling with their faces and voices asking for money in some emergency. But you’d also have plausible deniability as anything digital can be brushed off as “that’s not evidence, it could be AI generated”.

u8080 · 18 days ago

https://www.youtube.com/watch?v=diboERFAjkE pretty much this

oceanplexian · 19 days ago

> This is terrifying.

Far more terrifying is Big Tech having access to a closed version of the same models, in the hands of powerful people with a history of unethical behavior (i.e. Zuckerberg's "Dumb Fucks" comments). In fact it's a miracle and a bit ironic that the Chinese would be the ones to release a plethora of capable open source models, instead of the scraps like we've seen from Google, Meta, OpenAI, etc.

razster · 18 days ago

I'd be a bit more worried with Z-Image Edit/Base is release. Flux.2 Klein is our and its on par with Zit, and with some fine tuning can just about hit Flux.2. Adding on top of that is Qwen Image Edit 2511 for additional refinement. Anything is possible. Those folks at r/StableDiffusion and falling over the possible release of Z-Image-Omni-Base, a hold me over until actual base is out. I've heard its equal to Flux.2. Crazy time.

TacticalCoder · 18 days ago

> With this and z-image-turbo, we've crossed a chasm.

And most of all: they're both local models. The cat is out of the box and it's never going back in. There's no censoring of this. No company that can pull the plug. Anyone with a semi-modern GPU can use these models.

fridder · 18 days ago

Admittedly I have not dove into it much but, I wonder if we might finally have a usecase for NFTs and web3? We need some sort of way to denote items are persion generated not AI. Would certainly be easier than trying to determine if something is AI generated

echelon · 19 days ago

We're going to be okay.

There are far more good and interesting use cases for this technology. Games will let users clone their voices and create virtual avatars and heroes. People will have access to creative tools that let them make movies and shows with their likeness. People that couldn't sing will make music.

Nothing was more scary than the invention of the nuclear weapon. And we're all still here.

Life will go on. And there will be incredible benefits that come out of this.

magicalhippo · 19 days ago

The HF demo space was overloaded, but I got the demo working locally easily enough. The voice cloning of the 1.7B model captures the tone of the speaker very well, but I found it failed at reproducing the variation in intonation, so it sounds like a monotonous reading of a boring text.

I presume this is due to using the base model, and not the one tuned for more expressiveness.

edit: Or more likely, the demo not exposing the expressiveness controls.

The 1.7B model was much better at ignoring slight background noise in the reference audio compared to the 0.6B model though. The 0.6B would inject some of that into the generated audio, whereas the 1.7B model would not.

Also, without FlashAttention it was dog slow on my 5090, running at 0.3X realtime with just 30% GPU usage. Though I guess that's to be expected. No significant difference in generation speed between the two models.

Overall though, I'm quite impressed. I haven't checked out all the recent TTS models, but a fair number, and this one is certainly one of the better ones in terms of voice cloning quality I've heard.

thedangler · 18 days ago

How did you do this locally? Tools? Language?

dsrtslnd23 · 18 days ago

Any idea on the VRAM footprint for the 1.7B model? I guess it fits on consumer cards but I am wondering if it works on edge devices.

pseudosavant · 19 days ago

Remarkable tech that is now accessible to almost anyone. My cloned voice sounded exactly like me. The uses for this will be from good to bad and everywhere in-between. A deceased grandmother reading "Good Night Moon" to grandkids, scamming people, the ability to create podcasts with your own voices from just prompts.

_kb · 18 days ago

It's a good thing governments (https://www.ato.gov.au/online-services/voice-authentication) and banks (https://www.anz.com.au/security/how-we-protect-you/voice-id/) haven't gone all in on using voice as an authentication mechanism.

parentheses · 18 days ago

I got some errors trying to run this on my MBP. Claude was able to one-shot a fix.

``` Loaded speech tokenizer from ~/.cache/huggingface/hub/models--Qwen--Qwen3-TTS-12Hz-1.7B-VoiceDesign/snapshots/0e711a1c0aa5aad30654426 e0d11f67716c1211e/speech_tokenizer Fetching 11 files: 0%| | 0/11 [00:00<?, ?it/s]Fetching 11 files: 100%|| 11/11 [00:00<00:00, 125033.45it/s] The tokenizer you are loading from '!/.cache/huggingface/hub/models--Qwen--Qwen3-TTS-12Hz-1.7B-VoiceDesign/snapshots/0e711a1c0aa5aad30654426e0d11f67716c1211e' with an incorrect regex pattern: https://huggingface.co/mistralai/Mistral-Small-3.1-24B-Instr.... This will lead to incorrect tokenization. You should set the `fix_mistral_regex=True` flag when loading this tokenizer to fix this issue. ```

cristoperb · 18 days ago

I cloned my voice and had it generate audio for a paragraph from something I wrote. It definitely kind of sounds like me, but I like it much better than listening to my real voice. Some kind of uncanny peak.

viraptor · 18 days ago

They weirdly makes it a canny peak though :)

bsenftner · 18 days ago

You do realize that you don't hear your real voice normally, an individual has to record their voice to hear how others hear their voice. What you hear when you speak includes your skull resonating, which other's do not hear.

mohsen1 · 19 days ago

> The requested GPU duration (180s) is larger than the maximum allowed

What am I doing wrong?

gregsadetsky · 19 days ago

you need to login

KolmogorovComp · 18 days ago

Hello, the recording you posted does not tell much about the cloning capability without an example from your real voice.

simonw · 18 days ago

Given how easy voice cloning is with this thing I chickened out of sharing the training audio I recorded!

That's not really rational considering the internet is full of examples of my voice that anyone could use though. Here's a recent podcast clip: https://www.youtube.com/watch?v=lVDhQMiAbR8&t=3006s

kingstnap · 18 days ago

It was fun to try out. I wonder if at some point if I have a few minutes of me talking I could make myself read an entire book to myself.

itsTyrion · 17 days ago

well that isnt concerning at all

uv tool install --force git+https://github.com/Blaizzy/mlx-audio.git --prerelease=allow python -m mlx_audio.tts.generate --model mlx-community/Qwen3-TTS-12Hz-0.6B-Base-bf16 --text "Hello, this is a test." --ref_audio path_to_audio.wav --ref_text "Transcript of the reference audio." --play

Qwen team, please please please, release something to outperform and surpass the coding abilities of Opus 4.5.

Although I like the model, I don't like the leadership of that company and how close it is, how divisive they're in terms of politics.

mortsnort · 19 days ago

They were just waiting for someone in the comments to ask!

zeppelin101 · 18 days ago

Someone has to take the first step. Let's be grateful to the brave anon HN poster for stepping up.

mhuffman · 19 days ago

It really is the best way to incentivize politeness!

stuckkeys · 19 days ago

I loled hard at this. Thank you kind stranger.

pseudony · 19 days ago

Same issue (I am Danish).

Have you tested alternatives? I grabbed Open Code and a Minimax m2.1 subscription, even just the 10usd/mo one to test with.

Result? We designed a spec for a slight variation of a tool for which I wrote a spec with Claude - same problem (process supervisor tool), from scratch.

Honestly, it worked great, I have played a little further with generating code (this time golang), again, I am happy.

Beyond that, Glm4.7 should also be great.

See https://dev.to/kilocode/open-weight-models-are-getting-serio...

It is a recent case story of vibing a smaller tool with kilo code, comparing output from minimax m2.1 and Glm4.7

Honestly, just give it a whirl - no need to send money to companies/nations your disagree with with.

nunodonato · 19 days ago

I've been using GLM 4.7 with Claude Code. best of both worlds. Canceled my Anthropic subscription due to the US politics as well. Already started my "withdrawal" in Jan 2025, Anthropic was one of the few that was left

TylerLives · 19 days ago

>how divisive they're in terms of politics

What do you mean by this?

throwaw12 · 19 days ago

Dario said not nice words about China and open models in general:

https://www.bloomberg.com/news/articles/2026-01-20/anthropic...

Balinares · 19 days ago

They're supporters of the Trump administration's military, a view which is not universally lauded.

mohsen1 · 19 days ago

With a good harness I am getting similar results with GLM 4.7. I am paying for TWO! max accounts and my agents are running 24/7.

I still have a small Claude account to do some code reviews. Opus 4.5 does good reviews but at this point GLM 4.7 usually can do the same code reviews.

If cost is an issue (for me it is, I pay out of pocket) go with GLM 4.7

imiric · 18 days ago

Your GitHub profile is... disturbing. 1,354 commits and 464 pull requests in January so far.

Regardless of how productive those numbers may seem, that amount of code being published so quickly is concerning, to say the least. It couldn't have possibly been reviewed by a human or properly tested.

If this is the future of software development, society is cooked.

amrrs · 19 days ago

Have you tried the new GLM 4.7?

davely · 19 days ago

I've been using GLM 4.7 alongside Opus 4.5 and I can't believe how bad it is. Seriously.

I spent 20 minutes yesterday trying to get GLM 4.7 to understand that a simple modal on a web page (vanilla JS and HTML!) wasn't displaying when a certain button was clicked. I hooked it up to Chrome MCP in Open Code as well.

It constantly told me that it fixed the problem. In frustration, I opened Claude Code and just typed "Why won't the button with ID 'edit' work???!"

It fixed the problem in one shot. This isn't even a hard problem (and I could have just fixed it myself but I guess sunk cost fallacy).

throwaw12 · 19 days ago

yes I did, not on par with Opus 4.5.

I use Opus 4.5 for planning, when I reach my usage limits fallback to GLM 4.7 only for implementing the plan, it still struggles, even though I configure GLM 4.7 as both smaller model and heavier model in claude code

WarmWash · 19 days ago

The Chinese labs distill the SOTA models to boost the performance of theirs. They are a trailer hooked up (with a 3-6 month long chain) to the trucks pushing the technology forwards. I've yet to see a trailer overtake it's truck.

China would need an architectural breakthrough to leap American labs given the huge compute disparity.

miklosz · 19 days ago

I have seen indeed a trailer overtake its truck. Not a beautiful view.

overfeed · 19 days ago

Care to explain how the volume of AI research papers authored by Chinese researchers[1] has exceeded US-published ones? Time-traveling plagiarism perhaps, since you believe the US is destined to lead always.

1. Chinese researcher in China, to be more specific.

aaa_aaa · 19 days ago

No all they need is time. I am awaiting the dowfall of the ai hegemony and hype with popcorn at hand.

mhuffman · 19 days ago

I would be happy with an openweight 3 month old Claude

genewitch · 18 days ago

can you point me at another free voice cloning / tts model with this fidelity and, i guess prompt adherence?

because i've been on youtube and insta, and believe me, no one else even compares, yet.

Onavo · 19 days ago

Well DeepSeek V4 is rumored to be in that range and will be released in 3 weeks.

aussieguy1234 · 18 days ago

I could say the same about grok (although given there are better models for my use cases I don't use it). What part of divisive politics are you talking about here?

sampton · 19 days ago

Every time Dario opens his mouth it's something weird.