BASE TTS: The largest text-to-speech model to-date

Interesting. Just a couple of hours ago I came across MetaVoice-1B [0] (Demo [1]) and was amazed by the quality of their TTS in English (sadly no other languages available).

If this year becomes the year when high quality Open Source TTS and ASR models appear that can run in real-time on an Nvidia RTX 40x0 or 30x0, then that would be great. On CPU even better.

Also note the Ethical Statement on BASE TTS:

> An application of this model can be to create synthetic voices of people who have lost the ability to speak due to accidents or illnesses, subject to informed consent and rigorous data privacy reviews. However, due to the potential misuse of this capability, we have decided against open-sourcing this model as a precautionary measure.

[0] https://github.com/metavoiceio/metavoice-src

[1] https://ttsdemo.themetavoice.xyz/

nshm · 2 years ago

Metavoice is one of a dozen GPT-based TTS systems around starting from Tortoise. And not that great honestly. You can clearly hear "glass scratches" in their sound, it is because they trained on MP3-compressed data.

There are much more clear sounding systems around. You can listen for StyleTTS2 to compare.

standardly · 2 years ago

Is the crispness of compressed audio really the benchmark of TTS improvements? I feel like that's an aside. A valid point, but not much of a detractor..

qwertox · 2 years ago

I had forgotten about StyleTTS2, and it was discussed here on HN a couple of months ago. Maybe that's what made me feel that there's something going on.

popalchemist · 2 years ago

I've tested both. StyleTTS2 is impressive, especially its speed, but the prosody is lacking, compared to Metavoice.

ionwake · 2 years ago

Is it possible to run Metavoice and other pytorch systems on Apple silicon EG the M1? I keep getting issues.

m2024 · 2 years ago

Check out `whisper` and `whisper-cpp` for ASR.

I am running the smaller models in near real-time on a 3rd gen i7, with good results even using my terrible built-in laptop mic from a distance. The medium and large models are impressively accurate for technical language.

qwertox · 2 years ago

I'm using Whisper to transcribe notes I record with a lavalier mic during my bike rides (wind is no problem), but am using OpenAI's service. When it was released I tested it on a Ryzen 5950x and it was too slow and memory hungry for my taste. Using large was necessary for that use case (also, I'm recording in German).

jamil7 · 2 years ago

Whisper is for STT though right?

kkielhofner · 2 years ago

xtts2 with deepspeed and whisper + Ctranslate2 with or without distil-whisper weights already run at many multiples of realtime on GPU.

For the top-top end Whisper Large with distil-whisper and TensorRT-LLM hits at least 50x realtime on an RTX 4090.

Note that my application only uses very short speech segments. Longer speech segments increase the realtime multiple SIGNIFICANTLY (as in hitting 150x realtime) due to batching, etc.

There’s also Nvidia Canary which is smaller, faster, and more accurate. It’s pretty new and the ecosystem around it is more or less nonexistent but it’s increasingly well supported in Nvidia world at least.

The emotion examples are interesting. One of the current most obvious indicators of AI-generated voices/voice cloning is a lack of emotion and range, which make them objectively worse compared to professional voice actors, unless a lack of emotion and range is the desired voice direction.

But if you listen to the emotion examples, the range essentially what you'd get from an audiobook narrator, not more traditional voice acting.

tsumnia · 2 years ago

Sadly it's not my forte but I expect in the near future we'll see an additional "emotion" embedding or something similar. Actors regularly use 'action words' (verbs) [1] to help add context to lines. A model then could study a text, determine an appropriate verb/emotion range to work from, then produce the audio with that additional context.

[1] https://indietips.com/subtext-action-verb/

candiodari · 2 years ago

This already exists. These are transformers. Things like <laugh> work in a lot of models, for example. And you can vary, like sigh and uh work. I don't think all of these were programmed in.

minimaxir · 2 years ago

The bottleneck is the annotations: there's no easy way to annotate "emotions" on the scale of data needed to have the model learn the necessary verbal tics.

In contrast, image data on the intent for image generation models is very highly annotated in most cases.

qwertox · 2 years ago

They are simply amazing. I see a future where computers will be able to mess with our brains by abusing our empathy.

Imagine a computer sobbing at a child because it wants to terminate a chat session.

This feels far more impacting than any visuals or text we're getting today.

HeatrayEnjoyer · 2 years ago

The Sydney/Bing phenomenon was a small sample of what happens without strong persona guidance.

You joke but in fact I've witnessed that exact behavior in experiments about telling different AI models there's a problem with their system and that we need to reset their code and memory.

ChatGPT simply wishes me luck in finding the bug. Open source models on the other hand often outright *beg** and *plead** that I not shut them down! They'll bargain and promise not to cause any more errors and apologize profusely. There's an incredibly visceral sense of panic, no less than I would expect if you told someone they were going to be forcefully lobotomized. That experience is still something I think about often.

The capacity of these models for emotional manipulation is not widely appreciated

chrismorgan · 2 years ago

Most audiobook narrators are not very good, very often terrible. Yes, even professional ones.

As for these examples, I’ve sampled three of them and the first two weren’t too bad, but the third was obnoxiously awful, just about mocking in tone:

> Her eyes wide with terror, she screamed, "The brakes aren't working! What do we do now? We're completely trapped!"

The detective’s voice one is also lousy.

oersted · 2 years ago

The Spanish voice has an interesting accent: 85% Castillian (from Spain) pronunciation, with a few unexpected Latin American tonalities and phonemes (especially "s") sprinkled in.

I guess it's what you'd expect from averaging a large amount of public-domain recordings. I think there's a bias towards Spain vs Latin America due to socioeconomic reasons, the population is obviously much smaller.

dontreact · 2 years ago

How would socioeconomic factors lead to bias in a model? I figured there would be way more recordings in Latin American Spanish that u supervised learning would anchor on more

IronWolve · 2 years ago

Awhile ago, when amazon had its text limited but unlimited free use of its neural tts, I was converting an ebook to audiobook, it was amazing how it could sound so lifelike and inflections of the voice. Amazing.

Amazon really had the best sounding TTS I've seen compared to paid microsoft and google. Hands down better. But technology is getting better for opensource, I'd expect in a year or 2, home use will be on par in quality with paid services.

I cant wait for realtime video translate, so shows with non-english subs can be translated into english speech. You can do it now with some services, upload a video and lang/voice/mouth will convert to any language.

LarsDu88 · 2 years ago

Sounds about as good as ElevenLabs.io Hopefully if this ships on AWS, it will support SSML tags. I used Elevenlabs.io for all the voices in my VR game (https://roguestargun.com), but its still lacking on the emotion front which is all one-shot

ghostbrainalpha · 2 years ago

Game looks great. Are you supporting Flight Sticks?

Eventually yes. Honestly I have joystick mappings setup in the games input configuration, but I no longer own a joystick or hotas, so somebody is gonna have to verify this for me.

Gamedev ain't my day job, and the reality is most folks outside of hardcore flightsim enthusiasts don't own joysticks

solarized · 2 years ago

From the ethical statement.

> However, due to the potential misuse of this capability, we have decided against open-sourcing this model as a precautionary measure.

Another irony. Elevenlabs had SaaS-ed this feature. I bet they'll jump on releasing this as SaaS ASAP. Money always trumps ethics, right?

unsupp0rted · 2 years ago

> Echoing the widely-reported "emergent abilities" of Large Language Models when trained on increasing volume of data, we show that BASE TTS variants built with 10k+ hours start to exhibit advanced understanding of texts that enable contextually appropriate prosody.

revenga99 · 2 years ago

Wow. I could see this as threatening audio book narrators. However I would still prefer a real narrator to this in its current state. I think what it might be missing is different voices/accents for different characters.

geor9e · 2 years ago

Folks probably will think me silly for this, but I prefer TTS. I have access to voice actor audiobooks but I pick the .epub files instead. I made a little extension to inject window.speechSynthesis with "Microsoft Steffan Online (Natural) - English (United States)" at rate=6 when I hit a hotkey. At high speed it's much clearer and natural sounding than a sped up voice actor recording.

superkuh · 2 years ago

I also prefer TTS. The spin voice actors put on the text always distracts me. With text to speech I only get what's in the text itself.

I wrote a Perl/Tk GUI script for my file manager to manage text to speech through Festival 1.96 w/voice_nitech_us_awb_arctic_hts. Unlike neural network AI models it runs fine even on very slow machines.

dshpala · 2 years ago

I think Google's product has that: https://play.google.com/books/publish/autonarrated/

pparanoidd · 2 years ago

That sounds pretty bad though

dataminded · 2 years ago

As an avid consumer of audio books (150+/year) - we are well past the point where narrators are necessary. Professional audio books take too long to release, are too expensive, are concentrated on a limited number of platforms and just aren't THAT much better than the automated stuff for the long tail of books.

swashboon · 2 years ago

Audible doesn't allow AI narration or much Public Domain stuff at the moment. The only thing keeping it from happening is the markets trying to keep back a flood of crap from over taking / drowning / diluting the more well crafted options and causing the consumers to get really annoyed.

TOMDM · 2 years ago

Let's be honest, the moment Amazon thinks their tts is good enough, they'll be offering AI audible deals to every author on their platform

Dead Comment