Deep-learning text-to-speech tool for generating voices of various characters

From the about section:

> How much does maintaining the servers cost? > It depends on the amount of traffic, but the minimum baseline is around several thousands of US dollars every month. This is expected as inference is very GPU intensive and a sufficient number of instances need to be spun up to handle thousands of requests coming in every minute. Everything is paid out of pocket.

Wow, impressive commitment for something that's free.

calebkaiser · 5 years ago

The price of GPU inference can be brutal, but there's a lot you can do on the infra side to improve it:

- Spot instances

- Aggressive autoscaling

- Micro batching

Can reduce inference compute spend by huge amounts (90% is not uncommon). ML, especially anything involving realtime inference, is an area where effective platform engineering makes a ridiculous difference even in the earliest days.

Source: I help maintain open source ML infra for GPU inference and think about compute spend way too much https://github.com/cortexlabs/cortex

vsupalov · 5 years ago

Yeah, running anything related to AI involves GPU instances. An alternative is to point people to using Google Colab where you can get access to a GPU for free, but that's not a smooth end user experience for most folks.

aisofteng · 5 years ago

> running anything related to AI involves GPU instances

This is not true. A _lot_ of AI applications use algorithms such as logistic regression or random forests and don’t need GPUs - partly, of course, because GPUs are so expensive and these approaches are good enough (or more than good enough) for many applications.

Nican · 5 years ago

Out of curiosity, as I have no visibility about the infra actually required- but at that cost, would it not be easier to just have a machine under a desk somewhere?

calebkaiser · 5 years ago

Not for the kind of inference running here, I'd imagine.

There are few key reasons why most realtime inference is done on the cloud:

- Scale. Deep learning models especially tend to have poor latency, especially as they grow in size. As a result, you need to scale up replicas to meet demand at a way lower level of traffic than you do for a normal web app. At one point, AI Dungeon needed over 700 servers to support just thousands of concurrent players.

- Cost. Related to the above, GPUs are really expensive to buy. A g4dn.xlarge instance (the most popular AWS EC2 instance for GPU inference) is $0.526/hour on demand. To hit $3,000 per month in spend, you'd need to be running ~8 of them 24/7. Prices vary with purchasing GPUs, but you could expect 8 NVIDIA T4's to run around $20,000 at minimum, plus the cost of other components and maintainence. To be clear, that's very conservative--it's unlikely you'll get consistent traffic. What's more likely is you'll have some periods of very little traffic where you need one or two GPUs, and other high load periods where you'll need 10+.

3. Less universal of an issue, but the cloud gives you much better access to chips at lower switching costs. If NVIDIA releases a new GPU that's even better for inference, switching to it (once its available on your cloud) will be a tweak in your YAML. If you ever switch to ASICs like AWS's Inferentia or GCP's TPUs, which in many cases give way better performance and economics than GPUs, you'll also naturally have to be on their cloud.

However, there is a lot that can be done to lower the cost of inference even in the cloud. I listed some things in a comment higher up, but basically, there are some assumptions you can make with inference that allow you to optimize pretty hard on instance price and autoscaling behavior.

mickof · 5 years ago

You just sort of assume that this is correct? The person[1] running this comes across as a severely unstable character, that number is probably hyperbole.

[1] https://twitter.com/fifteenai

15ai · 5 years ago

Not a hyperbole – I can provide proof if you'd like.

nmfisher · 5 years ago

I’ve worked with deep learning models enough to know the cost of running GPU inference, and if the live queue stats published on the website are accurate, then thousands of dollars per month is certainly plausible.

I have no reason to disbelieve it.

hooloovoo_zoo · 5 years ago

It seems like one could get to those numbers pretty easily given the prices for GPU instances on AWS. Even just one decent-sized instance would be thousands of dollars per month.

I don't usually expect much from demos like this, but I'm kind of surprised how impressive the results currently are. They're definitely not perfect, you're definitely getting some odd clipping and noise, but this shows a large amount of promise.

Being able to generate voices for games would enable a lot of interesting indie projects. IMO people should be paying more attention the market implications of products like this than to the social implications. There are a lot of projects that just aren't really feasible right now that could be if this kind of technology was more polished and generally available for commercial/self-hosted use. And in those cases, you don't even need to do inference, makers will likely be willing to mark up their scripts themselves.

Anyway I digress. Congrats, this is really cool!

Pfhreak · 5 years ago

> people should be paying more attention the market implications of products like this than to the social implications.

People will absolutely suffer harm from this tech, but hey, think about the dollars that could be made! No, we should absolutely be paying more attention to the social implications.

danShumway · 5 years ago

Eh, this technology currently falls very squarely into the category of "almost good enough that I could use it for a creative project, but not nearly good enough that you're going to be able to convince me that the results aren't generated."

I'm not primarily interested about the dollars, I'm interested in allowing communities to do creative things. I think people are looking at this tech like it's only going to be used for deepfakes, and they're underestimating the extent it's going to be used to create voice-acted game mods, animations, anonymization tools, and other creative/helpful projects.

If you're really worried about this stuff though, you can take some comfort in the fact that by far the worst examples on the site are of real-world voices. This is currently technology that as far as I can see is far more suited for generating new voices or voicing cartoon characters with well-defined patterns/inflections than it is for imitating the president.

C19is20 · 5 years ago

Musicians Union?

Ajedi32 · 5 years ago

I wonder if there are any legal concerns with using the voices of well known characters/actors like this in a commercial context.

danShumway · 5 years ago

I don't think a voice can be copyrighted, but IANAL so you shouldn't bank on that.

If a voice could be copyrighted, or if this was a trademark issue or something, I strongly suspect that this site would not fall under fair use regardless of whether or not it was commercial. But again, IANAL, so I don't feel confident making any kind of strong claim about that either.

The results are really impressive. At the moment I'm considering spending a low 3-figure amount for a professionally spoken intro for a new podcast. Some of the lines I generated are in my top 5 easily, human speakers don't have a lot of edge for short generic blurbs of text anymore it seems.

kebman · 5 years ago

Pretty cool! I tried it with this small dialogue, and then edited together two voices in Reaper from the downloads:

Bob: “Hello, John.”

John: “Oh, hello there, Bob.”

Bob: “Yes, hello. It's what I said. Why do you keep repeating what I say, John?”

John: “I didn't repeat you! I merely said hello, you dimwit!”

Bob: “There you go, being condescending again. Fuck you!”

John: “What? You're the one who started it!”

Try it yourself, or write something different. Either way, good fun!

demonictoaster · 5 years ago

The security implications of this kind of tech are scary. Going forward it will become really easy to reproduce the voice of anyone! It seems not a lot of training data is required to achieve reasonable results (e.g. Spong Bob is just 27min of voice, Half Life Black Mesa Announcer is just 1.9min!!). This stuff could be easily leveraged for scams and deep fakes (along with deep learning models that could also tweak lip movements to match the voice for example). Thankfully, there is also a very active area of research that leverages similar tech to detect deep fakes.

dschooh · 5 years ago

These kinds of discussions are common with articles about deep fake video and audio. While I do not disagree with your point, here are two quick thoughts:

- We have had perfect image manipulation capabilities for quite some time now. We have had written text manipulation capabilities for hundreds of years.

- People will continue to believe what they believe, whether there is deep fake video and audio or not.

Agree with you. Hopefully people are more and more aware that they cannot trust anything out there. We are soon reaching a point where we can make anyone say anything we want, including in audio and video format.

spyder · 5 years ago

It's already happening:

A Voice Deepfake Was Used To Scam A CEO Out Of $243,000:

https://www.forbes.com/sites/jessedamiani/2019/09/03/a-voice...

Baeocystin · 5 years ago

The fact that you included Chell as a voice choice (and 'generated' a null audio clip to boot) earns a chuckle. The quality of the voices across the board earns wide eyes and an eyebrow raise. Thanks for sharing this, it's remarkable work.

high_byte · 5 years ago

GLaDOS hahahaha this is just... perfect. Stanley Parable Narrator funny you should mention this.

SommaRaikkonen · 5 years ago

Welp, after messing around with a few voices I was completely impressed with Glados's. This is really cool because I have no idea how the character's voice was synthesized, but apparently ML can do it for me so props to that.

smrq · 5 years ago

I'm pretty sure the real Glados voice effect is mostly pitch correction and formant shifting. You can do it with Melodyne at least (which, to be fair, is also computer magic-- just a different kind than this one!)

I just found a video on YT with an example of recreating this in Melodyne: https://youtu.be/1oQn66gvwKA

Deleted Comment

giantrobot · 5 years ago

GLaDOS was voiced by a real person [0]. Her voice had some effects added but mostly just her trying to sound like a computer.

[0] http://ellenmclain.net/

aksss · 5 years ago

My favorite is Carl Butananadilewski, but I just ended up making him say actual phrases from ATHF in the end. Was hoping to see Meatwad as a character option.

SV_BubbleTime · 5 years ago

Is the author being cute putting Chell from Portal and Freeman from Half-life in there, and then there is no audio? It would be a weird oversight if not intentional because the author is clearly familiar with Valve games.