Sounds like Mort from Family Guy.
Lol
Such hardware is not general-purpose, and upgrading the model would not be possible, but there's plenty of use-cases where this is reasonable.
It doesn't sound so good. Excellent technical achievement and it may just improve more and more! But for now I can't use it for consumer facing applications.
Ubuntu 24, Razer Blade 16, Intel Core i9-14900HX
Performance Results:
Initial Latency: ~315ms for short text
Audio Generation Speed (seconds of audio per second of processing):
- Short text (12 chars): 3.35x realtime
- Medium text (100 chars): 5.34x realtime
- Long text (225 chars): 5.46x realtime
- Very Long text (306 chars): 5.50x realtime
Findings:
- Model loads in ~710ms
- Generates audio at ~5x realtime speed (excluding initial latency)
- Performance is consistent across different voices (4.63x - 5.28x realtime)
I use TTS on my phone regularly and recently also tried this new project on F-Droid called SherpaTTS, which grabs some models from Huggingface. They're super heavy (the phone suspends other apps to disk while this runs) and sound good, but in the first news article there were already one or two mispronunciations because it's guessing how to say uncommon or new words and it's not based on logical rules anymore to turn text into speech
Google and Samsung have each a TTS engine pre-installed on my device and those sound and work fine. A tad monotonous but it seems to always pronounce things the same way so you can always work out what the text said
Espeak (or -ng) is the absolute worst, but after 30 seconds of listening closely you get used to it and can understand everything fine. I don't know if it's the best open source option (probably there are others that I should be trying) but it's at least the most reliable where you'll always get what is happening and you can install it on any device without licensing issues