This release supports English text-to-speech applications in eight voices: four male and four female. The model is quantized to int8 + fp16, and it uses onnx for runtime. The model is designed to run literally anywhere eg. raspberry pi, low-end smartphones, wearables, browsers etc. No GPU required!
We're releasing this to give early users a sense of the latency and voices that will be available in our next release (hopefully next week). We'd love your feedback! Just FYI, this model is an early checkpoint trained on less than 10% of our total data.
We started working on this because existing expressive OSS models require big GPUs to run them on-device and the cloud alternatives are too expensive for high frequency use. We think there's a need for frontier open-source models that are tiny enough to run on edge devices!
https://www.youtube.com/watch?v=60Dy3zKBGQg
Deleted Comment
It doesn't sound so good. Excellent technical achievement and it may just improve more and more! But for now I can't use it for consumer facing applications.
Dead Comment
The comparison to make is expressiveness and correct intonation for long sentences vs something like espeak. It actually sounds amazing for the size. The closest thing is probably KokoroTTS at 82M params and ~300MB.
Deleted Comment
I use TTS on my phone regularly and recently also tried this new project on F-Droid called SherpaTTS, which grabs some models from Huggingface. They're super heavy (the phone suspends other apps to disk while this runs) and sound good, but in the first news article there were already one or two mispronunciations because it's guessing how to say uncommon or new words and it's not based on logical rules anymore to turn text into speech
Google and Samsung have each a TTS engine pre-installed on my device and those sound and work fine. A tad monotonous but it seems to always pronounce things the same way so you can always work out what the text said
Espeak (or -ng) is the absolute worst, but after 30 seconds of listening closely you get used to it and can understand everything fine. I don't know if it's the best open source option (probably there are others that I should be trying) but it's at least the most reliable where you'll always get what is happening and you can install it on any device without licensing issues
Such hardware is not general-purpose, and upgrading the model would not be possible, but there's plenty of use-cases where this is reasonable.
Ubuntu 24, Razer Blade 16, Intel Core i9-14900HX
It sounds ok, but impressive for the size.
We've had formant synths for several decades, and they're perfectly understandable and require a tiny amount of computing power, but people tend not to want to listen to them:
https://en.wikipedia.org/wiki/Software_Automatic_Mouth
https://simulationcorner.net/index.php?page=sam (try it yourself to hear what it sounds like)
I agree with your wider point. I use Google TTS with Moon+Reader all the time (I tried audio books read by real humans but I prefer the consistency of TTS)
Well sure, the BBC have already established that it's supposed to sound like a brit doing an impersonation of an American: https://www.youtube.com/watch?v=LRq_SAuQDec
"This first Book proposes, first in brief, the whole Subject, Mans disobedience, and the loss thereupon of Paradise wherein he was plac't: Then touches the prime cause of his fall, the Serpent, or rather Satan in the Serpent; who revolting from God, and drawing to his side many Legions of Angels, was by the command of God driven out of Heaven with all his Crew into the great Deep."
It takes a while until it starts generating sound on my i7 cores but it kind of works.
This also works:
"blah. bleh. blih. bloh. blyh. bluh."
So I don't think it's a limit on punctuation. Voice quality is quite bad though, not as far from the old school C64 SAM (https://discordier.github.io/sam/) of the eighties as I expected.
Deleted Comment
If anyone else wants to try:
> Kitten TTS is an open-source series of tiny and expressive text-to-speech models for on-device applications. Our smallest model is less than 25 megabytes.
https://clowerweb.github.io/node_modules/onnxruntime-web/dis...
(seems reverted now)
Doesn't seem to work with thai.
Plus, Python software are dependency hell in general, while webpages have to be self-contained by their nature (thank god we no longer have Silverlight and Java applets...)
Dead Comment
Have you seen the code[1] in the repo? It uses phonemizer[2] which is GPL-3.0 licensed. In its current state, it's effectively GPL licensed.
[1]: https://github.com/KittenML/KittenTTS/blob/main/kittentts/on...
[2]: https://github.com/bootphon/phonemizer
Edit: It looks like I replied to an LLM generated comment.
And it isn't something you can fix, because the model was trained on bad phonemes (everyone uses Whisper + then phonemizes the text transcript).
If my MIT-licensed one-line Python library has this line of code…
…I’m not suddenly subject to bash’s licensing. For anyone wanting to run my stuff though, they’re going to need to make sure they themselves have bash installed.(But, to argue against my own point, if an OS vendor ships my library alongside a copy of bash, do they have to now relicense my library as GPL?)
[0]: https://www.gnu.org/licenses/license-list.html#apache2
Morals may stop you but other than that? IMHO all open source code is public domain code if anyone is willing to spend some AI tokens.
eSpeak NG's data files take about 12 MB (multi-lingual).
I guess this one may generate more natural-sounding speech, but older or lower-end computers were capable of decent speech synthesis previously as well.
$ ls -lh /usr/bin/flite
Listed as 27K last I checked.
I recall some Blind users were able to decode Gordon 8-bit dialogue at speeds most people found incomprehensible. =3
What about the training data? Is everyone 100% confident that models are not a derived work of the training inputs now, even if they can reproduce input exactly?
Iam curious how fast this is with CPU only.
Deleted Comment
Deleted Comment
On another machie the python version is too new, and the package/dependencies don't want to install.
https://github.com/KittenML/KittenTTS/pull/21https://github.com/KittenML/KittenTTS/pull/24https://github.com/KittenML/KittenTTS/pull/25
If you have `uv` installed, you can try my merged ref that has all of these PRs (and #22, a fix for short generation being trimmed unnecessarily) with
I found the TTS a bit slow so I piped the output into ffplay with 1.2x speedup to make it sound a bit better
https://docs.astral.sh/uv/guides/tools/
uv installation:
https://docs.astral.sh/uv/getting-started/installation/
This package is the epitome of dependency hell.
Seriously, stick with piper-tts.
Easy to install, 50MB gives you excellent results and 100MB gives you good results with hundreds of voices.
With no other language are you expected to maintain several entirely different versions of the language, each of which is a relatively large installation. Can you imagine if we all had five different llvms or gccs just to compile five different modern C projects?
I'm going to get downvoted to oblivion, but it doesn't change the reality that Python in 2025 is unnecessarily fragile.
> if we all had five different llvms or gccs
Oof, those are poor examples. Most compilers using LLVM other than clang do ship with their own LLVM patches, and cross-compiling with GCC does require installing a toolchain for each target.
Yes, because all I have to do is look at the real world.
Anyway, I think I'll stick with Festival 1.96 for TTS. It's super fast even on my core2duo and I have exactly zero chance of getting this Python 3'ish script to run on any machine with an OS older than a handful of years.
I send you a 500kb Windows .exe file and claim it runs literally everywhere.
Would it be ignorant to say anything against it because of its size?
Now, RISC architectures are much more common, so instead of the rare 68K Apple/Amiga/etc computer that existed at the time, it's super common to want to run software on an ARM or occasionally RISC-V processor, so writing in x86 assembly language would require emulation, making for worse performance than a compiled language.
Dead Comment
To make the setup easier and add a few features people are asking for here (like GPU support and long text handling), I built a self-hosted server for this model: https://github.com/devnen/Kitten-TTS-Server
The goal was a setup that "just works" using a standard Python virtual environment to avoid dependency conflicts.
The setup is just the standard git clone, pip install in a venv, and python server.py.
ONNX runtime is a single library, with C#'s package being ~115MB compressed.
Not tiny, but usually only a few lines to actually run and only a single dependency.
Which is completely reasonable imho, but obviously comes with tradeoffs.
Aside: Are there any models for understanding voice to text, fully offline, without training?
I will be very impressed when we will be able to have a conversation with an AI at a natural rate and not "probe, space, response"
My mid-range AMD CPU is multiple times faster than realtime with parakeet.
OpenAI's whisper is a few years old and pretty solid.
https://github.com/openai/whisper
[0]: https://github.com/openai/whisper/discussions/679 [1]: https://github.com/openai/whisper/discussions/928 [2]: https://github.com/openai/whisper/discussions/2608
Average duration per generation: 1.28 seconds
Characters processed per second: 30.35
--
"Um"
Average duration per generation: 0.22 seconds
Characters processed per second: 9.23
--
"The brown fox jumps over the lazy dog.. The brown fox jumps over the lazy dog.."
Average duration per generation: 2.25 seconds
Characters processed per second: 35.04
--
processor : 0
vendor_id : AuthenticAMD
cpu family : 25
model : 80
model name : AMD Ryzen 7 5800H with Radeon Graphics
stepping : 0
microcode : 0xa50000c
cpu MHz : 1397.397
cache size : 512 KB
I suppose it would make sense if you want to include it on top of an LLM that's already occupying most of a GPU and this could run in the limited VRAM that's left.