Readit News logoReadit News
Posted by u/divamgupta 24 days ago
Show HN: Kitten TTS – 25MB CPU-Only, Open-Source TTS Modelgithub.com/KittenML/Kitte...
Kitten TTS is an open-source series of tiny and expressive text-to-speech models for on-device applications. We are excited to launch a preview of our smallest model, which is less than 25 MB. This model has 15M parameters.

This release supports English text-to-speech applications in eight voices: four male and four female. The model is quantized to int8 + fp16, and it uses onnx for runtime. The model is designed to run literally anywhere eg. raspberry pi, low-end smartphones, wearables, browsers etc. No GPU required!

We're releasing this to give early users a sense of the latency and voices that will be available in our next release (hopefully next week). We'd love your feedback! Just FYI, this model is an early checkpoint trained on less than 10% of our total data.

We started working on this because existing expressive OSS models require big GPUs to run them on-device and the cloud alternatives are too expensive for high frequency use. We think there's a need for frontier open-source models that are tiny enough to run on edge devices!

mlboss · 24 days ago
seligman99 · 24 days ago
And a quick video with all of the different voices:

https://www.youtube.com/watch?v=60Dy3zKBGQg

a96 · 22 days ago
Thanks. I really would not want to listen to any of these regularly.
tracker1 · 23 days ago
Cool, thanks... aside: the last male voice sounds high/drunk.

Deleted Comment

Eduard · 23 days ago
thank you!
smusamashah · 24 days ago
The reddit video is awesome. I don't understand how people are calling it an OK model. Under 25MB and cpu only for this quality is amazing.
soasme · 10 days ago
Just made a TTS tool based on Kitten TTS, fully browser based, no Python server backend: https://quickeditvideo.com/tts/ A tts model of this size should be industry standard!
Retr0id · 24 days ago
The people calling it "OK" probably tried it for themselves. Whatever model is being demoed in that video is not the same as the 25MB model they released.
sergiotapia · 24 days ago
https://vocaroo.com/1njz1UwwVHCF

It doesn't sound so good. Excellent technical achievement and it may just improve more and more! But for now I can't use it for consumer facing applications.

Dead Comment

Zardoz84 · 24 days ago
Sounds very clear. For a non native english speaker like me, it's easy to understand.
tapper · 24 days ago
Sounds slow and like something from an anine
ricardobeat · 24 days ago
Speech speed is always a tunable parameter and not something intrinsic to the model.

The comparison to make is expressiveness and correct intonation for long sentences vs something like espeak. It actually sounds amazing for the size. The closest thing is probably KokoroTTS at 82M params and ~300MB.

numpad0 · 24 days ago
The only real questions are which Chinese gacha game they ripped data from and whether they used Claude Code or Gemini CLI for Python code. I bet one can get a formant match from output this much overfit to whatever data. This isn't going to stay up for long.

Deleted Comment

KaiserPro · 24 days ago
was it cross trained on futurama voices?
junon · 24 days ago
That would be a feature!
archon810 · 24 days ago
Sounds like Mort from Family Guy.
divamgupta · 23 days ago
It was not
Aachen · 24 days ago
Impressive technical achievement, but in terms of whether I'd use it: oof, that male voice is like one of these fake-excited newsreaders. Like they're always at the edge of their breath. The female one is better but still someone reading out an advertisement for a product they were told they must act extra excited for. I assume this is what the majority of training data was like and not an intentional setting for the demo. Unsure whether I could get used to that

I use TTS on my phone regularly and recently also tried this new project on F-Droid called SherpaTTS, which grabs some models from Huggingface. They're super heavy (the phone suspends other apps to disk while this runs) and sound good, but in the first news article there were already one or two mispronunciations because it's guessing how to say uncommon or new words and it's not based on logical rules anymore to turn text into speech

Google and Samsung have each a TTS engine pre-installed on my device and those sound and work fine. A tad monotonous but it seems to always pronounce things the same way so you can always work out what the text said

Espeak (or -ng) is the absolute worst, but after 30 seconds of listening closely you get used to it and can understand everything fine. I don't know if it's the best open source option (probably there are others that I should be trying) but it's at least the most reliable where you'll always get what is happening and you can install it on any device without licensing issues

willwade · 24 days ago
anyone else wants to try sherpaOnnx you can try this.. https://github.com/willwade/tts-wrapper we recently added in the kokoro models which should sound a lot better. There are a LOT of models to choose from. I have a feeling the Droid app isnt handling cold starts very well.
divamgupta · 23 days ago
Thanks a lot for the detailed feedback. We are working on some models which do not use a phonemizer
bornfreddy · 23 days ago
RHvoice is pretty good, imho.
nine_k · 24 days ago
I hope this is the future. Offline, small ML models, running inference on ubiquitous, inexpensive hardware. Models that are easy to integrate into other things, into devices and apps, and even to drive from other models maybe.
WhyNotHugo · 24 days ago
Dedicated single-purpose hardware with models would be even less energy-intensive. It's theoretically possible to design chips which run neural networks and alike using just resistors (rather than transistors).

Such hardware is not general-purpose, and upgrading the model would not be possible, but there's plenty of use-cases where this is reasonable.

amelius · 24 days ago
But resistors are, even in theory, heat dissipating devices. Unlike transistors, which can in theory be perfectly on or off (in both cases not dissipating heat).
regularfry · 23 days ago
It's theoretically possible but physical "neurons" is a terrible idea. The number of connections between two layers of an FF net is the product of the number of weights in each, so routing makes every other problem a rounding error.
divamgupta · 23 days ago
The thing is that the new models keep coming every day. So it’s economically not feasible to make chips for a single model
theshrike79 · 24 days ago
This is what Apple is envisioning with their SLMs, like having a model specifically for managing calendar events. It doesn't need to have the full knowledge of all humanity in it - just what it needs to manage the calendar.
koolala · 24 days ago
Issue is their envisioning everyone only using Apple products.
throwaway28733 · 24 days ago
Apple's hardware is notoriously overpriced, so I don't think they're envisioning that at all.
depingus · 24 days ago
Hmm. A pay once (or not at all) model that can run on anything? Or a subscription model that locks you in, and requires hardware that only the richest megacorps can afford? I wonder which one will win out.
tracker1 · 23 days ago
The popular one.
divamgupta · 24 days ago
That is our vision too!
divamgupta · 23 days ago
This is our goal too.
rohan_joshi · 24 days ago
yeah totally. the quality of these tiny models are only going to go up.
peanut_merchant · 24 days ago
I ran some quick benchmarks.

Ubuntu 24, Razer Blade 16, Intel Core i9-14900HX

  Performance Results:

  Initial Latency: ~315ms for short text

  Audio Generation Speed (seconds of audio per second of processing):
  - Short text (12 chars): 3.35x realtime
  - Medium text (100 chars): 5.34x realtime
  - Long text (225 chars): 5.46x realtime
  - Very Long text (306 chars): 5.50x realtime

  Findings:
  - Model loads in ~710ms
  - Generates audio at ~5x realtime speed (excluding initial latency)
  - Performance is consistent across different voices (4.63x - 5.28x realtime)

divamgupta · 23 days ago
Thanks for running the benchmarks. Currently the models are not optimized yet. We will optimize loading etc when we release an SDK meant for production :)
don-bright · 23 days ago
on my Intel(R) Celeron(R) N4020 CPU @ 1.10GHz it takes 6 seconds to import/load and text generation is roughly 1x realtime on various lengths of text.
Jotalea · 23 days ago
thanks for testing on the same hardware as mine, before me.
blopker · 24 days ago
Web version: https://clowerweb.github.io/kitten-tts-web-demo/

It sounds ok, but impressive for the size.

nine_k · 24 days ago
Does anybody find it funny that sci-fi movies have to heavily distort "robot voices" to make them sound "convincingly robotic"? A robotic, explicitly non-natural voice would be perfectly acceptable, and even desirable, in many situations. I don't expect a smart toaster to talk like a BBC host; it'd be enough is the speech if easy to recognize.
userbinator · 24 days ago
A robotic, explicitly non-natural voice would be perfectly acceptable, and even desirable, in many situations[...]it'd be enough is the speech if easy to recognize.

We've had formant synths for several decades, and they're perfectly understandable and require a tiny amount of computing power, but people tend not to want to listen to them:

https://en.wikipedia.org/wiki/Software_Automatic_Mouth

https://simulationcorner.net/index.php?page=sam (try it yourself to hear what it sounds like)

roywiggins · 24 days ago
This one is at least an interesting idea: https://genderlessvoice.com/
mfro · 24 days ago
In the Culture novels, Iain Banks imagines that we would become uncomfortable with the uncanny realism of transmitted voices / holograms, and intentionally include some level of distortion to indicate you're speaking to an image
incone123 · 24 days ago
Depends on the movie. Ash and Bishop in the Alien franchise sound human until there's a dramatic reason to sound more 'robotic'.

I agree with your wider point. I use Google TTS with Moon+Reader all the time (I tried audio books read by real humans but I prefer the consistency of TTS)

Twirrim · 24 days ago
> I don't expect a smart toaster to talk like a BBC host;

Well sure, the BBC have already established that it's supposed to sound like a brit doing an impersonation of an American: https://www.youtube.com/watch?v=LRq_SAuQDec

looperhacks · 24 days ago
I remember that the novelization of the fifth element describes that the cops are taught to speak as robotic as possible when using speakers for some reason. Always found the idea weird that someone would _want_ that
addandsubtract · 24 days ago
If you're on a Mac, you can type "say [thing to say]" into your terminal.
msgodel · 24 days ago
I personally prefer the older synthetic voices for TTS when the text is coming from software or a language model.
bkyan · 24 days ago
I got an error when I tried the demo with 6 sentences, but it worked great when I reduced the text to 3 sentences. Is the length limit due to the model or just a limitation for the demo?
divamgupta · 24 days ago
Currently we don't have chunking enabled yet. We will add it soon. That will remove the length limitations.
cess11 · 24 days ago
Perhaps a length limit? I tried this:

"This first Book proposes, first in brief, the whole Subject, Mans disobedience, and the loss thereupon of Paradise wherein he was plac't: Then touches the prime cause of his fall, the Serpent, or rather Satan in the Serpent; who revolting from God, and drawing to his side many Legions of Angels, was by the command of God driven out of Heaven with all his Crew into the great Deep."

It takes a while until it starts generating sound on my i7 cores but it kind of works.

This also works:

"blah. bleh. blih. bloh. blyh. bluh."

So I don't think it's a limit on punctuation. Voice quality is quite bad though, not as far from the old school C64 SAM (https://discordier.github.io/sam/) of the eighties as I expected.

Deleted Comment

Retr0id · 24 days ago
I tried to replicate their demo text but it doesn't sound as good for some reason.

If anyone else wants to try:

> Kitten TTS is an open-source series of tiny and expressive text-to-speech models for on-device applications. Our smallest model is less than 25 megabytes.

cortesoft · 24 days ago
Is the demo using the not smallest model?
quantummagic · 24 days ago
Doesn't work here. Backend module returns 404 :

https://clowerweb.github.io/node_modules/onnxruntime-web/dis...

Retr0id · 24 days ago
Looks like this commit 15 minutes ago broke it https://github.com/clowerweb/kitten-tts-web-demo/commit/6b5c...

(seems reverted now)

itake · 24 days ago
> Error generating speech: failed to call OrtRun(). ERROR_CODE: 2, ERROR_MESSAGE: Non-zero status code returned while running Expand node. Name:'/bert/Expand' Status Message: invalid expand shape

Doesn't seem to work with thai.

nxnsxnbx · 24 days ago
Thanks, I was looking for that. While the reddit demo sounds ok, even though on a level we reached a couple of years ago, all TTS samples I tried were barley understandable at all
divamgupta · 24 days ago
This is just an early checkpoint. We hope that the quality will improve in the future.
Aardwolf · 24 days ago
On PC it's a python dependency hell but someone managed to package it in self contained JS code that works offline once it loaded the model? How is that done?
a2128 · 24 days ago
ONNXRuntime makes it fairly easy, you just need to provide a path to the ONNX file, give it inputs in the correct format, and use the outputs. The ONNXRuntime library handles the rest. You can see this in the main.js file: https://github.com/clowerweb/kitten-tts-web-demo/blob/main/m...

Plus, Python software are dependency hell in general, while webpages have to be self-contained by their nature (thank god we no longer have Silverlight and Java applets...)

scotty79 · 24 days ago
It feels like it doesn't handle punctuation well. I don't hear sentence boundaries and commas. It sounds like continuous stream of words.
rohan_joshi · 24 days ago
yeah, this is just a preview model from an early checkpoint. the full model release will be next week which includes a 15M model and an 80M model, both of which will have much higher quality than this preview.
rldjbpin · 17 days ago
besides issues with webgpu (it is in beta fwiw), it'd be nice to increase voice speed through the setting without affecting the voice pitch.
Jotalea · 23 days ago
Using male voice 2 at 48kHz at 0.5x speed sounds a lot like Madeline's voice lines in Celeste. Seemed funny to me.
belchiorb · 24 days ago
This doesn’t seem to work on Safari. Works great on Chrome, though
divamgupta · 24 days ago
Hmm, we will look into it.

Dead Comment

MutedEstate45 · 24 days ago
The headline feature isn’t the 25 MB footprint alone. It’s that KittenTTS is Apache-2.0. That combo means you can embed a fully offline voice in Pi Zero-class hardware or even battery-powered toys without worrying about GPUs, cloud calls, or restrictive licenses. In one stroke it turns voice everywhere from a hardware/licensing problem into a packaging problem. Quality tweaks can come later; unlocking that deployment tier is the real game-changer.
rohan_joshi · 24 days ago
yeah, we are super excited to build tiny ai models that are super high quality. local voice interfaces are inevitable and we want to power those in the future. btw, this model is just a preview, and the full release next week will be of much higher quality, along w another ~80M model ;)
woadwarrior01 · 24 days ago
> It’s that KittenTTS is Apache-2.0

Have you seen the code[1] in the repo? It uses phonemizer[2] which is GPL-3.0 licensed. In its current state, it's effectively GPL licensed.

[1]: https://github.com/KittenML/KittenTTS/blob/main/kittentts/on...

[2]: https://github.com/bootphon/phonemizer

Edit: It looks like I replied to an LLM generated comment.

oezi · 24 days ago
The issue is even bigger: phonemizer is using espeak-ng, which isn't very good at turning graphemes into phonemes. In other TTS which rely on phonemes (e.g. Zonos) it turned out to be one of the key issues which cause bad generations.

And it isn't something you can fix, because the model was trained on bad phonemes (everyone uses Whisper + then phonemizes the text transcript).

jacereda · 24 days ago
gorgoiler · 24 days ago
This would only apply if they were distributing the GPL licensed code alongside their own code.

If my MIT-licensed one-line Python library has this line of code…

  run([“bash”, “-c”, “echo hello”])
…I’m not suddenly subject to bash’s licensing. For anyone wanting to run my stuff though, they’re going to need to make sure they themselves have bash installed.

(But, to argue against my own point, if an OS vendor ships my library alongside a copy of bash, do they have to now relicense my library as GPL?)

Hackbraten · 24 days ago
Given that the FSF considers Apache-2.0 to be compatible with GPL-3.0 [0], how could the fact that phonemizer is GPL-3.0 possibly be an issue?

[0]: https://www.gnu.org/licenses/license-list.html#apache2

keyKeeper · 24 days ago
Okay, what's stopping you from feeding the code into an LLM and re-write it and make it yours? You can even add extra steps like make it analyze the code block by block then supervise it as it is rewriting it. Bam. AI age IP freedom.

Morals may stop you but other than that? IMHO all open source code is public domain code if anyone is willing to spend some AI tokens.

defanor · 24 days ago
A Festival's English model, festvox-kallpc16k, is about 6 MB, and it is a large model; festvox-kallpc8k is about 3.5 MB.

eSpeak NG's data files take about 12 MB (multi-lingual).

I guess this one may generate more natural-sounding speech, but older or lower-end computers were capable of decent speech synthesis previously as well.

Joel_Mckay · 24 days ago
Custom voices could be added, but the speed was more important to some users.

$ ls -lh /usr/bin/flite

Listed as 27K last I checked.

I recall some Blind users were able to decode Gordon 8-bit dialogue at speeds most people found incomprehensible. =3

pjc50 · 24 days ago
> KittenTTS is Apache-2.0

What about the training data? Is everyone 100% confident that models are not a derived work of the training inputs now, even if they can reproduce input exactly?

entropie · 24 days ago
I play around with a nvidia jetson orin nano super right now and its actually pretty usuable with gemma3:4b and quite fast - even image processing is done in like 10-20 seconds but this is with GPU support. When something is not working and ollama is not using the GPU this calls take ages because the cpu is just bad.

Iam curious how fast this is with CPU only.

phh · 24 days ago
It depends on espeak-ng which is GPLv3
ethan_smith · 23 days ago
This opens up voice interfaces for medical devices, offline language learning tools, and accessibility gadgets for the visually impaired - all markets where cloud dependency and proprietary licenses were showstoppers.
Narishma · 23 days ago
But Pi Zero has a GPU, so why not make use of it?
a96 · 22 days ago
Because then you're stuck on that device only.
CyberDildonics · 24 days ago
The github just has a few KB of python that looks like an install script. How is this used from C++ ?

Deleted Comment

Deleted Comment

antisol · 24 days ago

  System Requirements
  Works literally everywhere
Haha, on one of my machines my python version is too old, and the package/dependencies don't want to install.

On another machie the python version is too new, and the package/dependencies don't want to install.

akx · 24 days ago
I opened a couple of PRs to fix this situation:

https://github.com/KittenML/KittenTTS/pull/21https://github.com/KittenML/KittenTTS/pull/24https://github.com/KittenML/KittenTTS/pull/25

If you have `uv` installed, you can try my merged ref that has all of these PRs (and #22, a fix for short generation being trimmed unnecessarily) with

    uvx --from git+https://github.com/akx/KittenTTS.git@pr-21-22-24-25 kittentts --output output.wav --text "This high quality TTS model works without a GPU"

tetris11 · 24 days ago
Thanks for the quick intro into UV, it looks like docker layers for python

I found the TTS a bit slow so I piped the output into ffplay with 1.2x speedup to make it sound a bit better

   uvx --from git+https://github.com/akx/KittenTTS.git@pr-21-22-24-25 kittentts --text "I serve 12 different beers at my restaurant for over 1000000 customers" --voice expr-voice-3-m --output - | ffplay -af "atempo=1.2" -f wav -

VagabundoP · 24 days ago
Install it with uvx that should solve the python issues.

https://docs.astral.sh/uv/guides/tools/

uv installation:

https://docs.astral.sh/uv/getting-started/installation/

IshKebab · 24 days ago
Yeah some people have a problem and think "I'll use Python". Now they have like fifty problems.
77pt77 · 23 days ago
I had the too new.

This package is the epitome of dependency hell.

Seriously, stick with piper-tts.

Easy to install, 50MB gives you excellent results and 100MB gives you good results with hundreds of voices.

xena · 24 days ago
It doesn't work on Fedora because of the lack of g++ having the right version.
trostaft · 24 days ago
Not sure if they've fixed between then and now, but I just had it working locally on Fedora.

  > g++ --version
  g++ (GCC) 15.1.1 20250521 (Red Hat 15.1.1-2)
  Copyright (C) 2025 Free Software Foundation, Inc.

divamgupta · 24 days ago
We are working to fix that. Thanks
pjc50 · 24 days ago
"Fixing python packaging" is somewhat harder than AGI.
raybb · 24 days ago
Have you considered offering a uvx command to run to get people going quickly?
flanked-evergl · 24 days ago
Just point people to uv/uvx.
hahn-kev · 24 days ago
Python man
baobun · 24 days ago

    man python
There you go.

turnsout · 24 days ago
You're getting a lot of comments along the lines of "Why don't you just ____," which only shows how Stockholmed the entire Python community is.

With no other language are you expected to maintain several entirely different versions of the language, each of which is a relatively large installation. Can you imagine if we all had five different llvms or gccs just to compile five different modern C projects?

I'm going to get downvoted to oblivion, but it doesn't change the reality that Python in 2025 is unnecessarily fragile.

jhurliman · 24 days ago
That’s exactly what I have. The C++ codebases I work on build against a specific pinned version of LLVM with many warnings (as errors) enabled, and building with a different version entails a nonzero amount of effort. Ubuntu will happily install several versions of LLVM side by side or compilation can be done in a Docker container with the correct compiler. Similarly, the TypeScript codebases I work with test against specific versions of node.js in CI and the engine field in package.json is specified. The different versions are managed via nvm. Python is the same via uv and pyproject.yaml.
debugnik · 23 days ago
I agree with your point, but

> if we all had five different llvms or gccs

Oof, those are poor examples. Most compilers using LLVM other than clang do ship with their own LLVM patches, and cross-compiling with GCC does require installing a toolchain for each target.

77pt77 · 23 days ago
> Can you imagine if we all had five different llvms or gccs just to compile five different modern C projects?

Yes, because all I have to do is look at the real world.

sigmoid10 · 24 days ago
There are still people who use machine wide python installs instead of environments? Python dependency hell was already bad years ago, but today it's completely impractical to do it this way. Even on raspberries.
superkuh · 24 days ago
Yep. Python stopped being Python a decade ago. Now there are just innumberable Pythons. Perl... on the otherhand, you can still run any perl script from any time on any system perl interpreter and it works! Granted, perl is unpopular and not getting constant new features re: hardcore math/computation libs.

Anyway, I think I'll stick with Festival 1.96 for TTS. It's super fast even on my core2duo and I have exactly zero chance of getting this Python 3'ish script to run on any machine with an OS older than a handful of years.

lynx97 · 24 days ago
Debian pretty much "solved" this by making pip refuse to install packages if you are not in an venv.
yjftsjthsd-h · 23 days ago
Using venv won't save you from having the wrong version of the actual Python interpreter installed.
dzogchen · 24 days ago
Such an ignorant thing to say for something that requires 25MB RAM.
Bilal_io · 24 days ago
Not sure what the size has to do with anything.

I send you a 500kb Windows .exe file and claim it runs literally everywhere.

Would it be ignorant to say anything against it because of its size?

dlcarrier · 24 days ago
It reminds me of the costs and benefits of RollerCoaster Tycoon being written in assembly language. Because it was so light on resources, it could run on any privately owned computer, or at least anything x86, which was pretty much everything at the time.

Now, RISC architectures are much more common, so instead of the rare 68K Apple/Amiga/etc computer that existed at the time, it's super common to want to run software on an ARM or occasionally RISC-V processor, so writing in x86 assembly language would require emulation, making for worse performance than a compiled language.

exe34 · 24 days ago
system python is for system applications that are known to work together. If you need a python install for something else, there's venv or conda and then pip install stuff.

Dead Comment

miellaby · 24 days ago
You're supposed to use venv for everything but the python scripts distributed with your os
klipklop · 24 days ago
I tried it. Not bad for the size (of the model) and speed. Once you install all the massive number of libraries and things needed we are a far cry away from 25MB though. Cool project nonetheless.
devnen · 24 days ago
That's a great point about the dependencies.

To make the setup easier and add a few features people are asking for here (like GPU support and long text handling), I built a self-hosted server for this model: https://github.com/devnen/Kitten-TTS-Server

The goal was a setup that "just works" using a standard Python virtual environment to avoid dependency conflicts.

The setup is just the standard git clone, pip install in a venv, and python server.py.

k4rnaj1k · 24 days ago
Oh wow, really impressive. How long did this take you to make?
Dayshine · 24 days ago
It mentions ONNX, so I imagine an ONNX model is or will be available.

ONNX runtime is a single library, with C#'s package being ~115MB compressed.

Not tiny, but usually only a few lines to actually run and only a single dependency.

wongarsu · 24 days ago
The repository already runs an ONNX model. But the onnx model doesn't get English text as input, it gets tokenized phonemes. The prepocessing for that is where most of the dependencies come from.

Which is completely reasonable imho, but obviously comes with tradeoffs.

divamgupta · 24 days ago
We will try to get rid of dependencies.
WhyNotHugo · 24 days ago
Usually pulling in lots of libraries helps develop/iterate faster. Then can be removed later once the whole thing starts to take shape.
zelphirkalt · 24 days ago
This case might be different, but ... usually that "later" never happens.
keyle · 24 days ago
I don't mind so much the size in MB, the fact that it's pure CPU and the quality, what I do mind however is the latency. I hope it's fast.

Aside: Are there any models for understanding voice to text, fully offline, without training?

I will be very impressed when we will be able to have a conversation with an AI at a natural rate and not "probe, space, response"

Dayshine · 24 days ago
Nvidia's parakeet https://huggingface.co/nvidia/parakeet-tdt-0.6b-v2 appears to be state of the art for english: 10x faster than Whisper.

My mid-range AMD CPU is multiple times faster than realtime with parakeet.

colechristensen · 24 days ago
>Aside: Are there any models for understanding voice to text, fully offline, without training?

OpenAI's whisper is a few years old and pretty solid.

https://github.com/openai/whisper

Hackbraten · 24 days ago
Whisper tends to fill silence with random garbage from its training set. [0] [1] [2]

[0]: https://github.com/openai/whisper/discussions/679 [1]: https://github.com/openai/whisper/discussions/928 [2]: https://github.com/openai/whisper/discussions/2608

jiehong · 24 days ago
Voice to text fully offline can be done with whisper. A few apps offer it for dictation or transcription.
blensor · 24 days ago
"The brown fox jumps over the lazy dog.."

Average duration per generation: 1.28 seconds

Characters processed per second: 30.35

--

"Um"

Average duration per generation: 0.22 seconds

Characters processed per second: 9.23

--

"The brown fox jumps over the lazy dog.. The brown fox jumps over the lazy dog.."

Average duration per generation: 2.25 seconds

Characters processed per second: 35.04

--

processor : 0

vendor_id : AuthenticAMD

cpu family : 25

model : 80

model name : AMD Ryzen 7 5800H with Radeon Graphics

stepping : 0

microcode : 0xa50000c

cpu MHz : 1397.397

cache size : 512 KB

moffkalast · 24 days ago
Hmm that actually seems extremely slow, Piper can crank out a sentence almost instantly on a Pi 4 which is a like a sloth compared to that Ryzen and the speech quality seems about the same at first glance.

I suppose it would make sense if you want to include it on top of an LLM that's already occupying most of a GPU and this could run in the limited VRAM that's left.

keyle · 24 days ago
assuming most answers will be more than a sentence, 2.25 seconds is already long enough if you factor the token generation in between... and imagine with reasoning!... We're not there yet.
Teever · 24 days ago
Any idea what factors play into latency in TTS models?
divamgupta · 24 days ago
Mostly model size, and input size. Some models which use attention are O(N^2)