Readit News logoReadit News
koljab commented on Show HN: Real-time AI Voice Chat at ~500ms Latency   github.com/KoljaB/Realtim... · Posted by u/koljab
wkat4242 · 4 months ago
Thank you!

I didn't realise that you custom-made that voice. Would you have some links to other out-of-the-box voices for coqui? I'm having some trouble finding them. I think from seeing the demo page that the idea is that you clone someone else's voice or something with that engine. Because I don't see any voices listed. I've never seen it before.

And yes I switched to Kokoro now, I thought it was the default already but then I saw there were 3 lines configuring the same thing. So that's working. Kokoro isn't quite as good though as coqui, that's why I'm wondering about that. I also used kokoro on openwebui and I wasn't very happy with it there either. It's fast, but some pronounciation is weird. Also, it would be amazing to have bilingual TTS (English/Spanish in my case). And it looks like Coqui might be able to do that.

koljab · 4 months ago
Didn't find many coqui finetunes too so far. I have David Attenborough and Snoop Dogg finetunes on my huggingface, quality is medium.

Coqui can to 17 languages. The problem with RealtimeVoiceChat repo is turn detection, the model I use to determine if a partial sentence indicates turn change is trained on english corpus only.

koljab commented on Show HN: Real-time AI Voice Chat at ~500ms Latency   github.com/KoljaB/Realtim... · Posted by u/koljab
sabellito · 4 months ago
Every time I see these things, they look cool as hell, I get excited, then I try to get them working on my gaming PC (that has the GPU), I spend 1-2h fighting with python and give up.

Today's issue is that my python version is 3.12 instead of <3.12,>=3.9. Installing python 3.11 from the official website does nothing, I give up. It's a shame that the amazing work done by people like the OP gets underused because of this mess outside of their control.

"Just use docker". Have you tried using docker on windows? There's a reason I never do dev work on windows.

I spent most of my career in the JVM and Node, and despite the issues, never had to deal with this level of lack of compatibility.

koljab · 4 months ago
Yes, you're absolutely right. I'll provide UV and conda support soon, especially for Windows. I'm using python 3.10 still, maybe that's the issue. You can always mail me your current problem or log an issue, I really care about opening up the repo for as many ppl to use as possible.
koljab commented on Show HN: Real-time AI Voice Chat at ~500ms Latency   github.com/KoljaB/Realtim... · Posted by u/koljab
echelon · 4 months ago
That's excellent. Really amazing bringing all of these together like this.

Hopefully we get an open weights version of Sesame [1] soon. Keep watching for it, because that'd make a killer addition to your app.

[1] https://www.sesame.com/

koljab · 4 months ago
That would be absolutely awesome. But I doubt it, since they released a shitty version of that amazing thing they put online. I feel they aren't planning to give us their top model soon.
koljab commented on Show HN: Real-time AI Voice Chat at ~500ms Latency   github.com/KoljaB/Realtim... · Posted by u/koljab
nardi · 4 months ago
I don't know what I'm talking about, but could you use distillation techniques?
koljab · 4 months ago
Maybe possible, I did not look into that much for Coqui XTTS. What i know is that the quantized versions for Orpheus sound noticably worse. I feel audio models are quite sensitive to quantization.
koljab commented on Show HN: Real-time AI Voice Chat at ~500ms Latency   github.com/KoljaB/Realtim... · Posted by u/koljab
karimf · 4 months ago
Do you have any information on how long each step take? Like how many ms on each step of the pipeline?

I'm curious how fast it will run if we can get this running on a Mac. Any ballpark guess?

koljab · 4 months ago
LLM and TTS latency get's determined and logged at the start. It's around 220ms for the LLM returning the first synthesizable sentence fragment (depending on the length of the fragment, which is usually something between 3 and 10 words). Then around 80ms of TTS until the first audio chunk is delivered. STT with base.en you can neglect, it's under 5 ms, VAD same. Turn detection model also adds around 20 ms. I have zero clue if and how fast this runs on a Mac.
koljab commented on Show HN: Real-time AI Voice Chat at ~500ms Latency   github.com/KoljaB/Realtim... · Posted by u/koljab
peterldowns · 4 months ago
Can you explain more about the "Coqui XTTS Lasinya" models that the code is using? What are these, and how were they trained/finetuned? I'm assuming you're the one who uploaded them to huggingface, but there's no model card or README https://huggingface.co/KoljaB/XTTS_Models

In case it's not clear, I'm talking about the models referenced here. https://github.com/KoljaB/RealtimeVoiceChat/blob/main/code/a...

koljab · 4 months ago
Lasinya voice is a XTTS 2.0.2 finetune I made with a self-created, synthesized dataset. I used https://github.com/daswer123/xtts-finetune-webui for training.
koljab commented on Show HN: Real-time AI Voice Chat at ~500ms Latency   github.com/KoljaB/Realtim... · Posted by u/koljab
tmaly · 4 months ago
What is the min VRAM needed on the GPU to run this? I did not see that on the github
koljab · 4 months ago
With the current 24b LLM model it's 24 GB. I have no clue how far down you can go with the GPU is using smaller models, you can set the model in server.py. Quite sure 16 GB will work but at some point it will probably fail.
koljab commented on Show HN: Real-time AI Voice Chat at ~500ms Latency   github.com/KoljaB/Realtim... · Posted by u/koljab
wkat4242 · 4 months ago
Yeah I really dislike the whisperiness of this voice "Lasinya". It sounds too much like an erotic phone service. I wonder if there's any alternative voice? I don't see Lasinya even mentioned in the public coqui models: https://github.com/coqui-ai/STT-models/releases . But I don't see a list of other model names I could use either.

I tried to select kokoro in the python module but it says in the logs that only coqui is available. I do have to say the coqui models sound really good, it's just the type of voice that puts me off.

The default prompt is also way too "girlfriendy" but that was easily fixed. But for the voice, I simply don't know what the other options are for this engine.

PS: Forgive my criticism of the default voice but I'm really impressed with the responsiveness of this. It really responds so fast. Thanks for making this!

koljab · 4 months ago
Yeah I know the voice polarizes, I trained it for myself, so it's not an official release. You can change the voice here:

https://github.com/KoljaB/RealtimeVoiceChat/blob/main/code/a...

Create a subfolder in the app container: ./models/some_folder_name Copy the files from your desired voice into that folder: config.json, model.pth, vocab.json and speakers_xtts.pth (you can copy the speakers_xtts.pth from Lasinya, it's the same for every voice)

Then change the specific_model="Lasinya" line in audio_module.py into specific_model="some_folder_name".

If you change TTS_START_ENGINE to "kokoro" in server.py it's supposed to work, what does happen then? Can you post the log message?

koljab commented on Show HN: Real-time AI Voice Chat at ~500ms Latency   github.com/KoljaB/Realtim... · Posted by u/koljab
dummydummy1234 · 4 months ago
Have you looked at pipecat, seems to be similar trying to do standardized backend/webrtc turn detection pipelines.
koljab · 4 months ago
Did not look into that one. Looks quite good, I will try that soon.
koljab commented on Show HN: Real-time AI Voice Chat at ~500ms Latency   github.com/KoljaB/Realtim... · Posted by u/koljab
pzo · 4 months ago
This looks great will definitely have a look. I'm just wondering if you tested fastRTC from hugging face? I haven't done that curious about speed between this vs fastrtc vs pipecat.
koljab · 4 months ago
Yes, I tested it. I'm not that sure what they created there. It adds some noticable latency compared towards using raw websockets. Imho it's not supposed to, but it did it nevertheless in my tests.

u/koljab

KarmaCake day264May 5, 2025View Original