koljab (u/koljab) - Readit News

koljab commented on Show HN: Real-time AI Voice Chat at ~500ms Latency github.com/KoljaB/Realtim... · Posted by u/koljab

wkat4242 · 4 months ago

Thank you!

I didn't realise that you custom-made that voice. Would you have some links to other out-of-the-box voices for coqui? I'm having some trouble finding them. I think from seeing the demo page that the idea is that you clone someone else's voice or something with that engine. Because I don't see any voices listed. I've never seen it before.

And yes I switched to Kokoro now, I thought it was the default already but then I saw there were 3 lines configuring the same thing. So that's working. Kokoro isn't quite as good though as coqui, that's why I'm wondering about that. I also used kokoro on openwebui and I wasn't very happy with it there either. It's fast, but some pronounciation is weird. Also, it would be amazing to have bilingual TTS (English/Spanish in my case). And it looks like Coqui might be able to do that.

koljab · 4 months ago

Didn't find many coqui finetunes too so far. I have David Attenborough and Snoop Dogg finetunes on my huggingface, quality is medium.

Coqui can to 17 languages. The problem with RealtimeVoiceChat repo is turn detection, the model I use to determine if a partial sentence indicates turn change is trained on english corpus only.

koljab commented on Show HN: Real-time AI Voice Chat at ~500ms Latency github.com/KoljaB/Realtim... · Posted by u/koljab

sabellito · 4 months ago

Every time I see these things, they look cool as hell, I get excited, then I try to get them working on my gaming PC (that has the GPU), I spend 1-2h fighting with python and give up.

Today's issue is that my python version is 3.12 instead of <3.12,>=3.9. Installing python 3.11 from the official website does nothing, I give up. It's a shame that the amazing work done by people like the OP gets underused because of this mess outside of their control.

"Just use docker". Have you tried using docker on windows? There's a reason I never do dev work on windows.

I spent most of my career in the JVM and Node, and despite the issues, never had to deal with this level of lack of compatibility.

koljab · 4 months ago

Yes, you're absolutely right. I'll provide UV and conda support soon, especially for Windows. I'm using python 3.10 still, maybe that's the issue. You can always mail me your current problem or log an issue, I really care about opening up the repo for as many ppl to use as possible.

koljab commented on Show HN: Real-time AI Voice Chat at ~500ms Latency github.com/KoljaB/Realtim... · Posted by u/koljab

echelon · 4 months ago

That's excellent. Really amazing bringing all of these together like this.

Hopefully we get an open weights version of Sesame [1] soon. Keep watching for it, because that'd make a killer addition to your app.

[1] https://www.sesame.com/

koljab · 4 months ago

That would be absolutely awesome. But I doubt it, since they released a shitty version of that amazing thing they put online. I feel they aren't planning to give us their top model soon.

koljab commented on Show HN: Real-time AI Voice Chat at ~500ms Latency github.com/KoljaB/Realtim... · Posted by u/koljab

nardi · 4 months ago

I don't know what I'm talking about, but could you use distillation techniques?

koljab · 4 months ago

Maybe possible, I did not look into that much for Coqui XTTS. What i know is that the quantized versions for Orpheus sound noticably worse. I feel audio models are quite sensitive to quantization.

koljab commented on Show HN: Real-time AI Voice Chat at ~500ms Latency github.com/KoljaB/Realtim... · Posted by u/koljab

karimf · 4 months ago

Do you have any information on how long each step take? Like how many ms on each step of the pipeline?

I'm curious how fast it will run if we can get this running on a Mac. Any ballpark guess?

koljab · 4 months ago

LLM and TTS latency get's determined and logged at the start. It's around 220ms for the LLM returning the first synthesizable sentence fragment (depending on the length of the fragment, which is usually something between 3 and 10 words). Then around 80ms of TTS until the first audio chunk is delivered. STT with base.en you can neglect, it's under 5 ms, VAD same. Turn detection model also adds around 20 ms. I have zero clue if and how fast this runs on a Mac.

koljab commented on Show HN: Real-time AI Voice Chat at ~500ms Latency github.com/KoljaB/Realtim... · Posted by u/koljab

peterldowns · 4 months ago

Can you explain more about the "Coqui XTTS Lasinya" models that the code is using? What are these, and how were they trained/finetuned? I'm assuming you're the one who uploaded them to huggingface, but there's no model card or README https://huggingface.co/KoljaB/XTTS_Models

In case it's not clear, I'm talking about the models referenced here. https://github.com/KoljaB/RealtimeVoiceChat/blob/main/code/a...

koljab · 4 months ago

Lasinya voice is a XTTS 2.0.2 finetune I made with a self-created, synthesized dataset. I used https://github.com/daswer123/xtts-finetune-webui for training.

koljab commented on Show HN: Real-time AI Voice Chat at ~500ms Latency github.com/KoljaB/Realtim... · Posted by u/koljab

tmaly · 4 months ago

What is the min VRAM needed on the GPU to run this? I did not see that on the github

koljab · 4 months ago

With the current 24b LLM model it's 24 GB. I have no clue how far down you can go with the GPU is using smaller models, you can set the model in server.py. Quite sure 16 GB will work but at some point it will probably fail.

koljab commented on Show HN: Real-time AI Voice Chat at ~500ms Latency github.com/KoljaB/Realtim... · Posted by u/koljab

wkat4242 · 4 months ago

Yeah I really dislike the whisperiness of this voice "Lasinya". It sounds too much like an erotic phone service. I wonder if there's any alternative voice? I don't see Lasinya even mentioned in the public coqui models: https://github.com/coqui-ai/STT-models/releases . But I don't see a list of other model names I could use either.

I tried to select kokoro in the python module but it says in the logs that only coqui is available. I do have to say the coqui models sound really good, it's just the type of voice that puts me off.

The default prompt is also way too "girlfriendy" but that was easily fixed. But for the voice, I simply don't know what the other options are for this engine.

PS: Forgive my criticism of the default voice but I'm really impressed with the responsiveness of this. It really responds so fast. Thanks for making this!

koljab · 4 months ago

Yeah I know the voice polarizes, I trained it for myself, so it's not an official release. You can change the voice here:

https://github.com/KoljaB/RealtimeVoiceChat/blob/main/code/a...

Create a subfolder in the app container: ./models/some_folder_name Copy the files from your desired voice into that folder: config.json, model.pth, vocab.json and speakers_xtts.pth (you can copy the speakers_xtts.pth from Lasinya, it's the same for every voice)

Then change the specific_model="Lasinya" line in audio_module.py into specific_model="some_folder_name".

If you change TTS_START_ENGINE to "kokoro" in server.py it's supposed to work, what does happen then? Can you post the log message?

koljab commented on Show HN: Real-time AI Voice Chat at ~500ms Latency github.com/KoljaB/Realtim... · Posted by u/koljab

dummydummy1234 · 4 months ago

Have you looked at pipecat, seems to be similar trying to do standardized backend/webrtc turn detection pipelines.

koljab · 4 months ago

Did not look into that one. Looks quite good, I will try that soon.

koljab commented on Show HN: Real-time AI Voice Chat at ~500ms Latency github.com/KoljaB/Realtim... · Posted by u/koljab

pzo · 4 months ago

This looks great will definitely have a look. I'm just wondering if you tested fastRTC from hugging face? I haven't done that curious about speed between this vs fastrtc vs pipecat.

koljab · 4 months ago

Yes, I tested it. I'm not that sure what they created there. It adds some noticable latency compared towards using raw websockets. Imho it's not supposed to, but it did it nevertheless in my tests.