Quick Demo Video (50s): https://www.youtube.com/watch?v=HM_IQuuuPX8
The goal is to get closer to natural conversation speed. It uses audio chunk streaming over WebSockets, RealtimeSTT (based on Whisper), and RealtimeTTS (supporting engines like Coqui XTTSv2/Kokoro) to achieve around 500ms response latency, even when running larger local models like a 24B Mistral fine-tune via Ollama.
Key aspects: Designed for local LLMs (Ollama primarily, OpenAI connector included). Interruptible conversation. Smart turn detection to avoid cutting the user off mid-thought. Dockerized setup available for easier dependency management.
It requires a decent CUDA-enabled GPU for good performance due to the STT/TTS models.
Would love to hear your feedback on the approach, performance, potential optimizations, or any features you think are essential for a good local voice AI experience.
The code is here: https://github.com/KoljaB/RealtimeVoiceChat
I'm curious how fast it will run if we can get this running on a Mac. Any ballpark guess?
> data provided by data-labeling services and paid contractors
someone in my circle was interested in finding out how people participate in these exercises and if there are any "service providers" that do the heavy lifting of recruiting and managing this workforce for the many AI/LLM labs globally or even regionally
they are interested in remote work opportunities that could leverage their (post-graduate level) education
appreicate any pointers here - thanks!