This is really cool. FWIW, existing open-source TTS engines are really bad in comparison to what you have here: I know this is voice-to-voice, but I think there'd be a lot of appetite to get this to also be multimodal and accept text (essentially making it a really good TTS model, in addition to a great voice-to-voice model).
I suppose someone could hack their way around the problem by finetuning it to essentially replay Piper (or whatever) output, only with more natural prosody and intonation. And then have the text LLM pipe to Piper, and Piper pipe to Hertz-dev. But it would be pretty useful to have it accept text natively!
Eh, that depends. A small model that's voice-and-text is probably more useful to most people than scaling up a voice-only model: the large voice-only model will have to compete on intelligence with e.g. Qwen and Llama, since it can't be used in conjunction with them; whereas a small voice+text model can be used as a cheap frontend hiding a larger, smarter, but more expensive text-only model behind it. This is an 8b model: running it is nearly free, it can fit on a 4090 with room to spare.
On the one hand, a small team focused on voice-to-voice could probably do a lot better at voice-to-voice than a small team focused on voice-to-voice+text. But a small team focused on making the most useful model would probably do better at that goal by focusing on voice+text rather than voice-only.
They say Hertz is first of its kind but Moshi is another duplex audio model from earlier this year that seems to perform similarly (and it runs on a MacBook):
https://github.com/kyutai-labs/moshi
Moshi never released the base model, only two conversationally finetuned models. They also never released training code except for the codec. Though I don't see any training code for Hertz either, just 3 inference notebooks, and model code full of no_grad. No paper either to help me understand how this was trained and what the architecture is like. So I'm not too sure about researcher-friendliness unless I'm missing something.
- LLaMA-Omni https://github.com/ictnlp/LLaMA-Omni a speech-language model built on Llama-3.1-8B-Instruct for simultaneous generation of text and speech
- Ichigo https://github.com/homebrewltd/ichigo open research project extending a text-based LLM to have native listening ability, using an early fusion technique
Moshi is a good model to build chat applications on, this is designed to be more of a proper base model with all the quirkiness, naturalness, and researcher-friendliness of base modeling.
Tesla’s approach to pure vision-based autonomous driving—temporarily setting aside lidar and other sensors—seems designed to make this technology more accessible and scalable. By focusing on a vision-only model, they can accelerate adoption and gather large datasets for quicker iterations. Once the vision-based system reaches a mature stage, I imagine Tesla might reintegrate additional sensor data, like lidar or radar, to refine their autonomous driving suite, making it even more robust and closer to perfection.
Additionally, I’ve been exploring an idea about voice interaction systems. Currently, most voice interactions are processed by converting voice input into text, generating a text-based response, and then turning this text back into audio. But what if we could train the system to respond directly in voice, without involving text at all? If developed to maturity, this model could produce responses that feel more natural and spontaneous, possibly diverging from traditional text-to-speech outputs. Natural speech has unique syntax and rhythm, not to mention dialect and tone variations, which could make a purely voice-trained system fascinating and more human-like.
Could you let me know if your current voice interaction model follows the standard speech-to-text-to-speech process, or if there is exploration in voice-to-voice processing?
So essentially this is voice input to voice output? Can you change gender/age/accent? Does it track prosodic information? I've been waiting for something like this.
Oh, you have completed the project I planned. Currently, do you think the difficulty in improving the model lies in voice data, computing power, or algorithm optimization?
I personally think that if you want to achieve the ultimate, you don’t need to remove the background sound from the original audio.
Outputting audio mixed with background sound as new audio may result in background music,
If you use completely unprocessed speech data (including speech information with background music on YouTube), I think the potential will be higher, but the requirements on your computing power are too high.
If you don’t have money to buy a GPU, just use voice noise reduction processing first.
That's really cool.
I'm currently exploring VUI (Voice User Interface) and this might come in handy.
I might be a bit biased (did my PhD exploring how VUI can persuade humans), but I think VUI is "the future" of computer interaction.
If it's not the future, than at least it adds a new group of people (kids + elderly people) as potential users.
If the authors or anyone else that works on a voice model are in here, do you ever get creeped out or feel the sounds you’re getting from the system have a physiological effect on you?
> Base models are uniquely valuable as a research product because they accurately model the distribution of the data that they were trained on, as opposed to models that have had substantial RL tuning done to collapse their generation distributions. This makes base models the best starting point to fine-tune for a large number of different tasks.
Is this idea (‘collapse of their generation distributions’) a researched topic? If so, under what name?
Sounds interesting and maybe related to the whole continual learning / how to finetune properly line of work
I suppose someone could hack their way around the problem by finetuning it to essentially replay Piper (or whatever) output, only with more natural prosody and intonation. And then have the text LLM pipe to Piper, and Piper pipe to Hertz-dev. But it would be pretty useful to have it accept text natively!
On the one hand, a small team focused on voice-to-voice could probably do a lot better at voice-to-voice than a small team focused on voice-to-voice+text. But a small team focused on making the most useful model would probably do better at that goal by focusing on voice+text rather than voice-only.
It may not be _them_ doing it, though.
- moshi https://github.com/kyutai-labs/moshi speech-text foundation model using Mimi, a SOTA streaming neural audio codec
- Mini-Omni https://github.com/gpt-omni/mini-omni multimodal LLM based on Qwen2 offering speech input and output
- Ichigo https://github.com/homebrewltd/ichigo open research project extending a text-based LLM to have native listening ability, using an early fusion technique
Additionally, I’ve been exploring an idea about voice interaction systems. Currently, most voice interactions are processed by converting voice input into text, generating a text-based response, and then turning this text back into audio. But what if we could train the system to respond directly in voice, without involving text at all? If developed to maturity, this model could produce responses that feel more natural and spontaneous, possibly diverging from traditional text-to-speech outputs. Natural speech has unique syntax and rhythm, not to mention dialect and tone variations, which could make a purely voice-trained system fascinating and more human-like.
Could you let me know if your current voice interaction model follows the standard speech-to-text-to-speech process, or if there is exploration in voice-to-voice processing?
If you use completely unprocessed speech data (including speech information with background music on YouTube), I think the potential will be higher, but the requirements on your computing power are too high. If you don’t have money to buy a GPU, just use voice noise reduction processing first.
I might be a bit biased (did my PhD exploring how VUI can persuade humans), but I think VUI is "the future" of computer interaction. If it's not the future, than at least it adds a new group of people (kids + elderly people) as potential users.
Is this idea (‘collapse of their generation distributions’) a researched topic? If so, under what name?
Sounds interesting and maybe related to the whole continual learning / how to finetune properly line of work