QuiLLMan: Voice chat with Vicuna-13B

For "actually serverless" voice chat, check out https://whisper.ggerganov.com/

jancsika · 3 years ago

This link is the most worthwhile hijack of an HN thread that I've ever seen.

How is tiny.en so damned accurate?

How much faster does this run natively?

Do the NEON instructions work for arm devices like an RPI or is it just tuned for Apple?

With this and Alpaca 13B you could probably replace an entire window manager.

Edit: it seems like it was less than a year ago that local speech recognition was a slow slog through a swamp of crufty, complicated research projects with most the high quality training data hidden behind walled gardens. Now I can stutter-step over a word and a demo in my browser correctly transcribes what I meant sans stutter. What happened?

lxe · 3 years ago

SIMD support in wasm is a bit weird. You'll need to compile into multiple targets, and many compile flags are not supported: https://emscripten.org/docs/porting/simd.html

anotherhue · 3 years ago

Yes NEON works on recent Pis. RPi CPU tuning: https://github.com/ggerganov/whisper.cpp/blob/70567eff232773...

arthurcolle · 3 years ago

This is so good. I tried playing around with whisper.cpp last week and got absolutely terrible performance. I played around with the "step size" as well as "n" (not sure what these do, I should probably read the docs seriously).

I tried a lot of tweaks but this one is definitely the best I've seen.

How did it know how to spell my name correctly when I just spoke into my microphone??? The two L's usually trips up the transcription models. What????

kurisufag · 3 years ago

>How did it know how to spell my name correctly when I just spoke into my microphone???

if the variations are pronounced the same? luck, probably.

syntaxing · 3 years ago

Hmm curious what the differences are, it’s the same person who wrote whisper cpp that hosts that website.

dom96 · 3 years ago

wow, you weren't kidding. I tried the tiny model and it got what I said perfectly. Super impressive.

antman · 3 years ago

Which of the models did you use?

lxe · 3 years ago

This isn't me, this is ggerganov's work. His demo is using gpt2, and whisper models (https://ggml.ggerganov.com/).

teacpde · 3 years ago

Really cool, I was able to play with it for a bit, but now seems like HN hug-of-death kicked in.

lxe · 3 years ago

It's hard to death-hug these wasm demos, as they are all static files that can be easily served via CDNs

EGreg · 3 years ago

what is the point when the Web browsers mostly all already support it

https://developer.mozilla.org/en-US/docs/Web/API/Web_Speech_...

easrng · 3 years ago

In Chrome that's using Google's servers to transcribe the audio (last I checked) and it doesn't work on Firefox.

hombre_fatal · 3 years ago

When you want much better, state of the art quality.

Compare it yourself.

lxe · 3 years ago

Engineering curiosity and fun

Nice to see Tortoise being used - I still think it's the best TTS system out there now. Generation time is slow, but quality is incredible. I wonder if the code can be optimised to speed up the generation, but I don't think the author is maintaining it any longer.[0]

[0]https://github.com/neonbjb/tortoise-tts

Tepix · 3 years ago

Coqui also looks interesting.

https://github.com/coqui-ai/TTS

Support for it was recently added to vocode:

https://github.com/vocodedev/vocode-python/pull/56

bondsynth · 3 years ago

If you're generally interested in TTS check out Bondsynth AI, it's a product I've been working on for long-form text-to-speech (think ebook to audiobook or website to audio). It's still in beta, so it's free, but I'm looking for feedback.

Demo at: https://www.youtube.com/watch?v=OmQup3kst5s

Signup at: https://signup.bondsynth.ai

dontwearitout · 3 years ago

Very impressive demo! FYI https://signup.bondsynth.ai doesn't work, but https://bondsynth.ai/signup.html does

turnsout · 3 years ago

Tortoise looks really nice! The output is very "polished" and audiobook-like. It's a contrast to Bark[0] which is far more expressive but unpredictable.

[0]: https://github.com/suno-ai/bark

discardedrefuse · 3 years ago

You might want to check out Mimic 3 by the Mycroft team. Its also open source and runs offline.

https://mycroft.ai/mimic-3/

mdaniel · 3 years ago

It took quite a bit of digging to find the repo link https://github.com/MycroftAI/mimic3#readme and it's AGPL-3 for those interested in such things

generalizations · 3 years ago

There's a lot of use cases just waiting for a good system that can do at least live speed generation.

IronWolve · 3 years ago

Been using textgen and downloading tons of models, the models are all over the place. The problems of accuracy and short term memory are major issues that people are trying to implement work arounds.

Check out textgen, it has voice in/out, graphics in/out, memory plugin, api, plugins, etc, all running locally.

https://github.com/oobabooga/text-generation-webui

bioemerl · 3 years ago

> it has voice in/out,

Do you know how to get this working? I looked through the read-me and didn't see any options for it.

https://github.com/oobabooga/text-generation-webui/blob/main...

You need to enable the extensions.

I only did voice out locally with silero_tts, it also supports voice out with eleven labs api.

Voice input is via whisper tts.

CrazyStat · 3 years ago

Any models and/or plugins you recommend?

wongarsu · 3 years ago

That's a pretty cool showcase of modal [1]. From a marketing perspective I have to congratulate, this is a really well done way to get people to check out your platform.

1: https://modal.com/

bredren · 3 years ago

I also clicked through but immediately abandoned wondering:

What’s the state of container based ML deployments?

Can I take a container orchestration if these services and just put them on a vps w a GPU and run this?

Is there secret or just special sauce in ML infra?

Is it basically Vercel for ML?

fredliu · 3 years ago

+1 modal.com is the first thing I checked after reading the readme.

npace12 · 3 years ago

+1, props to modal.com, well done and their site is nice

forgingahead · 3 years ago

tasty_freeze · 3 years ago

I pitched this on a recently thread, but it was 12+ hours after it was posted, so I'll try again here.

What I really want is a program to waste the time of phone calls making unsolicited sales pitches.

It would do voice to text, run a simple language model to generate responses, then synthesize the voice back. It doesn't need to be a sophisticated model, not much more sophisticated than the classic "Eliza" program. A few years back someone did this with a canned loop of vague responses and it fooled the sales people for surprisingly long:

https://www.youtube.com/watch?v=XSoOrlh5i1k

It seems like it could all run locally for low latency. Probably the most important part to get right would be a TTS system that isn't immediately pegged as a robot.

Zetaphor · 3 years ago

Have you seen Lenny? It's a much lower tech solution that seems to work quite well.

https://lennytroll.com/

Yes, that was the youtube video I linked to. If that can be successful with some telemarketers, I'm sure a chat-gpt aided program could be even more successful at frustrating telemarketers.

Deleted Comment

just in case someone finds that YT video entertaining, you'll really love https://www.youtube.com/@KitbogaShow

sramam · 3 years ago

Very cool - the demo was simple, functional and clear.

It was a bit laggy, but for a free demo from an open source project, I should be the one being shamed!

Well done.

juliennakache · 3 years ago

Interesting. How does local development work or remote debugging work if the entire production toolchain is abstracted away with proprietary software?

thundergolfer · 3 years ago

This docs guide has some answers: https://modal.com/docs/guide/developing-debugging