Readit News logoReadit News
lxe · 3 years ago
For "actually serverless" voice chat, check out https://whisper.ggerganov.com/
jancsika · 3 years ago
This link is the most worthwhile hijack of an HN thread that I've ever seen.

How is tiny.en so damned accurate?

How much faster does this run natively?

Do the NEON instructions work for arm devices like an RPI or is it just tuned for Apple?

With this and Alpaca 13B you could probably replace an entire window manager.

Edit: it seems like it was less than a year ago that local speech recognition was a slow slog through a swamp of crufty, complicated research projects with most the high quality training data hidden behind walled gardens. Now I can stutter-step over a word and a demo in my browser correctly transcribes what I meant sans stutter. What happened?

lxe · 3 years ago
SIMD support in wasm is a bit weird. You'll need to compile into multiple targets, and many compile flags are not supported: https://emscripten.org/docs/porting/simd.html
anotherhue · 3 years ago
Yes NEON works on recent Pis. RPi CPU tuning: https://github.com/ggerganov/whisper.cpp/blob/70567eff232773...
arthurcolle · 3 years ago
This is so good. I tried playing around with whisper.cpp last week and got absolutely terrible performance. I played around with the "step size" as well as "n" (not sure what these do, I should probably read the docs seriously).

I tried a lot of tweaks but this one is definitely the best I've seen.

How did it know how to spell my name correctly when I just spoke into my microphone??? The two L's usually trips up the transcription models. What????

kurisufag · 3 years ago
>How did it know how to spell my name correctly when I just spoke into my microphone???

if the variations are pronounced the same? luck, probably.

syntaxing · 3 years ago
Hmm curious what the differences are, it’s the same person who wrote whisper cpp that hosts that website.
dom96 · 3 years ago
wow, you weren't kidding. I tried the tiny model and it got what I said perfectly. Super impressive.
antman · 3 years ago
Which of the models did you use?
lxe · 3 years ago
This isn't me, this is ggerganov's work. His demo is using gpt2, and whisper models (https://ggml.ggerganov.com/).
teacpde · 3 years ago
Really cool, I was able to play with it for a bit, but now seems like HN hug-of-death kicked in.
lxe · 3 years ago
It's hard to death-hug these wasm demos, as they are all static files that can be easily served via CDNs
EGreg · 3 years ago
what is the point when the Web browsers mostly all already support it

https://developer.mozilla.org/en-US/docs/Web/API/Web_Speech_...

easrng · 3 years ago
In Chrome that's using Google's servers to transcribe the audio (last I checked) and it doesn't work on Firefox.
hombre_fatal · 3 years ago
When you want much better, state of the art quality.

Compare it yourself.

lxe · 3 years ago
Engineering curiosity and fun
IronWolve · 3 years ago
Been using textgen and downloading tons of models, the models are all over the place. The problems of accuracy and short term memory are major issues that people are trying to implement work arounds.

Check out textgen, it has voice in/out, graphics in/out, memory plugin, api, plugins, etc, all running locally.

https://github.com/oobabooga/text-generation-webui

bioemerl · 3 years ago
> it has voice in/out,

Do you know how to get this working? I looked through the read-me and didn't see any options for it.

IronWolve · 3 years ago
https://github.com/oobabooga/text-generation-webui/blob/main...

You need to enable the extensions.

I only did voice out locally with silero_tts, it also supports voice out with eleven labs api.

Voice input is via whisper tts.

CrazyStat · 3 years ago
Any models and/or plugins you recommend?
wongarsu · 3 years ago
That's a pretty cool showcase of modal [1]. From a marketing perspective I have to congratulate, this is a really well done way to get people to check out your platform.

1: https://modal.com/

bredren · 3 years ago
I also clicked through but immediately abandoned wondering:

What’s the state of container based ML deployments?

Can I take a container orchestration if these services and just put them on a vps w a GPU and run this?

Is there secret or just special sauce in ML infra?

turnsout · 3 years ago
Is it basically Vercel for ML?
fredliu · 3 years ago
+1 modal.com is the first thing I checked after reading the readme.
npace12 · 3 years ago
+1, props to modal.com, well done and their site is nice
forgingahead · 3 years ago
Nice to see Tortoise being used - I still think it's the best TTS system out there now. Generation time is slow, but quality is incredible. I wonder if the code can be optimised to speed up the generation, but I don't think the author is maintaining it any longer.[0]

[0]https://github.com/neonbjb/tortoise-tts

Tepix · 3 years ago
Coqui also looks interesting.

https://github.com/coqui-ai/TTS

Support for it was recently added to vocode:

https://github.com/vocodedev/vocode-python/pull/56

bondsynth · 3 years ago
If you're generally interested in TTS check out Bondsynth AI, it's a product I've been working on for long-form text-to-speech (think ebook to audiobook or website to audio). It's still in beta, so it's free, but I'm looking for feedback.

Demo at: https://www.youtube.com/watch?v=OmQup3kst5s

Signup at: https://signup.bondsynth.ai

dontwearitout · 3 years ago
Very impressive demo! FYI https://signup.bondsynth.ai doesn't work, but https://bondsynth.ai/signup.html does
turnsout · 3 years ago
Tortoise looks really nice! The output is very "polished" and audiobook-like. It's a contrast to Bark[0] which is far more expressive but unpredictable.

[0]: https://github.com/suno-ai/bark

discardedrefuse · 3 years ago
You might want to check out Mimic 3 by the Mycroft team. Its also open source and runs offline.

https://mycroft.ai/mimic-3/

mdaniel · 3 years ago
It took quite a bit of digging to find the repo link https://github.com/MycroftAI/mimic3#readme and it's AGPL-3 for those interested in such things
generalizations · 3 years ago
There's a lot of use cases just waiting for a good system that can do at least live speed generation.
tasty_freeze · 3 years ago
I pitched this on a recently thread, but it was 12+ hours after it was posted, so I'll try again here.

What I really want is a program to waste the time of phone calls making unsolicited sales pitches.

It would do voice to text, run a simple language model to generate responses, then synthesize the voice back. It doesn't need to be a sophisticated model, not much more sophisticated than the classic "Eliza" program. A few years back someone did this with a canned loop of vague responses and it fooled the sales people for surprisingly long:

https://www.youtube.com/watch?v=XSoOrlh5i1k

It seems like it could all run locally for low latency. Probably the most important part to get right would be a TTS system that isn't immediately pegged as a robot.

Zetaphor · 3 years ago
Have you seen Lenny? It's a much lower tech solution that seems to work quite well.

https://lennytroll.com/

tasty_freeze · 3 years ago
Yes, that was the youtube video I linked to. If that can be successful with some telemarketers, I'm sure a chat-gpt aided program could be even more successful at frustrating telemarketers.

Deleted Comment

mdaniel · 3 years ago
just in case someone finds that YT video entertaining, you'll really love https://www.youtube.com/@KitbogaShow
sramam · 3 years ago
Very cool - the demo was simple, functional and clear.

It was a bit laggy, but for a free demo from an open source project, I should be the one being shamed!

Well done.

juliennakache · 3 years ago
Interesting. How does local development work or remote debugging work if the entire production toolchain is abstracted away with proprietary software?
thundergolfer · 3 years ago
This docs guide has some answers: https://modal.com/docs/guide/developing-debugging

Deleted Comment