This link is the most worthwhile hijack of an HN thread that I've ever seen.
How is tiny.en so damned accurate?
How much faster does this run natively?
Do the NEON instructions work for arm devices like an RPI or is it just tuned for Apple?
With this and Alpaca 13B you could probably replace an entire window manager.
Edit: it seems like it was less than a year ago that local speech recognition was a slow slog through a swamp of crufty, complicated research projects with most the high quality training data hidden behind walled gardens. Now I can stutter-step over a word and a demo in my browser correctly transcribes what I meant sans stutter. What happened?
This is so good. I tried playing around with whisper.cpp last week and got absolutely terrible performance. I played around with the "step size" as well as "n" (not sure what these do, I should probably read the docs seriously).
I tried a lot of tweaks but this one is definitely the best I've seen.
How did it know how to spell my name correctly when I just spoke into my microphone??? The two L's usually trips up the transcription models. What????
Been using textgen and downloading tons of models, the models are all over the place. The problems of accuracy and short term memory are major issues that people are trying to implement work arounds.
Check out textgen, it has voice in/out, graphics in/out, memory plugin, api, plugins, etc, all running locally.
That's a pretty cool showcase of modal [1]. From a marketing perspective I have to congratulate, this is a really well done way to get people to check out your platform.
Nice to see Tortoise being used - I still think it's the best TTS system out there now. Generation time is slow, but quality is incredible. I wonder if the code can be optimised to speed up the generation, but I don't think the author is maintaining it any longer.[0]
If you're generally interested in TTS check out Bondsynth AI, it's a product I've been working on for long-form text-to-speech (think ebook to audiobook or website to audio). It's still in beta, so it's free, but I'm looking for feedback.
Tortoise looks really nice! The output is very "polished" and audiobook-like. It's a contrast to Bark[0] which is far more expressive but unpredictable.
I pitched this on a recently thread, but it was 12+ hours after it was posted, so I'll try again here.
What I really want is a program to waste the time of phone calls making unsolicited sales pitches.
It would do voice to text, run a simple language model to generate responses, then synthesize the voice back. It doesn't need to be a sophisticated model, not much more sophisticated than the classic "Eliza" program. A few years back someone did this with a canned loop of vague responses and it fooled the sales people for surprisingly long:
It seems like it could all run locally for low latency. Probably the most important part to get right would be a TTS system that isn't immediately pegged as a robot.
Yes, that was the youtube video I linked to. If that can be successful with some telemarketers, I'm sure a chat-gpt aided program could be even more successful at frustrating telemarketers.
How is tiny.en so damned accurate?
How much faster does this run natively?
Do the NEON instructions work for arm devices like an RPI or is it just tuned for Apple?
With this and Alpaca 13B you could probably replace an entire window manager.
Edit: it seems like it was less than a year ago that local speech recognition was a slow slog through a swamp of crufty, complicated research projects with most the high quality training data hidden behind walled gardens. Now I can stutter-step over a word and a demo in my browser correctly transcribes what I meant sans stutter. What happened?
I tried a lot of tweaks but this one is definitely the best I've seen.
How did it know how to spell my name correctly when I just spoke into my microphone??? The two L's usually trips up the transcription models. What????
if the variations are pronounced the same? luck, probably.
https://developer.mozilla.org/en-US/docs/Web/API/Web_Speech_...
Compare it yourself.
Check out textgen, it has voice in/out, graphics in/out, memory plugin, api, plugins, etc, all running locally.
https://github.com/oobabooga/text-generation-webui
Do you know how to get this working? I looked through the read-me and didn't see any options for it.
You need to enable the extensions.
I only did voice out locally with silero_tts, it also supports voice out with eleven labs api.
Voice input is via whisper tts.
1: https://modal.com/
What’s the state of container based ML deployments?
Can I take a container orchestration if these services and just put them on a vps w a GPU and run this?
Is there secret or just special sauce in ML infra?
[0]https://github.com/neonbjb/tortoise-tts
https://github.com/coqui-ai/TTS
Support for it was recently added to vocode:
https://github.com/vocodedev/vocode-python/pull/56
Demo at: https://www.youtube.com/watch?v=OmQup3kst5s
Signup at: https://signup.bondsynth.ai
[0]: https://github.com/suno-ai/bark
https://mycroft.ai/mimic-3/
What I really want is a program to waste the time of phone calls making unsolicited sales pitches.
It would do voice to text, run a simple language model to generate responses, then synthesize the voice back. It doesn't need to be a sophisticated model, not much more sophisticated than the classic "Eliza" program. A few years back someone did this with a canned loop of vague responses and it fooled the sales people for surprisingly long:
https://www.youtube.com/watch?v=XSoOrlh5i1k
It seems like it could all run locally for low latency. Probably the most important part to get right would be a TTS system that isn't immediately pegged as a robot.
https://lennytroll.com/
Deleted Comment
It was a bit laggy, but for a free demo from an open source project, I should be the one being shamed!
Well done.
Deleted Comment