Local, using WhisperX. Precompiled binaries available.
I'm hoping to find and try a local-first version of an nvidia/canary like (like https://huggingface.co/nvidia/canary-qwen-2.5b) since it's almost twice as fast as Whisper with even lower word-error-rate
Allegedly Groq will be offering diarization with their cloud offering and super fast API which will be huge for those willing to go off-local.
[a] https://www.nobelprize.org/uploads/2024/11/advanced-physicsp...
Some things never change.
What gets in between? Because the first two are 99% success rate I'd bet.
This is not that obvious. Calculating VRAM usage for VLMs/LLMs is something of an arcane art. There are about 10 calculators online you can use and none of them work. Quantization, KV caching, activation, layers, etc all play a role. It's annoying.
But anyway, for this model, you need 40+ GB of VRAM. System RAM isn't going to cut it unless it's unified RAM on Apple Silicon, and even then, memory bandwidth is shot, so inference is much much slower than GPU/TPU.
- https://www.befreed.ai/knowledge-visualizer
- https://kodisc.com/
- https://github.com/hesamsheikh/AnimAI-Trainer
- https://tiger-ai-lab.github.io/TheoremExplainAgent/
- https://tma.live/, HN discussion: https://news.ycombinator.com/item?id=42590290
- https://generative-manim.vercel.app/
No doubt the results can be impressive: https://x.com/zan2434/status/1898145292937314347
Only reason I'm aware of all these attempts is because I'm betting the 'one-shot LLM animation' technique is not scalable long term. I'm trying to build an AI animation app that has a good human-in-the-loop experience. Though I'm building with bevy instead of manim