Readit News logoReadit News
Posted by u/sanchitmonga22 6 days ago
Launch HN: RunAnywhere (YC W26) – Faster AI Inference on Apple Silicongithub.com/RunanywhereAI/...
Hi HN, we're Sanchit and Shubham (YC W26). We built a fast inference engine for Apple Silicon. LLMs, speech-to-text, text-to-speech – MetalRT beats llama.cpp, Apple's MLX, Ollama, and sherpa-onnx on every modality we tested. Custom Metal shaders, no framework overhead.

Also, we've open-sourced RCLI, the fastest end-to-end voice AI pipeline on Apple Silicon. Mic to spoken response, entirely on-device. No cloud, no API keys.

To get started:

  brew tap RunanywhereAI/rcli https://github.com/RunanywhereAI/RCLI.git
  brew install rcli
  rcli setup   # downloads ~1 GB of models
  rcli         # interactive mode with push-to-talk
Or:

  curl -fsSL https://raw.githubusercontent.com/RunanywhereAI/RCLI/main/install.sh | bash
The numbers (M4 Max, 64 GB, reproducible via `rcli bench`):

LLM decode – 1.67x faster than llama.cpp, 1.19x faster than Apple MLX (same model files): - Qwen3-0.6B: 658 tok/s (vs mlx-lm 552, llama.cpp 295) - Qwen3-4B: 186 tok/s (vs mlx-lm 170, llama.cpp 87) - LFM2.5-1.2B: 570 tok/s (vs mlx-lm 509, llama.cpp 372) - Time-to-first-token: 6.6 ms

STT – 70 seconds of audio transcribed in *101 ms*. That's 714x real-time. 4.6x faster than mlx-whisper.

TTS – 178 ms synthesis. 2.8x faster than mlx-audio and sherpa-onnx.

We built this because demoing on-device AI is easy but shipping it is brutal. Voice is the hardest test: you're chaining STT, LLM, and TTS sequentially, and if any stage is slow, the user feels it. Most teams fall back to cloud APIs not because local models are bad, but because local inference infrastructure is.

The thing that's hard to solve is latency compounding. In a voice pipeline, you're stacking three models in sequence. If each adds 200ms, you're at 600ms before the user hears a word, and that feels broken. You can't optimize one stage and call it done. Every stage needs to be fast, on one device, with no network round-trip to hide behind.

We went straight to Metal. Custom GPU compute shaders, all memory pre-allocated at init (zero allocations during inference), and one unified engine for all three modalities instead of stitching separate runtimes together.

MetalRT is the first engine to handle all three modalities natively on Apple Silicon. Full methodology:

LLM benchmarks: https://www.runanywhere.ai/blog/metalrt-fastest-llm-decode-e...

Speech benchmarks: https://www.runanywhere.ai/blog/metalrt-speech-fastest-stt-t...

How: Most inference engines add layers between you and the GPU: graph schedulers, runtime dispatchers, memory managers. MetalRT skips all of it. Custom Metal compute shaders for quantized matmul, attention, and activation - compiled ahead of time, dispatched directly.

Voice Pipeline optimizations details: https://www.runanywhere.ai/blog/fastvoice-on-device-voice-ai... RAG optimizations: https://www.runanywhere.ai/blog/fastvoice-rag-on-device-retr...

RCLI is the open-source voice pipeline (MIT) built on MetalRT: three concurrent threads with lock-free ring buffers, double-buffered TTS, 38 macOS actions by voice, local RAG (~4 ms over 5K+ chunks), 20 hot-swappable models, and a full-screen TUI with per-op latency readouts. Falls back to llama.cpp when MetalRT isn't installed.

Source: https://github.com/RunanywhereAI/RCLI (MIT)

Demo: https://www.youtube.com/watch?v=eTYwkgNoaKg

What would you build if on-device AI were genuinely as fast as cloud?

stingraycharles · 6 days ago
I’m a bit confused by what you’re offering. Is it a voice assistant / AI as described on your GitHub? Or is it more general purpose / LLM ?

How does the RAG fit in, a voice-to-RAG seems a bit random as a feature?

I don’t mean to come across as dismissive, I’m genuinely confused as to what you’re offering.

shubham2802 · 6 days ago
RunAnywhere builds software that makes AI models run fast locally on devices instead of sending requests to the cloud.

Right now, our focus is Apple Silicon.

Today there are two parts:

MetalRT - our proprietary inference engine for Apple Silicon. It speeds up local LLM, speech-to-text, and text-to-speech workloads. We’re expanding model coverage over time, with more modalities and broader support coming next.

RCLI - our open-source CLI that shows this in practice. You can talk to your Mac, query local docs, and trigger actions, all fully on-device.

So the simplest way to think about us is: we’re building the runtime / infrastructure layer for on-device AI, and RCLI is one example of what that enables.

Longer term, we want to bring the same approach to more chips and device types, not just Apple Silicon.

For people asking whether the speedups are real, we’ve published our benchmark methodology and results here: LLM: https://www.runanywhere.ai/blog/metalrt-fastest-llm-decode-e... Speech: https://www.runanywhere.ai/blog/metalrt-speech-fastest-stt-t...

mirekrusin · 6 days ago
From LLM benchmarks it looks like it's better to use open source uzu than RunAnywhere's proprietary inference engine.

[0] https://github.com/trymirai/uzu

concats · 5 days ago
How does it compare for models of any meaningful size?

These 0.6B-4B models are, frankly, just amusing curiosities. But commonly regarded as too error prone for any non-demo work.

The reason why people are buying Apple Silicon today is because the unified memory allows them to run larger models that are cost prohibitive to run otherwise (usually requiring Nvidia server GPUs). It would be much more interesting to see benchmarks for things like Qwen3.5-122B-A10B, GLM-5, or any dense model is the 20b+ range. Thanks.

Deleted Comment

Dead Comment

sanchitmonga22 · 5 days ago
Fair question, let me clarify.

RunAnywhere is an inference company. We build the runtime layer for on-device AI.

There are two pieces:

MetalRT, a proprietary GPU inference engine for Apple Silicon. It runs LLMs, speech-to-text, and text-to-speech faster than anything else available (benchmarks: https://www.runanywhere.ai/blog/metalrt-fastest-llm-decode-e...). This is our core product.

RCLI, an open-source CLI (MIT) that demonstrates what MetalRT enables. It wires STT + LLM + TTS into a real voice pipeline with 43 macOS actions, local RAG, and a TUI. Think of it as the reference application built on top of the engine.

On RAG specifically: voice + document Q&A is a natural pairing for on-device use cases. You have sensitive documents you don't want to upload to the cloud, you ingest them locally, and then ask questions by voice. The retrieval runs at ~4ms over 5K+ chunks, so it feels instant in the voice pipeline. Its not random, it's one of the strongest privacy arguments for running everything locally.

The longer-term vision is bringing MetalRT to more chips and platforms, so any developer can get cloud-competitive inference on-device with minimal integration effort.

drcongo · 6 days ago
I came to the comments here to see if anyone had worked out what it is, so you're not alone.
glitchc · 6 days ago
From the TFA: Document Intelligence (RAG): Ingest docs, ask questions by voice — ~4ms hybrid retrieval.

Seems pretty clear. You can supply documents to the model as input and then verbally ask questions about them.

vessenes · 6 days ago
Just tried it. really cool, and a fun tech demo with rcli. I filed a bug report; not everything is loading properly when installed via homebrew.

Quick request: unsloth quants; bit per bit usually better. Or more generally UI for huggingface model selections. I understand you won't be able to serve everything, but I want to mix and match!

Also - grounding:

"open safari" (safari opens, voice says: "I opened safari") "navigate to google.com in safari" (nothing happens, voice says: "I navigated to google.com")

Anyway, really fun.

sanchitmonga22 · 5 days ago
Thanks for trying it and for filing the bug, we're looking into the homebrew install issue.

On unsloth quants: agreed, they're consistently better bit-for-bit. Adding broader quantization format support (including unsloth's approach) is on the roadmap. Right now MetalRT works with MLX 4-bit files and GGUF Q4_K_M, we want to expand that.

On the grounding issue ("navigate to google.com" not actually navigating): you're right, that's a gap. The "open_url" action exists but the LLM doesn't always route to it correctly, especially with compound commands. Small models (0.6B-1.2B) have limited tool-calling accuracy, upgrading to Qwen3.5 4B via rcli upgrade-llm helps significantly. We're also improving the action routing prompts.

Appreciate the detailed feedback, this is exactly what we need.

blks · 6 days ago
> "open safari" (safari opens, voice says: "I opened safari") "navigate to google.com in safari" (nothing happens, voice says: "I navigated to google.com")

So you’re describing a core broken feature. Application breaking at easiest test.

sanchitmonga22 · 5 days ago
Fair criticism. The action executed on the LLM side but didn't translate to the correct macOS action, the model hallucinated success instead of routing to the open_url tool.

This is a known limitation with small LLMs (0.6B-1.2B) doing tool calling. They sometimes confuse "I know what you want" with "I did it." Upgrading to a larger model improves tool-calling accuracy significantly.

We're also working on verification, having the pipeline confirm the action actually succeeded before reporting back. Thats a fair expectation and we should meet it.

Tacite · 6 days ago
How did you try it? You said on github it doesn't work.
wlesieutre · 6 days ago
They said it didn't work installed from homebrew, so I assume they went back and did the curl | bash install option
vessenes · 6 days ago
It loads after those errors. Tap space and talk to it.

Deleted Comment

jonhohle · 6 days ago
If I send a Portfile patch, would you consider MacPorts distribution?
halostatue · 6 days ago
You're welcome to add me as a co-maintainer on this if you submit it to macports/macports-ports:

     {macports.halostatue.ca:austin @halostatue}
I maintain https://github.com/macports/macports-ports/blob/master/sysut... amongst other things regularly.

sanchitmonga22 · 5 days ago
Absolutely, we'd welcome a Portfile contribution. Happy to review and merge. If halostatue wants to co-maintain, even better.

Feel free to open a PR or issue on the RCLI repo and we'll coordinate.

AmanSwar · 6 days ago
yes please
mhamann · 4 days ago
Can you help me understand MetalRT a bit more? Based on the name, it sounds like something that's Apple-only (although, Apple basically co-opted the name Metal, which was traditionally more generic). Does or will MetalRT run on more platforms?

What about MetalRT's relationship to llama.cpp, onnx, MLX, transformers, etc? Is MetalRT a replacement for those? Designed to be compatible with a wide variety of model formats? Or are you just providing an abstraction on top of these?

AmanSwar · 4 days ago
MetalRT is metal only inference engine (we are making for other hardwares too). Think of it like SGLang or vLLM but for single batch inference on apple silicon. See this blogpost : https://www.runanywhere.ai/blog/metalrt-speech-fastest-stt-t...
tristor · 6 days ago
> What would you build if on-device AI were genuinely as fast as cloud?

I think this has to be the future for AI tools to really be truly useful. The things that are truly powerful are not general purpose models that have to run in the cloud, but specialized models that can run locally and on constrained hardware, so they can be embedded.

I'd love to see this able to be added in-path as an audio passthrough device so you can add on-device native transcriptioning into any application that does audio, such as in video conferencing applications.

sanchitmonga22 · 5 days ago
This is a great idea. A virtual audio device that sits in the path of any audio stream and provides live transcription, that would be huge for video conferencing, lectures, podcasts.

MetalRT's STT numbers make this feasible: 70 seconds of audio transcribed in 101ms means you could process audio chunks in real-time with massive headroom. The latency would be imperceptible.

We haven't built this yet but it's a compelling use case. CoreAudio supports virtual audio devices (aggregate devices) that could pipe audio through the pipeline. If anyone in this thread has experience building macOS audio HAL plugins and wants to collaborate, we're very open to contributions, RCLI is MIT.

tristor · 5 days ago
Something that could be possible is serving the model as a virtual audio device and then you can use existing tools on macOS like Rogue Amoeba's Loopback to direct audio to split to that virtual device and your other output (you'd configure your Loopback device as the output in your system audio settings).

I have never written audio drivers on macOS, but maybe something worth exploring to see if I can make this happen. I really appreciate high quality AI transcripts in my meetings, but right now only Webex has good transcriptioning, and a lot of meetings use other services like MS Teams, Zoom, Meet, et al.

DetroitThrow · 6 days ago
Wow, this is such a cool tool, and love the blog post. Latency is killer in the STT-LLM-TTS pipeline.

Before I install, is there any telemetry enabled here or is this entirely local by default?

shubham2802 · 6 days ago
Fully local - no data is collected!!

Dead Comment

mips_avatar · 6 days ago
Have you tried any really big models on a mac studio? I'm wondering what latency is like for big qwens if there's enough memory.
sanchitmonga22 · 5 days ago
Not yet with MetalRT, right now we support models up to ~4B parameters (Qwen3 4B, Llama 3.2 3B, LFM2.5 1.2B). These are optimized for the voice pipeline use case where decode speed and latency matter more then model size.

Expanding to larger models (7B, 14B, 32B) on machines with more unified memory is on the roadmap. The Mac Studio with 192GB would be an interesting target, a 32B model at 4-bit would fit comfortably and MetalRT's architectural advantages (fused kernels, minimal dispatch overhead) should scale well.

What model / use case are you thinking about? That helps us prioritize.

mips_avatar · 5 days ago
Well it’s just more that I’ve noticed in the agents I’ve built that qwen doesn’t get reliable until around 27b so unless you want to rl small qwen I don’t think I would get much useful help out of it.
asimovDev · 5 days ago
I am running 80b Qwen coder next 4bit quant MLX version on a 96GB M3 MacBook and it responds quickly, almost immediately. I can fit the model + 128k context comfortably into the memory
rushingcreek · 6 days ago
Very cool, congrats! I'm curious how you were able to achieve this given Apple's many undocumented APIs. Does it use private Neural Engine APIs or fully public Metal APIs?

Either way, this is a tremendous achievement and it's extremely relevant in the OpenClaw world where I might not want to have sensitive information leave my computer.

sanchitmonga22 · 5 days ago
Fully public Metal APIs, no private frameworks, no Neural Engine, no undocumented entitlements.

MetalRT is built on the public Metal API. The performance comes from how we use the GPU, not from accessing anything Apple doesn't document.

We specifically chose to stay on public APIs so that MetalRT works on any Apple Silicon Mac without special entitlements or SIP workarounds. This also means its App Store compatible for future macOS/iOS distribution.

The results speak for themselves: 1.1-1.19x faster than Apple's own MLX on identical model files, 4.6x faster on STT, 2.8x faster on TTS. Full methodology published here: https://www.runanywhere.ai/blog/metalrt-fastest-llm-decode-e...

Appreciate the kind words, the "OpenClaw world" framing is exactly why we built this.