sanchitmonga22 (u/sanchitmonga22)

sanchitmonga22 commented on Launch HN: RunAnywhere (YC W26) – Faster AI Inference on Apple Silicon github.com/RunanywhereAI/... · Posted by u/sanchitmonga22

ifh-hn · 5 days ago

Faster AI inference of Apple silicon... So not run anywhere then...

sanchitmonga22 · 5 days ago

please check our main repo: https://github.com/RunanywhereAI/runanywhere-sdks/

We are running anywhere, hence RunAnywhere, MetalRT is the fastest inference engine we made for Apple silicon, and we'll be covering other edge devices as well, All edge about to hit Warp speed!

sanchitmonga22 commented on Launch HN: RunAnywhere (YC W26) – Faster AI Inference on Apple Silicon github.com/RunanywhereAI/... · Posted by u/sanchitmonga22

brainless · 6 days ago

I am interested in MetalRT. I am an indie builder, focused mostly on building products with LLM assistance that run locally. Like: https://github.com/brainless/dwata

I would be interested if MetalRT can be used by other products, if you have some plans for open source products?

sanchitmonga22 · 5 days ago

Yes, that's the plan. MetalRT will ship as part of the RunAnywhere SDK so other developers can integrate it into their own apps. We're working on making that available. If you want to be in the early access group, drop me a line at founder@runanywhere.ai or open an issue on the RCLI repo. Happy to look at your project.

sanchitmonga22 commented on Launch HN: RunAnywhere (YC W26) – Faster AI Inference on Apple Silicon github.com/RunanywhereAI/... · Posted by u/sanchitmonga22

elpakal · 5 days ago

> This is a known limitation with small LLMs (0.6B-1.2B) doing tool calling.

To me this is this nut to crack, wrt tool calling and locally running inference. This seems like a really cool project and I'm going to dive around a little later but if it's hallucinating for something as basic as this makes me think it's more of POC stage right now (to echo other sentiment here).

sanchitmonga22 · 5 days ago

That's a fair read. Tool calling reliability with sub-4B models is genuinely the hardest unsolved problem in on-device AI right now.

The inference engine (MetalRT) is production-grade, the pipeline architecture is solid, but the models at this size are still the weak link for complex tool routing. Larger model support (where tool calling is much more reliable) is next on the roadmap. Please stay tuned!

sanchitmonga22 commented on Launch HN: RunAnywhere (YC W26) – Faster AI Inference on Apple Silicon github.com/RunanywhereAI/... · Posted by u/sanchitmonga22

mips_avatar · 6 days ago

Well it’s just more that I’ve noticed in the agents I’ve built that qwen doesn’t get reliable until around 27b so unless you want to rl small qwen I don’t think I would get much useful help out of it.

sanchitmonga22 · 5 days ago

That tracks with what we've seen too. For agent workflows with reliable tool calling, you really do need the larger models. Larger model support is a priority for us. Thanks for the data point.

sanchitmonga22 commented on Launch HN: RunAnywhere (YC W26) – Faster AI Inference on Apple Silicon github.com/RunanywhereAI/... · Posted by u/sanchitmonga22

concats · 6 days ago

How does it compare for models of any meaningful size?

These 0.6B-4B models are, frankly, just amusing curiosities. But commonly regarded as too error prone for any non-demo work.

The reason why people are buying Apple Silicon today is because the unified memory allows them to run larger models that are cost prohibitive to run otherwise (usually requiring Nvidia server GPUs). It would be much more interesting to see benchmarks for things like Qwen3.5-122B-A10B, GLM-5, or any dense model is the 20b+ range. Thanks.

sanchitmonga22 · 5 days ago

Fair criticism. Our benchmarks are on small models because MetalRT was built for the voice pipeline use case, where decode latency on 0.6B-4B models is the bottleneck.

You're right that the bigger opportunity on Apple Silicon is large models that don't fit on consumer GPUs. Expanding MetalRT to 7B, 14B, 32B+ is on the roadmap. The architectural advantages(that MetalRT has) should matter even more at that scale where everything becomes memory-bandwidth-bound.

We'll publish benchmarks on larger models as we add support. If you have a specific model/size you'd want to see first, that helps us prioritize.

sanchitmonga22 commented on Launch HN: RunAnywhere (YC W26) – Faster AI Inference on Apple Silicon github.com/RunanywhereAI/... · Posted by u/sanchitmonga22

woadwarrior01 · 6 days ago

> Neural Engine / ANE is powerful for fixed-shape inference (vision, classification) but autoregressive LLM decode, where you're generating one token at a time with dynamic KV cache, doesn't map as cleanly to ANE today.

What does the ANE have to with this?

Neural Engine (ANE) and the M5 Neural Accelerator (NAX) are not the same thing. NAX can accelerate LLM prefill quite dramatically, although autoregressive decoding remains memory bandwidth bound.

I suspect the biggest blocker for Metal 4 adoption is the macOS Tahoe 26 requirement.

sanchitmonga22 · 5 days ago

Good correction, thanks. You're right that NAX and ANE are distinct, I shouldn't have conflated them. NAX's ability to accelerate LLM prefill is exactly the kind of capability that could complement MetalRT's decode-focused pipeline. Appreciate the clarification on the Metal 4 / Tahoe requirement too.

sanchitmonga22 commented on Launch HN: RunAnywhere (YC W26) – Faster AI Inference on Apple Silicon github.com/RunanywhereAI/... · Posted by u/sanchitmonga22

Reebz · 6 days ago

Do you have plans to port your proprietary library MetalRT to mobile devices? These performance gains would be a boon for privacy-centric mobile applications.

sanchitmonga22 · 6 days ago

Yes, mobile is our primary offering and it is on the roadmap. The same Metal GPU pipeline that powers MetalRT on macOS maps directly to iOS (same Apple Silicon, same Metal API)

sanchitmonga22 commented on Launch HN: RunAnywhere (YC W26) – Faster AI Inference on Apple Silicon github.com/RunanywhereAI/... · Posted by u/sanchitmonga22

nostrebored · 6 days ago

It’s the TTS layer that is weird. I’m in the same boat — speech out is just a much worse modality than text when possible.

sanchitmonga22 · 6 days ago

Agreed for a lot of use cases. RCLI supports text-only mode (--no-speak flag or just type in the TUI instead of using push-to-talk). TTS makes sense for hands-free / eyes-free scenarios, but we dont force it.

sanchitmonga22 commented on Launch HN: RunAnywhere (YC W26) – Faster AI Inference on Apple Silicon github.com/RunanywhereAI/... · Posted by u/sanchitmonga22

Tacite · 6 days ago

I did on Github. This looks vibecoded? EDIT: Dev is using Claude Code as stated in their github updates.

sanchitmonga22 · 6 days ago

We use AI tools in our workflow, same as a lot of teams at this point. The pipeline architecture, Metal integration, and engine design are ours. The code is MIT and open for anyone to read and judge the quality directly.

sanchitmonga22 commented on Launch HN: RunAnywhere (YC W26) – Faster AI Inference on Apple Silicon github.com/RunanywhereAI/... · Posted by u/sanchitmonga22

antipaul · 6 days ago

Nice list.

What about for on-device RAG use cases?

sanchitmonga22 · 6 days ago

RCLI includes local RAG out of the box. You can ingest PDFs, DOCX, and plain text, then query by voice or text:

rcli rag ingest ~/Documents/notes rcli ask --rag ~/Library/RCLI/index "summarize the project plan"

It uses hybrid retrieval (vector + BM25 with Reciprocal Rank Fusion) and runs at ~4ms over 5K+ chunks. Embeddings are computed locally with Snowflake Arctic, so nothing leaves you're machine.