I would be interested if MetalRT can be used by other products, if you have some plans for open source products?
I would be interested if MetalRT can be used by other products, if you have some plans for open source products?
To me this is this nut to crack, wrt tool calling and locally running inference. This seems like a really cool project and I'm going to dive around a little later but if it's hallucinating for something as basic as this makes me think it's more of POC stage right now (to echo other sentiment here).
The inference engine (MetalRT) is production-grade, the pipeline architecture is solid, but the models at this size are still the weak link for complex tool routing. Larger model support (where tool calling is much more reliable) is next on the roadmap. Please stay tuned!
These 0.6B-4B models are, frankly, just amusing curiosities. But commonly regarded as too error prone for any non-demo work.
The reason why people are buying Apple Silicon today is because the unified memory allows them to run larger models that are cost prohibitive to run otherwise (usually requiring Nvidia server GPUs). It would be much more interesting to see benchmarks for things like Qwen3.5-122B-A10B, GLM-5, or any dense model is the 20b+ range. Thanks.
You're right that the bigger opportunity on Apple Silicon is large models that don't fit on consumer GPUs. Expanding MetalRT to 7B, 14B, 32B+ is on the roadmap. The architectural advantages(that MetalRT has) should matter even more at that scale where everything becomes memory-bandwidth-bound.
We'll publish benchmarks on larger models as we add support. If you have a specific model/size you'd want to see first, that helps us prioritize.
What does the ANE have to with this?
Neural Engine (ANE) and the M5 Neural Accelerator (NAX) are not the same thing. NAX can accelerate LLM prefill quite dramatically, although autoregressive decoding remains memory bandwidth bound.
I suspect the biggest blocker for Metal 4 adoption is the macOS Tahoe 26 requirement.
What about for on-device RAG use cases?
rcli rag ingest ~/Documents/notes rcli ask --rag ~/Library/RCLI/index "summarize the project plan"
It uses hybrid retrieval (vector + BM25 with Reciprocal Rank Fusion) and runs at ~4ms over 5K+ chunks. Embeddings are computed locally with Snowflake Arctic, so nothing leaves you're machine.
We are running anywhere, hence RunAnywhere, MetalRT is the fastest inference engine we made for Apple silicon, and we'll be covering other edge devices as well, All edge about to hit Warp speed!