Readit News logoReadit News
darkolorin commented on Show HN: We made our own inference engine for Apple Silicon   github.com/trymirai/uzu... · Posted by u/darkolorin
giancarlostoro · a month ago
Hoping the author can answer, I'm still learning about how this all works. My understanding is that inference is "using the model" so to speak. How is this faster than established inference engines specifically on Mac? Are models generic enough that if you build e.g. an inference engine focused on AMD GPUs or even Intel GPUs, would they achieve reasonable performance? I always assumed because Nvidia is king of AI that you had to suck it up, or is it just that most inference engines being used are married to Nvidia?

I would love to understand how universal these models can become.

darkolorin · a month ago
Basically “faster” means better performance e.g. tokens/s without loosing quality (benchmarks scores for models). So when we say faster we provide more tokens per second than llama cpp. That means we effectively utilize hardware API available (for example we wrote our own kernels) to perform better.

Deleted Comment

Deleted Comment

Deleted Comment

darkolorin commented on 90T/s on my iPhone llama3.2-1B-fp16   reddit.com/r/LocalLLaMA/s... · Posted by u/darkolorin
darkolorin · 5 months ago
I made it! 90 t/s on my iPhone with llama1b fp16

We completely rewrite the inference engine and did some tricks. This is a summarization with llama 3.2 1b float16. So most of the times we do much faster than MLX. lmk in comments if you wanna test the inference and I’ll post a link.

u/darkolorin

KarmaCake day73August 5, 2014
About
prev CEO & co-founder Prisma & Capture, now CEO & co-founder LFG
View Original