Readit News logoReadit News
aazo11 commented on The Accuracy of On-Device LLMs   medium.com/@aazo11/on-the... · Posted by u/aazo11
aazo11 · 7 months ago
I tested on-device LMs (Gemma, DeepSeek) across prompt cleanup, PII redaction, math, and general knowledge on my M2 Max laptop using LM Studio + DSPy.

Some observations

- Gemma-3 is the best model for on-device inference - 1B models look fine at first but break under benchmarking - 4B can handle simple rewriting and PII redaction. It also did math reasoning surprisingly well. - General knowledge Q&A does not work with a local model. This might work with a RAG pipeline or additional tools

I plan on training and fine-tuning 1B models to see if I can build high accuracy task specific models under 1GB in the future.

aazo11 commented on AI's Version of Moore's Law   metr.org/blog/2025-03-19-... · Posted by u/aazo11
aazo11 · 8 months ago
The trend is that the length of tasks AI can do is doubling every 7 months.

Accompanying YT video https://www.youtube.com/watch?v=evSFeqTZdqs

aazo11 commented on Lossless LLM compression for efficient GPU inference via dynamic-length float   arxiv.org/abs/2504.11651... · Posted by u/CharlesW
aazo11 · 8 months ago
This is a huge unlock for on-device inference. The download time of larger models makes local inference unusable for non-technical users.
aazo11 commented on Local LLM inference – impressive but too hard to work with   medium.com/@aazo11/local-... · Posted by u/aazo11
byyoung3 · 8 months ago
its an if statement on whether the model has downloaded or not
aazo11 · 8 months ago
A better solution would train/finetune the smaller model from the responses of the larger model and only push to the inference to the edge if the smaller model is performant and the hardware specs can handle the workload?
aazo11 commented on Local LLM inference – impressive but too hard to work with   medium.com/@aazo11/local-... · Posted by u/aazo11
zellyn · 8 months ago
Weird to give MacBook Pro specs and omit RAM. Or did I miss it somehow? That's one of the most important factors.
aazo11 · 8 months ago
Thanks for calling that out. It was 32GB. I updated the post as well.
aazo11 commented on Local LLM inference – impressive but too hard to work with   medium.com/@aazo11/local-... · Posted by u/aazo11
ijk · 8 months ago
There's two general categories of local inference:

- You're running a personal hosted instance. Good for experimentation and personal use; though there's a tradeoff on renting a cloud server.

- You want to run LLM inference on client machines (i.e., you aren't directly supervising it while it is running).

I'd say that the article is mostly talking about the second one. Doing the first one will get you familiar enough with the ecosystem to handle some of the issues he ran into when attempting the second (e.g., exactly which model to use). But the second has a bunch of unique constraints--you want things to just work for your users, after all.

I've done in-browser neural network stuff in the past (back when using TensorFlow.js was a reasonable default choice) and based on the way LLM trends are going I'd guess that edge device LLM will be relatively reasonable soon; I'm not quite sure that I'd deploy it in production this month but ask me again in a few.

Relatively tightly constrained applications are going to benefit more than general-purpose chatbots; pick a small model that's relatively good at your task and train it on enough of your data and you can get a 1B or 3B model that has acceptable performance, let alone the 7B ones being discussed here. It absolutely won't replace ChatGPT (though we're getting closer to replacing ChatGPT 3.5 with small models). But if you've got a specific use case that will hold still enough to deploy a model it can definitely give you the edge versus relying on the APIs.

I expect games to be one of the first to try this: per-player-action API costs murder per-user revenue, most of the gaming devices have some form of GPU already, and most games are shipped as apps so bundling a few more GB in there is, if not reasonable, at least not unprecedented.

aazo11 · 8 months ago
Very interesting. I had not thought about gaming at all but that makes a lot of sense.

I also agree the goal should not be to replace ChatGPT. I think ChatGPT is way overkill for a lot of the workloads it is handling. A good solution should probably use the cloud LLM outputs to train a smaller model to deploy in the background.

aazo11 commented on Local LLM inference – impressive but too hard to work with   medium.com/@aazo11/local-... · Posted by u/aazo11
bionhoward · 8 months ago
LM Studio seems pretty good at making local models easier to use
aazo11 · 8 months ago
They look awesome. Will try it out.
aazo11 commented on Local LLM inference – impressive but too hard to work with   medium.com/@aazo11/local-... · Posted by u/aazo11
antirez · 8 months ago
Download the model in background. Serve the client with an LLM vendor API just for the first requests, or even using that same local LLM installed on your own servers (likely cheaper). By doing so, in the long run the inference cost is near-zero and allows to use LLMs in otherwise impossible business models (like freemium).
aazo11 · 8 months ago
Exactly. Why does this not exist yet?

u/aazo11

KarmaCake day325March 20, 2019View Original