Mostly SOTA performance at the 3B level. A notable addition to the small but truly open club of models that provide full disclosure, code, recipes to reproduce their work.
Looks like ballpark a million dollars of GPU time if you want to train up one for yourself (4000 gpus/24 days).
Very nice write up that’s generous in sharing their learnings.
I spent about 10 minutes this AM cross-checking with Phi-4-mini benchmarks, as it was very odd to not include the leader in benchmarks and it seemed universally behind.
For context, I dev an LLM client, a core tenant is keeping local as close to cloud parity as much as is possible. (via llama.cpp)
Companies aren't taking local AI seriously on a sustained basis outside Microsoft.
Overall, I usually would bite my tongue. HF is a great citizen, and I doubt this'll be a one off. However, when I see superlatives affirmed, while leaving out the local SoTA for many many moons that is a godsend in this sector, I think it is good to, rather than shy away, stand up and say this.
From the blog post:
"SmolLM3 supports tool calling, and its chat template incorporates two distinct sections for tool descriptions: XML Tools and Python Tools"
It's small (3B) and does great on benchmarks. This is a model for edge / mobile deployments so the gains over gemma3-4b are meaningful. It has dual mode reasoning / non_reasoning AND they released the full training method:
> We're releasing SmolLM3 with our engineering blueprint. It includes architecture details, exact data mixtures showing how we progressively boost performance across domains in a three-stage pretraining approach, and the methodology for building a hybrid reasoning model. Usually, achieving these results would require months of reverse engineering. Instead, we're providing the full methodology.
I hate to say it, but reasoning models simply aren't suited for edge computing. I just ran some tests on this model and even at 4bit weight quantisation it blows past 10GB of VRAM with just ~1000 tokens while it is still reasoning. So even if you're running on a dedicated ML edge device like a $250 Jetson, you will run out of memory before the model even formulates a real answer. You'll need a high end GPU to make full use of it for limited answers and an enterprise grade system to support longer contexts. And with reasoning turned off I don't see any meaningful improvement over older models.
So this is primarily great for enterprises who want to do on-prem with limited budgets and maybe high-end enthusiasts.
You should use flash attention with KV cache quantization. I routinely use Qwen 3 14B with the full 128k context and it fits in under 24 GB VRAM. On my Pixel 8, I've successfully used Qwen 3 4B with 8K context (again with flash attention and KV cache quantization).
Wow. Close to a Qwen3 distill with 75% the size. That's great!
I've been using the smollm base models for my own finetunes just because they're so high quality, it looks like I might be using them to drive local agents/code completion in the near future too.
Their RL algorithm looks interesting. I'm still using OpenAI's algorithm for my stuff, I've been meaning to check on the SoTA since I know my code is pretty outdated (It's crazy how fast that happens with this stuff.)
This seems to be a persistent issue with almost all weight releases, even from bigger companies like Meta.
Are the people who release these weights not testing them in various inference engines? Seems they make it work with Huggingface's Transformers library, then call it a day, but sometimes not even that.
Oh so chat template issues yes are quite pervasive sadly - for eg Llama as you mentioned, but also Qwen, Mistral, Google, the Phi team, DeepSeek - it's actually very common!
My take is large labs with closed source models also did have issues during the beginning, but most likely have standardized the chat template (for eg OpenAI using ChatML). The OSS community on the other hand keeps experimenting with new templates - for example adding tool calling causes a large headache. For example in https://unsloth.ai/blog/phi3 - we found many bugs in OSS models.
No they don't. Why would they? Most of them are using a single inference engine, most likely developed inhouse. Or they go for something like vLLM, but llama.cpp especially is under their radar.
The reason is simple. There isn't much money in it. llama.cpp is free and targets lower end of the hardware spectrum. Corporations will run something else, or even more likely, offload the task to contractor.
Which small model is good for fine tuning to various enterprise data sets? Our business units are wanting to run small models in browser and on mobile devices, without dealing with RAG and cloud resources.
Small models are bad at knowing things. Trying to train knowledge in to small models is probably not the way you want to go. You could try building an offline embedded RAG system that is deployable as wasm. Some folks have been experiencing success with this.
We do use WebLLM and a hosted Weaviate database, but there are complaints about speed (both retrieval and time to first token as the context will get big). The Gemma 3n "nesting doll" approach sounds like it could be useful .. but haven't found anyone specifically doing it to add domain specific knowledge.
You really need to try them all out yourself and make sure you have proper benchmarks.
While machine learning is not my field, I've tried to finetune Mistral 7B (following their official guide and toolset) and the results did not satisfy. Had a few very specific questions from the dataset that no matter how much I've finetuned and tweaked the process it was not able to respond with correct information.
A mix of vector search + keyword search is still better at building the right question context than expecting it to learn all the information.
I've used the pretrained dataset approach. Maybe building syntethic questions and answers around the dataset yields better results but I didn't have time to experiment with that approach.
> Maybe building syntethic questions and answers around the dataset yields better results but I didn't have time to experiment with that approach.
While they answer a slightly different question in the Physics of Language Models[1], based on their results it seems to me it is likely that one needs to do such augmentation of the dataset to get good results.
However, they also show that the dataset the base model is trained on can drastically affect finetuning performance. So if the base model is trained on a poor dataset for your specific task, perhaps you'll never get good performance.
Bite the bullet and do some kind of RAG; you need to provide clear, authoritative information to a model that is skilled enough to remix it for the user.
Tuning the model to imitate the dataset will damage the model's skills and "common sense" but won't train it reliably recall information.
I have fine-tuned Gemma 3N 2B and it's pretty good, but loads slow on my S23U, once it's loaded though, it works fine
Also tried SmolVLM 256M and 500M, they load faster and you can embed them in assets, they work if you know what you're doing
Just keep in mind that smaller models don't perform as well due to their limited parameters
Also on Android, since you can't ship files larger than 2GB due to Java compression issues, you need to download models separately, then you can't load the model from the download folder, you have to copy it into the app's own folder, this means a Gemma 3N 2B model that's 3.14 GB would need at least 7 GB of free space on the user's phone
I'm having trouble running this on my Mac - I've tried Ollama and llama.cpp llama-server so far, both using GGUFs from Hugging Face, but neither worked.
(llama_model_load: error loading model: error loading model architecture: unknown model architecture: 'smollm3')
I've managed to run it using Python and transformers with PyTorch in device="cpu" mode but unsurprisingly that's really slow - it took 35s to respond to "say hi"!
Anyone had success with this on a Mac yet? I really want to get this running with tool calling, ideally via an OpenAI-compatible serving layer like llama-server.
Hey Simon, VB from Hugging Face here and the person who added the model to MLX and llama.cpp (with Son). The PR hasn't yet landed on llama.cpp, hence it doesn't work OTB on llama.cpp installed via brew (similarly doesn't work with ollama since they need to bump their llama.cpp runtime)
Could you please enlighten me regarding all these engines, I’m using lamacpp and ollama. Should I also try mlx, onnx, vllm, etc. I’m not quite sure whats the difference between all these. I’m running on CPU and sometimes GPU
Looks like ballpark a million dollars of GPU time if you want to train up one for yourself (4000 gpus/24 days).
Very nice write up that’s generous in sharing their learnings.
This is a solid and positive contribution.
Like if I really really just wanted to build it from scratch, could I do so? (not that I have that money but just curious)
For context, I dev an LLM client, a core tenant is keeping local as close to cloud parity as much as is possible. (via llama.cpp)
Companies aren't taking local AI seriously on a sustained basis outside Microsoft.
Overall, I usually would bite my tongue. HF is a great citizen, and I doubt this'll be a one off. However, when I see superlatives affirmed, while leaving out the local SoTA for many many moons that is a godsend in this sector, I think it is good to, rather than shy away, stand up and say this.
> We're releasing SmolLM3 with our engineering blueprint. It includes architecture details, exact data mixtures showing how we progressively boost performance across domains in a three-stage pretraining approach, and the methodology for building a hybrid reasoning model. Usually, achieving these results would require months of reverse engineering. Instead, we're providing the full methodology.
So this is primarily great for enterprises who want to do on-prem with limited budgets and maybe high-end enthusiasts.
I've been using the smollm base models for my own finetunes just because they're so high quality, it looks like I might be using them to drive local agents/code completion in the near future too.
Their RL algorithm looks interesting. I'm still using OpenAI's algorithm for my stuff, I've been meaning to check on the SoTA since I know my code is pretty outdated (It's crazy how fast that happens with this stuff.)
./llama.cpp/llama-cli -hf unsloth/SmolLM3-3B-GGUF:Q4_K_XL --jinja -ngl 99
This seems to be a persistent issue with almost all weight releases, even from bigger companies like Meta.
Are the people who release these weights not testing them in various inference engines? Seems they make it work with Huggingface's Transformers library, then call it a day, but sometimes not even that.
My take is large labs with closed source models also did have issues during the beginning, but most likely have standardized the chat template (for eg OpenAI using ChatML). The OSS community on the other hand keeps experimenting with new templates - for example adding tool calling causes a large headache. For example in https://unsloth.ai/blog/phi3 - we found many bugs in OSS models.
The reason is simple. There isn't much money in it. llama.cpp is free and targets lower end of the hardware spectrum. Corporations will run something else, or even more likely, offload the task to contractor.
While machine learning is not my field, I've tried to finetune Mistral 7B (following their official guide and toolset) and the results did not satisfy. Had a few very specific questions from the dataset that no matter how much I've finetuned and tweaked the process it was not able to respond with correct information.
A mix of vector search + keyword search is still better at building the right question context than expecting it to learn all the information.
I've used the pretrained dataset approach. Maybe building syntethic questions and answers around the dataset yields better results but I didn't have time to experiment with that approach.
While they answer a slightly different question in the Physics of Language Models[1], based on their results it seems to me it is likely that one needs to do such augmentation of the dataset to get good results.
However, they also show that the dataset the base model is trained on can drastically affect finetuning performance. So if the base model is trained on a poor dataset for your specific task, perhaps you'll never get good performance.
[1]: https://physics.allen-zhu.com/part-3-knowledge/part-3-1
Bite the bullet and do some kind of RAG; you need to provide clear, authoritative information to a model that is skilled enough to remix it for the user.
Tuning the model to imitate the dataset will damage the model's skills and "common sense" but won't train it reliably recall information.
Also tried SmolVLM 256M and 500M, they load faster and you can embed them in assets, they work if you know what you're doing
Just keep in mind that smaller models don't perform as well due to their limited parameters
Also on Android, since you can't ship files larger than 2GB due to Java compression issues, you need to download models separately, then you can't load the model from the download folder, you have to copy it into the app's own folder, this means a Gemma 3N 2B model that's 3.14 GB would need at least 7 GB of free space on the user's phone
I hope you continue the 50-100M parameter models.
I think there is a case for models that finish fast on CPUs in solve by llm test cases.
(llama_model_load: error loading model: error loading model architecture: unknown model architecture: 'smollm3')
I've managed to run it using Python and transformers with PyTorch in device="cpu" mode but unsurprisingly that's really slow - it took 35s to respond to "say hi"!
Anyone had success with this on a Mac yet? I really want to get this running with tool calling, ideally via an OpenAI-compatible serving layer like llama-server.
The easiest would be to install llama.cpp from source: https://github.com/ggml-org/llama.cpp
If you want to avoid it, I added SmolLM3 to MLX-LM as well:
You can run it via `mlx_lm.chat --model "mlx-community/SmolLM3-3B-bf16"`
(requires the latest mlx-lm to be installed)
here's the MLX-lm PR if you're interested: https://github.com/ml-explore/mlx-lm/pull/272
similarly, llama.cpp here: https://github.com/ggml-org/llama.cpp/pull/14581
Let me know if you face any issues!
Just curious, how frequently does that happen?