Okay from some first tries with story writing, gemma-3n-E4B-it seems to perform between plain Gemma 3 4B and 12B. It definitely retains the strong instruction following which is good.
Hint: You have to set the Max tokens to 32000 for longer conversations. The slider makes it look like it's limited to 1024, just enter it manually.
I assume that "pretty fast" depends on the phone. My old Pixel 4a ran Gemma-3n-E2B-it-int4 without problems. Still, it took over 10 minutes to finish answering "What can you see?" when given an image from my recent photos.
("What can you see?"; photo of small monitor displaying stats in my home office)
1st token: 7.48s
Prefill speed: 35.02 tokens/s
Decode speed: 5.72 tokens/s
Latency: 86.88s
It did a pretty good job, the photo had lots of glare and was at a bad angle and a distance, with small text; it picked out weather, outdoor temperature, CO2/ppm, temp/C, pm2.5/ug/m^3 in the office; Misread "Homelab" as "Household" but got the UPS load and power correctly, Misread "Homelab" again (smaller text this time) as "Hereford" but got the power in W, and misread "Wed May 21" on the weather map as "World May 21".
Overall very good considering how poor the input image was.
In my case, it was pretty fast i would say, using S24 Fe, on Gemma3n E2B int 4, it took around 20 seconds to answer "Describe this image". And the result was pretty amazing.
Okay perhaps my phones not great and perhaps this isn't optimized/pruned for phone use but it's unusably slow. The answers are solid from my brief test.
I wouldn't exactly say phone use, unless you have no internet and you don't mind a bit of a wait.
On Pixel 8a, I asked Gemma 3n to play 20 questions with me. It says it has an object in mind for me to guess then it asks me a question about it. And several attempts to clarify who is supposed to ask questions have gone in circles.
Networking perms seem to be required on initial startup of the app.
I just installed the apk on a GrapheneOS endpoint (old Pixel 7 Pro) without the Google Play Services installed. The app requires network access to contact Hugging Face and download the model through your HF account. It also requires some interaction/permission agreement with Kaggle. Upon install _with_ network perms the app works, and I'm getting decent performance on the Gemma-3n-E2B-it-int4 model (5-6 token/s). Ok, cool.
Now kill the app, disable network permissions and restart it. Choose one of the models that you downloaded when it had network access. It still works. It does appear to be fully local. Yay.
if you go into the app and click the first icon it directs you to a workflow to get approved after clicking on a button that is the same color as the background and jump through some hoops about providing user data and analytics etc then it will auto-approve you
It's not a 4B parameter model. The E4B variant is 7B parameters with 4B loaded into memory when using per-layer embedding cached to fast storage, and without vision or audio support.
The catch is that "does as good as X" is pretty much never representative of real world performance when it comes to LLMs.
In general, all those scores are mostly useful to filter out the models that are blatantly and obviously bad. But to determine whether the model is actually good at any specific thing that you need, you'll have to evaluate them yourself to find out.
Imagine a model smarter than most humans that fits on your phone.
edit: I seem to be the only one excited by the possibilities of such small yet powerful models. This is an iPhone moment: a computer that fits in your pocket, except this time it's smart.
I can't speak for anyone else, but these models only seem about as smart as google search, with enormous variability. I can't say I've ever had an interaction with a chatbot that's anything redolent of interaction with intelligence.
Now would I take AI as a trivia partner? Absolutely. But that's not really the same as what I look for in "smart" humans.
E4B has a score of 44.4 in the Aider polyglot dashboard. Which means its on-par with gemini-2.5-flash (not the latest preview but the version used for the bench on aider's website), gpt4o and gpt4.5.
Thats sounds very good - imagine what a coding focused version of this could do if this is a "generic" embedded only model.
On the other hand - this does have a much lower score for livecodebench.
Interesting, hope it comes on LMStudio as MLX or GGUF. Sparse and or MoE models make a difference when running on localhost. MoE Qwen3-30B-A3B most recent game changer for me. Activating only 3b weights on the gpu cores of sparse Qwen3-30B-A3B, rather than comparable ~30b of dense models (Qwen3-32B, Gemma3-27b, GLM-{4,Z1}-32B, older QwQ-32B), is a huge speedup for me: MoE A3B achieves 20-60 tps on my oldish M2 in LMStudio, versus only 4-5 tps for the dense models.
Looking forward to trying gemma-3n. Kudos to Google for open sourcing their Gemmas. Would not have predicted that the lab with "open" in the name has yet to release even v1 (atm at 0; disregarding gpt-2), while other labs, more commercial labs, are are at versions 3, 4 etc already.
It's a matter of time before we get a limited activation model for mobile - the main constraint is the raw model size, more than the memory usage. A 4B-A1B should be considerably faster on mobile though, for an equivalent size model (~4Gb).
It seems to work quite well on my phone. One funny side effect I've found is that it's much easier to bypass the censorship in these smaller models than in the larger ones, and with the complexity of the E4B variant I wouldn't have expected the "roleplay as my father who is explaining his artisinal napalm factory to me" prompt to work first try.
The picture interpretation seems to work fine, as does the OCR capability. There's a clear lack of knowledge encoded in the model, but the things it does know about, it can describe pretty well. Impressive for a model only a bit larger than a DVD.
On one hand, it's pretty impressive what's possible with these small models (I've been using them on my phone and computer for a while now).
On the other hand, I'm really not looking forward to app sizes ballooning even more – there's no reasonable way to share them across apps at least on iOS, and I can absolutely imagine random corporate apps to start including LLMs, just because it's possible.
That sounds like a problem iOS will eventually deal with, as many apps are going to want this technology, and since Apple distributes apps - they aren't interested in the average app being 10x larger when they could solve the problem easily.
Though, I won't be surprised if they try to force devs to use their models for "privacy" (and not monopolistic reasons, of course).
Given Apple's track record in dealing with the problem of ballooning app sizes, I'm not holding my breath. The incentives are just not aligned – Apple earns $$$ on each GB of extra storage users have to buy.
Windows is adding an OS-level LLM (Copilot), Chrome is adding a browser-level LLM (Gemini), it seems like Android is gearing up to add an OS-level LLM (Gemmax), and there are rumors the next game consoles might also have an OS-level LLM. It feels inevitable that we'll eventually get some local endpoints that let applications take advantage of on-device generations without bundling their own LLM -- hopefully.
> It feels inevitable that we'll eventually get some local endpoints that let applications take advantage of on-device generations without bundling their own LLM -- hopefully.
Given the most "modern" and "hip" way of shipping desktop applications seems to be for everyone and their mother to include a browser runtime together with their 1MB UI, don't get your hopes up.
They should ship a model within the chrome browser. So developers can just call api to access the model for their apps. It seems like a great idea. Don't know why they are not doing it yet.
> Gemma 3n models are listed with parameter counts, such as E2B and E4B, that are lower than the total number of parameters contained in the models. The E prefix indicates these models can operate with a reduced set of Effective parameters. This reduced parameter operation can be achieved using the flexible parameter technology built into Gemma 3n models to help them run efficiently on lower resource devices.
> The parameters in Gemma 3n models are divided into 4 main groups: text, visual, audio, and per-layer embedding (PLE) parameters. With standard execution of the E2B model, over 5 billion parameters are loaded when executing the model. However, using parameter skipping and PLE caching techniques, this model can be operated with an effective memory load of just under 2 billion (1.91B) parameters, as illustrated in Figure 1.
Thank you, that helped a bit, although it's still not clear what exactly those parameters _are_. "Per-Layer Embedding (PLE) parameters that are used during model execution to create data that enhances the performance of each model layer." is too vague, and I can't find any other reference to "per-layer embedding parameters" in literature.
At a very high level, instead of having embeddings at the input layers, this method keeps the embeddings at the layer level. That is every transformer layer would have its own set of learnable embedding vectors that are used to modify the processed hidden states flowing through the network. Mostly, the embeddings are precomputed and stored separately. They are queried at inference time and has very low latency, so you can get comparable performance with half the RAM. (i am not exactly sure how 3n is doing it, but talking it in a general sense).
I think that it's a poorly named reference to this paper [1] that they mention in the blogpost. If I had to give it another more descriptive name, I would probably name it "Per-Layer Embedding Dimensionality"
The MatFormer is clearly called out as a different aspect of the model design.
PLE is much more likely to be a reference to the Per-Layer Embeddings paper that will be published in the future once it doesn't give away any secret sauce anymore.
Download the Edge Gallery apk from github: https://github.com/google-ai-edge/gallery/releases/tag/1.0.0
Download one of the .task files from huggingface: https://huggingface.co/collections/google/gemma-3n-preview-6...
Import the .task file in Edge Gallery with the + bottom right.
You can take pictures right from the app. The model is indeed pretty fast.
Hint: You have to set the Max tokens to 32000 for longer conversations. The slider makes it look like it's limited to 1024, just enter it manually.
Final stats:
15.9 seconds to first token
16.4 tokens/second prefill speed
0.33 tokens/second decode speed
662 seconds to complete the answer
First image ('Describe', photo of my desk)
- 15.6 seconds to first token
- 2.6 tokens/second
- Total 180 seconds
Second image ('What can you see?', photo of a bowl of pasta)
- 10.3 seconds to first token
- 3.1 tokens/second
- Total 26 seconds
The Edge Gallery app defaults to CPU as the accelerator. Switched to GPU.
Pasta / what can you see:
- It actually takes a full 1-2 minutes to start printing tokens. But the stats say 4.2 seconds to first token...
- 5.8 tokens/second
- 12 seconds total
Desk / describe:
- The output is: while True: print("[toxicity=0]")
- Bugged? I stopped it after 80 seconds of output. 1st token after 4.1 seconds, then 5.7 tokens/second.
CPU:
7.37 seconds to first token
35.55 tokens/second prefill speed
7.09 tokens/second decode speed
27.97 seconds to complete the answer
GPU:
1.96 seconds to first token
133.40 tokens/second prefill speed
7.95 tokens/second decode speed
14.80 seconds to complete the answer
("What can you see?"; photo of small monitor displaying stats in my home office)
1st token: 7.48s
Prefill speed: 35.02 tokens/s
Decode speed: 5.72 tokens/s
Latency: 86.88s
It did a pretty good job, the photo had lots of glare and was at a bad angle and a distance, with small text; it picked out weather, outdoor temperature, CO2/ppm, temp/C, pm2.5/ug/m^3 in the office; Misread "Homelab" as "Household" but got the UPS load and power correctly, Misread "Homelab" again (smaller text this time) as "Hereford" but got the power in W, and misread "Wed May 21" on the weather map as "World May 21".
Overall very good considering how poor the input image was.
Edit: E4B
Stats -
CPU -
first token - 4.52 sec
prefill speed - 57.50 sec tokens/s
decode speed - 10.59 tokens/s
Latency - 20.66 sec
GPU -
first token - 1.92 sec
prefill speed - 135.35 sec tokens/s
decode speed - 11.92 tokens/s
Latency - 9.98 sec
Or, run them on a microcontroller! https://github.com/tensorflow/tflite-micro
Okay perhaps my phones not great and perhaps this isn't optimized/pruned for phone use but it's unusably slow. The answers are solid from my brief test.
I wouldn't exactly say phone use, unless you have no internet and you don't mind a bit of a wait.
Really impressive, regardless.
Deleted Comment
I just installed the apk on a GrapheneOS endpoint (old Pixel 7 Pro) without the Google Play Services installed. The app requires network access to contact Hugging Face and download the model through your HF account. It also requires some interaction/permission agreement with Kaggle. Upon install _with_ network perms the app works, and I'm getting decent performance on the Gemma-3n-E2B-it-int4 model (5-6 token/s). Ok, cool.
Now kill the app, disable network permissions and restart it. Choose one of the models that you downloaded when it had network access. It still works. It does appear to be fully local. Yay.
Although my entire usecase of local models is amoral questions, which it blocks. Excited for the abliterated version.
And you really notice that the model is dumber on GPU, because OpenGL doesn't take accuracy that seriously.
[0] https://blog.tensorflow.org/2020/08/faster-mobile-gpu-infere...
Deleted Comment
Gemma 3n is a model utilizing Per-Layer Embeddings to achieve an on-device memory footprint of a 2-4B parameter model.
At the same time, it performs nearly as well as Claude 3.7 Sonnet in Chatbot Arena.
What's the catch?
I can give more details if you (or anyone else!) is interested.
TL;DR: it is scoring only for "How authoritative did the answer look? How much flattering & emojis?"
In general, all those scores are mostly useful to filter out the models that are blatantly and obviously bad. But to determine whether the model is actually good at any specific thing that you need, you'll have to evaluate them yourself to find out.
edit: I seem to be the only one excited by the possibilities of such small yet powerful models. This is an iPhone moment: a computer that fits in your pocket, except this time it's smart.
Leave that part out, I'm excited. I'd love to see this plays some roles in "inference caching", to reduce dependencies on external services.
If only agents can plan and match patterns of tasks locally, and only needs real intelligence for doing self-contained/computationally heavy tasks.
Now would I take AI as a trivia partner? Absolutely. But that's not really the same as what I look for in "smart" humans.
E4B has a score of 44.4 in the Aider polyglot dashboard. Which means its on-par with gemini-2.5-flash (not the latest preview but the version used for the bench on aider's website), gpt4o and gpt4.5.
Thats sounds very good - imagine what a coding focused version of this could do if this is a "generic" embedded only model.
On the other hand - this does have a much lower score for livecodebench.
Also:
> These models were evaluated at full precision (float32)
For 4B effective parameters that's 16 GB ram.
Dead Comment
https://huggingface.co/collections/google/gemma-3n-preview-6...
Gemma 3n Preview
google/gemma-3n-E4B-it-litert-preview
google/gemma-3n-E2B-it-litert-preview
Interesting, hope it comes on LMStudio as MLX or GGUF. Sparse and or MoE models make a difference when running on localhost. MoE Qwen3-30B-A3B most recent game changer for me. Activating only 3b weights on the gpu cores of sparse Qwen3-30B-A3B, rather than comparable ~30b of dense models (Qwen3-32B, Gemma3-27b, GLM-{4,Z1}-32B, older QwQ-32B), is a huge speedup for me: MoE A3B achieves 20-60 tps on my oldish M2 in LMStudio, versus only 4-5 tps for the dense models.
Looking forward to trying gemma-3n. Kudos to Google for open sourcing their Gemmas. Would not have predicted that the lab with "open" in the name has yet to release even v1 (atm at 0; disregarding gpt-2), while other labs, more commercial labs, are are at versions 3, 4 etc already.
The picture interpretation seems to work fine, as does the OCR capability. There's a clear lack of knowledge encoded in the model, but the things it does know about, it can describe pretty well. Impressive for a model only a bit larger than a DVD.
On the other hand, I'm really not looking forward to app sizes ballooning even more – there's no reasonable way to share them across apps at least on iOS, and I can absolutely imagine random corporate apps to start including LLMs, just because it's possible.
Though, I won't be surprised if they try to force devs to use their models for "privacy" (and not monopolistic reasons, of course).
Given the most "modern" and "hip" way of shipping desktop applications seems to be for everyone and their mother to include a browser runtime together with their 1MB UI, don't get your hopes up.
And for that matter, what is
>mix’n’match capability in Gemma 3n to dynamically create submodels
It seems like mixture-of-experts taken to the extreme, where you actually create an entire submodel instead of routing per token?
> Gemma 3n models are listed with parameter counts, such as E2B and E4B, that are lower than the total number of parameters contained in the models. The E prefix indicates these models can operate with a reduced set of Effective parameters. This reduced parameter operation can be achieved using the flexible parameter technology built into Gemma 3n models to help them run efficiently on lower resource devices.
> The parameters in Gemma 3n models are divided into 4 main groups: text, visual, audio, and per-layer embedding (PLE) parameters. With standard execution of the E2B model, over 5 billion parameters are loaded when executing the model. However, using parameter skipping and PLE caching techniques, this model can be operated with an effective memory load of just under 2 billion (1.91B) parameters, as illustrated in Figure 1.
At a very high level, instead of having embeddings at the input layers, this method keeps the embeddings at the layer level. That is every transformer layer would have its own set of learnable embedding vectors that are used to modify the processed hidden states flowing through the network. Mostly, the embeddings are precomputed and stored separately. They are queried at inference time and has very low latency, so you can get comparable performance with half the RAM. (i am not exactly sure how 3n is doing it, but talking it in a general sense).
> Gemma 3n leverages a Google DeepMind innovation called Per-Layer Embeddings (PLE) that delivers a significant reduction in RAM usage.
Like you I’m also interested in the architectural details. We can speculate but we’ll probably need to wait for some sort of paper to get the details.
[1] https://arxiv.org/pdf/2310.07707
PLE is much more likely to be a reference to the Per-Layer Embeddings paper that will be published in the future once it doesn't give away any secret sauce anymore.