I assume that "pretty fast" depends on the phone. My old Pixel 4a ran Gemma-3n-E2B-it-int4 without problems. Still, it took over 10 minutes to finish answering "What can you see?" when given an image from my recent photos.
In my case, it was pretty fast i would say, using S24 Fe, on Gemma3n E2B int 4, it took around 20 seconds to answer "Describe this image". And the result was pretty amazing.
Final stats:
15.9 seconds to first token
16.4 tokens/second prefill speed
0.33 tokens/second decode speed
662 seconds to complete the answer
Stats -
CPU -
first token - 4.52 sec
prefill speed - 57.50 sec tokens/s
decode speed - 10.59 tokens/s
Latency - 20.66 sec
GPU -
first token - 1.92 sec
prefill speed - 135.35 sec tokens/s
decode speed - 11.92 tokens/s
Latency - 9.98 sec