In our deployments, we've seen open source models rival and even outperform lower-tier cloud counterparts. Happy to share some benchmarks if you like.
Our pricing is on a per-monthly-active-device basis, regardless of utilization. For voice-agent workflows, you typically hit savings as soon as you process over ≈2min of daily inference.
I am very curious what could be done with your impressive optimization on an rk3588, since it has pretty decent bits in all 3 categories, and am now seriously considering a Radxa Orion to play with this on :)
One more if you have a moment: will this be limited to text generation, or will it have audio and image capabilities as well? Would be neat to enable not only image generation, but also explore voice recognition, translation, computer vision, as well as image editing and enhancement features in mobile apps beyond what the big players daign to give us :)
We don't advise using GPUs on smartphones, since they're very energy-inefficient. Mobile GPU inference is actually the main driver behind the stereotype that "mobile inference drains your battery and heats up your phone".
Wrt to your last question – the short answer is yes, we'll have multimodal support. We currently support voice transcription and image understanding. We'll be expanding these capabilities to add more models, voice synthesis, and much more.