Inference in Python uses harmony [1] (for request and response format) which is written in Rust with Python bindings. Another OpenAI's Rust library is tiktoken [2], used for all tokenization and detokenization. OpenAI Codex [3] is also written in Rust. It looks like OpenAI is increasingly adopting Rust (at least for inference).
I see that you're using gemma3n which is a 4B parameter model and utilizes around 3GB RAM. How do you handle loading/offloading the model into the RAM? Or is it always in the memory as long as the app is running?