This is a guide for the casual observer who wants to try things out, given that getting started with other AI platforms is so much more straightforward. It's all open source, with transparent hosting, catering to any remaining concerns someone interested in exactly that may have.
This is blowing my mind. gemini-1.5-flash accidentally knows how to transcribe amazingly well but it is -very- hard to figure out how to use it well and now Amazon comes out with a gemini flash like model and it explicitly ignores audio. It is so clear that multi-modal audio would be easy for these models but it is like they are purposefully holding back releasing it/supporting it. This has to be a strategic decision to not attach audio. Probably because the margins on ASR are too high to strip with a cheap LLM. I can only hope Meta will drop a mult-modal audio model to force this soon.
Edit: SteamOS != Arch Linux[0]
[0] https://store.steampowered.com/hwsurvey/Steam-Hardware-Softw...