One thing that I thought was fairly clear in my write-up but feels a little lost in the comments: I didn't just try this with whisper. I tried it with their newer gpt-4o-transcription model, which seems considerably faster. There's no way to run that one locally.
Clearly the next thing we need to test is removing all the vowels from words, or something like that :)