Do you have any sense of whether this type of model would work with children's speech? There are plenty of educational applications that would value a privacy-first locally deployed model. But, my understanding is that Whisper performs pretty poorly with younger speakers.
One thing humans seem to be uniquely good at is picking goalposts that separate us from other species.