I really should package it up so people can try it. The one problem that makes it a little unnatural is that determining when the user is done talking is tough. What's needed is a speech conversation turn-taking dataset and model; that's missing from off the shelf speech recognition systems. But it should be trivial for a company like OpenAI to build. That's what I'd work on right now if I was there, because truly natural voice conversations are going to unlock a whole new set of users and use cases for these models.
Dead Comment
I don't want to edit a "file", I want to edit these two functions that exist in some module(s), why can't I just see those two?
I constantly jump between many different languages and the cognitive load is noticeable: was it "!=", "/=" or "~="? Why am I writing/viewing ascii art when I'm coding?
If I am most comfortable/fluent viewing python, why can't I view a javascript source as python?
I think the remaining challenge is what sort of projections are the best/most useful? How do we manipulate these projections? I have played around with these ideas and made an AST viewer for the browser where you could configure exactly how a node was represented (using CSS) and navigation was done in block mode (node traversal) but I found it really hard to build an editing experience that felt smooth..