Finn, who is big-time dyslexic, has been using dictation software since the sixth grade when his dad set him up on Dragon Dictation. He used it through school to write papers, and has been keeping his own transcription benchmarks since college. All that time, writing with your voice has remained a cumbersome and brittle experience that is riddled with painpoints.
Dictation software is still terrible. All the solutions basically compete on accuracy (i.e. speech recognition), but none of them deal with the fundamentally brittle nature of the text that they generate. They don't try to format text correctly and require you to learn a bunch of specialized commands, which often are not worth it. They're not even close to a voice replacement for a keyboard.
Even post LLM, you are limited to a set of specific commands and the most accurate models don’t have any commands. Outside of these rules, the models have no sense for what is an instruction and what is content. You can’t say “and format this like an email” or “make the last bullet point shorter”. Aqua solves this.
This problem is important to Finn and millions of other people who would write with their voice if they could. Initially, we didn't think of it as a startup project. It was just something we wanted for ourselves. We thought maybe we'd write a novel with it - or something. After friends started asking to use the early versions of Aqua, it occurred to us that, if we didn't build it, maybe nobody would.
Aqua Voice is a text editor that you talk to like a person. Depending on the way that you say it and the context in which you're operating, Aqua decides whether to transcribe what you said verbatim, execute a command, or subtly modify what you said into what you meant to write.
For example, if you were to dictate: "Gryphons have classic forms resembling shield volcanoes," Aqua would output your text verbatim. But if you stumble over your words or start a sentence over a few times, Aqua is smart enough to figure that out and to only take the last version of the sentence.
The vision is not only to provide a more natural dictation experience, but to enable for the first time an AI-writing experience that feels natural and collaborative. This requires moving away from using LLMs for one-off chat requests and towards something that is more like streaming where you are in constant contact with the model. Voice is the natural medium for this.
Aqua is actually 6 models working together to transcribe, interpret, and rewrite the document according to your intent. Technically, executing a real-time voice application with a language model at its core requires complex coordination between multiple pieces. We use MoE transcription to outperform what was previously thought possible in terms of real-time accuracy. Then we sync up with a language model to determine what should be on the screen as quickly as possible.
The model isn't perfect, but it is ready for early adopters and we’ve already been getting feedback from grateful users. For example, a historian with carpal tunnel sent us an email he wrote using Aqua and said that he is now able to be five times as productive as he was previously. We've heard from other people with disabilities that prevent them from typing. We've also seen good adoption from people who are dyslexic or simply prefer talking to typing. It’s being used for everything from emails to brainstorming to papers to legal briefings.
While there is much left to do in terms of latency and robustness, the best experiences with Aqua are beginning to feel magical. We would love for you to try it out and give us feedback, which you can do with no account on https://withaqua.com. If you find it useful, it’s $10/month after a 1000-token free trial. (We want to bump the free trial in the future, but we're a small team, and running this thing isn’t cheap.)
We’d love to hear your ideas and comments with voice-to-text!
- As others have said, "1000 tokens" doesn't mean anything to non-technical users and barely means anything to me. Just tell me how many words I can dictate!
- That serif-font LaTeX error rate table is also way too boring. People want something flashy: "Up to 7x fewer errors than macOS dictation" is cool, a comparison table is not.
- Similarly, ".05 Word Error Rate" has to go. Spell out what that means and use percentages.
- "Forgot a name, word, fact, or number? Just ask Aqua to fill it in for you." It would be nice to be able to turn this off, or at least have a clear indication when content that I did not say is inserted into my document. If I'm dictating, I don't usually want anything but the words I say on the page.
Respectfully disagree on this one: as a startup, you can't effectively compete with the likes of Apple on flashiness. However, the very target market of those dictating large amounts of text will include a significant number of people in academia themselves. For those people, Aqua Voice will feel relevant. Those who aren't interested in comparison tables will simply skip over them :)
I'm a heavy voice dictation user, and I would switch to this in a heartbeat. I'll tell you why this is so impressive, it means you can make mistakes and correct them with voice, it takes away the overhead of preparing a sentence in your mind before saying it, one of the hardest things about voice dictation.
I often have my shoulder in pain, and I have to reach for my mouse to change a word, I would not if I used this. This software would literally prevent me pain.
However, I cannot use it without a privacy policy. I have to know where the recording of my voice is being saved, if it's being saved, and what is it going to be used for.
I would pay extra for my voice to be entirely deleted and not used afterwards, that could even be an upsell part of your packages. Extra $5 to never save your voice or data.
I love it, but I can't use it for most things without a privacy policy.
I'm sure you get this feedback 100 times a day, but I'd gladly pay a substantial amount to use this in place of the system dictation on my Mac and iPhone. Right now, the main limitation to me using it constantly would be the endless need to copy and paste from this separate new document editor into my email app or into Notion or Google Docs, etc.
- Consider a time-based free trial. As others have said, tokens are confusing, but also your model is unlimited so the chunk of tokens doesn't allow me to see what it might be like to actually use your product. I'm more than halfway through my tokens after writing an HN comment and a brief todo list for work, so I've been able to see what it'd be like to pay the $10 for about 5 minutes worth of work, which feels like a very short trial. A week, say, seems fair? And then you have some kind of cap on tokens that only comes up if someone uses an abusively huge amount (an issue, I'm sure, you'd face with paying customers too, right?)
- I had a bit of trouble with making a todo list—I kept wanting the system to do a "new line" or "next item" and show me a new line with a dash so I know I'm dictating to the right place, but I couldn't coax it into doing that for me. I had to sort of just start on the next item and then use my keyboard to push return. When making lists, it's good to be able to do so fluidly and intentionally as much as possible. Sometimes it did figure out, impressively, that long pauses meant I wanted a new line. But not always.
But I do think that the reliability needs to take a few more steps before it becomes a true keyboard replacer.
You've moved us all a lot closer to my dream: taking a long walk outside, AirPods in, and handling the day's email without even looking at a screen once.
One suggestion I have here is to have at least two different sections of the UI. One part would be the actual document and the other would be the scratchpad. It seems like much of what you say would not actually make it into the document (edits, corrections, etc) so those would only be shown in the scratchpad. Once the editor has processed the text from the scratchpad then it can go into the document how it's supposed to. Having text immediately show up in the document as it's dictated is weird.
Your big challenge right now is just that STT is still relatively slow for this usecase. Time will be on your side in that regard as I'm sure you know.
Good luck! Voice is the future of a lot of the interactions we have with computers.
I actually made a demo just like this aqua voice internally (unfortunately didn't get prioritized) but there is really no lag. However it will always be the case where the model will want to "revisit" transcribed words based on what comes next. So if you want the best accuracy you do want to wait a sec or two for the transcription to settle down a bit.
[0]: https://www.houndify.com
I kid. Your comment made me think of a shower thought I had recently where I wished my audiobook had subtitles.
The next most important thing would be the ability to write action routines for grammar. My preference is for Python because it's the easiest target when using chatGPT to write code. However, I could probably learn to live with other languages (except JavaScript, which I hate). I refer you to Joel Gould's "natPython" package he wrote for NaturallySpeaking. Here's the original presentation that people built on. https://slideplayer.com/slide/5924729/
Here's a lesson from the past. In the early days of DragonDictate/NaturallySpeaking, when the Bakers ran Dragon Systems, they regularly had employees drop into the local speech recognition user group meetings and talk to us about what worked for us and what failed. They knew that watching us Crips would give them more information about how to build a good speech recognition environment than almost any other user community. We found the corner cases before anybody else. They did some nice things, such as supporting a couple of speech recognition user group conferences with space and employee time.
It seems like nuance has forgotten those lessons.
Anyway, I was planning on getting work done today, but your announcement shoots that in the head. :-)
[edit] Freaking impressive. It is clear that I should spend more time on this. I can see how my experience of Naturally Speaking limited my view, and you have a much wider view of what the user interface could be.
For those who don't know what happened next, and why Dragon seem to stagnant so much in the aughts, the story about how Goldman Sachs helped them sell to essentially Belgian Enron, months before they collapsed, was quite illuminating to me, and sad.
https://archive.ph/Zck6i
> Professor Gompers opined that at the time the acquisition closed, Dragon was a troubled company that was losing money and had regularly missed its own financial projections. It was highly uncertain whether Dragon could survive as a stand-alone entity. Professor Gompers also showed that technology stocks were on a downward trend, and L&H was the only buyer willing to pay the steep price Dragon demanded. Thus, he concluded that if the company had not accepted the L&H deal, Dragon likely would have declared bankruptcy. The jury found in favor of the defendants and awarded no damages to the plaintiffs.
It just so happens, that many of the interfaces one has to deal with are somewhat low bandwidth. (For example, many spend most of their time stepping over, stepping into, or setting breakpoints in a debugger.) Code completion greatly cuts down the number of options to be navigated second to second. It seems like the time has arrived for an interactive voice operated AI pair programmer agent, where the human is taking the "strategic" role.
We want to get Aqua into as many places as possible — and will go full tilt into that as soon as the core is extremely extremely solid (this is our focus right now).
Great lessons from Dragon Dictation. Would love to learn more about the speech recognition user group meetings! Are those still running? Are you a part of any?
FWIW, I am fleeing Fusebase, formally known as Nimbus, because they "pivoted" and messed up my notetaking environment. In the beginning, I went with Nimbus because it was the only notetaking environment that worked with Dragon. After the pivot, not so much. I'm giving Joplin a try. Aqua might work well as an extension to Joplin, especially if there was a WYSIMWYG (what you see is mostly what you get) front-end like Rich Markdown. I'd also look at heynote.
Deleted Comment
My use case is dictating text into various applications and correcting that text within the text area. If I have to, I can use the dictation box and then paste it into the target application.
When you talk about using speech recognition for creating code, I've been through enough brute-force solutions like Talon to know they are the wrong way because they always focus the user on the wrong thing. When creating code, you should be thinking about the data structure and the environment in which it operates. When you use speech-driven programming systems, you focus on what you have to say to get the syntax you need to make it compile correctly. As a result, you lose your connection to the problem you're trying to solve.
Whether you like it or not, ChatGPT is currently the best solution as long as you never edit the code directly.
I would really happily pay $10 / month for this, but what I really want is either: - A Raycast plugin or Desktop app that lets this interact with any editable text area in my environment - An API that I can pass existing text / context + audio stream to and get back a heartbeat of full document updates. Then, the community can build Obsidian/VSCode/browser plugins for the huge surface area of text entry
Going to give you $10 later this afternoon regardless, and congrats!
That would let me quickly build an interface for editing basically any application state, which would be awesome!
>Certainly - let me grok your text!!... OK - I am ready!
BLAH BLAH BLAH...
etc
Have you explored this market segment?
Deleted Comment
It's less about how quickly all that transpires and more about presenting the product in a way that doesn't require a lot of talking around it. Well done.
My first thought, when reading the headline, was that this could be useful for my coworker that got RSI in both hands and codes using special commands to a mic. But after having watched it I think it can be much more than such a niche product.