Launch HN: Aqua Voice (YC W24) – Voice-driven text editor

Hey HN! We’re Jack and Finn from Aqua Voice (https://withaqua.com/). Aqua is a voice-native document editor that combines reliable dictation and natural language commands, letting you say things like: “make this a list” or “it’s Erin with an E” or “add an inline citation here for page 86 of this book”. Here is a demo: https://youtu.be/qwSAKg1YafM.

Finn, who is big-time dyslexic, has been using dictation software since the sixth grade when his dad set him up on Dragon Dictation. He used it through school to write papers, and has been keeping his own transcription benchmarks since college. All that time, writing with your voice has remained a cumbersome and brittle experience that is riddled with painpoints.

Dictation software is still terrible. All the solutions basically compete on accuracy (i.e. speech recognition), but none of them deal with the fundamentally brittle nature of the text that they generate. They don't try to format text correctly and require you to learn a bunch of specialized commands, which often are not worth it. They're not even close to a voice replacement for a keyboard.

Even post LLM, you are limited to a set of specific commands and the most accurate models don’t have any commands. Outside of these rules, the models have no sense for what is an instruction and what is content. You can’t say “and format this like an email” or “make the last bullet point shorter”. Aqua solves this.

This problem is important to Finn and millions of other people who would write with their voice if they could. Initially, we didn't think of it as a startup project. It was just something we wanted for ourselves. We thought maybe we'd write a novel with it - or something. After friends started asking to use the early versions of Aqua, it occurred to us that, if we didn't build it, maybe nobody would.

Aqua Voice is a text editor that you talk to like a person. Depending on the way that you say it and the context in which you're operating, Aqua decides whether to transcribe what you said verbatim, execute a command, or subtly modify what you said into what you meant to write.

For example, if you were to dictate: "Gryphons have classic forms resembling shield volcanoes," Aqua would output your text verbatim. But if you stumble over your words or start a sentence over a few times, Aqua is smart enough to figure that out and to only take the last version of the sentence.

The vision is not only to provide a more natural dictation experience, but to enable for the first time an AI-writing experience that feels natural and collaborative. This requires moving away from using LLMs for one-off chat requests and towards something that is more like streaming where you are in constant contact with the model. Voice is the natural medium for this.

Aqua is actually 6 models working together to transcribe, interpret, and rewrite the document according to your intent. Technically, executing a real-time voice application with a language model at its core requires complex coordination between multiple pieces. We use MoE transcription to outperform what was previously thought possible in terms of real-time accuracy. Then we sync up with a language model to determine what should be on the screen as quickly as possible.

The model isn't perfect, but it is ready for early adopters and we’ve already been getting feedback from grateful users. For example, a historian with carpal tunnel sent us an email he wrote using Aqua and said that he is now able to be five times as productive as he was previously. We've heard from other people with disabilities that prevent them from typing. We've also seen good adoption from people who are dyslexic or simply prefer talking to typing. It’s being used for everything from emails to brainstorming to papers to legal briefings.

While there is much left to do in terms of latency and robustness, the best experiences with Aqua are beginning to feel magical. We would love for you to try it out and give us feedback, which you can do with no account on https://withaqua.com. If you find it useful, it’s $10/month after a 1000-token free trial. (We want to bump the free trial in the future, but we're a small team, and running this thing isn’t cheap.)

We’d love to hear your ideas and comments with voice-to-text!

I developed an RSI-related injury back in 94/95 and have been using speech recognition ever since. I would love a solution that would let me move off of Windows. I would love a solution allowing me to easily dictate text areas in Firefox, Thunderbird, or VS code. Most important, however, would be the ability to edit/manipulate the text using what Nuance used to call Select-and-Say. The ability to do minor edits, replace sentences with new dictation, etc., is so powerful and makes speech much easier to use than straight captured dictation like most whisper apps. If you can do that, I will be a lifelong customer.

The next most important thing would be the ability to write action routines for grammar. My preference is for Python because it's the easiest target when using chatGPT to write code. However, I could probably learn to live with other languages (except JavaScript, which I hate). I refer you to Joel Gould's "natPython" package he wrote for NaturallySpeaking. Here's the original presentation that people built on. https://slideplayer.com/slide/5924729/

Here's a lesson from the past. In the early days of DragonDictate/NaturallySpeaking, when the Bakers ran Dragon Systems, they regularly had employees drop into the local speech recognition user group meetings and talk to us about what worked for us and what failed. They knew that watching us Crips would give them more information about how to build a good speech recognition environment than almost any other user community. We found the corner cases before anybody else. They did some nice things, such as supporting a couple of speech recognition user group conferences with space and employee time.

It seems like nuance has forgotten those lessons.

Anyway, I was planning on getting work done today, but your announcement shoots that in the head. :-)

[edit] Freaking impressive. It is clear that I should spend more time on this. I can see how my experience of Naturally Speaking limited my view, and you have a much wider view of what the user interface could be.

phillco · 2 years ago

> when the Bakers ran Dragon Systems

For those who don't know what happened next, and why Dragon seem to stagnant so much in the aughts, the story about how Goldman Sachs helped them sell to essentially Belgian Enron, months before they collapsed, was quite illuminating to me, and sad.

https://archive.ph/Zck6i

gcanyon · 2 years ago

That's only the intro. Here's the conclusion: https://www.cornerstone.com/insights/cases/janet-baker-v-gol...

> Professor Gompers opined that at the time the acquisition closed, Dragon was a troubled company that was losing money and had regularly missed its own financial projections. It was highly uncertain whether Dragon could survive as a stand-alone entity. Professor Gompers also showed that technology stocks were on a downward trend, and L&H was the only buyer willing to pay the steep price Dragon demanded. Thus, he concluded that if the company had not accepted the L&H deal, Dragon likely would have declared bankruptcy. The jury found in favor of the defendants and awarded no damages to the plaintiffs.

Aeolun · 2 years ago

It’s crazy to me they were helped by what were essentially boys right out of college, and they had any faith it would work…

nerpderp82 · 2 years ago

Goldman Sachs is such a wonderful model of what is possible via Capitalism. I think they are holding on what they really could achieve with a little will.

stcredzero · 2 years ago

I remember being in a conversation back in 2002 or so, where some Smalltalkers were brainstorming over the idea of controlling the IDE and debugger with voice.

It just so happens, that many of the interfaces one has to deal with are somewhat low bandwidth. (For example, many spend most of their time stepping over, stepping into, or setting breakpoints in a debugger.) Code completion greatly cuts down the number of options to be navigated second to second. It seems like the time has arrived for an interactive voice operated AI pair programmer agent, where the human is taking the "strategic" role.

jmcintire1 · 2 years ago

Thank you! We love hearing stories like this.

We want to get Aqua into as many places as possible — and will go full tilt into that as soon as the core is extremely extremely solid (this is our focus right now).

Great lessons from Dragon Dictation. Would love to learn more about the speech recognition user group meetings! Are those still running? Are you a part of any?

rickydroll · 2 years ago

Unfortunately no. I think they faded out almost 20 years ago. The main problem was that without having someone able to create solutions, the speech recognition user group devolved into a bunch of crips complaining about how fewer and fewer applications work with speech recognition. We knew what was wrong; we knew how to iterate to where NaturallySpeaking should be, but nobody was there to do it.

FWIW, I am fleeing Fusebase, formally known as Nimbus, because they "pivoted" and messed up my notetaking environment. In the beginning, I went with Nimbus because it was the only notetaking environment that worked with Dragon. After the pivot, not so much. I'm giving Joplin a try. Aqua might work well as an extension to Joplin, especially if there was a WYSIMWYG (what you see is mostly what you get) front-end like Rich Markdown. I'd also look at heynote.

stevenkkim · 2 years ago

On a somewhat unrelated note, I remember Nuance used to be quite litigious, using its deep patent collection to sue startups and competitors. I'm not sure if this is still the case now that they're owned by Microsoft, but you may want to look into that.

rkagerer · 2 years ago

I always felt coding could be such a great fit for voice recognition, as you have a limited number of tokens in scope and know all the syntax in advance (so recognition accuracy should be pretty good). Never saw a solution that really capitalized on that, though.

Deleted Comment

zellyn · 2 years ago

You should check out cursorless… it may be more directly targeting your use case

rickydroll · 2 years ago

I saw it was based on Talon, but unfortunately, Talon makes things overly complex and focuses the user on the wrong part of the process. The learning curve to get started, especially when writing your action routines, is much higher than it needs to be. See: https://vocola.net/. It's not perfect; it's clumsy, but you can start creating action routines within 5 to 10 minutes of reading the documentation. Once you exceed the capabilities of Vocola, you can develop extensions in Python based on what you've learned in Vocola. One could say that Talon is the second system implementation according to Mythical Man Month.

My use case is dictating text into various applications and correcting that text within the text area. If I have to, I can use the dictation box and then paste it into the target application.

When you talk about using speech recognition for creating code, I've been through enough brute-force solutions like Talon to know they are the wrong way because they always focus the user on the wrong thing. When creating code, you should be thinking about the data structure and the environment in which it operates. When you use speech-driven programming systems, you focus on what you have to say to get the syntax you need to make it compile correctly. As a result, you lose your connection to the problem you're trying to solve.

Whether you like it or not, ChatGPT is currently the best solution as long as you never edit the code directly.

jodrellblank · 2 years ago

For anyone else reading, see: https://news.ycombinator.com/item?id=38214915 - "Cursorless is alien magic from the future" article linked from 4 months ago.

rafram · 2 years ago

This is cool! Some feedback:

- As others have said, "1000 tokens" doesn't mean anything to non-technical users and barely means anything to me. Just tell me how many words I can dictate!

- That serif-font LaTeX error rate table is also way too boring. People want something flashy: "Up to 7x fewer errors than macOS dictation" is cool, a comparison table is not.

- Similarly, ".05 Word Error Rate" has to go. Spell out what that means and use percentages.

- "Forgot a name, word, fact, or number? Just ask Aqua to fill it in for you." It would be nice to be able to turn this off, or at least have a clear indication when content that I did not say is inserted into my document. If I'm dictating, I don't usually want anything but the words I say on the page.

seabass-labrax · 2 years ago

> People want something flashy: "Up to 7x fewer errors than macOS dictation" is cool, a comparison table is not.

Respectfully disagree on this one: as a startup, you can't effectively compete with the likes of Apple on flashiness. However, the very target market of those dictating large amounts of text will include a significant number of people in academia themselves. For those people, Aqua Voice will feel relevant. Those who aren't interested in comparison tables will simply skip over them :)

musiciangames · 2 years ago

One of my favourite nitpicks, but IMO 7x fewer errors means -6 times the error rate. Maybe error rate reduced by 86%.

Thanks for the feedback! On the last point, you can't see it in the sandbox, but the app has a Strict mode that does what you're looking for

bukacdan · 2 years ago

I was wondering whether the table actually comes from some paper, or it's just a marketing trick for techy folks.

idk1 · 2 years ago

This is incredible, I said go back and swap one word with another, and it did it, this has blew my mind, I've not been able to do that before.

I'm a heavy voice dictation user, and I would switch to this in a heartbeat. I'll tell you why this is so impressive, it means you can make mistakes and correct them with voice, it takes away the overhead of preparing a sentence in your mind before saying it, one of the hardest things about voice dictation.

I often have my shoulder in pain, and I have to reach for my mouse to change a word, I would not if I used this. This software would literally prevent me pain.

However, I cannot use it without a privacy policy. I have to know where the recording of my voice is being saved, if it's being saved, and what is it going to be used for.

I would pay extra for my voice to be entirely deleted and not used afterwards, that could even be an upsell part of your packages. Extra $5 to never save your voice or data.

I love it, but I can't use it for most things without a privacy policy.

ForrestN · 2 years ago

This is so cool! Great work. I'm writing this comment using Aqua Voice, and it's very impressive. I've been waiting for something like this. As a neurodivergent person, certain tasks (cough, email, cough) are about 10 times harder sitting down at my computer than they are handling them aloud with my assistant.

I'm sure you get this feedback 100 times a day, but I'd gladly pay a substantial amount to use this in place of the system dictation on my Mac and iPhone. Right now, the main limitation to me using it constantly would be the endless need to copy and paste from this separate new document editor into my email app or into Notion or Google Docs, etc.

Two more small pieces of feedback, in case they're useful:

- Consider a time-based free trial. As others have said, tokens are confusing, but also your model is unlimited so the chunk of tokens doesn't allow me to see what it might be like to actually use your product. I'm more than halfway through my tokens after writing an HN comment and a brief todo list for work, so I've been able to see what it'd be like to pay the $10 for about 5 minutes worth of work, which feels like a very short trial. A week, say, seems fair? And then you have some kind of cap on tokens that only comes up if someone uses an abusively huge amount (an issue, I'm sure, you'd face with paying customers too, right?)

- I had a bit of trouble with making a todo list—I kept wanting the system to do a "new line" or "next item" and show me a new line with a dash so I know I'm dictating to the right place, but I couldn't coax it into doing that for me. I had to sort of just start on the next item and then use my keyboard to push return. When making lists, it's good to be able to do so fluidly and intentionally as much as possible. Sometimes it did figure out, impressively, that long pauses meant I wanted a new line. But not always.

the_king · 2 years ago

Awesome. Agree on the copy-paste annoyance, we're working on more clients.

But I do think that the reliability needs to take a few more steps before it becomes a true keyboard replacer.

Thanks for all your hard work! Even, as a start, I found myself asking the app to copy the text to the clipboard for me without even thinking. Might be nice to be able to do that more seamlessly, just as a start?

You've moved us all a lot closer to my dream: taking a long walk outside, AirPods in, and handling the day's email without even looking at a screen once.

mavsman · 2 years ago

Since voice-to-text has gotten so good I've used it a lot more and also noticed how distracting and confusing it can be. Using Apple's dictation has a similar feel to this where you're constantly seeing something that's changing on the screen. It's kind of irritating and I don't really know what the solution is.

One suggestion I have here is to have at least two different sections of the UI. One part would be the actual document and the other would be the scratchpad. It seems like much of what you say would not actually make it into the document (edits, corrections, etc) so those would only be shown in the scratchpad. Once the editor has processed the text from the scratchpad then it can go into the document how it's supposed to. Having text immediately show up in the document as it's dictated is weird.

Your big challenge right now is just that STT is still relatively slow for this usecase. Time will be on your side in that regard as I'm sure you know.

Good luck! Voice is the future of a lot of the interactions we have with computers.

robbomacrae · 2 years ago

Not trying to hijack this. Great demo! But STT can be very much real-time now. Try SoundHound's transcription service available through the Houndify platform [0] (we really don't market this well enough). It's lightning fast and it's half of what powers the Dynamic Interaction demos that we've been putting out.

I actually made a demo just like this aqua voice internally (unfortunately didn't get prioritized) but there is really no lag. However it will always be the case where the model will want to "revisit" transcribed words based on what comes next. So if you want the best accuracy you do want to wait a sec or two for the transcription to settle down a bit.

[0]: https://www.houndify.com

codercowmoo · 2 years ago

Distil-whisper is incredibly fast. Realtime on a 3060 Ti, and I used it to transcribe an 11 hour audiobook in 9 minutes.

peddling-brink · 2 years ago

You know, those audiobooks already have transcriptions. Often written by the original author!

I kid. Your comment made me think of a shower thought I had recently where I wished my audiobook had subtitles.

benpacker · 2 years ago

This is really great. I was hoping someone would build this: https://bprp.xyz/__site/Looking+for+Collaborators/Better+Loc...

I would really happily pay $10 / month for this, but what I really want is either: - A Raycast plugin or Desktop app that lets this interact with any editable text area in my environment - An API that I can pass existing text / context + audio stream to and get back a heartbeat of full document updates. Then, the community can build Obsidian/VSCode/browser plugins for the huge surface area of text entry

Going to give you $10 later this afternoon regardless, and congrats!

I would also love to integrate this into text areas in my app, or as an editor of a JSON object.

That would let me quickly build an interface for editing basically any application state, which would be awesome!

oulipo · 2 years ago

There should probably a community effort to build an open-source version of this around Obsidian?

samstave · 2 years ago

Take this [TEXT] read it and then let me tell you how to edit it:

>Certainly - let me grok your text!!... OK - I am ready!

BLAH BLAH BLAH...

etc

hubraumhugo · 2 years ago

Dictation software is huge in the healthcare industry. Every doctor uses it, and a solution like yours could likely make their work much more efficient.

Have you explored this market segment?

eggdaft · 2 years ago

I’d consider dentistry first. It’s still an open market in terms of SaaS, and they tend to have the same computer sat there all day constantly switching between the patient and the machine.

gardenhedge · 2 years ago

Why do doctors use it?

lmiller1990 · 2 years ago

Not OP but a big part of a doctor's job is clinical notes. Typing is slow, talking is fast. Less time spent taking notes == more time with patient.

quadragenarian · 2 years ago

My wife is a Radiologist and uses voice transcription literally ALL DAY LONG as she reads imaging and transcribes her findings. Powerscribe from Nuance in case your curious

hidelooktropic · 2 years ago

This was such a well executed demo. A few seconds in and I'm seeing the value. The core of the product is fully explained in just 36 seconds.

It's less about how quickly all that transpires and more about presenting the product in a way that doesn't require a lot of talking around it. Well done.

matsemann · 2 years ago

I agree, very well spent seconds. Straight to the point and immediately obvious what the product is doing and how useful it could be.

My first thought, when reading the headline, was that this could be useful for my coworker that got RSI in both hands and codes using special commands to a mic. But after having watched it I think it can be much more than such a niche product.