Video: https://withaqua.com/watch
Try it here: https://withaqua.com/sandbox
Finn is uber dyslexic and has been using dictation software since sixth grade. For over a decade, he’s been chasing a dream that never quite worked — using your voice instead of a keyboard.
Our last post (https://news.ycombinator.com/item?id=39828686) about this seemed to resonate with the community - though it turned out that version of Aqua was a better demo than product. But it gave us (and others) a lot of good ideas about what should come next.
Since then, we’ve remade Aqua from scratch for speed and usability. It now lives on your desktop, and it lets you talk into any text field -- Cursor, Gmail, Slack, even your terminal.
It starts up in under 50ms, inserts text in about a second (sometimes as fast as 450ms), and has state-of-the-art accuracy. It does a lot more, but that’s the core. We’d love your feedback — and if you’ve got ideas for what voice should do next, let’s hear them!
MacWhisper [0] (the app I settled on) is conspicuously missing from your benchmarks [1]. How does it compare?
[0]: https://goodsnooze.gumroad.com/l/macwhisper
[1]: https://withaqua.com/blog/benchmark-nov-2024
For that benchmarking table, you can use Whisper Large V3 as a stand-in for Mac Whisper and Super Whisper accuracy.
Aqua looks good and I will be testing it, but I do like that with superwhisper nothing leaves my computer unless I add AI integrations.
Side-comment of something this made me think of (again): tech builds too much for tech. I've lived in the Bay before, so I know why this happens. When you're there, everyone around you is in tech, your girlfriend is in tech, you go to parties and everyone invariably ends up talking about work, which is tech. Your frustrations are with tech tools and so are your peers', so you're constantly thinking about tech solutions applicable to tech's problems.
This seems very much marketed to SF people doing SF things ("Cursor, Gmail, Slack, even your terminal"). I wonder how much effort has gone into making this work with code editors or the terminal, even though I doubt this would a big use-case for this software if it ever became generally popular. I'd imagine the market here is much larger in education, journalism, film, accessibility, even government. Those are much more exciting demos.
I share the same sentiment. I remember thinking in college how annoying it was that I was reading low-resolution, marked-up, skewed, b&w scans of a book using Adobe Acrobat while CS concentrators were doing everything in VS Code (then brand new).
but we do think voice is actually great with Cursor. It’s also really useful in the terminal for certain things. Checking out or creating branches, for example.
"we also collect and process your voice inputs [..] We leverage this data for improvements and development [..] Sharing of your information [..] service providers [..] OpenAI" https://withaqua.com/privacy
No mention of privacy (or on prem) - so assumed it's 100% cloud.
Non-starter for me. Accuracy is important, but privacy is more so.
Hopefully a service with these capabilities will be available where the first step has the user complete a brief training session, sends that to the cloud to tailor the recognition parameters for their voice and mannerisms... then loads that locally.
given that we need the cloud, we offer zero data retention -- you can see this in the app. your concern is as much about ux and communications as it is privacy
And self-hosting real-time streaming LLMs will probably also come out at 50 cents per hour. Arguing a $120/month price for power users is probably going to be very difficult. Especially so if there is free open-source alternatives.
This is where I bounce (out of this discussion).
I’d say local is necessary for delightful product experience and the added bonus is that it ticks the privacy box
Deleted Comment
I do wish there was a mobile app though (or maybe an iOS keyboard). It would also be nice to be able to have a separate hotkey you can set up to send the output to a specific app (instead of just the active one).
Things I've learned are:
1. It works better if you're connected by Ethernet than by Wi-Fi.
2. It needs to have a longer recognition history because sometimes you hit the wrong key to end a recognition session, and it loses everything.
3. Besides the longer history, a debugging mode that records all the characters sent to the dictation box would be useful. Sometimes, I see one set of words, blink, and then it's replaced with a new recognition result. Capturing would be useful in describing what went wrong.
4. There should be a way to tell us when a new version is running. Occasionally, I've run into problems where I'm getting errors, and I can't tell if it's my speaking, my audio chain, my computer, the network, or the app.
5. Grammarly is a great add-on because it helps me correct mis-speakings and odd little errors, like too many spaces caused by starting and stopping recognition.
When Dragon Systems went through bankruptcy court, a public benefits corporation bid for the core technology because it recognized that Dragon was a critical tool for people with disabilities to function in a digital world.
In my opinion, Aqua has reached a similar status as an essential tool. Well, it doesn't fully replace Dragon for those who need command and control (yet). The recognition accuracy and smoothness are so amazing that I can't envision returning to Dragon Systems without much pain. The only thing worse would be going back to a keyboard.
Aqua Guys, don't fuck it up.