Had to read the privacy policy to see they use Google.
>We share your audio recording with Google Cloud’s speech-to-text service to assist us in processing and carrying out your commands. Audio recordings are shared without personally identifiable metadata, and we’ve instructed Google’s service not to retain the audio or transcript associated with a command after it processes the command
I built something like this 7 years ago, it's called Hands Free for Chrome, a now languishing project that I lost interest in a long time ago, unfortunately. It made the top 10 of HN back then though! My site's design is not nearly as nice as yours.
I just didn't get enough users or support to really care about it. But I wish you the best. It was an exciting thing to build and using it always felt futuristic to me.
This is just so fascinating though. It's like seeing what could have been if I had been a better developer and found the dedication to really stick to the project in the longterm.
Edit: I see we had the exact same idea! Your "tag" is my "map." Love it. One big difference is that mine was just a free project. I'd be super interested to know how many users you've got. I never had more than ~1100. From looking through your website mine was a much less intensive project. (Oh, CWS says 4000+ for you....wow, wonder how many are paid.)
Edit2: Looking over your update history is almost nostalgic. "Fixed issue with overlapping commands -- delaying commands that are partial matches of other commands." Had to do the exact same thing!
Edit3: We have so many overlapping command names that I wonder if you took inspiration from my project, almost. Either that or it's just a case of convergent evolution.
Edit4: Suggestion for dictation: a way to alternate between a special character and actually writing the word. Doesn't look like there's a way to do ^ vs carrot or & vs ampersand. Something like "Enter special character caret". Maybe you already have a plugin for this though, idk.
Edit5: God, this is so well architected! plugins and contexts are just fantastic ideas for this domain. Click by voice using hidden search-for-text is also a perfect solution to that problem. I wonder if this could be made more intelligent, i.e. "Click Submit in the sidebar on the left"-- challenging though.
Edit6: Wow, just noticed someone else built something called "Handsfree for Web" somewhere along the way and theirs is ALSO way better than what I had built. Geez. Starting to feel bad about my awful website.
Never saw yours before, but I discovered "Handsfree for Web" a few months after I started - and thought he had ripped mine off. But I no longer think so. Yes, seems like many commands are the same. Shame that so much wheel reinvention is going on. One thing that makes LipSurf "special" is the deep integration with sites. I wanted to use Duolingo, Reddit, HN and some others more with voice - so they get special plugins. Doing Duolingo with voice is a game changer for language learning - and if it weren't for usecases like that I would have likely lost interest like you long ago.
"On some browsers, like Chrome, using Speech Recognition on a web page involves a server-based recognition engine. Your audio is sent to a web service for recognition processing, so it won't work offline."
Seems like the answer for why the API isn't in Firefox. It's also not standardized, and is prefixed, so...
You should try rhasspy. It's open source. It respects your privacy by using offline services. Fully customizable (each service can be replaced by another) All the services are containerized for easy installation and is available for several architectures such as arm ( on a raspberry pi). There is even n Option to use Mozilla deepspeech tts service.
I'm sorry, but cloud based speech recognition in itself would already be a red flag, even if Mozilla was doing it in-house. Outsourcing it to Google though? I feel like a company as ostensibly privacy-focused as Mozilla should really know better by now...
Mozilla DeepSpeech is rapidly maturing, but it needs thousands of hours of validated audio data to train each language. Its a feat that with only 2000 hours of audio they can achieve a 5.97% word error rate.
Baidu had 5000 hours of audio data to train their DeepSpeech and DeepSpeech 2 models, meanwhile Google, Microsoft & IBM have people constantly giving them fresh audio to train and validate their models with.
hoping they are able to continue their efforts and be more respectful. they are the only company that i am fully lenient with to "allow analytics" in hopes that they are able to improve and compete with the more nefarious competitors.
They just pushed an update on android on me that disabled all addons except uBO because they don't support them yet, a few hours ago.
That was the last straw that broke the camels back for me, after using FF since 1.0. Just rm -rf'd my firefox profile on my desktop a few minutes ago.
I've defended them for a long time even if I didn't agree with everything they did, but they're so completely off the rails, enough is enough. Blink mono-culture it is then.
I feel it the same way. If they had FOSS speech to text available, then why not. But oterwise, why? For me, this voice thing is totally useless. Instead of adding it or developing new STT, they could have spend time on other innovative things. I am a big fan of Mozilla and daily user, I am still very thankful for more open and privacy friendly ecosystem it offers but implementing voice on top of Google and presenting it as a big feature looks strange to me.
Bigger players have the leverage to get companies to do something they don't do out-of-the-box. They can contractually oblige them to do that, as well as sue each other if one side breaks its part of the deal.
DeepSpeech is a lot less accurate and much slower than Facebook's FOSS offering, wav2letter, on equivalent data. If they want something competetive they'll need to drop DeepSpeech, or overhaul it. Common Voice is where the value is.
Like, there's nothing stopping firefox from just using wav2letter. It's BSD-licensed.
But DeepSpeech has already been trained with millions of data samples!
I'd feel way better about it if they went for a slightly worse DeepSpeech based implementation, but kept it working in the free software spirit they have been known about for many years.
Also, for desktop devices inference on DeepSpeech is cheap enough, so they could even go the extra mile and work on some Wasm magic to get offline recognition.
That's the kind of work I'd expect from Mozilla! Not wiring up your data collection to the Google Cloud APIs and call it a day! I'm genuinely disappointed with them...
Mozilla needs to come more around to Apple's way of thinking. These things need to be done locally on the device, not farmed out to some cloud. Use the cloud (CDN) to deploy the software, but run the software locally.
* "Make me laugh" always brings me to the same YouTube video.
* Had pretty much no issues with the default prompts. It was able to find some challenging Spotify playlists, open random websites (including non-standard English domains ones when I spelled them out).
* "Read this page" uses an awful TTS engine, which is a shame considering that I might actually use this feature on a somewhat regular basis. I'm assuming it uses whatever it detects on the OS level, and so far I haven't bothered with finding a better one (on Ubuntu, if you know of one, please suggest).
* "Set a timer for X min" works just fine, which is probably the only thing I use Google's assistant on my phone (or whatever it's name might be now).
* I like the idea of routines in the app settings, which is supposed to tie multiple queries together. I could see myself using it for something like a morning routine (tell me what time it is, give me weather info, read me news, etc.)
>We share your audio recording with Google Cloud’s speech-to-text service to assist us in processing and carrying out your commands. Audio recordings are shared without personally identifiable metadata, and we’ve instructed Google’s service not to retain the audio or transcript associated with a command after it processes the command
I've wanted to port it to Firefox, but the HTML5 SpeechRecognition API (https://developer.mozilla.org/en-US/docs/Web/API/SpeechRecog...) is still not available. Why not just make the API available and leave this in addon territory for all developers?
https://www.handsfreechrome.com/
I just didn't get enough users or support to really care about it. But I wish you the best. It was an exciting thing to build and using it always felt futuristic to me.
This is just so fascinating though. It's like seeing what could have been if I had been a better developer and found the dedication to really stick to the project in the longterm.
Edit: I see we had the exact same idea! Your "tag" is my "map." Love it. One big difference is that mine was just a free project. I'd be super interested to know how many users you've got. I never had more than ~1100. From looking through your website mine was a much less intensive project. (Oh, CWS says 4000+ for you....wow, wonder how many are paid.)
Edit2: Looking over your update history is almost nostalgic. "Fixed issue with overlapping commands -- delaying commands that are partial matches of other commands." Had to do the exact same thing!
Edit3: We have so many overlapping command names that I wonder if you took inspiration from my project, almost. Either that or it's just a case of convergent evolution.
Edit4: Suggestion for dictation: a way to alternate between a special character and actually writing the word. Doesn't look like there's a way to do ^ vs carrot or & vs ampersand. Something like "Enter special character caret". Maybe you already have a plugin for this though, idk.
Edit5: God, this is so well architected! plugins and contexts are just fantastic ideas for this domain. Click by voice using hidden search-for-text is also a perfect solution to that problem. I wonder if this could be made more intelligent, i.e. "Click Submit in the sidebar on the left"-- challenging though.
Edit6: Wow, just noticed someone else built something called "Handsfree for Web" somewhere along the way and theirs is ALSO way better than what I had built. Geez. Starting to feel bad about my awful website.
Seems like the answer for why the API isn't in Firefox. It's also not standardized, and is prefixed, so...
According to article, Firefox is offering this through an addon... obviously not using the SpeechRecognition API, however.
https://rhasspy.readthedocs.io/en/latest/
> Note: In the future, we expect to enable Mozilla’s own technology for Speech-to-Text which enables us to stop using Google’s Speech-to-Text engine.
edit: s/texting/testing
Baidu had 5000 hours of audio data to train their DeepSpeech and DeepSpeech 2 models, meanwhile Google, Microsoft & IBM have people constantly giving them fresh audio to train and validate their models with.
Firefox Voice data should help rapidly expand the Common Voice audio corpus beyond the 1492hrs it currently contains: https://commonvoice.mozilla.org/en/datasets
Deleted Comment
That was the last straw that broke the camels back for me, after using FF since 1.0. Just rm -rf'd my firefox profile on my desktop a few minutes ago.
I've defended them for a long time even if I didn't agree with everything they did, but they're so completely off the rails, enough is enough. Blink mono-culture it is then.
Hahaha! :D Thanks for the good laugh.
I agree with and applaud their truthful choice of words. There’s no such thing as “contractually oblige”.
// To keep yourself cognitive, consider one party receiving a National Security Letter (“NSL”) with gag order.
Like, there's nothing stopping firefox from just using wav2letter. It's BSD-licensed.
https://github.com/facebookresearch/wav2letter
I'd feel way better about it if they went for a slightly worse DeepSpeech based implementation, but kept it working in the free software spirit they have been known about for many years.
Also, for desktop devices inference on DeepSpeech is cheap enough, so they could even go the extra mile and work on some Wasm magic to get offline recognition.
That's the kind of work I'd expect from Mozilla! Not wiring up your data collection to the Google Cloud APIs and call it a day! I'm genuinely disappointed with them...
* "Make me laugh" always brings me to the same YouTube video.
* Had pretty much no issues with the default prompts. It was able to find some challenging Spotify playlists, open random websites (including non-standard English domains ones when I spelled them out).
* "Read this page" uses an awful TTS engine, which is a shame considering that I might actually use this feature on a somewhat regular basis. I'm assuming it uses whatever it detects on the OS level, and so far I haven't bothered with finding a better one (on Ubuntu, if you know of one, please suggest).
* "Set a timer for X min" works just fine, which is probably the only thing I use Google's assistant on my phone (or whatever it's name might be now).
* I like the idea of routines in the app settings, which is supposed to tie multiple queries together. I could see myself using it for something like a morning routine (tell me what time it is, give me weather info, read me news, etc.)