CNN's are only mentioned because of potential processing considerations as computationally, they are easier to deal with. But given the nature of speech recognition, which is so highly temporally correlated, it shouldn't be a surprise that a recurrent neural network would be used. This is pretty much exactly the purpose the RNN type of model architecture was designed for.
Also if you haven't looked into the properties of how exactly a RNN Transducer functions, I highly recommend doing so. They help resolve a great deal of problems that traditional RNNs and CNNs are unable to deal with.
"But it’s sort of funny considering hardly any of Google’s other products work offline. Are you going to dictate into a shared document while you’re offline? Write an email? Ask for a conversion between liters and cups? You’re going to need a connection for that!"
While offline, you might write email drafts, your blog, or even a book:
What's missing is the ability to make edits using your phone. You can probably speak at over 100 words a minute but then you need to stop to bring up the software keyboard.
The offline aspect is hardly the main draw here though. As mentioned earlier in the article, the latency reduction is huge. Another aspect they didn't really cover is privacy implications. Lastly, you may not be offline, but dodgy connections can also be a pain if you need a stable stream of packets going back and forth.
I refuse to put an amazon/apple/google surveillance device in my home, so I am very interested in a DIY digital assistant device. I'm aware of a few options but it seems like offline voice recognition is always a little sub-par. I am really looking forward to the day when an offline, open source digital assistant can compare in quality to a proprietary/cloud device.
>As mentioned earlier in the article, the latency reduction is huge.
Well, on macOS offline voice recognition is actually much slower than online. Not to mention the choice of words and Vocab is quite limited. I love to get an offline version, but so far every online version seems to be better.
FWIW Google Translate (including the "translate from picture" feature) is an example of a product that has had offline option for quite some time. You have to tell it to download for each language pair IIRC.
For the record it wasn't always this way, the last couple of years though they have made a lot of improvements on this front. I think it may have something to do with Google's "next billion of devices" being in countries with bad connectivity.
With that said I especially like the Google Maps offline features which have been added recently. You can even have it calculate driving directions completely offline if you have the starting and ending addresses.
I just switched my Pixel 1 to airplane mode and tried voice input. Sure enough, it worked offline and it was fast! Very impressive work. (I've tried that before, but in the past it could only understand a few special phrases.) I suppose this new feature came with the security update my phone downloaded a few days ago.
There are lots of ways to spin this, but I see it as a significant improvement for any app that could benefit from voice input. It's immediate and not susceptible to network glitches. The benefit for Google, IMHO, is primarily more sales of updated Android devices.
Unless you very recently (meaning today) accepted a download of a new language pack for English, it's likely just the old model, which is perfectly functional, while not being as accurate as the online version.
> But it’s sort of funny considering hardly any of Google’s other products work offline.
I dunno, Android and a lot of Google's mobile apps that aren't about online communication work fine offline. Actually, a lot of the online communications ones do too, as much as is even conceivable, they just don't transmit and receive offline, because, how would they?
On the other hand, when you can transcribe locally, uploading whole days worth of eavesropping would not cause a noticeable spike in traffic. I'd consider it more a lateral change than an improvement.
I'm unclear on if this moves the privacy needle. It says they do offline translation, but they still may attempt to send the audio clip to compare with the resulting text translation.
It could be used to improve privacy, I just don't know if it will be used that way.
I generally think of Google the same way I think of the NSA. If they stop doing something invasive, either it didn't work, they found a better way of doing it, or it was transferred to a legally distinct category, and we only hear about it because of PR considerations.
Gboard is governed by Google's catch all privacy policy, that allows them to gather all data and mine everything.
If you have an android device with Google services and a firewall, you'll see that the device is constantly phoning home, which is also noted in the privacy policy.
This does nothing for privacy, rather than provide the illusion of privacy.
Does the Pixel have some specific hardware that this uses, or is it simply limited to Pixel to limit the rollout? I am curious if I should get my hopes up to see this on gboard with non-Pixel Android devices.
The Pixel 2+ does have a coprocessor for compute workloads (the Visual Core). However users here have reported this working on a Pixel 1, which doesn't have that chip.
The Verge says it may reach other devices later.
It sounds like it's both better than the old dictation model, and significantly smaller.
Even so, offline AI solutions have been piss poor and Google moving the state of the art in spite of their vested interest in keeping people online is a good thing.
Yes, we want an open source solution, but I'm not going to work on it. So who's going to work on it? Are you?
In absence of resources working towards the ideal, I'll applaud any step in the right direction.
Didn't they advertise something like this a few years ago? I seem to remember trying it and finding that it didn't really work as well as the online recognition at the time.
arXiv: https://arxiv.org/abs/1811.06621
Also if you haven't looked into the properties of how exactly a RNN Transducer functions, I highly recommend doing so. They help resolve a great deal of problems that traditional RNNs and CNNs are unable to deal with.
While offline, you might write email drafts, your blog, or even a book:
https://medium.com/@augustbirch/what-i-learned-writing-an-en...
What's missing is the ability to make edits using your phone. You can probably speak at over 100 words a minute but then you need to stop to bring up the software keyboard.
Well, on macOS offline voice recognition is actually much slower than online. Not to mention the choice of words and Vocab is quite limited. I love to get an offline version, but so far every online version seems to be better.
With that said I especially like the Google Maps offline features which have been added recently. You can even have it calculate driving directions completely offline if you have the starting and ending addresses.
There are lots of ways to spin this, but I see it as a significant improvement for any app that could benefit from voice input. It's immediate and not susceptible to network glitches. The benefit for Google, IMHO, is primarily more sales of updated Android devices.
Gboard > Voice Typing > Faster voice typing
It says its an 85MB download for US-English
I dunno, Android and a lot of Google's mobile apps that aren't about online communication work fine offline. Actually, a lot of the online communications ones do too, as much as is even conceivable, they just don't transmit and receive offline, because, how would they?
This is translating what you said after the wake word from voice to text on the local [Pixel] hardware rather than sending it into Google's Cloud.
The biggest benefits here are speed and reliability. It could also handle some actions offline.
It could be used to improve privacy, I just don't know if it will be used that way.
I generally think of Google the same way I think of the NSA. If they stop doing something invasive, either it didn't work, they found a better way of doing it, or it was transferred to a legally distinct category, and we only hear about it because of PR considerations.
If you have an android device with Google services and a firewall, you'll see that the device is constantly phoning home, which is also noted in the privacy policy.
This does nothing for privacy, rather than provide the illusion of privacy.
The Verge says it may reach other devices later.
It sounds like it's both better than the old dictation model, and significantly smaller.
The thought that every interaction with my phone is being streamed in realtime to a third party server freaks me out.
Kudos to Google for working on this.
You want an open source solution, not just an offline solution.
Yes, we want an open source solution, but I'm not going to work on it. So who's going to work on it? Are you?
In absence of resources working towards the ideal, I'll applaud any step in the right direction.
EDIT: Looks like something was added in Jelly Bean: https://stackoverflow.com/questions/17616994/offline-speech-...