Voice assistants are not doing it for big tech

Voice assistants are basically just mainstream non-visual command-lines, and it's unsurprising to me that something that relies heavily on memorization and extremely specialized "skills" isn't quite taking off in the way it was imagined. A voice system that can do literally everything one can do with a keyboard and a mouse would be magical, but no system offers that.

Instead, it's a guessing game about syntax and semantics, and frequently a source of frustration. There are many failure points: it can "hear" you wrong, it can miss the wake word, it can hear correctly but interpret wrong, miss context clues, or simply be unable to process whatever the request is. In my experience, most normal people either relegate voice commands to ultra-specific tasks, like timers, weather, and music, and that's that. Google and Alexa are relatively good at "trivia" questions, but Siri is a complete failure. All systems have edge cases that make them brittle.

I think there's potential here. Cortana was the most promising: an assistant that's integrated into the OS and can change any setting or perform anything on-screen would, again, be really awesome. We just don't have that. I think maybe OS-wide + GPT 4 (or later) might get closer to what we expect, but it's just not great right now. I really want to be able to say something as unstructured as "hey siri, create alarms every 5 minutes starting at 6am tomorrow" or "hey siri, when I get home every day, turn on all of the lights, change my focus to personal, and turn on the news". There /is/ power to-be-had, but nobody has really tapped it.

qsort · 3 years ago

The problem isn't voice, it's natural language.

Natural language is a fundamentally wrong vehicle to convey information to a computer. It can be useful for some specific tasks, automated Q/A, simple interfaces to databases, stuff where I can't be properly f_ed to remember the syntax or the shortcut like IDE commands.

But the idea it can replace formal language is fundamentally and dangerously incorrect. I agree with Dijkstra's quip, we shouldn't regard formal language as a burden, but rather as a privilege.

bombcar · 3 years ago

I'd be perfectly happy with a list of Siri commands that I would have to learn to be able to do things. I don't care if I ended up sounding like:

Hey Siri

Turn lights on 50 percent

For one hour

Dim over that time

Play music.

I can learn what I need to do; JUST LET ME KNOW THE MAGIC WORDS!

albertzeyer · 3 years ago

On the other side, humans have been fine using natural language to delegate commands to each other.

So maybe it's just that the subfield of natural language understanding is still too early to be really useful. Speech recognition itself has gotten really good but then understanding the context, the intent, etc, all that is natural language understanding, and that is often the problem.

pjc50 · 3 years ago

The problem is that it's not actually a conversation. To significantly improve it, you'd want to:

- identify users by voice

- ask them clarifying questions

- remember the answers on a per-user basis

- understand "no, that was the wrong answer"

If you're going to provide a formal interface to the computer, you also have to provide teaching in that formal interface, which is far more of a burden to the user than the cost of the device. And we've completely moved away from that model (not necessarily a good thing, but that's what the market has chosen).

4b11b4 · 3 years ago

An example backing this is voice assistants that DO work, e.g. Talon voice. But these require defining a language, and then they are very accurate and powerful.

I don't see why a voice assistant for the masses couldn't "train it's own users", for example suggesting the language it does expect. But even then, most times people are talking in noisy environments or talk to fast or don't have an understand of how the machine might work. Regardless, who cares. They ruin the audio environment of a home. They're good for setting timers while you're cooking, that's about it.

version_five · 3 years ago

Right - natural language works for people because we have minds that are communicating. A virtual assistant has a list of things it can do, and uses language as an interface to them. So the language just becomes obfuscation instead of allowing clarification.

I've said before, I would prefer a voice assistant that optimized for traversing its menu system, in response to unambiguous noises (could be high and low pitch hums or whatever) that lets me bypass the guessing game and use the menu it's hiding

foobarian · 3 years ago

The problem is that it doesn't make money.

Otherwise, it works great :-) We love the hands-off usage mode because we cook a lot, so adding things to shopping lists or looking stuff up doesn't require cleaning hands in the middle of prep. Also the speakers are pretty darn good for the size and work well for music.

Doing complicated things is right out though. But the simple stuff works fine.

Ajedi32 · 3 years ago

I'm just waiting for someone to finally release a voice assistant built around an actual language model, like GPT-3 or LaMDA.

It would be more error prone in a lot of ways, which is probably why nobody's done it yet, but it would also be a _lot_ more powerful, and fulfill the vision of conversational AI in a way the current rules-based assistants do not.

I think if powerful language models were easily accessible to normal people (in an inexpensive and completely unrestricted fashion, like with Stable Diffusion) we'd already see this happening in the open source world. Companies are going to be a lot more hesitant to try it though until they have a way to 100% prevent the models from making mistakes that could reflect poorly on the company, which is going to take _way_ longer to achieve.

RupertEisenhart · 3 years ago

Are you trying to say, Alexa should be funding the synthetic language nerds over at Lojban[0] or the Universal Networking Language[1]???

That would be a fun universe.

[0] https://mw.lojban.org/index.php?title=Lojban&setlang=en-US

[1] https://en.wikipedia.org/wiki/Universal_Networking_Language

gernb · 3 years ago

Natural language conveys information to other people just fine. So the problem isn't that "Natural language is a fundamentally wrong vehicle to convey information to a computer". The problem is getting the computer to understand natural language to the same level as a human.

darkerside · 3 years ago

The problem is both

ClumsyPilot · 3 years ago

> we shouldn't regard formal language as a burden, but rather as a privilege

What the hell? Is riding public transport or riding a bike either a burdain or a privilidge? Is Driving a car?

I am trying to control shit in my home, it should be neither.

bambax · 3 years ago

> I think there's potential here.

But how? Even if those interfaces were actually working, it's still extremely inconvenient to talk when you can click. You have to be somewhere where talking out loud doesn't disturb the people around you. That excludes most situations: open space offices, restaurants, coffee shops, public transport, cars with passengers, and most places in the home except maybe the bathroom.

And even if you're all alone in a silent place, giving instructions out loud takes more time than configuring a screen, and will always be error prone, because the feedback will always be ambiguous and imprecise.

Except maybe if the feedback is on a screen, but then if there's already a screen, why not use it.

t-sauer · 3 years ago

I think the best use cases for voice assistants are when you don’t have free hands. I have two scenarios where I use voice assistants: setting a timer while cooking and changing the music while showering. Both could be done by other means as well but they wouldn‘t be more convenient.

Shank · 3 years ago

> But how? Even if those interfaces were actually working, it's still extremely inconvenient to talk when you can click. You have to be somewhere where talking out loud doesn't disturb the people around you. That excludes most situations: open space offices, restaurants, coffee shops, public transport, cars with passengers, and most places in the home except maybe the bathroom.

I would separate out the two, actually. There's a "natural language control system for the entire OS" and then there's the actual voice part. Voice is often mostly useful for accessibility purposes -- hands full, running, driving, etc. However, the other side is that a text-based NL assistant would also be profoundly useful. On iOS, you can enable "Type-to-siri" and you can just type sentences and Siri will respond back in text.

If we make progress on NL-driven command-lines, we can actually make progress on voice-assistants, and vice versa. The catch is that the voice side still needs recognition work.

papito · 3 years ago

Well, you are not trying to operate heavy machinery with Amazon Echo - hopefully. Voice as a common interface - I agree with all of that, but to me the everyday utility of being able to add something to my shopping list or my TODO list without having to fire up an APP greatly increases my quality of life. That part is magical, but I don't expect a lot more from it.

stephc_int13 · 3 years ago

If the assistant AI was advanced enough for pleasant conversations to occur, it would be useful.

The would be trivial to use the interface on screen when appropriate, and a truly smart assistant should be able to follow the context and be aware of your preferences and mood.

This is not fundamentally impossible, we're simply not there yet.

ClumsyPilot · 3 years ago

> But how? Even if those interfaces were actually working, it's still extremely inconvenient to talk when you can click.

Smart home light/etc while hands are occupied like with a baby. But usecases are quite limited

pmontra · 3 years ago

> But how? Even if those interfaces were actually working, it's still extremely inconvenient to talk when you can click

Working from home changes that. I can see many more opportunities for a multimodal input interface. Examples:

1. My fingertips now are closer to the "reply" button below this text area than they are even to the touchpad. Touching "reply" is half a second, moving one hand to the touchpad, aiming the pointer at the button and clicking takes longer. With a mouse: much longer. Anyway, my screen is not a touchscreen. I'll click.

2. Or, with an assistant, I could have said "Click reply", provided that the assistant knows where the focus is and that it can read the form I'm typing in.

matthewmacleod · 3 years ago

"hey siri, when I get home every day, turn on all of the lights, change my focus to personal, and turn on the news"

I think the problem with that is that even I, as a human, struggle to know for sure what you want.

You want to turn all the lights on in the house? Does that include the lamps in the bedroom? How about new lights that you add later? Or the ones in the garden? It's full of ambiguity. What device do you want to watch the news on? Or did you mean the radio? Do you want this to apply when you get back at 2am one night, meaning your family gets woken up when you turn on all the lights and start playing the news in their bedrooms?

I think that's probably why voice interfaces aren't likely to work well for anything beyond direct, specific, well-scoped requests: turn on the lights in the bedroom; turn off the heating at home; roll up the blinds; what's the weather like today; what's the remaining range on my car. They really struggle to deal with anything more complex – not so bad in theory, but really incredibly irritating when they make the wrong decision.

If you had some kind of 24-hour live-in assistant (a butler, maybe?), then they probably have the knowledge and intuition to make sensible decisions in response to fairly unstructured requests. But I think we're miles off getting a voice assistant to do it – not because they can't, necessarily, but because if they mess it up at all it's infuriating.

bombcar · 3 years ago

You can do some of this with shortcuts, and then use Siri to trigger the shortcut. But that involves thinking; the magic of Jeeves is that he knows what you want even before you do.

gspencley · 3 years ago

I might be in the minority, but I also don't want to add things to my life that make my environment noisier or that require me or others living with me to speak more. As much of a Star Trek fan as I am, I never found "The Computer" to be appealing, and always thought of it more as an artistic device. It's a lot easier to communicate a character's intent / action if they are vocalizing it for performance. Even in scenes where they are "typing" something into the computer, they will inevitably be communicating to the captain or another character what they are doing.

In practical reality these interfaces feel, to me, as extremely inefficient. As someone who doesn't particularly like to speak, and prefers silent environments, these interfaces require more energy from me to use. Unless they are serving someone who has a physical impairment then I don't see what problems exist that these solve, but I can identify lots of problems that they introduce (not only noise but privacy / security vulnerabilities etc.)

Personal preference.

eternityforest · 3 years ago

Timers and reminders alone are enough to make them a pretty nice thing to have though.

I don't really want them to be all that much more powerful, because natural language can be imprecise, and... there's just not much I that I want to automate in a home setting beyond some real simple timers for lights and stuff.

What if I had a bad day and didn't want to see depressing news? Or what if I came home and was talking on the phone when it turned the news on?

True automation as opposed to just telemetry and remote control can easily be annoying more than helpful.

I like the idea of automation... but I don't actually... automate anything aside from timers and reminders.

ghaff · 3 years ago

I think that's generally true though playing music is a little more freeform. (And, guess what? Voice assistants tend to be worse at that.)

The problem is that you have many many billions of dollars have been sunk into making these devices about more than setting alarms and timers. There's actually been a lot of pretty amazing progress. But it's yet another one of those things that getting to 90% to anyone but techies who want to fiddle with their smarthome stuff or otherwise play with the technology.

antupis · 3 years ago

If I would be in this space I would just build voice assistant to very specific situations where you cannot type like driving, cooking, doing some sport etc. There is lots of potential but big players are kinda trying build generic tool for every situation which is super hard problem.

dmitriid · 3 years ago

You want utility. The big players want a product that can be monetized and milked for revenue.

brycehalley · 3 years ago

Voice assistants have reached the Unhelpful Valley stage.

When they were a novelty I recall the excitement of trying new commands and layering in context, after many failures I've been conditioned to now only attempt and expect success with generic queries.

mc32 · 3 years ago

To me what’s interesting is that MS smelled that it was a problem a while ago and pulled the plug before it ate a hole in their wallet but Amazon and Google keep plugging along ploughing money into a bottomless pit. Apple has a different play and looks like they are controlling their losses there quite well and may act as a slight loss leader for other products.

foobarian · 3 years ago

I can't fathom how they managed to spend so much on it, though. The product has been around for quite a while, as well, so it's not some initial ramp-up cost. $3B/quarter $10B/year? Wow.

Edit: Maybe things like this happen because there are various nerds who lead these products and are good at talking the businesspeople into funding it. Maybe this was only possible at the big tech growth stage while business wasn't that good at telling the value proposition. So end result, lots more engineers get paid which is great in my book :-)

serial_dev · 3 years ago

> Instead, it's a guessing game about syntax and semantics, and frequently a source of frustration

My biggest frustration with Alexa is getting it play the podcasts I want to listen to. Even popular podcasts with English names are hard to get just right for Alexa. The same goes for song titles and bands that are not popular, or they are in other languages.

Usually when I want to take a shower, I try to get the podcasts/music to play for 2 minutes, then sigh, give up and just say "Alexa play Britney Spears".

ghaff · 3 years ago

And discoverability. For a long drive I probably want to pick out some specific podcast episodes rather than play whatever. I'm just not a whatever background sound sort of person. The interfaces aren't really good enough to present me with some options with voice control only. So I end up mostly pre-populating a "Car" playlist.

phkahler · 3 years ago

>> A voice system that can do literally everything one can do with a keyboard and a mouse would be magical, but no system offers that.

And even then, a voice assistant is essentially a user interface, not a product or service.

It could be a service if you could reliably say "Alexa, plan my trip to customer X the week of the 30th and send me my itinerary". But for now they are an alternative to a phone UI.

ghaff · 3 years ago

The reality is that even a human personal assistant can rapidly devolve to being more of a hindrance than a help if they're not very good once you get beyond simple mechanical tasks. Even with all the knowledge about the world that most adults carry around in their heads. Yes, a poor human assistant can fall down in other ways such as forgetting to do something--but they have a lot of context.

This seems a really high bar for voice assistants aspiring to do much more than set alarms or turn the odd light etc. on or off.

PurpleRamen · 3 years ago

The potential would be there, if they would focus on the assistant-part, and take the voice just as a mean to interact with the assistant, besides other means like clicking, typing, showing complex information on a screen, etc.

Voice alone sucks, it's just too limited to be useful on a grand scale. Similarly, command lines suck too. The shell in general has the same problems that Voice assistants have, just that they have more value and had decades to mature into something actually useful. And toady we have unix-shells which reduce the problematic parts by many levels, and still receive constant improvements. This is missing for voice assistants, because unix-shells are growing and improving in an open space, where everyone can add their own things. This is not happening in big tech.

sublinear · 3 years ago

I don't think this is actually reliably possible due to the fact that while grammar does tend to follow patterns sometimes, we're fundamentally dealing with an exponential amount of ways to say things to a voice assistant.

In the spirit of the title of this post, someone else also has to say something.

If your argument is that this is a "non-visual command line" there's slim hope of the layperson learning a whole secret grammar without even a goddamn man page just to do their menial tasks.

ianai · 3 years ago

I really doubt *nix would have made it so far if the cli were audio based, too. It's a fundamentally slower and lower bandwidth communication channel.

_dain_ · 3 years ago

>Voice assistants are basically just mainstream non-visual command-lines, and it's unsurprising to me that something that relies heavily on memorization and extremely specialized "skills" isn't quite taking off in the way it was imagined.

This got me thinking. Voice recognition is basically a commodity now .. there are open source AI engines that can do it offline really well. So the recognition part is solved, you can just grab it from your distro's package manager. Now there's just the language part.

Thing is, I don't want to speak to my computer using English. Aside from the enormous practical problems in natural language processing you've outlined, I just find the idea creepy[1].

What I want is to unambiguously tell it to do arbitrary things. I.e. use it as an actual computer, not a toy that can do a few tricks. I.e. actually program it. In some kind of Turing complete shell language that is optimized for being spoken aloud. You would speak words into the open source voice recognizer, it writes those to stdout, then an interpreter reads from stdin and executes the instructions.

Is there any language like this? What should it look like?

And yeah that would take effort to learn to use it right, just like any other programming language; so be it. This would be a hobbyist thing.

[1] https://i.kym-cdn.com/photos/images/original/002/054/961/748...

viraptor · 3 years ago

> So the recognition part is solved

If you're using an averaged American voice - maybe. But it's really not solved for everyone. Google assistant can't set the right timer for me 1/10 times. And that's before we get to heavy accent Scots and others.

simiones · 3 years ago

> Voice recognition is basically a commodity now .. there are open source AI engines that can do it offline really well. So the recognition part is solved, you can just grab it from your distro's package manager.

This is potentially far from true, depending on how exactly you draw the line between "voice recognition" and "language". I've looked at quite a few transcription services, and they fail a lot of the time for most people - those who either have a non-native accent (even if very slight!) or those who do any amount of stammering or other vocal tics.

Shank · 3 years ago

I personally don't consider this a fully-solved problem. The best transcription system I've used is OpenAI Whisper, and it doesn't work in realtime. Maybe it's fine on small amounts but it's still not perfect. You really need error to be driven down dramatically. Zoom auto-captions are a joke in terms of how badly they work for me, and Live Text (beta) on macOS is equally dreadful. YouTube auto-captions suck. All of these use industry-leading APIs. If I'm speaking a voice command and one single word is wrong, usually the whole thing fails.

There's an entirely separate issue about things that are Proper Nouns that don't exist. For example, "Todoist" is often misunderstood by Siri. Thus, people started saying "Two doist (where doist rhymes with joist)" to fool it into understanding "Todoist". Media like anime with strange titles from other languages often flat out trolls these transcription systems. ("Hey Siri, remind me to watch Kimetsu no Yaiba tomorrow".)

Aramgutang · 3 years ago

That reminds me of the handwriting recognition approach [1] used in old Palm Pilot devices. Even though the shapes it expected you to draw resembled the corresponding letters, you would never draw them like that if you were writing on paper.

You knew that you were drawing something designed for a computer to recognise as unambiguously as possible, while being efficient to draw quickly and easy to learn for you. I feel like that's the kind of notion that voice interfaces should somehow expand upon.

[1] https://en.wikipedia.org/wiki/Graffiti_(Palm_OS)

draugadrotten · 3 years ago

> And yeah that would take effort to learn to use it right, just like any other programming language; so be it. This would be a hobbyist thing.

There are quite a few hobbyists working on local on-prem privacy focused voice assistans with conversation support.

https://www.home-assistant.io/integrations/#voice https://www.home-assistant.io/integrations/conversation/

Have fun. It is a rabbit hole.

spookthesunset · 3 years ago

To me the hardest problem is simply remembering what every light on my network is named. Did I call the light next to my desk “desk light” or did I call it “office light”? If I don’t get the name exactly right, I cannot control the light. Multiply that by every other light in the house and it becomes a lot to remember. I have probably 15 lights controlled by Alexa and I can only remember the name of like three of them. Thus most of the time it is just “Alexa turn on the lights” so it can turn everything on in a room.

If these voice assistants were smarter about “alternative” names for every device it might be easier to use. But as it stands, it’s kind of a pain because the way you phrase each request is so unforgiving…

Oh yeah, and god help you if your device name is similar to your room name. If your room is “office” (or did I name it “the office”?) and your light is “office light” Alexa is gonna have a bad time figuring the two apart.

I have no clue how to fix this…

PS: this is why I question steering wheel free self driving cars. How will we tell these things exactly where to go when we cannot even reliably tell our voice assistants exactly what light to turn on?

7952 · 3 years ago

I think the biggest potential is with Microsoft Teams in business. It is ubiquitousness in people's work life, has access to data and has integrated with everything. And adding cortana to calls would be an easy step for people to understand and learn. People would say "cortana share my screen". People would learn phrases from each other.

happymellon · 3 years ago

But teams hasn't figured out how to send text in a coherent way.

It's used because companies can cheap out on buying a license for other communication applications, it is fundamentally worse than anything else in any other metric. If voice lets me respond to a message without hunting for the hidden reply because Teams shoves it below the bottom of the screen then it could be a win. Considering UX is so low for Teams I doubt it will.

SheinhardtWigCo · 3 years ago

> There /is/ power to-be-had, but nobody has really tapped it.

This kind of thing can't be built for modern mainstream operating systems because they generally prevent subjugation of the OS components and other programs, even if the user wants that, ostensibly for security reasons.

Unlike a human operator, an assistant "app" can only operate within the bounds of APIs defined by the OS vendors and third-party developers. Gone are the days of third-party software that extends the operating system in ways that the overlords couldn't (or wouldn't) dream of.

sdf4j · 3 years ago

That's not entirely true. Accessibility APIs on macOS, for example, would let you control so many aspects of the OS from user land apps given that permissions are granted. But voice assistants are not up to the task.

bistable · 3 years ago

I think you're identifying some of the right problems here. All voice assistants are based on turn-taking, and when the VoiceAI hits one of those failure points and just comes back with "I didn't get that" it leaves the user in a frustrating state trying to debug what's wrong.

I work at SoundHound where we've been worried about these issues. (I'm going to plug our recent work...) Our new approach is to do natural language understanding in real-time instead of at the utterance (turn) taking level. That way we can give the user constant feedback in real-time. In the case of a screen that means the user sees right away that they are understood, and if not, a better hint of what went wrong. For example a likely mistake is an ASR mistranscription for a word or two.

We still need to prove this is a better paradigm for VoiceAI in products that people can try for themselves, and are working towards that goal. I hope that voice interfaces that were clunky with turn-taking will finally be more naturally usable with real-time NLU.

https://www.youtube.com/watch?v=5WLYH1qHfq8

sliken · 3 years ago

I tried Amazon's Alexa, the top end model with a display. Often it would taunt you about new/interesting things on the screen, but I could never get them to work. I'd had to memorize things to get even the basics working. Ended up unplugging it.

However Google's Assistant in comparison worked great, no memorization, and very useful. Sure time, weather, set timers, and alarms worked great with a very flexible set of natural language queries. Even more complex things like what will be the temperature tomorrow at 10pm, simple calculations and unit conversions. But also things like IMDB like queries about directors, actors, which movies someone was in, etc generally worked well. It seemed to really understand things, not just "A web search returned ...". Even more complex things like the wheelbase of a 2004 WRX would return an answer, not a search result.

With all that said I'm looking for a non-cloud/on site solution, even if it requires more work, most recently noticed https://github.com/rhasspy/rhasspy

Sakos · 3 years ago

The big issue is that there's no clearly defined interface for users. What commands are possible? Nobody knows. So people default to the most obvious things like setting a timer. Is it possible to setup your own commands and build your own work flows? AFAIK, no. So the tech is essentially dead in the water until companies fundamentally rethink what they're trying to do with voice assistants.

jasmer · 3 years ago

Yup. At the risk of being glib I would say this is 90% of the issue. Or more like 'the big blocking issue' at the moment.

Voice can do way more than we know, but we have no idea what it does or how to use it.

Standardizing the interface and providing tutorials would possibly change things dramatically.

And this goes for the back-end protocols as well.

The tech is way, way ahead of the UI and integration.

Imagine getting the power of 'git' with no tutorial and not really an understanding of what it does? Good luck with that.

90% of us would be using it in the car to do a lot of things if we really knew how to do it:

You: "Siri: Command. Open. Mail. Prompt. Recipients starting with S"

Siri: "Sarah, Sue, Sundar"

You: "Stop. Command. Message. To: Sunar. Thanks for the note. Stop. Send without Review"

Some of this already exists, but it's product specific etc. there needs to be some kind of natural universal interface - or we have to wait until the AI is really, really that good.

4b11b4 · 3 years ago

Talon voice can do everything a keyboard and mouse offers, plus more (contextual awareness, higher level abstraction). Very powerful in combination with modal editing. I'm not affiliated, just a user.

Granted, this is for a specific user base and yes, not in coffee shops.

Razengan · 3 years ago

This timeline is such a mishmash of mediocrity. Voice assistants could have been a vibrant ecosystem of different personalities, like say buying a Darth Vader voice pack or having your computer sound like a snooty English butler..

There's a great little game series called Megaman Battle Network (Rockman.exe in Japan) which diverges from the mainline by showing an alternate universe where scientists focused on AI instead of robotics, resulting in a world where "Navis" are ubiquitous.

I wonder, what if our early software engineers focused on bringing natural voice control to CLIs, before perfecting GUIs first?

amelius · 3 years ago

> There /is/ power to-be-had

This is not power. This is just first-world problems.

bogdanstanciu · 3 years ago

I think these assistants just need to give the user a way to edit interpretations.

A 'debug' area that lets you ask a command, see what was interpreted - and immediately edit or click "that's not what I wanted". But not an afterthought and not a cumbersome process like setting up an automation that is triggered by specific commands.

Imagine telling your voice assistant "You're wrong, as usual" and instead of it giving you the boiler plate "I'm sorry ", it actually offered a way to improve itself.

iquerno · 3 years ago

I would think that a good command-line is one that responds to me within milliseconds on a crapbox i386 machine, and I can COMMAND it what to do. A good command-line is not a binary blob that cannot parse simple instructions correctly.

At the same time, siri seems to be getting slower and fatter every iteration so perhaps it is becoming more human ;)

sokoloff · 3 years ago

> "hey siri, create alarms every 5 minutes starting at 6am tomorrow"

“OK, I’ve created an infinite number of alarms, every five minutes, starting at 6 AM tomorrow!”

(As a native English speaker, I'm not sure what specific outcome you want to happen from that request. That's the one that makes the most sense.)

ghaff · 3 years ago

As a native English speaker, that seems a profoundly odd request but that is what you asked for.

And you now have me wondering how open-ended calendar requests are actually implemented given that they can't literally have entries out to infinity. (I assume they go out some finite period and some background process periodically re-populates future entries.)

1MachineElf · 3 years ago

Another pitfall of most voice assistants is that they are really designed first with the corporation in mind rather than the user. Most are proxies for surveillance, advertising, or are just steering consumers back to a preferred set of walled-garden services.

PhasmaFelis · 3 years ago

Yeah, the whole idea has a lot of potential that seems like it should be within reach, but somehow it's 2022 and my phone still can't handle "hey Google, play my driving playlist on Spotify."

freeone3000 · 3 years ago

Your queries continue to be money-sinks -- even in your ideal case, you aren't buying anything! This query costs them money but earns them nothing. This is useless.

gernb · 3 years ago

> an assistant that's integrated into the OS and can change any setting

That sounds like a security nightmare. Someone walks by and starts changing your system settings? No thank you

Eleison23 · 3 years ago

Me and voice assistants are like me on the ballroom dance floor. I loved to take the lessons and learn all sorts of moves and chain them all together and look impressive, but when I got onto the floor with a partner, I just wouldn't know what to do or where to start. I kept to the "basic" steps and maybe a timid little turn once in a while.

Maybe it's possible to learn a working vocabulary and know how to command a voice assistant. I know my way around several command lines, but I have no idea what to say to Hey Google.

Avicebron · 3 years ago

it almost sounds like you are describing how it feels to learn a new language. And if that's the case and people need to learn "voice assistant" to communicate with their device effectively, hasn't it utterly failed as a natural language processor?

Also I know this is true in other domains as well, obviously there is a common "google-ese" that people learn to narrow down their searches.

I think it's fairly clear now that the only time a voice based UI is better is when the user is unable to use their hands. Driving or in the kitchen when cooking seem the be the most successful. The are barely any other strong use cases.

On top of that the general distrust of the privacy of these systems has stoped a significant number of people (myself included) from wanting to us them at all. I don't have an in home device, and have turned off Siri on my Apple devices.

algesten · 3 years ago

And then it typically interacts and fails without feedback. I've tried so many times to tell Siri "Send a message to x that I'm 10 mins away", only to realize much later that "message delivery failed".

No clear feedback, a weird timing issues where it just stalls and show the message it' about to send in case it got it wrong.

It's just a terrible UX all around.

causi · 3 years ago

I've had to stop using Google Assistant to send messages. It used to ask the user to choose the correct word when it misheard something. Now it just makes a wild-ass guess and sends it on. It's caused me to send some very odd messages to people and/or look like an idiot.

kevsim · 3 years ago

Totally agree. "Hey Siri, start a timer for X minutes" and "Hey Siri, play Y on Spotify" are the entire extent of my voice assistant interactions.

aflag · 3 years ago

And I’m always annoyed that it doesn’t tell me how long I set each timer to when I have multiple timers, just the remaining time, which makes it hard for me to know which is which.

nonanonymo · 3 years ago

I discovered recently that you can set a timer just by saying "Hey Siri, four minutes," or however long you want, and she will set a timer to that length. Not that I'm doing anything with the extra second I've saved myself, but it feels good anyway.

goosedragons · 3 years ago

I don't think that's true. It's way faster for things like smart lights and playing music. The problem is it's still so limited for other things. For example I can tell Alexa to play The Simpsons on Fire TV, it will do so on Disney+ but always the first episode even though the last watched one was in Season 15. It also can't seem to find my purchased episodes from iTunes (watchable with the Apple TV app on Fire TV). Simple searches also have a high chance of being misunderstood still with poor results.

I think if the accuracy was better and more content/things were available through voice it would be a pretty good input method for any scenario where you don't need visual feedback.

samwillis · 3 years ago

Selecting content via voice only works when you can either name exactly the content you want, or you are choosing a general category to play.

Browsing content, or looking it up, via voice is slow and playful, it always will be. Who what's to be saying "next", "scroll down", or have a full on conversation with an AI to try and work out what you want to play? Our fingers on our hands have evolved to be incredible at interacting with things, we are good at using them. Touch screens or physical UIs will never be superseded by voice.

So yes, there is a small use case for voice for controlling music/tv, or controlling a few things in the home (heating, blinds, lights) but thats it, I don't believe there is this massive opportunity to expand it into our everyday lives where we are constantly interacting with devices via voice.

phh · 3 years ago

This is a real problem, though the problem is not technical, it's purely contract/legal: Disney+, AppleTV (all content providers actually) refuse to allow third parties to know what you've watched, because viewer data is closely protected

That's why an opensource ToS-violating assistant has chances to work better than legal ones, they can just scrape all those infos off internet. But then, once you go into that grey area, you just end up pirating content already.

red_Seashell_32 · 3 years ago

To be fair, that’s an issue with siri integration, not siri itself. Kinda sad that Apple’s own products doesn’t implement that properly :/

bryanrasmussen · 3 years ago

>the only time a voice based UI is better is when the user is unable to use their hands. Driving

my observation of people on the road has led me to conclude that Driving is an activity where people think they can do absolutely anything else while engaged in it.

bluGill · 3 years ago

When you put it like that you will get a lot of upvotes. However for almost all specifics you will get a ton of downvotes when you name them, and probably someone saying no that is okay. (these days if you say "you can't drink alcohol while driving" you are safe, 30 years ago if there had been an internet people would have said you are wrong)

dmitriid · 3 years ago

> Driving or in the kitchen when cooking seem the be the most successful.

Since the voice assistants are incredibly stupid I find it extremely stressful and distracting to ask them for anything while driving.

swores · 3 years ago

Figuring out what works well or not while driving isn't a great idea, but using the ones that work well seems fine for most people.

Saying "Hey Siri, text Fred <pause> I'm on my way but stuck in traffic, eta 4 o'clock" or something along those lines nearly always works fine for me and is no more distracting than having a conversation with somebody in the car with me. If Siri gets some of the message wrong I'll either send a new one using clearer speech or wait until I'm not driving to fix it if the mistake isn't important.

Sure, it would be possible to then allow myself to get distracted by focussing too much on some weird aspect of it, but equally it would be possible to get so emotional in a conversation with somebody sat next to you that you stop paying attention to the road. And we (most people at least) don't say "it's not safe to talk at all while driving", we just make sure not to go over that line of getting too distracted by the conversation.

enobrev · 3 years ago

Google (android auto) was significantly better at this early on than it is now. I used to be able to search random topics by voice while driving, and it would read me excerpts and results. I used it often. Now it's map-specific, messaging, or music-specific and nothing else.

wintermutestwin · 3 years ago

>Driving

While driving, I wanted to have Siri read a lengthy webpage to me. I pulled up the page, got in the car and asked Siri to "speak screen." Siri says it can't do that when I am driving! What idiot thought that was a necessary safety measure? What if I were the passenger?

Overall, I am stunned at how bad Siri is at things that don't even require AI. It's almost as if this insanely profitable company failed to invest a tiny bit of money into researching ways that people would like to use Siri.

JCharante · 3 years ago

> What if I were the passenger?

I often go places with my sister (she drives). Her car doesn't allow pairing or swapping bluetooth connections to the car's entertainment system while it's moving. If we want to switch to my phone we have to come to a complete stop.

gcanyon · 3 years ago

I disagree: a voice based UI is better IFF:

1. The command set is broad enough or user input is complex enough to make other UIs inefficient. 2. The voice UI is up to the task of correctly interpreting the voice input correctly most of the time

What "most of the time" means for the second item is somewhat personal and use-case specific.

For item 1, examples where voice is better right now, or could be with reasonable NLU improvements:

"Text my wife that I'll be there in five minutes."

"Get me driving directions to the nearest Indian restaurant with at least 3 stars on Yelp."

"Order six rolls of paper towels and a bottle of Windex from Walmart, delivered to my home address, for delivery by Saturday"

"Remind me tomorrow morning to review this web page"

"Create a shopping list with the items from this recipe"

"Create a basic presentation with one slide each for each entry in the table of contents for this book"

Voice can be better. As others have pointed out, as long as it's like playing Zork where half the time the response is "you can't do that" or "I don't understand", voice interfaces will continue to flounder.

thiht · 3 years ago

> Driving

Siri never triggers when I'm driving, it just doesn't hear me. I think it's because of the noise of the car or because of my music, but it doesn't work. I have to move my face closer to my phone so that it can hear me, but that's even more dangerous than using the controls.

Same when I'm in the shower and I ask it to change the music, it doesn't hear me, I have to shout and get angry every time.

balfirevic · 3 years ago

Is your phone mounted close to the air vent? Siri hears me perfectly in my car, even when the phone is in my pocket but it doesn't hear me at all when I mount the phone on the air vent.

college_physics · 3 years ago

> when the user is unable to use their hands

this is still potentially a huge domain. one could imagine a benign scenario where voice assistants enhance people's abilities to interact with each other (and digital devices) when a more potent UI is not within reach

privacy concerns (->controversial business models) and technical ability to deliver a desirable service (that people would pay for) might indeed prevent this vision from catching on in the short term

another factor that may complicate adoption might be just cultural / perceptions. It is a somewhat odd thing to be shouting at devices - especially in the presence of other people. User interfaces that interfere strongly with communication habits and behaviors established over millennia (see also wearing VR goggles) might have a harder time seeing adoption outside very specific scenarios

fxtentacle · 3 years ago

Fully agree on the privacy distrust.

BTW, another use case for speech recognition is when you're carrying a baby around.

newaccount74 · 3 years ago

I've gotten pretty good at doing chores one handed...

maweki · 3 years ago

I'd say my baby preferred me playing around with my phone than speaking up. Doubly so when they were carried around sleeping.

genocidicbunny · 3 years ago

Even while driving, it's useless past basic commands.

My most egregious example of this for me is that there's a grocery store near me that the Google assistant is incapable of finding because of a few people in my contacts list. Whenever I try to ask it for directions to that store, it picks (at pretty much random) one of three of my contacts instead. This is despite the only common part of said contacts' names and the grocery store is that their names all start with the same letter.

Basically, imagine asking for directions to Albertsons, and the assistant giving you directions to Andrew.

red_Seashell_32 · 3 years ago

Or in countries where we get snow during Autumn. Getting messages read-out loud, and responding with voice-to-speech is great too