This clearly elucidated a number of things I've tried to explain to people who are so excited about "conversations" with computers. The example I've used (with varying levels of effectiveness) was to get someone to think about driving their car by only talking to it. Not a self driving car that does the driving for you, but telling it things like: turn, accelerate, stop, slow down, speed up, put on the blinker, turn off the blinker, etc. It would be annoying and painful and you couldn't talk to your passenger while you were "driving" because that might make the car do something weird. My point, and I think it was the author's as well, is that you aren't "conversing" with your computer, you are making it do what you want. There are simpler, faster, and more effective ways to do that then to talk at it with natural language.
I had the same thoughts on conversational interfaces [1]. Humane AI failed not only because of terrible execution, the whole assumption of voice being a superior interface (and trying to invent something beyond smartphones) was flawed.
> Theoretically, saying, “order an Uber to airport” seems like the easiest way to accomplish the task. But is it? What kind of Uber? UberXL, UberGo? There’s a 1.5x surge pricing. Acceptable? Is the pickup point correct? What would be easier, resolving each of those queries through a computer asking questions, or taking a quick look yourself on the app?
> Another example is food ordering. What would you prefer, going through the menu from tens of restaurants yourself or constantly nudging the AI for the desired option? Technological improvement can only help so much here since users themselves don’t clearly know what they want.
And 10x worse than that is booking a flight: I found one that fits your budget, but it leaves at midnight, or requires an extra stop, or is on an airline for which you don't collect frequent flyer miles, or arrives it at a secondary airport in the same city, or it only has a middle seat available.
How many of these inconveniences will you put up with? Any of them, all of them? What price difference makes it worthwhile? What if by traveling a day earlier you save enough money to even pay for a hotel...?
All of that is for just 1 flight, what if there are several alternatives? I can't imagine have a dialogue about this with a computer.
Why couldn't the interface ask you about your preferences? Because instead, what we have right now are clunky web interface that just cram every choice in the small screen in front of you and letting you understand how they are in fact different and sort out yourself how to make things work.
Of course a conversational interface is useless if it tries to just do the same thing as a web UI, which is why it failed a decade ago when it was trendy, because the tech was nowhere clever enough to make that useful. But today, I'd bet the other way round.
And then there is the fact that voice isn't the dominant mode of expression for all people. Some are predominantly visual thinkers, some are analytic and slow to decide, while some prefer to use their hands and so on.
I guess there's just no substitute for someone actually doing the work of figuring out the most appropriate HMI for a given task or situation, be it voice controls, touch screens, physical buttons or something else.
> I had the same thoughts on conversational interfaces [1]. Humane AI failed not only because of terrible execution, the whole assumption of voice being a superior interface (and trying to invent something beyond smartphones) was flawed.
Amen to that. I guess, it would help to get of the IT high horse and have a talk with linguists and philosophers of language. They are dealing with this shit for centuries now.
You're onto something. We've learned to make computers and electronic devices feel like extensions of ourselves. We move our bodies and they do what we expect. Having to switch now to using our voice breaks that connection. Its no longer an extension of ourselves but a thing we interact with.
Two key things that make computers useful, specificity and exactitude, are thrown out of the window by interposing NLP between the person and the computer.
Yeah, it comes and goes in games for a reason. If it's not already some sort of social game, then the time to speak an answer is always slower than 3 button presses to select a pre-canned answer. Navigating a menu with Kinect voice commands will often be slower than a decent interface a user clicks through.
Voice interface only prevails in situations with hundreds of choices, and even then it's probably easier to use voice to filter down choices rather than select. But very few games have such scale to worry about (certainly no AAA game as of now).
> Even in a car, being able to control the windscreen wipers, radio, ask how much fuel is left are all tasks it would be useful to do conversationally.
are you REALLY sure you want that?
how much fuel there is is a quick glance into the dash, and you can control precisely the radio volume without even looking.
'turn up the volume', 'turn down the volume a little bit', 'a bit more',...
and then a radio ad going 'get yourself a 3 pack of the new magic wipers...' and car wipers going off.
Agree. Not all systems require convo mode.
I personally find Chat/Convo/IVR type interface slow/tedious.
Keyboard/Mouse ftw.
However,
A CEO using Power BI with Convo to can get more insights/graphs rather than slice/dicing his data. They do have fixed metrics but incase they want something not displayed.
Honestly that just says that the interface is too low level. Telling a car to drive you to some place and make it fast is how we interact with taxi drivers. It works fine as a concept, it just needs a higher level of abstraction that isn't there yet.
This only works for tasks where the details of execution are not important. Driving fits that category well, but many other tasks we're throwing at AI don't.
> you couldn't talk to your passenger while you were "driving" because that might make the car do something weird.
This even happens while walking my dog. If my wife messages me, my iPhone reads it out and, at the same time, I'm trying to cross a road, she'll get a garbled reply which is just me shouting random words at my dog to keep her under control.
I think a lot of these "voice assistant" systems are envisioned and pushed by senior leadership in companies like SVPs and VPs. They're the ones who make the decision to invest in products like this. Why do they think these products make sense? Because they themselves have personal assistants and nannies and chauffeurs and private chefs, and voice is their primary interface to these people. It makes sense that people who spend all their time vocally telling others to do work, think that voice is a good interface for regular people to tell their computers to do work.
An empirical example would be Amazon's utter failure at making voice shopping a thing with the Echo. There were always a number of obvious flaws with the idea. There's no way to compare purchase options, check reviews, view images, or just scan a bunch of info at once with your eyeballs at 100x the information bandwidth of a computer generated voice talking to you.
Even for straightforward purchases, how many people trust Amazon to find and pick the best deal for them? Even if Amazon started out being diligent and honest it would never last if voice ordering became popular. There's no way that company would pass up a wildly profitable opportunity to rip people off in an opaque way by selecting higher margin options.
If the driver could queue actions it would make chat interfaced driving easier since the desired actions could be prepared for implementation by button press rather than needed a dedicated button built at a factory built by an engineer.
1. "Natural language is a data transfer mechanism"
2. "Data transfer mechanisms have two critical factors: speed and lossiness"
3. "Natural language has neither"
While a conversational interface does transfer information, its main qualities are what I always refer to as "blissfull ignorance" and "intelligent interpretation".
Blisfull ignorance allows the requester to state an objective while not being required to know or even be right in how to achieve it. It is the opposite of operational command. Do as I mean, not as I say.
"Intelligent Interpretation" allows the receiver the freedom to infer an intention in the communication rather than a command. It also allows for contextual interactions such as goal oriented partial clarification and elaboration.
The more capable of intelligent interpretation the request execution system is, the more appropriate a conversational interface will be.
Think of it as managing a team. If they are junior, inexperienced and not very bright, you will probably tend towards handholding, microtasking and micromanagement to get things done. If you have a team of senior, experienced and bright engineers, you can with a few words point out a desire and, trust them to ask for information when there is relevant ambiguity, and expect a good outcome without having to detail manage every minute of their days.
> If you have a team of senior, experienced and bright engineers, you can with a few words point out a desire and, trust them to ask for information when there is relevant ambiguity, and expect a good outcome
It's such a fallacy. First thing an experienced and bright engineer will tell you is to leave the premises with your "few words about a desire" and not return without actual specs and requirements formalized in some way. If you do not understand what you want yourself, it means hours/days/weeks/months/literally years of back and forths and broken solutions and wasted time, because natural language is slow and lossy af (the article hits the nail on the head on this one).
Re "ask for information", my favorite example is when you say one thing if I ask you today and then you reply something else (maybe the opposite, it happened) if I ask you a week later because you forgot or just changed your mind. I bet a conversational interface will deal with this just fine /s
> First thing an experienced and bright engineer will tell you is to leave the premises with your "few words about a desire" and not return without actual specs and requirements formalized in some way.
No, that's what a junior engineer will do. The first thing that an experienced and bright senior engineer will do is think over the request and ask clarifying questions in pursuit of a more rigorous specification, then repeat back their understanding of the problem and their plan. If they're very bright they'll get the plan down in writing so we stay on the same page.
The primary job of a senior engineer is not to turn formal specifications into code, it's to turn vague business requests into formal specifications. They're senior enough to recognize that that's the actually difficult part of the work, the thing that keeps them employed.
I do understand that in bad cases it can be very frustrating as an engineer to chase vague statements only to be told later 'nah, that was not what I meant'. This is especially true when the gap in both directions is very large or there is incompetence and/or even adversarial stances between the parties. Language and communication only work if both parties are willing to understand.
Unfortunately if either is the case "actual specs and requirements formalized", while sounding logical, and might help, in my experience did very little to save any substantial project (and I've seen a lot). The common problem is that the business/client/manager is forced to sign of on formal documents far outside their domain of competence, or the engineers are straitjacketed into commitments that do not make sense or have no idea of what is considered tacit knowledge in the domain and so can't contextualize the unstated. Those formalized documents then mostly become weaponized in a mutual destructive CYA.
What I've also seen more than once is years of formalized specs and requirements work while nothing ever gets produced, and the project is aborted before even the first line of code hit test.
I've given this example before: When Covid lockdows hit there were digitization projects years in planning and budgeted for years of implementation, that were hastily specked, coded and roiled out into production by a 3 person emergency team over a long weekend. Necessity apparently has a way of cutting through the BS like nothing else can.
You need both sides capable, willing and able to understand. If not, good luck mitigating, but you're probably doomed either way.
Star Trek continues to be prescient. It not only introduced the conversational interface to the masses, it also nailed its proper uses in ways we're still (re)discovering now.
If you pay attention to how the voice interface is used in Star Trek (TNG and upwards), it's basically exactly what the article is saying - it complements manual inputs and works as a secondary channel. Nobody is trying to manually navigate the ship by voicing out specific control inputs, or in the midst of a battle, call out "computer, fire photon torpedoes" - that's what the consoles are for (and there are consoles everywhere). Voice interface is secondary - used for delegation, queries (that may be faster to say than type), casual location-independent use (lights, music; they didn't think of kitchen timers, though (then again, replicators)), brainstorming, etc.
Yes, this is a fictional show and the real reason for voice interactions was to make it a form of exposition, yadda yadda - but I'd like to think that all those people writing the script, testing it, acting and shooting it, were in perfect position to tell which voice interactions made sense and which didn't: they'd know what feels awkward or nonsensical when acting, or what comes off this way when watching it later.
I have similar thoughts on LCARS: the Doylist requirement for displays that are bold enough and large enough to feel meaningful even when viewed on a 1990-era TV, are also the requirements for real life public information displays.
At first glance it feels like real life will not benefit from labelling 90% of the glowing rectangles with numbers as the show does, but second thoughts say spreadsheets and timetables.
Yeah, this and I think even weapons control, happened on the show. But the scenario for these cases is when the bridge is understaffed for episode-specific plot reasons, and the protagonist has to simultaneously operate systems usually handled by distinct stations. That's when you get an officer e.g. piloting the shuttle/runabout while barking out commands to manage power flow, or voice-ordering evasions while manually operating weapons, etc.
(Also worth noting is that "pre-programmed evasion patterns" are used in normal circumstances, too. "Evasive maneuver JohnDoe Alpha Three" works just as well when spoken to the helm officer as to a computer. I still don't know whether such preprogrammed maneuvers make sense in real-life setting, though.)
There was an episode where Beverly Crusher was alone on the ship, and controlled everything just by talking to the computer. I wondered why there is a bridge, much less a bridge crew. But yes it makes sense to use higher bandwidth control systems when possible.
If that was the episode where the crew disappeared with nobody else but her noticing, it doesn't really count because she was trapped in a Negative Space Wedgie pocket dimension based on her own thoughts at the time she was trapped.
Star trek's crews overall are chosen in a way that seems to consider redundancies, as well as meshing as a team that can offer varying viewpoints.
It runs directly counter to that more capitalistic mindset of "why don't we do more with less?" when spending years navigating all kinds of unknown situations, you want as many options as possible available.
Completely agree, voice UI is best as an augmentation of our current HCI patterns with keyboard/mouse. I think one of the reasons this is, is because our brains kind of have separate buffers for visual memory and aural memory (Baddeley's working memory model). Most computer use takes up the visual buffer, and our aural buffer has extra bandwidth. This also means we can do things aurally while still maintaining focus/attention on what we're doing visually, allowing a kind of multitasking.
One thing I will note is that I'm not sure I buy the example for voice UIs being inefficient. I've almost never said "Alexa what's the weather like in Toronto?". I just say "Alexa, weather". And that's much faster than taking my phone out and opening an app. I don't think we need to compress voice input. Language kind of auto-compresses, since we create new words for complex concepts when we find the need.
For example, in a book club we recently read "As Long as the Lemon Trees Grow". We almost immediately stopped referring to it by the full title, and instead just called it "lemons" because we had to refer to it so much. Eg "Did you finish lemons yet?" or "This book is almost as good as lemons!". The context let shorten the word. Similarly the context of my location shortens the word to just "weather". I think this might be the way the voice UIs can be made more efficient: in the same way human speech makes itself more efficient.
> This also means we can do things aurally while still maintaining focus/attention on what we're doing visually, allowing a kind of multitasking.
Maybe you, but I most definitely cannot focus on different things aurally and visually. I never successfully listened to something in the background while doing something else. I can't even talk properly if I'm typing something on a computer.
Yup, we are all different. I require auditory stimulation to work at my peak.
I did horribly in school but once I was in an environment where I could have some kind of background audio/video playing I began to excel. It also helps me sleep of a night. It’s like the audio keeps the portion of me that would otherwise distract me occupied.
Or to clarify, I don't think one can be in deep flow eg programming and simultaneously in deep flow having an aural conversation; we're human we can't truly multitask. But I do think that if you're focusing on something using your computer, it's _less_ disruptive to eg say "Alexa remind me in twenty minutes to take out the trash" then it is to stop what you're doing and put that in an app on your computer.
The multitasking is something I like about smart home speakers. I can be asking it to turn the lights on/off or check the temperature, while doing other things physically and not interrupting them, often while walking through the room. Even if voice commands are slower, they don't interrupt other processing nearly as much as having to visually devote attention and fine motor skills, and navigate to the right screen in an app to do what you want.
I feel like the people using Voice Attack or whatever in space sims zeroed in on this.
It's very useful being able to request auxillary functions without losing your focus, and I think that would apply to say, word editing as well - e.g. being able to say "insert a date here" rather the having to get into the menus to find it.
> The second thing we need to figure out is how we can compress voice input to make it faster to transmit. What’s the voice equivalent of a thumbs-up or a keyboard shortcut? Can I prompt Claude faster with simple sounds and whistles?
The number of times in the last few years I've wanted that level of "verbal hotkeys"... The latencies of many coding llms are still a little bit too low to allow for my ideal level of flow (though admittedly I haven't tried one's hosted on services like groq), but I can clearly envision a time when I'm issuing tight commands to a coder model that's chatting with me and watching my program evolve on screen in real time.
On a somewhat related note to conversational interfaces, the other day I wanted to study some first aid stuff - used Gemini to read the whole textbook and generate Anki flash cards, then copied and pasted the flashcards directly into chat GPT voice mode and had it quiz me. That was probably the most miraculous experience of voice interface I've had in a long time - I could do chores while being constantly quizzed on what I wanted to learn, and anytime I had a question or comment I could just ask it to explain or expound on a term or tangent.
I worked like that for a year in uni because of RSI and it's very easy to get voice strain if you use your voice for coding like that. Many short commands is very tiring for the voice.
It's also hard to dictate code without a lot of these commands because it's very dense in information.
I hope something else will be the solution. Maybe LLMs being smart enough to guess the code out of a very short description and then a set of corrections.
Would be nice to be able to do something like write a function signature and then just say “fill out this function,” with it having the implicit needed context, as though it had been pairing with you all along and is just taking the wheel for a second. Or when you’ve finished writing a function, “test this function with some happy path inputs.” I feel like I’d appreciate that kind of use, which could integrate decently into the flow state I get into when programming. The current suite of tools for me often feels too clunky, with the need to explicitly manage context and queries: it takes me out of my flow state and feels slower than just doing it myself.
Oh wow. That video is 12 years old. Early in the presentation Travis reveals he used Dragon back then.
Do you recall Swype keyboard for Android? The one that popularized swyping to write on touch screens? It had Dragon at some point.
IT WAS AMAZING.
Around 12-14 years ago (Android 2.3? Maybe 3?) I was able to easily dictate full long text messages and emails, in my native tongue, including punctuation and occasional slang or even word formation. I could dictate a decent long paragraph of text on the first try and not have to fix a single character.
It's 2025 and the closest I can find is a dictation app on my newest phone that uses online AI service, yet it's still not that great when it comes to punctuation and requires me to spit the whole paragraph at once, without taking a breath.
Is there anything equally effective for any of you nowadays? That actually works across the whole device?
> It's 2025 and the closest I can find is a dictation app on my newest phone that uses online AI service, yet it's still not that great [...]
> Is there anything equally effective for any of you nowadays?
I'm not affiliated in any way. You might be interested in the "Futo Keyboard" and voice input apps - they run completely offline and respect your privacy.
The source code is open and it does a good job at punctuation without you needing to prompt it by saying, "comma," or, "question mark," unlike other voice input apps such as Google's gboard.
>I admit that the title of this essay is a bit misleading (made you click though, didn’t it?). This isn’t really a case against conversational interfaces, it’s a case against zero-sum thinking.
No matter the intention or quality of the article, i do not like this kind of deceitful link-bait article. It may have higher quality than pure link-bait but nobody like to be deceived
I did not find the article to be deceitful at all. He does make a case against overuse of conversational interfaces. The author is just humbly acknowledging his position is more nuanced than the title of article might suggest.
I simply saw that as tongue in cheek about how the author wanted to use a more general core point. The lens of conversational interfaces makes a good case for that while keeping true to the idea.
You can argue against something but also not think it's 100% useless.
There’s an interesting… paradox? Observation? That up until 20-30 years ago, humans were not computerized beings. I remember a thought leader at a company I worked at said that the future was wearable computing, a computer that disappears from your knowing and just integrates with your life. And that sounds great and human and has a very thought leadery sense of being forward thinking.
But I think it’s wrong? Ever since the invention of the television, we’ve been absolutely addicted to screens. Screens and remotes, and I think there’s something sort of anti-humanly human about it. Maybe we don’t want to be human? But people I think would generally much rather tap their thumb on the remote than talk to their tv, and a visual interface you hold in the palm of your hand is not going away any time soon.
I went through Waldorf education and although Rudolf Steiner is quite eccentric, one thing I think he was spot on about was regarding WHEN you introduce technology. He believed that introducing technology or mechanized thinking too early in childhood would hinder imaginative, emotional, and spiritual development. He emphasized that children should engage primarily with natural materials, imaginative play, storytelling, artistic activities, and movement, as opposed to being exposed prematurely to mechanical devices or highly structured thinking, I seem to recall he recommended this till the age of 6.
My parents did this with me, no screens till 6 (wasn't so hard as I grew up in the early 90s, but still, no TV). I notice too how much people love screens, that non-judgmental glow of mental stimulation, it's wonderful, however I do think it's easier to "switch off" when you spent the first period of your life fully tuned in to the natural world. I hope folks are able to do this for their kids, it seems it would be quite difficult with all the noise in the world. Given it was hard for mine during the era of CRT and 4 channels, I have empathy for parents of today.
I will counter this by saying that my time spent with screens before 6 was unimaginably critical for me.
If I hadn't had it, I would have been trapped by the racist, religously zealous, backwoods mentality that gripped the rest of my family and the majority of the people I grew up with. I discovered video games at age 3 and it changed EVERYTHING. It completely opened my mind to abstract thought and, among other things, influenced me to teach myself to read at age 3. I was reading at a collegiate level by age five and discovered another passion, books. Again, propelled me out of an extremely anti-intellectual upbringing.
I simply could not imagine where I would be without video games, visual arts or books. Screens are not the problem. Absent parenting is the problem. Not teaching children the power of these screens is the problem.
I’ve been theory crafting around video games for children on the opposing premise. I think fundamentally the divide is on the quality of content — most games have some value to extract, but many are designed to be played inefficiently, and require far more time investment than value extracted.
Eg Minecraft, Roblox, CoD, Fortnite, Dota/LoL, the various mobile games clearly have some kind of value (mechanical skill, hand-eye coordination, creative modes, 3D space navigation / translation / rotation, numeric optimization, social interaction, etc), but they’re also designed as massive timesinks mostly through creative mode or multiplayer.
Games like paper Mario, pikmin, star control 2, katamari damacy, lego titles, however are all children-playable but far more time efficient and importantly time-bounded for play. Even within timesink games there are higher quality options — you definitely get more, and faster, out of satisfactory / factorio than modded Minecraft. If you can push kids towards the higher quality, lower timesink games, I think it’s worth. Fail to do so and it’s definitely not.
The same applies to TV, movies, books, etc. Any medium of entertainment have horrendous timesinks to avoid, and if you can do so, avoiding the medium altogether is definitely a missed opportunity. Screens are only notable in that the degenerate cases are far more degenerate than anything that came before it
When I was teaching, I used to force students using laptops to sit near the back of the room for exactly this reason. It's almost impossible for humans to ignore a flickering screen.
Sensitivity to stimuli behind orienting impulse varies by individual and I wish I was less sensitive on daily basis.
These days screen brightness goes pretty high and it is unbelievable how many people seem to never use their screen (phone or laptop) on anything less than 100% brightness in any situation and are seemingly not bothered by flickering bright light or noise sources.
I am nostalgic about old laptops’ dim LCD screens that I saw a few times as a kid, they did not flicker much and had a narrow angle of view. I suspect they would even be fine in a darkened classroom.
Playing computer games since an early age made me who I am. It required learning English a decade earlier than my peers. It pulled me into programming around start of primary school. I wouldn’t be a staff engineer in a western country without these two.
Computers are tools, not people. They should be made easier to use as tools, not tried to be made people. I actually hate people, tools are much better.
It's no wonder extraverted normie and managerial types that get through their day by talking think throwing words at a problem is the best thing since sliced bread.
They have problems like "compose an email that vaguely makes the impression I'm considering various options but I'm actually not" and for that, I suspect, the conversational workflow is quite good.
Anyone else that actually just does the stuff is viscerally aware of how sub-optimal it is to throw verbiage at a computer.
I guess it depends on what level of abstraction you're working at.
The best executives to work for are the ones who are able to be as precise at their level of abstraction as I am at mine. There’s a shared understanding at an intermediate level, and we can resolve misunderstandings quickly. And then there are the executives who think we should just feed our transducer data into an llm.
> Theoretically, saying, “order an Uber to airport” seems like the easiest way to accomplish the task. But is it? What kind of Uber? UberXL, UberGo? There’s a 1.5x surge pricing. Acceptable? Is the pickup point correct? What would be easier, resolving each of those queries through a computer asking questions, or taking a quick look yourself on the app?
> Another example is food ordering. What would you prefer, going through the menu from tens of restaurants yourself or constantly nudging the AI for the desired option? Technological improvement can only help so much here since users themselves don’t clearly know what they want.
[1]: https://shubhamjain.co/2024/04/16/voice-is-bad-ui/
How many of these inconveniences will you put up with? Any of them, all of them? What price difference makes it worthwhile? What if by traveling a day earlier you save enough money to even pay for a hotel...?
All of that is for just 1 flight, what if there are several alternatives? I can't imagine have a dialogue about this with a computer.
Of course a conversational interface is useless if it tries to just do the same thing as a web UI, which is why it failed a decade ago when it was trendy, because the tech was nowhere clever enough to make that useful. But today, I'd bet the other way round.
I guess there's just no substitute for someone actually doing the work of figuring out the most appropriate HMI for a given task or situation, be it voice controls, touch screens, physical buttons or something else.
Knowing what you want is, sadly, computationally irreducible.
Amen to that. I guess, it would help to get of the IT high horse and have a talk with linguists and philosophers of language. They are dealing with this shit for centuries now.
I don't get it at all.
Voice interface only prevails in situations with hundreds of choices, and even then it's probably easier to use voice to filter down choices rather than select. But very few games have such scale to worry about (certainly no AAA game as of now).
Even in a car, being able to control the windscreen wipers, radio, ask how much fuel is left are all tasks it would be useful to do conversationally.
There are some apps (im thinking of jira as an example) where i'd like to do 90% of the usage conversationally.
are you REALLY sure you want that?
how much fuel there is is a quick glance into the dash, and you can control precisely the radio volume without even looking.
'turn up the volume', 'turn down the volume a little bit', 'a bit more',...
and then a radio ad going 'get yourself a 3 pack of the new magic wipers...' and car wipers going off.
id hate conversational ui on my car.
However, A CEO using Power BI with Convo to can get more insights/graphs rather than slice/dicing his data. They do have fixed metrics but incase they want something not displayed.
This even happens while walking my dog. If my wife messages me, my iPhone reads it out and, at the same time, I'm trying to cross a road, she'll get a garbled reply which is just me shouting random words at my dog to keep her under control.
Theres 1-5 things any individual finds them useful for (timers/lights/music/etc) and then.. thats it.
99.9% of what I use a computer for its far faster to type/click/touch my phone/tablet/computer.
Even for straightforward purchases, how many people trust Amazon to find and pick the best deal for them? Even if Amazon started out being diligent and honest it would never last if voice ordering became popular. There's no way that company would pass up a wildly profitable opportunity to rip people off in an opaque way by selecting higher margin options.
Dead Comment
1. "Natural language is a data transfer mechanism"
2. "Data transfer mechanisms have two critical factors: speed and lossiness"
3. "Natural language has neither"
While a conversational interface does transfer information, its main qualities are what I always refer to as "blissfull ignorance" and "intelligent interpretation".
Blisfull ignorance allows the requester to state an objective while not being required to know or even be right in how to achieve it. It is the opposite of operational command. Do as I mean, not as I say.
"Intelligent Interpretation" allows the receiver the freedom to infer an intention in the communication rather than a command. It also allows for contextual interactions such as goal oriented partial clarification and elaboration.
The more capable of intelligent interpretation the request execution system is, the more appropriate a conversational interface will be.
Think of it as managing a team. If they are junior, inexperienced and not very bright, you will probably tend towards handholding, microtasking and micromanagement to get things done. If you have a team of senior, experienced and bright engineers, you can with a few words point out a desire and, trust them to ask for information when there is relevant ambiguity, and expect a good outcome without having to detail manage every minute of their days.
It's such a fallacy. First thing an experienced and bright engineer will tell you is to leave the premises with your "few words about a desire" and not return without actual specs and requirements formalized in some way. If you do not understand what you want yourself, it means hours/days/weeks/months/literally years of back and forths and broken solutions and wasted time, because natural language is slow and lossy af (the article hits the nail on the head on this one).
Re "ask for information", my favorite example is when you say one thing if I ask you today and then you reply something else (maybe the opposite, it happened) if I ask you a week later because you forgot or just changed your mind. I bet a conversational interface will deal with this just fine /s
No, that's what a junior engineer will do. The first thing that an experienced and bright senior engineer will do is think over the request and ask clarifying questions in pursuit of a more rigorous specification, then repeat back their understanding of the problem and their plan. If they're very bright they'll get the plan down in writing so we stay on the same page.
The primary job of a senior engineer is not to turn formal specifications into code, it's to turn vague business requests into formal specifications. They're senior enough to recognize that that's the actually difficult part of the work, the thing that keeps them employed.
Expecting a good outcome is different from expecting to get exactly what you intended.
Formal specifications are useful in some lines of work and for some projects, less so for others.
Wicked problems would be one example where formal specs are impossible by definition.
Unfortunately if either is the case "actual specs and requirements formalized", while sounding logical, and might help, in my experience did very little to save any substantial project (and I've seen a lot). The common problem is that the business/client/manager is forced to sign of on formal documents far outside their domain of competence, or the engineers are straitjacketed into commitments that do not make sense or have no idea of what is considered tacit knowledge in the domain and so can't contextualize the unstated. Those formalized documents then mostly become weaponized in a mutual destructive CYA.
What I've also seen more than once is years of formalized specs and requirements work while nothing ever gets produced, and the project is aborted before even the first line of code hit test.
I've given this example before: When Covid lockdows hit there were digitization projects years in planning and budgeted for years of implementation, that were hastily specked, coded and roiled out into production by a 3 person emergency team over a long weekend. Necessity apparently has a way of cutting through the BS like nothing else can.
You need both sides capable, willing and able to understand. If not, good luck mitigating, but you're probably doomed either way.
Deleted Comment
If you pay attention to how the voice interface is used in Star Trek (TNG and upwards), it's basically exactly what the article is saying - it complements manual inputs and works as a secondary channel. Nobody is trying to manually navigate the ship by voicing out specific control inputs, or in the midst of a battle, call out "computer, fire photon torpedoes" - that's what the consoles are for (and there are consoles everywhere). Voice interface is secondary - used for delegation, queries (that may be faster to say than type), casual location-independent use (lights, music; they didn't think of kitchen timers, though (then again, replicators)), brainstorming, etc.
Yes, this is a fictional show and the real reason for voice interactions was to make it a form of exposition, yadda yadda - but I'd like to think that all those people writing the script, testing it, acting and shooting it, were in perfect position to tell which voice interactions made sense and which didn't: they'd know what feels awkward or nonsensical when acting, or what comes off this way when watching it later.
At first glance it feels like real life will not benefit from labelling 90% of the glowing rectangles with numbers as the show does, but second thoughts say spreadsheets and timetables.
(Also worth noting is that "pre-programmed evasion patterns" are used in normal circumstances, too. "Evasive maneuver JohnDoe Alpha Three" works just as well when spoken to the helm officer as to a computer. I still don't know whether such preprogrammed maneuvers make sense in real-life setting, though.)
It runs directly counter to that more capitalistic mindset of "why don't we do more with less?" when spending years navigating all kinds of unknown situations, you want as many options as possible available.
One thing I will note is that I'm not sure I buy the example for voice UIs being inefficient. I've almost never said "Alexa what's the weather like in Toronto?". I just say "Alexa, weather". And that's much faster than taking my phone out and opening an app. I don't think we need to compress voice input. Language kind of auto-compresses, since we create new words for complex concepts when we find the need.
For example, in a book club we recently read "As Long as the Lemon Trees Grow". We almost immediately stopped referring to it by the full title, and instead just called it "lemons" because we had to refer to it so much. Eg "Did you finish lemons yet?" or "This book is almost as good as lemons!". The context let shorten the word. Similarly the context of my location shortens the word to just "weather". I think this might be the way the voice UIs can be made more efficient: in the same way human speech makes itself more efficient.
Maybe you, but I most definitely cannot focus on different things aurally and visually. I never successfully listened to something in the background while doing something else. I can't even talk properly if I'm typing something on a computer.
I did horribly in school but once I was in an environment where I could have some kind of background audio/video playing I began to excel. It also helps me sleep of a night. It’s like the audio keeps the portion of me that would otherwise distract me occupied.
It's very useful being able to request auxillary functions without losing your focus, and I think that would apply to say, word editing as well - e.g. being able to say "insert a date here" rather the having to get into the menus to find it.
Conversely, latency would be a big issue.
This reminds me of the amazing 2013 video of Travis Rudd coding python by voice: https://youtu.be/8SkdfdXWYaI?si=AwBE_fk6Y88tLcos
The number of times in the last few years I've wanted that level of "verbal hotkeys"... The latencies of many coding llms are still a little bit too low to allow for my ideal level of flow (though admittedly I haven't tried one's hosted on services like groq), but I can clearly envision a time when I'm issuing tight commands to a coder model that's chatting with me and watching my program evolve on screen in real time.
On a somewhat related note to conversational interfaces, the other day I wanted to study some first aid stuff - used Gemini to read the whole textbook and generate Anki flash cards, then copied and pasted the flashcards directly into chat GPT voice mode and had it quiz me. That was probably the most miraculous experience of voice interface I've had in a long time - I could do chores while being constantly quizzed on what I wanted to learn, and anytime I had a question or comment I could just ask it to explain or expound on a term or tangent.
It's also hard to dictate code without a lot of these commands because it's very dense in information.
I hope something else will be the solution. Maybe LLMs being smart enough to guess the code out of a very short description and then a set of corrections.
Do you recall Swype keyboard for Android? The one that popularized swyping to write on touch screens? It had Dragon at some point.
IT WAS AMAZING.
Around 12-14 years ago (Android 2.3? Maybe 3?) I was able to easily dictate full long text messages and emails, in my native tongue, including punctuation and occasional slang or even word formation. I could dictate a decent long paragraph of text on the first try and not have to fix a single character.
It's 2025 and the closest I can find is a dictation app on my newest phone that uses online AI service, yet it's still not that great when it comes to punctuation and requires me to spit the whole paragraph at once, without taking a breath.
Is there anything equally effective for any of you nowadays? That actually works across the whole device?
> Is there anything equally effective for any of you nowadays?
I'm not affiliated in any way. You might be interested in the "Futo Keyboard" and voice input apps - they run completely offline and respect your privacy.
The source code is open and it does a good job at punctuation without you needing to prompt it by saying, "comma," or, "question mark," unlike other voice input apps such as Google's gboard.
https://keyboard.futo.org/
But now Microsoft bought them a few years ago. Weird that it took so long though.
No matter the intention or quality of the article, i do not like this kind of deceitful link-bait article. It may have higher quality than pure link-bait but nobody like to be deceived
Not a case against, but the case against.
You can argue against something but also not think it's 100% useless.
But I think it’s wrong? Ever since the invention of the television, we’ve been absolutely addicted to screens. Screens and remotes, and I think there’s something sort of anti-humanly human about it. Maybe we don’t want to be human? But people I think would generally much rather tap their thumb on the remote than talk to their tv, and a visual interface you hold in the palm of your hand is not going away any time soon.
My parents did this with me, no screens till 6 (wasn't so hard as I grew up in the early 90s, but still, no TV). I notice too how much people love screens, that non-judgmental glow of mental stimulation, it's wonderful, however I do think it's easier to "switch off" when you spent the first period of your life fully tuned in to the natural world. I hope folks are able to do this for their kids, it seems it would be quite difficult with all the noise in the world. Given it was hard for mine during the era of CRT and 4 channels, I have empathy for parents of today.
If I hadn't had it, I would have been trapped by the racist, religously zealous, backwoods mentality that gripped the rest of my family and the majority of the people I grew up with. I discovered video games at age 3 and it changed EVERYTHING. It completely opened my mind to abstract thought and, among other things, influenced me to teach myself to read at age 3. I was reading at a collegiate level by age five and discovered another passion, books. Again, propelled me out of an extremely anti-intellectual upbringing.
I simply could not imagine where I would be without video games, visual arts or books. Screens are not the problem. Absent parenting is the problem. Not teaching children the power of these screens is the problem.
Eg Minecraft, Roblox, CoD, Fortnite, Dota/LoL, the various mobile games clearly have some kind of value (mechanical skill, hand-eye coordination, creative modes, 3D space navigation / translation / rotation, numeric optimization, social interaction, etc), but they’re also designed as massive timesinks mostly through creative mode or multiplayer.
Games like paper Mario, pikmin, star control 2, katamari damacy, lego titles, however are all children-playable but far more time efficient and importantly time-bounded for play. Even within timesink games there are higher quality options — you definitely get more, and faster, out of satisfactory / factorio than modded Minecraft. If you can push kids towards the higher quality, lower timesink games, I think it’s worth. Fail to do so and it’s definitely not.
The same applies to TV, movies, books, etc. Any medium of entertainment have horrendous timesinks to avoid, and if you can do so, avoiding the medium altogether is definitely a missed opportunity. Screens are only notable in that the degenerate cases are far more degenerate than anything that came before it
Actually, it's the reverse. The orienting response is wired in quite deeply. https://en.wikipedia.org/wiki/Orienting_response
When I was teaching, I used to force students using laptops to sit near the back of the room for exactly this reason. It's almost impossible for humans to ignore a flickering screen.
These days screen brightness goes pretty high and it is unbelievable how many people seem to never use their screen (phone or laptop) on anything less than 100% brightness in any situation and are seemingly not bothered by flickering bright light or noise sources.
I am nostalgic about old laptops’ dim LCD screens that I saw a few times as a kid, they did not flicker much and had a narrow angle of view. I suspect they would even be fine in a darkened classroom.
They have problems like "compose an email that vaguely makes the impression I'm considering various options but I'm actually not" and for that, I suspect, the conversational workflow is quite good.
Anyone else that actually just does the stuff is viscerally aware of how sub-optimal it is to throw verbiage at a computer.
I guess it depends on what level of abstraction you're working at.