Voice user interfaces and AI: what to know today

A deep dive in to the latest on voice user interfaces.

On a yellow background, a collage of a hand holding a phone with a large open mouth on it.

Whilst Voice user interfaces (VUI) have their roots in the 1950s, the first VUI iteration as we know them today was Siri, released by Apple in 2007. Today, Siri and Alexa are the most popular voice assistants, followed by Google Home.  But who dominates this landscape looks soon set to change.

VUIs have been commonly used for basic tasks like setting timers, playing music, controlling a smart device or checking the weather. They’re also particularly useful for people with access needs.

Generally, adoption has been slow, and voice doesn't compete with other interfaces for most tasks. VUIs have their uses but have generally fallen short of performance expectations — there is a gulf between their potential and actual functionality. However, with the advent of GPT-4o, things could be about to change.

To get a better view of what’s going on in the world of VUI and its intersection with Generative AI, we conducted research to explore the space. Is it time to double down, or approach with caution?

How people currently use VUI

Our findings revealed that VUI hasn't lived up to the early hype.

The people we spoke to typically use VUI for tasks such as playing music, enabling do not disturb on devices, setting timers, checking weather reports and controlling smart home devices such as heating controls.

However, people shared that VUI is unreliable and often frustrating to use, despite the many use cases and benefits.

A strong theme that emerged from our discussions was that many of the people we spoke to felt like VUI had its uses but is fundamentally non-essential to daily life.

They are fine, but I can function without it.

[VUI was] positioned as a game-changer disrupter life companion, but ended up being a smart timer to command when your hands are full.

There's a big problem underlying these concerns: to get around the usability issues with VUI, people must change how they speak.

Sometimes I have Alexa set an alarm or timer, and when it goes off, I would like to be able to say "Alexa, thank you" to tell it I have heard the alarm. But she merely pauses the alarm, says "you're welcome" then returns to the alarm. So instead, I have to say "Alexa, quiet", which feels rude.

[It's] a bit rubbish. Can never understand what I'm saying. Have to put on a southern accent.

You learn to ask questions with zero ambiguity.

Despite these problems, with GPT-40 on the horizon, is the way people use VUI about to change? Open AI have demonstrated their improvements to the tone and flow of conversations, which now seem more “human” with the added ability to interrupt the voice assistant; this is all a big leap forward and addresses some of the concerns we found in our research.

However, access to information that is relevant beyond the application on which someone is using the VUI, or the ability to ask for help in real-time when switching between tasks seems some way off.

Voice modulation

We found many of the people who use VUI change the way they speak. For example, people will use unnatural speech patterns; adopt a different accent; make their voice deeper; or speak in clipped, abrupt sentences to be understood.

This runs counter to the principles of accessibility and good user experience, which advocates for tools or platforms adapting to the user, and not the other way around.

Several participants in our research strongly preferred using their hands to complete tasks, opting to physically interact with their phones rather than formulating commands verbally to interact with a VUI. As people will always choose the path of least resistance, it suggests the combination of typing, tapping and talking to an app requires less effort compared to solely using speech.

Again, GPT-4o's promotional content suggests they’ve remedied the need for a user to change the way they talk when using VUI. Although, this alone won’t be enough to normalise voice as an input for task performance and search.

A lack of intent

VUI cannot understand user intent.

When asked how often voice assistants understood their commands, only 9% of respondents reported that their assistant 'always' understood, while an additional 22% indicated 'often'.

Examples cited included failing everyday tasks like responding to requests to listen to music, where VUI sometimes struggled due to noisy environments or a lack of context from previous conversations. In some instances, users noted that VUI seemed to have no record of recent conversations, even those occurring within the last couple of minutes.

To date, most people use VUIs for basic tasks which don’t factor in intent. Even with GPT-4o's promise, it’s unlikely it will understand the ‘why’ behind the action. It still cannot make logical leaps in the way a human can or keep continuity over a series of exchanges. Moreover, the data it has access to is limited based upon what has been shared or displayed.

On a yellow background, an ilustration of a ring toss game, with only one ring on the stand, the rest are scattered around.
Negative feedback loops

The most predominant frustration revolved around commands being misunderstood. People told us they must structure their commands in several different ways to get their voice assistant to understand them. This repetitive process creates a negative feedback loop, eroding the user experience over time.

Compared to a regular smartphone or computer screen (a graphical user interface or GUI), correcting mistakes can be difficult. When using voice, it takes time for the system to surface an error, or for the user to notice a mistake.

This stilted feedback loop creates an unnatural dialogue, disconnected from the way people usually communicate.

Voice also presents major issues for people with speech impediments, despite offering many advantages over GUIs for people with both situational and permanent accessibility needs.

Given the more conversational flow of ChatGPT-4o and the ability to interrupt and redirect the voice assistant, some of these issues leading to poor user experience may have been addressed.

However, despite the big step forward, it remains to be seen how generative AI enabled VUIs handle noisy environments and secondary conversations not directed at them. This could be a stepping stone, given mass adoption on mobile, desktop and other smart devices is likely to rely on interaction in busy environments.

Accuracy and reliability of interaction

Many voice assistants are hit and miss, often giving irrelevant results or playing a different song to the one requested.

People also told us they feel their voice assistant "tries to guess" and is often wrong, which is more frustrating than simply not returning any results at all.

These issues mean that no one we spoke to would trust a VUI for complex or high-stakes tasks such as ordering an item online.

Privacy matters

People told us they have concerns around how personal data is being used. There is a perceived lack of transparency around data privacy.

There is also a perception of being "listened to" by voice assistants, fuelled by experiences such as seeing targeted ads related to something you've recently spoken about.

This looks set to continue, with a lack of transparency around how information has been sourced for GPT-4o as well as how a user’s interactions will be used for future iterations.

Why adoption of AI assisted VUI may continue to be slow 

The unease around AI

Through our research we found that there is an unease with allowing AI assistants and smart technology to do more for us.

AI is like a soup we're all swimming in. There's a perception that it's "everywhere" and "in the background" and even "watching us" - 66% of respondents said AI impacts their life already.

We spoke to people about this unease and found a picture of many overlapping anxieties around job security, privacy, data security and more.

This is only compounded by negative stories and legal disputes around information and access. Scarlett Johannson’s legal claim against Open AI in relation to their cavalier approach to using a very similar voice to her own for their assistant is only the tip of the iceberg. This isn’t just an issue that impacts Open AI, but the way in which the AI industry is perceived as a whole.

On a yellow background, a black and white image of a hand holding a glass of water.
Transparency matters

A key issue is the lack of transparency around how voice assistants work, with users generally not aware of how data is collected and used.

We found a general lack of trust in technology firms. There is a perception that voice assistants are “listening” and that people's data is being used to target advertising.

We always pay for things in the end whether it is with our privacy, data or intelligence.

Given that data is crucial to the continuous evolution of models, this concern looks set to continue bubbling away.

Embarrassing issues

At times, it’s unlikely to matter how advanced the VUI is. In fact, a good VUI that feels like you’re interacting with a real person could be off putting for some, especially when searching for embarrassing or sensitive issues through voice.

If you were looking up embarrassing or sensitive symptoms, psychologically speaking, typing the symptoms could feel like the more private option rather than vocalising the symptoms out loud to a “real” person.

How the next generation of VUIs will handle sensitive situations remains to be seen. This needs to be addressed to take common usage beyond basic tasks.

Dependency on smart assistants

Some people said they worry that an increasing dependency on smart assistants could lead to a loss of control of their own lives.

Will it make us lazier or mean we forget how to do things for ourselves?

[I'm worried about] the reduction of peoples core skills. Forgetting how to do things manually. And a reliance on it. What if it goes down... what then?

On the social side I'd always try and resist it, even if it was perfect. Even if it was better at interacting with my friends than I am.

We also observed some people voicing concerns about the dependency changing us and our interpersonal relations for the worse based on increased interaction with AI-enabled VUI.

Fear of job replacement

Some fear losing their jobs as tools like Gen AI and VUI become more advanced.

There are ethical considerations too, such as bias in AI algorithms. 59% of people we spoke to said they're worried or very worried about potential discrimination or stereotyping related to AI.

[AI] makes me question my professional value going forward.

On a yellow background, a black and white image of a robot hand appears, next to arrows pointing up.

Why VUI use may rapidly increase now it has assistance from AI

A more personalised experience

Based on initial examples, VUIs powered by increasingly advanced AI look to provide a more personalised and intuitive experience. They’re likely to feel more like speaking to another person.

We expect to see examples of voice where the sense of the thread of a conversation is more prevalent, compared to the stilted exchanges that people currently experience. 

Responses are also likely to become increasingly personalised, taking into account greater context to understand the user's intent.

Poor quality voice assistants will feel increasingly frustrating.

Instant language translation

One of the more impressive new capabilities from the recent ChatGPT-4o demonstration is the instant language translation feature.

Although participants from our research didn’t mention this as a need, it’s easy to see why this might bring more users to VUI. It has implications for both business and personal use; removing language barriers for companies or holidaymakers could be a major mass adoption catalyst.

Mathematics and coding help

We weren’t sure whether this feature was a positive or not, even though the demonstrations of their mathematics and coding features are impressive.

Apps that solve mathematical equations for you already exist, the impressive thing with the ChapGPT-4o demonstration is that you can ask it to help you solve the equation, with rationale, rather than just giving you answer. You can see how this could become a teacher when kids are away from the classroom.

Similarly, the coding features are impressive. They allow those with and without coding skills to complete a number of different tasks. Recently, one NFT artist launched a crypto “memecoin” with zero coding ability. ChatGPT is democratising coding and it’s easy to see how this feature could become very popular with their customer base.

Interrupting the voice assistant

We’ve watched a good amount of the various demonstrations now available. As you watch, you start to get the impression ChatGPT-40 has its own personality. It’s enthusiastic, some will find this endearing while other will find it less so.

In several demonstrations, this enthusiasm translated into the user interrupting the “super excited” voice assistant.

However, the ability to do this will be welcome one for users; VUIs of old had a habit of misunderstanding the question which meant you had to wait while it relayed information that wasn’t relevant to the query. Quickly steering it back on track will make the user experience a better one and cut down on the time it takes to get what it is you’re looking for.

Anticipating needs

VUIs are likely to become more predictive, where they are currently reactive. They will be able to anticipate user needs.

Soon, we can imagine an always-on, AI-powered voice assistant that can make simple personalised suggestions, such as items to buy or tasks to remember, without needing a command at all.

User expectations of AI assistants are high. Use of more capable AI tools will make poor quality voice assistants feel obsolete or frustrating.

Proactive interactions

Voice can also provide the medium for generative AI to interact with people proactively and provide human-like nuance and sensitivity to digital interactions.

This could have particular utility in care settings. For example, we can imagine elderly or unwell people speaking to a device that can monitor their needs and alert their caregivers.

This technology is already being used in medical settings to alleviate loneliness and even care for people with dementia.

Smart home integration

There is also an opportunity to enhance the value of voice interactions and integrate more deeply with smart home and IoT.

Like the Samsung fridge that automatically orders milk when you're running low, this predictive ability offers huge value in retail opportunities alone.

Apple make their move

While we were writing this article, Apple made a major AI announcement. Up until now, they’ve taken a different approach, even going as far to refuse to say AI in their previous keynotes.

Monday 10th June changed all that, with Apple announcing their devices will now integrate with Open AI; they’re now calling AI in their products Apple Intelligence. This feels like the most groundbreaking product update since they unveiled the iPhone and has seismic implications for Apple users.

Their demonstration showed off a range of new features the integration brings. Ranging from the sublime to the ridiculous.

Surprisingly, one of the jaw-dropping updates is to the calculator app. Apple has integrated the mathematical abilities we mentioned earlier into their native calculator; the way it instantly calculates algebra equations is very impressive. You can also ask Apple’s Intelligence to rewrite your work to be more concise, friendly or professional. And you can instantly edit your photos and videos that goes way beyond adding a filter or changing the contrast.

The tasks we’ve just mentioned though are app specific, but it doesn’t have to be. Apple Intelligence draws deep from all the apps on your device. This makes describing what you can do with the technology next to impossible because now so many things become possible. You could, for example, ask it what time your mum’s flight is due to arrive, ask it to book her a taxi to collect her from the airport and bring her to your house, and to set an alarm for when she’s a couple of minutes away.

However, allowing Open AI deep access across all your apps to perform these sorts of tasks will be concerning to some. After the keynote, Elon Musk took to X (formerly Twitter) and said he would ban Apple products from his workplaces over concerns about data leaks. Apple seemed to have pre-empted these privacy worries; part of their keynote was reassuring users that AI will use people’s data, not store it. Whether this is enough to allay people’s concerns remains to be seen as it also raises questions about Apple’s commitment to privacy and trust at a brand level.

But what about VUI in all this? Well, almost everything we’ve mentioned (and much more besides) you can do with your voice. This is great news for those with accessibility needs as the ability to complete more advanced tasks is made possible moving beyond Siri and its capabilities.

However, Apple is hedging its bets on VUI. They were keen to point out in their keynote that you can direct the AI assistant with your voice or type out your commands. So, it looks like they're unsure on whether this technology will accelerate mass VUI adoption.

On a yellow background, a large grey question mark appears in a speech bubble.

What we think will happen next with AI assisted VUIs

There are many ways this could go in the next five to ten years. We propose three possible trajectories for the future of voice interfaces.

Changing how we use VUI

The first is a step change in people's usage of voice assistants and other VUI, fuelled by a dramatic improvement in usability sparked by better responses to requests.

This paradigm shift could eventually see voice become the primary method for interacting with software and media, facilitated by increasingly smart LLMs.

Something seemingly trivial like voice messaging, long a staple in non-western nations, seems to be becoming an increasingly used feature. This could spread into new tools and utilities giving them the behavioural nudge required to gain traction and adoption.

Gradual adoption

While we are transitioning into a generative AI-powered world, where delivering high value through VUI may finally become a reality, adoption is still likely to be gradual, with GUIs remaining the primary mode of digital interaction for the foreseeable future.

We believe user confidence will only be won through vastly improved usability along with greater transparency and privacy.

Many people are unconvinced about the benefits of interacting via voice despite the demonstrated potential, and are justifiably concerned about the risks.

Even with a step-change in their capabilities, the success of voice will require sustained commitment to an inclusive experience that protects people's privacy and security.

Deeper integrations with GUI

A final possibility is an intertwining of VUI with existing GUIs to create a more seamless transition and hand-off between the two.

VUIs and GUIs complement each other seamlessly. For instance, when following a new recipe, we often choose between using voice or touch. However, envisioning a future where both modalities are integrated into a single journey with seamless handovers appears to be on the horizon.

For example, verbally asking a device to find information in a document and then transitioning to physically editing the document.

This would allow people to switch from browsing with gestures to completing a task with voice, or vice versa.

Wrapping up

The VUI and AI-assisted marketplace continues to explode. Deeper, more contextual use, at the point at which the tools can intersect seems to be the biggest place for a potential shift in how we think about interactions with software.

However, barriers remain, and entrenched behaviours to enable adoption will still need to be shifted through design, marketing and communications that capture cultural nuance and user concerns.

Related articles