Voice User Interfaces (VUI) — The Ultimate Designer’s Guide
“Set my alarm to 7:15am”
— “Ok, calling Selma Martin`”
“No! Set my alarm to 7:15am”
— “I’m sorry. I can’t help you with that.”
“Sigh” *Manually sets alarm*
Our voices are diverse, complex, and variable. Voice commands are even more daunting to process — even between people, let alone computers. The way we frame our thoughts, the way we culturally communicate, the way we use slang and infer meaning… all of these nuances influence the interpretation and comprehensibility of our words.
So, how are designers and engineers tackling this challenge? How can we cultivate trust between user and AI? This is where VUIs come into play.
Voice User Interfaces (VUI) are the primary or supplementary visual, auditory, and tactile interfaces that enable voice interaction between people and devices. Simply stated, a VUI can be anything from a light that blinks when it hears your voice to an automobile’s entertainment console. Keep in mind, a VUI does not need to have a visual interface — it can be completely auditory or tactile (ex. a vibration).
While there’s a vast spectrum of VUI, they all share a set of common UX fundamentals that drive usability. We’ll explore those fundamentals so, as a user, you can dissect your everyday VUI interactions — and, as a designer, you can build better experiences.
The way we interact with our world is highly shaped by our technological, environmental, and sociological constraints. The speed at which we can process information, the accuracy at which we can translate that data into action, the language/dialect we use to communicate that data, and the recipient of that action (whether it’s ourselves or someone else).
Before we dive into our interactive design, we must first identify the environmental context that frames the voice interaction.
The device type influences the modes and inputs that underly the spectrum and scope of the voice interaction.
Phones
Wearables
Stationary Connected Devices
Non-Stationary Computing Devices (Non-Phones)
What are the primary, secondary, and tertiary use cases for the voice interaction? Does the device have one primary use case (like a fitness tracker)? Or does it have an eclectic mix of use cases (like a smartphone)?
It is very important to create a use case matrix that will help you identify why users are interacting with the device. What is their primary mode of interaction? What is secondary? What is a nice-to-have interaction mode and what is essential?
You can create a use case matrix for each mode of interaction. When applied to voice interaction, the matrix will help you understand how your users currently use or want to use voice to interact with the product — including where they would use the voice assistant:
If you’re using user research to inform your use cases (either usage or raw quant/qual research), then it is important to qualify your analysis by rank ordering the perspective modes of interaction.
If someone tells you: “OMG that would be so cool if I could talk to my TV and tell it to change the channel”, then you really need to dig deeper. Would they really use it? Do they understand the constraints? Do they truly understand their own propensity to use that feature?
For instance, let’s say we’re examining whether a user is likely to use voice commands to interact with their TV. In this case, it is safe to assume that voice interaction is one of many possible types of interaction.
The user has access to multiple alternative interaction implements: a remote, a paired smartphone, a gaming controller, or a connected IoT device. Voice, therefore, does not necessarily become the default mode of interaction. It is one of many.
So the question becomes: what is the likelihood that a user will rely on voice interaction as the primary means of interaction? If not primary, then would it be secondary? Tertiary? This will qualify your assumptions and UX hypotheses moving forward.
Translating our words into actions is an extremely difficult technological challenge. With unlimited time, connectivity, and training, a well-tuned computational engine could expediently ingest our speech and trigger the appropriate action.
Unfortunately, we live in a world where we don’t have unlimited connectivity (i.e omnipresent gigabit internet), nor do we have unlimited time. We want our voice interactions to be as immediate as the traditional alternatives: visual and touch — even though voice engines require complex processing and predictive modeling.
Here are some sample flows that demonstrate what has to happen for our speech to be recognized:
As we see… there are numerous models that need to be continuously trained to work with our lexicon, accents, variable tones, and more.
Every voice recognition platform has a unique set of technological constraints. It is imperative that you embrace these constraints when architecting a voice interaction UX.
Analyze the following categories:
Furthermore, we should also consider that user’s can interact with the device in a non-linear way. For example, if I want to book a plane ticket on a website, then I am forced to follow the website’s progressive information flow: select destination, select date, select number of tickets, look at options, etc…
But, VUI’s have a bigger challenge. The user can say “We want to fly to San Francisco on business class.” Now, the VUI has to extract all of the relevant information from the user in order to harness existing flight booking APIs. The logical ordering may be skewed, so it is the responsibility of the VUI to extract the relevant information (either by voice or visual supplements) from the user.
Now that we’ve explored our constraints, dependencies, and use cases, we can start to dive a little deeper into the actual voice UX. First, we’ll explore how devices know when to listen to us.
For some added context, this diagram below illustrates a basic voice UX flow:
Which manifests as…
There are four types of voice input triggers:
As a designer, you must understand which triggers will be relevant to your use cases; and rank order those triggers from likely relevant to not relevant.
Typically, when a device is triggered to listen, there will be an auditory, visual, or tactile cue.
These cues should follow the following usability principles:
Feedback is critical to successful voice interface UX. It allows users to get consistent and immediate confirmation that their words are being ingested and processed by the device. Feedback also lets users take corrective or affirmative action.
Here are some UX principles that lend to effective VUI feedback:
This cue connotes when the device has stopped listening to the user’s voice and will begin processing the command. Many of the same ‘leading cue’ principles apply to the end cue (immediate, brief, clear, consistent, and distinct). However, a few additional principles apply:
Simple commands like “Turn on my alarm” don’t necessarily require a lengthy conversation, but more complex commands do. Unlike traditional human to human interaction, human to AI interaction requires additional layers of confirmation, redundancy, and rectification.
More complex commands or iterative conversation typically require multiple layers of speech / option verification to assure accuracy. Complicating matters even more, often times the user is not sure what to ask or how to ask for it. So, it becomes the VUI’s job to decipher the message and allow the user to provide additional context.
Giving human-like traits to voice interaction creates a relationship between human and device. This anthropomorphization can manifest in various ways: patterns of lights, shapes that bounce, abstract spherical patterns, computer-generated voice, and sounds.
This relationship cultivates a more intimate bond between user and machine, which can also span across products with similar operating platforms (ex. Google’s Assistant, Amazon’s Alexa, and Apple’s Siri).
Voice interactions should be fluid and dynamic. When we converse with each other in person, we typically use a myriad of facial expressions, changes in tone, body language, and movement. The challenge is capturing this fluid interaction in a digitized environment.
When possible, the entire voice interaction experience should feel like a rewarding interaction. Of course, more fleeting interactions like “Turn the lights off” don’t necessarily require a full relationship. However, any sort of more robust interaction like cooking with a digital assistant does require a prolonged conversation.
An effective voice motion experience would benefit from the following principles:
VUI’s are extremely complex, multifaceted, and often hybrid amalgams of interaction. In fact, there isn’t really an all encompassing definition. What’s important to remember is that an increasingly digitized world means that we may actually be spending more time with our devices than we do with each other. Will VUIs eventually become our primary means of interaction with our world? We’ll see.
In the meantime, are you looking to build a world-class VUI? Here are some helpful resources:
Voice User Interfaces (VUI) — The Ultimate Designer’s Guide
Research & References of Voice User Interfaces (VUI) — The Ultimate Designer’s Guide|A&C Accounting And Tax Services
Source