Design guidelines for voice user interfaces
If you are working in design or software development, then you have probably heard of Nielsen’s ten usability heuristics. The guide lists rules of thumb that are supported by a long history of usability testing with all sorts of interfaces. We’ve seen these guidelines in action mostly in graphical interfaces, but how do we use them for speech interaction? Fortunately, Murad and colleagues summarized some research-backed issues and suggestions specific to voice user interfaces. Without further ado, here are the ten heuristics:
During conversations, a person can judge if they are understood or not based on facial expressions and body language. People can also tell if it’s their turn to speak or listen.
Good voice user interfaces are particularly challenging to build because they are inherently transient and invisible, according to Schnelle and Lyardet. Once the user communicates the commands, it won’t be there anymore unless there’s a visual interface tied to the system.
To make the system status visible, the start and end of the speech interaction must be obvious. In cases of errors and missed information, make sure to say which information was understood and which information was missed. For example, in a form-filling context, ask only the missing information instead of defaulting to “Sorry, I didn’t get that.” for all types of errors.
From fixed voice commands and keywords, speech interaction has become more conversational such that users can interact with informal, free-form speech. Experts recommend learning from interpersonal communication to design these interactions as natural and intuitive as possible.
To create a convincing user experience, it is better if a system not only understands short, fixed commands but also more natural ways of instruction. Responses should be varied and technical terms should be avoided, unless the application area calls for it (e.g. for systems used in formal settings like a hospital operation room, it might be better to keep technical terms and consistent responses.)
Users desire the sense of being in-charge of the system. They want the system to respond to their action. However, experiments by Limerick and colleagues have shown that users exhibit a diminished sense of agency (or sense of control) when using speech interaction.
Keyboards achieve a strong intentional binding with the user because each keystroke results to a letter on the screen. Each user action corresponds to an instantaneous response. In contrast, voice user interfaces tend to respond slower and miss the user’s intention. This makes the users perceive less control over the system.
Voice user interfaces will respond faster with better accuracy through developments in natural language understanding, such as those provided by DialogFlow and Amazon Lex. To leverage the power of such models, designers still need to supply good training phrases. However, even when using advanced platforms, designers and developers should still monitor the performance of their systems. Missed user intentions should be captured and targeted for improvement in the next iteration.
Because there are no graphic elements like icons, there seems to be nothing to keep consistent, right? Well, not really. People form judgments based on how a person speaks. In the same manner, factors like tone and word choice affect the user’s experience of a voice user interface.
Strohmann and colleagues argue that a virtual assistant’s persona significantly impacts user experience. They refer to crafting the persona as part of representational design as opposed to interaction design. Representational design deals with aesthetics, style, and overall look and feel of the system.
Virtual assistants need to have a well-defined persona including name, background, and personality. This dictates what kinds of words they will use, their mannerisms, their pitch and tone, among others. When crafting a persona, make sure to consider results of previous user research and brand values of the company. This persona should be used as a style guide during the whole design process. Over time, several people could work on the same virtual assistant. A well-documented persona enables various teams at various stages of development to be consistent.
This is easier said than done.
Based on a study by Myers et al., more than half of the errors encountered by users are due to mishearing the user. It’s a combination of issues in microphone quality, speech-to-text conversion, and mapping to the correct user intention. Most developers rely on speech-to-text and natural language understanding software provided by technology powerhouses or open source communities. If improving the underlying technology is not within the designer’s reach, then what can they do?
First, as much as possible, designers and companies must use the systems that they make! By doing so, they would be able to experience what kind of errors the users might encounter. Then, they would be able to address the common errors before releasing the product. Second, carry out research with actual users. It would probably cost more but a well-designed user research would give a lot of insights to making the product a success. Lastly, after releasing the system, monitor the logs for user errors and address them in a later version.
Recognizing something is always easier than recalling it. Recognition involves picking up cues that help users reach into their memory, thereby enabling relevant information to surface. This is why in written tests, multiple choice questions are easier than open-ended ones. Nielsen recommends making objects, actions, and options visible so that the user’s task is to simply recognize them.
Do not expect the user to remember speech commands. To explicitly explain what conversational agents can do and which voice commands it understands, a straightforward What Can I Say help menu could be provided.
As much as possible, create voice user interfaces that are consistent with the user’s current experiences and mental models. Instead of reinventing the wheel, apply established design patterns for common tasks. Adopt how users transact through face-to-face and phone conversations. Do an analysis of competing products and see if there are repeatable UX flows and patterns.
Speech I/O almost always loses against text I/O in measures of ease of use. This is largely due to the familiarity of users with keyboards and its shortcuts. However, researchers are hopeful that speech interaction can improve efficiency because users are able to just say their commands instead of searching through a graphical user interfaces (GUI). Speech interaction, when done right, enables additional flexibility and efficient operation on standard GUIs.
Within speech interaction itself, flexibility is achieved when the same command can be given in different ways. For example, weather forecast should be available by saying “Weather forecast please” or “How’s the weather tomorrow?” Natural language understanding platforms like DialogFlow can be applied to handle such free-form input. DialogFlow leverages on Google’s machine learning expertise to understand the user’s intention based on a few sample phrases.
More than understanding free-form input, conversation design recommends some human-like maneuvering throughout the interaction. Systems that use voice or chat should be able to gracefully handle cases when people give multiple information in one go, when people digress, and when people are using ambiguous language.
Speech interaction increases cognitive load. Unlike pointing and clicking on a graphical interface, speaking and listening share the same short-term memory and working memory as recall and problem-solving. This is the reason why people prefer quiet environments when they are working on tough problems. Humans can think and do physical activities like typing easily, whereas thinking while speaking requires more effort.
Given that speech takes precious memory away from other tasks, it is best to keep it minimal. Do not give the user all the information in one go. Provide only the most relevant information and then confirm with the user which part to elaborate. Remember that minimalism is about providing exactly the right amount of information the user needs.
Babu notes that it’s tempting to give all the information at once. To avoid doing such, he recommends writing as if the user is having an instant chat with a real human and not a machine. The designer must put themselves in their users’ shoes and help them step-by-step as if they would help a friend or an elderly family member.
Errors cannot be completely prevented so designers need to develop a good strategy to handle them. Google’s conversation design guidelines provide a strategy to handle these three kinds of errors:
1. The user did not respond to the interface. (No input.)
2. The user confuses the interface. (No match)
3. The user asks the interface something it can’t do. (System error)
In general, users should not experience more than three “no input” or “no match” errors in a row. On the first no match error, the system should do a rapid reprompt that combines an apology with a condensed repetition of the original question. Repeating the original question would sound robotic. On the second no match error, the system should already include some support or help information. On the third, the system should already end the conversation or transfer to a human customer service agent to prevent user frustration.
For more ideas on handling no input and system error, checkout Google’s conversation design guidelines.
There are two fundamental challenges with regards to learnability and discoverability of voice user interfaces. First, users tend to assume that the system can understand beyond its actual capabilities. Second, users are unaware of the available functionalities. The opaqueness of voice user interfaces prevents the users from generating accurate mental models of how the interface works.
Aside from a straight-forward help menu like What Can I Say, Corbett and Weber recommends interactive tutorials that provide contextualized help. In graphical user interfaces, it’s common to have a separate menu devoted for help. The users can keep their current work context and view help on a separate window. In speech interaction, the users shouldn’t need to exit the current conversation context in order to navigate to help. Instead, help should be built in around the current actions available in the conversation context. For example, conversation design recommends adding help information whenever the user commits the same mistake twice.
Aside from keeping help available in every context of the conversational agent, Strohmann et al. also recommends that virtual assistants should proactively inform the user. Whenever relevant to the current situation, virtual assistants can volunteer information especially about functions that users have never used before.
And there you have it — the ten usability heuristics applied to voice user interfaces. Now, it’s time to design!
Design guidelines for voice user interfaces
Research & References of Design guidelines for voice user interfaces|A&C Accounting And Tax Services
Source