Voice User Interface (VUI) Design is rapidly emerging as a new discipline in the world of user experience due to the rapid adoption of smart speakers and voice assistants. While a lot has been written about key principles of VUI Design, or how designing for voice radically differs from designing for a screen, not much has been written on the actual concepts that VUI designers can manipulate or should think about when crafting a conversational experience. Visual designers play with layers, filters, gradients, etc...But what can VUI designers play with? What does their toolbox look like?
Concepts like invocation phrase, intents, entities or slots that define the interaction model of a voice application are well defined and understood, but they apply to only a very small part of the conversation design: recognizing what the user means. How to act on that meaning, how to organize the conversational flow from that point has not been codified from a design point of view.
This blog post proposes 4 sets of fundamental VUI design elements upon which every computer conversation is built. These elements have emerged from PullString’s 7 year history of crafting hundreds of advanced multi-turn voice or text based conversational experiences.
At the highest level, every computer conversation is an exchange of words, between a user and a machine. From the machine point of view, anything the user says, or does is an INPUT. Whereas every response said or displayed by the machine, or any computation that the machine executes is an OUTPUT. Starting with a wake word, computer conversations are a succession of input and output sequences.
Talking to the user: Prompt & Statements
Once the machine hears the invocation phrase, it wakes up and responds with a text to speech (TTS) output. While every speech output appears similar in the code of a developer, VUI designers should differentiate between a prompt and a statement.
Prompts and statements constitute the core building blocks of any human conversation. When it is your turn to speak, you will likely deliver a statement and follow by a prompt to let your interlocutor back in the conversation, if only to confirm he has heard you.
Prompts ask questions to the other party and seek an answer; they take many forms: from open-ended questions ("what do you think?"), closed-ended question ("are you ready?"), to intonation, body language or eye contact that invite the other person to jump in the conversation.
Statements deliver an opinion, a comment or share information and do not solicit input; interrupting a statement would be considered rude in most social settings.
VUI designers need to differentiate between prompts and statements because each requires a unique authoring strategy.
Voice user interfaces create two challenges for designers:
- users do not know what they can or can not ask the machine;
- once a response is said by the machine, it resides only in the short term memory of the user and has no persistence.
Prompts' mission in VUI design is to overcome these 2 challenges: they need to clearly describe the universe of things that a user can ask in a way that he can remember. Best practices in prompt writing typically recommend the use of closed-ended questions, placement of the question as the last thing said, or limiting the number of options at maximum three.
Prompts have a "re-prompt" property; in the case of Alexa, for example, if no input is detected for eight seconds, Alexa issues a re-prompt. The re-prompt should always acknowledge the fact that no input was detected to keep the conversation natural.
Statements also have their own challenges and techniques. One of the first functions of a statement is to confirm what was just heard. Statements will be used for implicit confirmation, whereas a prompt will be used for explicit confirmation.
Next, statements function as a provider of information. With users' attention span shrinking, it is important to put in place strategies to summarize information but provide the ability to follow-up for more details.
In a multi-modal world, prompts and statements also take different forms. For example, in the Google Assistant world, statements might be represented by basic cards while prompts use suggestion chips to help with a choice or a response.
Listening to the user: Local Intents & Global Intents
It is now the turn of the user to speak. The magic of speech recognition or natural language understanding will transform the user’s words into an intent and in some instances an entity value (or slot value in Alexa parlance). Machine learning driven interaction models like Alexa or Dialogflow work as a flat list of intents. However, VUI designers should create an intent hierarchy model and distinguish between a local intent and a global intent.
Local intents represent the user inputs expected as a response to a prompt. To a question like “are you ready?”, you will expect a Yes or No intent. In this case Yes and No are local intents. The follow-up to a local intent is straightforward: it is the expected next step in the conversation.
Global intents are the user inputs that should be handled at any point in the conversation. They include intents like Help, Cancel, and Stop that Alexa requires every skill to handle. But they could also include key navigational intents like “Checking account” or “Savings account” to let the user change topic at any time in the conversation. Following a global intent is more tricky as designers have to decide how to get back into the conversation. Do you return the user to where they were or do you follow a different path?
During the conversation, local intents should take priority over global intents, but it does not have to always be true. VUI designers need to specify in their design the priority order in which these intents are executed.
Keeping the conversation moving: Fallbacks, Interjections & Segues
The sequence prompt - local intent - statement form the basic flow of any VUI: the happy path. The fun, however, starts once the happy path is set. VUI designers now need to think about all the different ways that the conversation can take and how to handle them in order to continue to move it forward.
A well-known principle of screenwriting says that true character is only revealed under pressure. When a user takes your voice application off the happy path, it puts your application under pressure, revealing its true personality.
VUI designers can handle the unexpected with two strategies: first, create a design to react to unknown inputs; second, create lists of unexpected things that the user may say and design conversation flows to react to it.
Fallbacks help with the first strategy.
A fallback triggers when the input received does not match any of the intent your application is listening to. A basic fallback is “sorry, I did not get that” and a repeat of the prompt. More advanced fallback can track how many times an unexpected input is received and take the user at a different place in the conversation after the third time for example.
Fallbacks can also be used to move the conversation forward irrespective of the input from the user, in which case, no local intents need to be defined.Fallbacks need to be written to take into consideration that unrecognized input might be due to your interaction model falling short or the user trying to trick your application.
Interjections and Segues help with the second strategy.
Interjections are similar to Global Intent. Interjections comprise the universe of intents you do not expect as part of your happy path but that you want to handle to build a personality for your voice application. They can range from handling profanity to one of the most asked questions to Alexa “Will you marry me?”.As for global intents, the hard part of interjection design is figuring out how to get back into the conversation. Segues define how to transition back into the conversation. Following a response to the interjection, will you return the user where they left off or will you navigate to an entirely new part of the experience? If returning, will you repeat the prompt, trigger the fallback? Designing advanced segues will make your conversation sound much more natural.
Just like Global Intents, VUI designers should specify at any point of the conversation the hierarchy of intents between local and interjections and whether the application should even listen for interjections instead of going to a fallback.
Fallbacks, Interjections and Segues are critical design elements to make your voice application more expressive and natural sounding.
Creating context - States and Conditional Branching
Great user experiences are personalized experiences, experiences that remember previous user choices: it is true for any experience, in-person, screen based or voice based.
VUI designers can use states and conditional branching to personalize their VUI.
States help you keep track of what users say and do. More specifically, states allow you to store (and later recall) information about the conversation. Examples of states are:
- Is the user a new user? (yes/no)
- User name (“William”)
- How many right answers (0,1,2)
Conditional branching is the ability to condition the machine output based on the value of a state.
States and conditional branchings are the equivalent to variables and “if” functions in programming languages.
States should be updated throughout the conversation, for example incrementing number of visits, incrementing a score. They can be used to capture system data (number of visits) or user inputs (first name). States can persist across conversation sessions or be reset. Designers need to think about which states need to be reset at the beginning of every experience and which should not. States can and should be used in a prompt or statement. If you capture the user’s first name in a state, use it in your responses to personalize the experience.
Conditions can be used to either branch the conversation, for example routing the user to a game tutorial if it is their first visit or providing alternate output, for example alternating a “good morning” or “good evening” statement based on time of day.
In the exploration of these four sets of design elements, we have progressed from creating basic conversational exchanges that define the happy path of a voice application, to crafting a personality for the application based on its handling of the unexpected, to personalizing the experience through the use of states and conditional branchings. I believe these are the four key sets of tools every VUI designer should be familiar with when they embark on the design of a voice application.
The field is so new however, that the words I use in this post to describe these concepts might not be the best ones. Some concepts might overlap with others or require further granularity. The goal of this blog post is to start a conversation to create a common vocabulary around VUI design so that VUI designers have their own layers, gradients and filters. Let’s start the conversation. How would you label these concepts? What other concepts would you add to the list?