Conversational user interfaces (or CUIs) are platforms that mimic a natural human conversation, such as the voice assistant platforms Amazon Alexa, Apple Siri, Microsoft Cortana, and Google Assistant. Until recently, computers relied on graphical user interfaces (GUIs) that require additional hardware for visuals and input such as a keyboard or touch pad. Today, CUIs provide an opportunity for the user to communicate with the computer through natural language understanding techniques based on advanced artificial intelligence and machine learning algorithms. Creating voice applications that produce natural and seamless computer conversations is a significant and long-term challenge in the field of Artificial Intelligence. It’s a core belief at PullString that in order to succeed in this challenge you need a combination of technical and creative talents working together. And that given the complexities that are involved in pulling this off, you need great visual authoring tools to help you design, implement, and iterate on voice interfaces. It is our belief that these authoring tools should:
- Let you craft complex multi-turn conversational flows between the user and the computer
- Support dynamic conversations that can switch fluidly between different topics
- Provide higher-level abstractions that embody conversational best practices
There is a common misconception that we often seen repeated—even among experts in the field—that visual authoring tools are inherently limiting and fundamentally cannot produce dynamic conversational experiences, resulting in a poor user experience. We believe this to be a false claim because any such limitations are the function of the runtime conversational engine, not the authoring conversational user interface. We think this is an important distinction to appreciate because we believe that great authoring tools are critical to the advancement of the field of conversational AI. The future of voice-enabled devices and services depends on it.
Directed Graph Visual Interfaces
Holding a conversation normally involves two or more participants taking turns in expressing opinions and exchanging ideas. At each point, the conversation could go in myriad different directions. One common way to represent this visually is with a directed graph, also know as a flow network, flow chart, or dialog tree.
Figure 1: An example conversation graph, or dialog tree.
There are some who look at a flow chart like this and conclude that this must therefore mean that any conversations based on this structure can only follow a rigid predefined path in a live conversation with a user. For example, they believe that after the user interacts with the “Mountain” node in the graph above they can only ever progress to the “Yes Photo” or “No Photo” nodes. This belief may stem from older voice technologies like VoiceXML that were used to drive phone trees in Interactive Voice Response (IVR) systems. In a traditional phone tree, you start in a particular menu and have a list of options that you can pick between, some of which will take you to another menu with other options specific to that menu. It’s important to note that this is the defined runtime behavior of an IVR system. That is, the IVR system has been programmed to behave this way at runtime when the user interacts with it, but there’s nothing stopping it behaving another way.
The misconception that a flow chart must define a rigid or static experience like an IVR phone tree is due to not appreciating that a flow chart is an author-time construct that does not necessarily imply any run-time behavior. That is, the graph layout is input data that can provide hints to the run-time system, but the run-time system doesn’t have to blindly follow this structure. For example, the run-time system could behave as follows:
- There can be conditional expressions that cause a different path to be followed each time, e.g., the greeting flow could be different if you’re a returning user.
- The runtime system can have state (e.g., variables) that causes behavior to be different each time through a part of the content, e.g., if the user has provided both the pizza size and the toppings, then jump to the payment flow.
- There can be arbitrary code execution, such as calling out to a web API, that can cause different behavior and different output depending upon the current state of the conversation, such as checking a stock price or the current weather conditions.
- There can be out-of-context questions that cause the conversational flow to jump to a completely different part of the decision tree, for example, if the computer asks “what stock would you like to look up?” you could still respond “Actually, can you tell me what the time is.” and then be taken to that part of the content.
- After responding to an out-of-context interjection, you could be returned to where you were before with an inserted segue to smooth the transition, e.g., “So, we were looking up a stock price. Which stock should I look up?”
The key insight is that using a visual system to script how a computer might respond does not mean it must always respond the same way at runtime. As long as the visual representation provides a way to manipulate state (e.g., variables), receive user input (e.g., voice commands), and define conditional branching (e.g., if-then-else statements), then it is expressing a Turing complete system and can, at least in theory, be just as flexible as writing source code in a programming language.
Given that this “static flow” objection is often proposed by engineers, we can use a programming example to drive this point home. The argument is equivalent to someone looking at the static text in a text editor for a computer program and stating that the resulting program can therefore only be static. Obviously when this static text data is input to the programming language interpreter or compiler it can produce countless combinations of program flow at runtime due to conditional expressions, variable state, and different user inputs.
Figure 2: The source code text for this program may never change, but its runtime behavior can be dynamic and ever changing.
At a fundamental level, a flow diagram defines a set of branching logic that lets a system progress through a series of states. That’s essentially what a computer program does too. In fact, it’s worth observing that there are many visual programming interfaces that let you access the full expressivity of a programming language using a visual chart or block interface. So, in reality, the main reason that someone might say a visual authoring interface is static or constrained is because the runtime system that is being driven by the authoring interface can only produce static and constrained results.
In contrast, the PullString Conversation Cloud provides a powerful runtime system that is much more flexible than a traditional IVR system. Internally it maintains a stack of conversation flows to represent the dynamic behavior of context switching in a natural conversation. When the user switches context, then a new conversation flow can be pushed onto the stack. When the user completes that flow the AI engine can automatically return back to the previous context, inserting a segue filler to smooth the transition such as “Let’s go back to what we were talking about earlier…”.
In support of this, the PullString Conversation Cloud also defines a series of contexts that could be used to match user input. A context defines the user’s location in the conversational graph as well as state for any previous interactions. For example, if the user is playing a voice app that offers multiple audio games, the user could respond to the current query from the voice app, or it could say “restart this game” and the app will know from its context which game you are referring to. Or the user could say “play my favorite game” which might be a global context that takes into consideration the state of the users previous interactions. All of this is dynamic behavior is possible while still being able to craft individual conversation flows using a visual authoring tool like PullString Converse.
Representing Higher Level Abstractions
A further concern of using directed graph visual interfaces to define conversational paths is that they describe the system at a very low level, and hence do not scale well to more complex scenarios. Again, we believe this claim is based on assumptions that are easy to address.
For example, consider a voice app that is tasked with letting users order a T-shirt where the user can select the gender, size, and color for the shirt. The logic to collect all this information could become quite complex. For example, if the user says “I want a blue shirt” then the app needs to follow up and ask “What gender are you?” and “What size shirt do you want?”. But if the user says “I want a large red men’s T-shirt” then all the information has been provided and the app can proceed immediately to the purchase flow. All of this decision flow logic can be expressed in a flow chart, and at this level of detail the nodes of a flow chart represent quite low-level concepts, equivalent perhaps to a single line of code, or a single condition, in a program. As noted above, there is nothing preventing an authoring tool from expressing all of this logic visually using a flow diagram, however, there is a valid concern that if you try to represent all of the logic of a vast and complex computer program in a flow diagram then it will turn into a ball of spaghetti and will become very difficult to navigate easily. There are several points to consider here:
- Large computer programs are not normally contained within a single source file, so in the same way a large conversational experience does not need to be represented visually on a single canvas. Just as programmers use multiple source files, a visual authoring tool can break an entire voice app across several different containers. (In PullString Author, we call these Categories.)
- Complex programs are normally broken up into functions that group several operations into a coherent higher-level concept with known inputs and outputs, such as a CollectTShirtInformation() function. This same concept can be applied to a visual authoring tool. For example, it would be possible to select a group of nodes in a flow chart and collapse them into a single node. This lets the user visualize the conversation flow at a higher level of detail and not get distracted by the lower-level details (unless they really want to dive down to that level and expand out the group again).
- Programming languages invariably come with standard libraries of functions that provide solutions to common tasks. Similarly, as we learn more about what makes successful voice experiences, these best practices can be encapsulated into higher-level nodes in the graph that come with a lot of behavior and logic that the user doesn’t need to worry about. In PullString Converse, we call these Blocks, and you can think of them as abstracting a certain amount of conversational logic. For example, Converse has a block called Data Capture which can ask users a series of questions until all required inputs are filled out and then call out to some arbitrary code to perform some operation with those inputs, i.e., the whole T-Shirt purchase flow, or any other task that fits this template, could potentially be expressed as a single node in a conversational flow.
- Flow charts are good at showing high level structure, i.e., how different components are connected together, but they may indeed not be a good option to express low-level logic. Other mechanisms may make sense too. For example, a single node in a graph could embody a segment of arbitrary code that can be run (in PullString Converse, this is our Web Service capability). Also, there are other ways to express a directed graph visually, such as the tree hierarchy visualization we use in PullString Author.
Using Engineering Resources Wisely
The core argument of this post is that a visual authoring tool for constructing a conversational experience does not need to suffer from limitations of static content that cannot scale to large tasks. In fact, it can be just as expressive as a computer program written by a software engineer, if it is backed by a sufficiently sophisticated runtime system.
It should be noted, however, that asserting the need for good authoring tools does not mean that engineers do not have a role to play in developing great computer conversation experiences. For example, if you’re developing a voice app that can tell you the weather, an engineer will likely be needed to write code to perform the integration with an appropriate weather API. However, it’s important to not confuse the engineering work required to provide what is returned with how it is returned.
That is, an engineer who is developing her company’s Amazon Alexa skill or Google Assistant action should not be spending her time reinventing the field of AI and trying to build an entire stateful and contextually-aware natural language production system. Instead, she would make much better use of her time writing the custom logic or web API integrations that are unique to her company’s voice application. This is shown in the figure below, which attempts to show the difference between a developer creating the API integrations for their specific skill at author time versus the developers who have created the underlying AI runtime system.
Figure 3: The key users and activities in building and using a voice app.
This is a Good Thing because often developers can be a limited resource at a company, so allowing them to focus on the unique value add of your voice app is much better than requiring them to become experts in AI dialog management system design. Having a separation between author-time context creation and runtime AI engine behavior helps to make this separation of tasks more obvious. Yet another benefit of a dedicated authoring environment!
To learn more about common misperceptions and challenges that must be overcome to advance the field of computer conversation, download this white paper on Three Fallacies of Conversational Artificial Intelligence.