Think about how hard it is to maintain a text or voice assistant. A service like Siri has to respond to a huge number of different inquiries across an ever-expanding list of domains, such as travel, music, scheduling, reminders, texting, photos, search, sports, entertainment, and more. Apple doesn’t publish a complete list of what Siri can respond to, but it’s not hard to imagine that there are many thousands of possible variations of questions that Siri can answer. Now think about the problem of maintaining all of that carefully-crafted behavior, i.e., manually curating and training all those intents, while also trying to add support for new queries. How can you be sure that any new features won’t break existing behavior?
For example, consider a personal assistant that can understand a user saying “I’m feeling great” and responds appropriately to that statement. Then at some later point, someone extends this bot with the ability for users to specify their name in a format like “I’m Joe Bloggs”. That new feature works as intended so it’s deployed to production. But then you get feedback from your users that if they say “I’m feeling great” the assistant responds erroneously with “Hello! I’ll call you Feeling Great from now on.”.
Even for conversational experiences that are not as deep as Siri, the complexity and inherent ambiguity of the human language means that it can be very easy for changes in intent definitions to have unintended consequences on things your bot could respond to earlier. So how can you, as a bot designer or conversational writer, feel confident that as you evolve and improve your intent models over time they don’t break key workflows that your users expect will work? In the software development world, this problem is addressed using a combination of automated and manual testing, and for any nontrivial system automated testing is critical. Similarly, we believe that having automated tests for your conversational flows is critical for maintaining your bots over time.