Putting Conversation-Driven Development to the Test

When it comes to creating better software products, you won't find much disagreement about the need to understand the user. But we're just beginning to recognize what understanding the user really means for building conversational AI products.

It boils down to two things:

Adapting to the requests and conversation paths users are naturally inclined to take, instead of forcing the user to adapt to the software.
Training your model on things users actually say, instead of a broad but improbable data set.

We use the term conversation-driven development to describe practices for building user-centered assistants. We wrote about it here. Recently, we've looked for ways that developers can demonstrate the benefits of developing assistants with CDD to their team-mates.

So we set up an experiment. A CDD hackathon, if you will.

For this first experiment, we decided to focus on building the best NLU model. We tested two assistants side by side: one developed using conversation-driven development, the other attempting to break as many CDD "rules" as we could.

CDD bot vs (non) CDD bot

Six of us at Rasa got together and split into two groups: one group would develop their assistant according to CDD practices, the other would ignore CDD altogether.

Each group started with their own copy of the Financial Services Starter Pack, a banking assistant that handles tasks like paying a credit card balance and checking spending history. The assistant assigned to the CDD group was left unchanged because the assistant had already been connected to Rasa X and developed using CDD from day 1.

The group not practicing CDD received the same bot, but with the training examples under each intent label removed, creating a blank slate of a data set.

The CDD group shared their assistant with volunteers from Rasa as well as their friends and family. This generated 30 new test conversations in Rasa X. The non-CDD group did not share their assistant with test users during the hackathon.

Over the course of several hours, the group practicing CDD worked their way through the NLU inbox in Rasa X, annotating user messages. They also reviewed whole conversations to look for insights as well as conversations that could be converted to training examples and tests.

Meanwhile, the group not practicing CDD (unfortunately we never did come up with a catchy name for this group), used a JavaScript-based tool to generate training examples for each intent. One team member also spent some time coming up with training examples by hand. All in all, this group produced approximately 1000 training examples.

You can find the data set for the CDD group here, and you can find the data set for the non-CDD group here.

Lastly, the CDD group used a CI/CD pipeline to automatically test the new model before merging updates. The non-CDD group ran a data validation check to catch formatting errors in training data, but otherwise did no testing before changes were merged.

What we found...

TL;DR: The CDD model performed better (a weighted F1 average of 90%) on real-world conversations than the autogenerated data (81%)

First, we performed a 5-fold cross-validation on the CDD dataset, which consists of real user messages and is representative of things users actually say. The intent classifier's F1 score was, as mentioned above, 90%.

We then trained a model on all of the autogenerated data, and tested on the CDD dataset, which achieved a weighted average F1 of 81%, nine points lower than the CDD result.

Now you might think that that doesn't seem like a fair comparison, and you're right! CDD gives you an unfair advantage. One of the key pillars of CDD is using real user messages to give you the best possible NLU model.

What's more, just working with autogenerated data creates a false sense of security. If you run a 5-fold cross-validation on the autogenerated data set, you'll get an F1 of 94%, which sounds pretty good. But all this is telling you is that it's easier for your NLU model to reverse-engineer the script you used to generate that data. It doesn't reflect real world conditions. In production, the autogen model will always be making predictions against messages from real users, not the artificial messages it's been fitted for. So no matter how high the F1 score, testing the autogen model against its own data distribution is a poor predictor of actual performance. The proof? When we tested the autogen model against a test set of CDD data, it scored a full 13 points lower, at 81%.

On the other hand, testing the CDD model against data from the same distribution is actually quite similar to real-world conditions.

Over time, a CDD assistant gains another advantage. Practicing CDD on an assistant running in production means that you'll start to see more and more messages users send already represented in your training data. That makes your NLU model's job even easier.

Conclusion

In this exercise, we focused primarily on the impact CDD has on NLU training data, which is an important part of conversation-driven development. Building a data set based on real-world data produces a model that does a better job of classifying real-world messages, which is really the only evaluation that matters.

CDD encompasses more than just annotation though. We lightly touched on reviewing conversations and automating tests; in a real-life scenario, these practices would be central to understanding how well the assistant is performing and making sure nothing breaks when changes are made. Sharing your assistant with test users (early and often), tracking issues and successes, and iteratively fixing bugs are all part of CDD too.

Look out for more resources on the How behind CDD, not just the Why. And join the Conversation-Driven Development LinkedIn group to continue the discussion. Ask questions, tell us how you agree (or disagree), and share what conversation-driven development looks like on your team.