10 Best Practices for Designing NLU Training Data

It's almost a cliche that good data can make or break your AI assistant. But, cliches exist for a reason, and getting your data right is the most impactful thing you can do as a chatbot developer.

At Rasa, we've seen our share of training data practices that produce great results....and habits that might be holding teams back from achieving the performance they're looking for. We put together a roundup of best practices for making sure your training data not only results in accurate predictions, but also scales sustainably.

The Challenge

What causes good training data to go bad? And what happens when things go off track?

Data sets often pass through many hands, which leaves plenty of opportunity to accumulate errors. These errors can have quite serious effects on your assistant's behavior: the inability to recognize user messages that don't exist in your training data (also known as overfitting), poor entity extraction, and intent confusion.

Models aren't static; it's necessary to continually add new training data, both to improve the model and to allow the assistant to handle new situations. It's important to add new data in the right way to make sure these changes are helping, and not hurting.

If you've inherited a particularly messy data set, it may be better to start from scratch. But if things aren't quite so dire, you can start by removing training examples that don't make sense and then building up new examples based on what you see in real life. Then, assess your data based on the best practices listed below to start getting your data back into healthy shape.

Here are 10 best practices for creating and maintaining NLU training data.

1. Use real data.

One common mistake is going for quantity of training examples, over quality. Often, teams turn to tools that autogenerate training data to produce a large number of examples quickly.

There are a few pitfalls to this approach. One is overfitting: over time, the model loses the ability to generalize and is only able to recognize phrases it's seen before. The other is that autogeneration tends to produce implausible examples-not things users would actually say in real life. This implausible data isn't doing your model any favors.

Instead, focus on building your data set over time, using examples from real conversations. This means you won't have as much data to start with, but the examples you do have aren't hypothetical-they're things real users have said, which is the best predictor of what future users will say.

2. Keep training examples distinct across intents.

In order for the model to reliably distinguish one intent from another, the training examples that belong to each intent need to be distinct. That is, you definitely don't want to use the same training example for two different intents. When training examples are too similar, intent confusion results.

This sounds simple, but categorizing user messages into intents isn't always so clear cut. What might once have seemed like two different user goals can start to gather similar examples over time. When this happens, it makes sense to reassess your intent design and merge similar intents into a more general category.

One common situation where many intents can be merged into one is when users are providing information in response to an assistant's question. You might think the best approach would be to create multiple intents: provide_name, provide_address, provide_email. But these user messages actually aren't so different-they differ mainly in their entities, and the surrounding words in the sentence are fairly similar. In this case, the best way forward would be to create a single inform intent, and group all training examples where a user is providing information underneath it. But what if you want to vary the assistant's response depending on the information provided? More on that in the next tip.

3. Merge on intents, split on entities.

There are many situations where you might want to perform conditional logic depending on the information provided by the user. For example, after asking the user if they're a new or returning customer, you want to take the conversation down a different path depending on what they answer. You might assume the best way to solve this problem is to create two different intents: inform_new and inform_returning. But just like the inform intent we discussed in the previous tip, it's better to group both 'new' and 'returning' user messages into a single intent.

So how do you control what the assistant does next, if both answers reside under a single intent? You do it by saving the extracted entity (new or returning) to a categorical slot, and writing stories that show the assistant what to do next depending on the slot value. Slots save values to your assistant's memory, and entities are automatically saved to slots that have the same name. So if we had an entity called status, with two possible values (new or returning), we could save that entity to a slot that is also called status.

Here's what that looks like for a returning user:

## returning user
* greet
    - utter_ask_if_new
* inform
    - slot{"status": "returning"}
    - utter_welcome_back

Versus a new user:

## new user onboarding
* greet
    - utter_ask_if_new
* inform
    - slot{"status": "new"}
    - utter_create_account

And the training data for the inform intent? Be sure to include examples like I'm [new](status) and [Returning](status) so the NLU model can learn to recognize these entities.

You can learn more about slots in the docs.

4. Use synonyms wisely.

A common misconception is that synonyms are a method of improving entity extraction. In fact, synonyms are more closely related to data normalization, or entity mapping. Synonyms convert the entity value provided by the user to another value-usually a format needed by backend code.

For example, let's say you're building an assistant that searches for nearby medical facilities (like the Rasa Masterclass project). The user asks for a "hospital," but the API that looks up the location requires a resource code that represents hospital (like rbry-mqwu). So when someone says "hospital" or "hospitals" we use a synonym to convert that entity to rbry-mqwu before we pass it to the custom action that makes the API call.

Let's look at another example of a good use case for synonyms. Let's say you're building an assistant that asks insurance customers if they want to look up policies for home, life, or auto insurance. The user might reply "for my truck," "automobile," or "4-door sedan." It would be a good idea to map truck, automobile, and sedan to the normalized value auto. This allows us to consistently save the value to a slot so we can base some logic around the user's selection.

The key is that you should use synonyms when you need one consistent entity value on your backend, no matter which variation of the word the user inputs. Synonyms don't have any effect on how well the NLU model extracts the entities in the first place. If that's your goal, the best option is to provide training examples that include commonly used word variations. But you don't want to break out the thesaurus right away-the best way to understand which word variations you should include in your training data is to look at what your users are actually saying, using a tool like Rasa X.

5. Understand lookup tables and regexes.

Lookup tables and regexes are methods for improving entity extraction, but they might not work exactly the way you think. Lookup tables are lists of entities, like a list of ice cream flavors or company employees, and regexes check for patterns in structured data types, like 5 numeric digits in a US zip code. You might think that each token in the sentence gets checked against the lookup tables and regexes to see if there's a match, and if there is, the entity gets extracted. But actually, lookup tables and regexes get featurized. That means, they're used to train the NLU model itself. This is why you can include an entity value in a lookup table and it might not get extracted-while it's not common, it is possible.

For best results, you should make sure to include a few of the entities used in lookup tables and regexes in your training examples-this gives the model a better representation of how the entity is actually used in a sentence and increases the likelihood that the entities will be correctly extracted.

6. Leverage pre-trained entity extractors.

Names, dates, places, email addresses...these are entity types that would require a ton of training data before your model could start to recognize them. That's because there are a lot of possible values.

Instead of flooding your training data with a giant list of names, take advantage of pre-trained entity extractors. These models have already been trained on a large corpus of data, so you can use them to extract entities without training the model yourself.

There are two pre-trained entity extractors available in Rasa. The first is SpacyEntityExtractor, which is great for names, dates, places, and organization names. DucklingEntityExtractor is another option. It's used to extract amounts of money, dates, email addresses, times, and distances. You can find more info in the docs.

When using spaCy or Duckling to extract entities, you still need to include a few examples of sentences that contain the entities in your training data, but since entity training is happening outside of Rasa Open Source, you don't need to annotate the entities themselves in your training examples.

7. Always include an out-of-scope intent.

An out-of-scope intent is a catch-all for anything the user might say that's outside of the assistant's domain. If your assistant helps users manage their insurance policy, there's a good chance it's not going to be able to order a pizza. When an out-of-scope intent is detected, the assistant can reply with something like "That sounds interesting, but that's not a skill I've learned yet. Here's what you can ask me...." Which is a much nicer user experience than "sorry, I don't understand."

It also takes the pressure off of the fallback policy to decide which user messages are in scope. While you should always have a fallback policy as well, an out-of-scope intent allows you to better recover the conversation, and in practice, it often results in a performance improvement.

8. Handle misspelled words.

It's a given that the messages users send to your assistant will contain spelling errors-that's just life. Many developers try to address this problem using a custom spellchecker component in their NLU pipeline. But we'd argue that your first line of defense against spelling errors should be your training data.

Spellcheckers vary wildly in quality. Some actually introduce more errors into user messages than they remove. Before turning to a custom spellchecker component, try including common misspellings in your training data, along with the NLU pipeline configuration below. This pipeline uses character n-grams in addition to word n-grams, which allows the model to take parts of words into account, rather than just looking at the whole word. By doing so, it can better recover from misspellings.

language: "en"

pipeline:
  - name: ConveRTTokenizer
  - name: ConveRTFeaturizer
  - name: RegexFeaturizer
  - name: LexicalSyntacticFeaturizer
  - name: CountVectorsFeaturizer
  - name: CountVectorsFeaturizer
    analyzer: "char_wb"
    min_ngram: 1
    max_ngram: 4
  - name: DIETClassifier
    epochs: 100
  - name: EntitySynonymMapper
  - name: ResponseSelector
    epochs: 100

But you don't want to start adding a bunch of random misspelled words to your training data-that could get out of hand quickly! Instead, focus on the words that users actually commonly misspell. You can learn what these are by reviewing your conversations in Rasa X. If you notice that multiple users are searching for nearby "resteraunts," you know that's an important alternative spelling to add to your training data.

9. Treat your data like code.

You wouldn't write code without keeping track of your changes-why treat your data any differently? Like updates to code, updates to training data can have a dramatic impact on the way your assistant performs. It's important to put safeguards in place to make sure you can roll back changes if things don't quite work as expected. No matter which version control system you use-GitHub, Bitbucket, GitLab, etc.-it's essential to track changes and centrally manage your code base, including your training data files.

Rasa X connects directly with your Git repository, so you can make changes to training data in Rasa X while properly tracking those changes in Git. Check out the docs on Integrated Version Control for more details.

10. Test your updates.

Finally, once you've made improvements to your training data, there's one last step you shouldn't skip. Testing ensures that things that worked before still work and your model is making the predictions you want.

The best way to incorporate testing into your development process is to make it an automated process, so testing happens every time you push an update, without having to think about it. We've put together a guide to automated testing, and you can get more testing recommendations in the docs.

Conclusion

That's a wrap for our 10 best practices for designing NLU training data, but there's one last thought we want to leave you with. There's no magic, instant solution for building a quality data set.

That's because the best training data doesn't come from autogeneration tools or an off-the-shelf solution, it comes from real conversations that are specific to your users, assistant, and use case.

The good news is that once you start sharing your assistant with testers and users, you can start collecting these conversations and converting them to training data. Rasa X is the tool we built for this purpose, and it also includes other features that support NLU data best practices, like version control and testing. The term for this method of growing your data set and improving your assistant based on real data is called conversation-driven development (CDD); you can learn more here and here.

Whether you're starting your data set from scratch or rehabilitating existing data, these best practices will set you on the path to better performing models. Follow us on Twitter to get more tips, and connect in the forum to continue the conversation.