Customize how Rasa imports training data

TL;DR;

Rasa 1.2 makes training data importing customizable
You can now implement your own importer by extending the TrainingDataImporter interface
This blog post gives you a quick example of how to do so

As an open-source framework for contextual AI assistants, Rasa has a diverse community all over the world. Individual developers, small teams, and large enterprises are deploying Rasa in a variety of different settings and infrastructures. Delivering an out-of-the-box solution for every use case is tough. At Rasa we want to leverage the expertise of developers, giving them the opportunity to custom-tailor Rasa for their use cases. Hence, we are building Rasa as modular framework with customizable plug-and-play components. For example, you can add support for new platforms by adding a custom connector, or build your own NLU pipeline component.

The recently released Rasa 1.2 makes another step towards a fully customizable framework whose architecture can adapt to your individual requirements: you can now customize the way Rasa imports training data to train the model.

Why is this useful?

Support data from different sources: FTP, your company's file server, end-to-end encrypted data fetching - you name it.
Support different data formats: you write your stories in Excel or model stories as a graph? No problem, write a parser and hit the train button.
Support different strategies: change the way Rasa collects training data - e.g. generate data on the fly.

How to Use It

Rasa comes with two different importers out of the box:

RasaFileImporter: the default importer in Rasa. It will use the arguments supplied through the command line interface to load data from the files at the given paths.
MultiProjectImporter: an experimental importer which allows you to split up your assistant into multiple sub projects. Work on each project independently and then use the MultiProjectImporter to import the training data of all selected projects at training time.

If you want to use the default importer, you don't have to change anything. But what if you want to configure another one?

It's as easy as putting a line importers in your Rasa configuration file and list the name (in case it's already included in Rasa) or the full module path in the list of importers:

... # More Rasa configuration

importers:
- name: RasaFileImporter
- name: path.to.your.CustomImporter

Why a list of importers? Rasa automatically combines the data of the given importers at training time. In the given example, Rasa would combine the training data of both the RasaFileImporter and the CustomImporter.

How to Implement Your Own Importer

Every importer has to satisfy the interface of the TrainingDataImporter class. This class provides the Rasa configuration, domain, NLU data, and stories.To demonstrate how you can implement your own importer, this blog post will guide you through the implementation of a data importer which loads data from a GitHub repository. For the sake of simplicity it's a very naive implementation and assumes that the repositories follow the default Rasa directory layout. Let's start:

What you need:

Rasa >= 1.2.0 installed
Git on your machine
Installed PyGithub package

You can find the complete source code here.

The TrainingDataImporter Interface

Start by building a class which extends the TrainingDataImporter interface:

You have to implement four methods:

get_stories: returns the training data to train the the dialogue model
get_nlu_data: provides the training data to train the the NLU model
get_domain: provides the domain of your assistant
get_config: provides configuration for the dialogue and NLU model training

Each function is async, which means you can use Python's asyncio module to work with modern IO frameworks and speed up the data loading.

Implementing the Interface

Getting Files From Github
Start by connecting to the GitHub repository. An easy way to do so is to use the PyGithub library:

When the importer is loaded, Rasa passes in the file paths which the user provided through the command line interface (config_path , domain_path, training_data_paths). In this case they are not required, so you can safely ignore them. The parameter you need is the repository parameter which you can later specify in the importers section of the configuration file.

After connecting to the repository, our importer searches for the stories and NLU files in the data folder of the repository, and stores any found files in a temporary directory on our machine. Storing the files on disk makes parsing a bit easier because you can reuse functions which are already part of Rasa_,_ and also gives us better performance when you need the actual content file of the file later. Finally use a Rasa function called get_core_nlu_files to separate between stories and NLU files:

get_stories
get_stories returns the StoryGraph which contains the parsed training stories. It further gets a couple of parameters which you can pass on to the Rasa StoryFileReader. If you are implementing your own parsing algorithm, you can also decide to ignore these parameters.

Since you already collected the story files in a list, you can simply point the StoryFileReader to it. The StoryFileReader will read the files and parse their content to StorySteps. This reader does not return a plain list of StorySteps, but returns a coroutine. To get the result of this coroutine, we have to apply await to it. If you are using a custom parser which is not asynchronous, then you don't need to apply await. Finally wrap the list of stories in a StoryGraph and return it.

get_nlu_data
This method returns the NLU training data as a TrainingData object. The method receives a language parameter which can be used to distinguish between training data for multiple languages. There is a handy function in rasa.importers.utils which makes reading the NLU data very short:

get_domain
Now you have to get the Domain of the assistant. For simplicity, it is assumed that the domain is stored as YAML in a file domain.yml in the repository. Using this you can pull the file from GitHub, parse its content as text, and load the domain from it.

get_config
The last missing piece is the model configuration, which is a simple Python dictionary. It defines your dialogue policies and NLU components. Similar to the domain file, the assumption is that the configuration is stored in a file called config.yml:

Training with the GitImporter

Now it's time to take the importer for a test drive. Put the source code in a file called git_importer.py in your Rasa project directory. Then add these lines to your configuration file:

importers:
- name: "git_importer.GitImporter"
  repository: "rasahq/rasa-demo"

This will get the training data from the rasa demo repository, but you can also put in any other public repository that follows the default Rasa project layout.

Finally, you can simply execute rasa train to train a bot with the training data from the GitHub repository (since rasa-demo is a complex bot, training might take a while 😀).

Where to Go From Here

Rasa is a framework for makers. Being developers ourselves, we acknowledge the need to customize and extend software for your individual use case. While you know what's required to do so, it's our job to make this as easy and quickly doable as possible. As of now, you can already plug in the following components into Rasa:

Is anything missing in this list what you need to use Rasa in your environment? Then please create a feature request for it on GitHub and we will discuss the details with you in the issue.

This tutorial gave you an overview of how to use the new TrainingDataImporter interface to implement your own data importer. We are stoked to see where you take this. Please share your experiences, implementations and ideas on the Rasa forum so that we can take Rasa to the next level - together.