How to make a Custom Printer Component in Rasa NLU

Example Project

You can clone the repository found here if you'd like to be able to run the same code. The repository contains a relatively small rasa project; we're only dealing with four intents and one entity. Here's some of the files in the project:

`data/nlu.md`

## intent:greet
- hey
- hello
...

## intent:goodbye
- bye
- goodbye
...

## intent:bot_challenge
- are you a bot?
- are you a human?
...

## intent:talk_code
- i want to talk about [python](proglang)
- Code to ask yes/no question in [javascript](proglang)
...

`data/stories.md`

## just code
* talk_code
  - utter_proglang

## check bot
* bot_challenge
  - utter_i_am_bot
* goodbye
  - utter_goodbye

## hello and code
* greet
    - utter_greet
* talk_code{"proglang": "python"}
    - utter_proglang

Once we call rasa train on the command line these files will generate training data for our machine learning pipeline. You can see the definition of this pipeline in the config.yml file.

`config.yml`

language: en

pipeline:
- name: WhitespaceTokenizer
- name: CountVectorsFeaturizer
- name: CountVectorsFeaturizer
  analyzer: char_wb
  min_ngram: 1
  max_ngram: 4
- name: LexicalSyntacticFeaturizer
- name: DIETClassifier
  epochs: 20
policies:
  - name: MemoizationPolicy
  - name: TEDPolicy
  - name: MappingPolicy

The goal of this tutorial is to add our own component to this file.

Printing Context

Let's make a component that will help with debugging. The goal of the component will be to print all available information known at a certain point in the pipeline. This way, our new pipeline may look something like this;

`config.yml`

language: en

pipeline:
- name: WhitespaceTokenizer
- name: printer.Printer 
  alias: after tokenizer
- name: CountVectorsFeaturizer
- name: printer.Printer
  alias: after 1st cv
- name: CountVectorsFeaturizer
  analyzer: char_wb
  min_ngram: 1
  max_ngram: 4
- name: printer.Printer
  alias: after 2nd cv
- name: LexicalSyntacticFeaturizer
- name: printer.Printer
  alias: after lexical syntactic featurizer
- name: DIETClassifier
  epochs: 20
- name: printer.Printer
  alias: after diet classifier
policies:
  - name: MemoizationPolicy
  - name: TEDPolicy
  - name: MappingPolicy

Let's note a few things.

We've added new steps with the name printer.Printer. This is a custom component that we'll create.
We've placed the printer.Printer component after each featurization step. The goal is that this component prints what information is created in each step.
We've also placed the printer.Printer component after the DIETClassifier step. This should allow us to directly see the model output.
The custom component takes an argument alias that allows us to give it an extra name. This means that the component that we'll create needs to be able to read in parameters passed in config.yml.

Making the `printer.Printer` Component

The schematic below shows the lifecycle of components in Rasa.

Our own custom component will be a python object and it will need to have some of the methods implemented that you see in the diagram. We will create a new file called printer.py in the project directory to put the new Printer component in. Note that this is also how config.yml is able to find the printer.Printer component. To get started writing the component I took the example from the documentation and made some changes to it.

`printer.py`

import typing
from typing import Any, Optional, Text, Dict, List, Type

from rasa.nlu.components import Component
from rasa.nlu.config import RasaNLUModelConfig
from rasa.nlu.training_data import Message, TrainingData
from rasa.nlu.tokenizers.tokenizer import Token

if typing.TYPE_CHECKING:
    from rasa.nlu.model import Metadata

def _is_list_tokens(v):
  """
  This is a helper function.
  It checks if `v` is a list of tokens. 
  If so, we need to print differently.
  """
    if isinstance(v, List):
        if len(v) > 0:
            if isinstance(v[0], Token):
                return True
    return False

class Printer(Component):
    
    @classmethod
    def required_components(cls) -> List[Type[Component]]:
        return []

    defaults = {"alias": None}
    language_list = None

    def __init__(self, component_config: Optional[Dict[Text, Any]] = None) -> None:
        super().__init__(component_config)

    def train(
        self,
        training_data: TrainingData,
        config: Optional[RasaNLUModelConfig] = None,
        **kwargs: Any,
    ) -> None:
        pass

    def process(self, message: Message, **kwargs: Any) -> None:
        if self.component_config['alias']:
            print("\n")
            print(self.component_config['alias'])
        for k, v in message.data.items():
            if _is_list_tokens(v):
                print(f"{k}: {[t.text for t in v]}")
            else:
                print(f"{k}: {v.__repr__()}")

    def persist(self, file_name: Text, model_dir: Text) -> Optional[Dict[Text, Any]]:
        pass

    @classmethod
    def load(
        cls,
        meta: Dict[Text, Any],
        model_dir: Optional[Text] = None,
        model_metadata: Optional["Metadata"] = None,
        cached_component: Optional["Component"] = None,
        **kwargs: Any,
    ) -> "Component":
        """Load this component from file."""

        if cached_component:
            return cached_component
        else:
            return cls(meta)

Most of the code in this file is exactly the same as what you will find in the documentation. Let's observe a few things here.

We've created a Printer object that inherits from rasa.nlu.components.Component.
This component does not depend on other components. You can confirm this by looking at the required_components method. If this component was a
CountVectorizer then it would depend on tokens being present and this method would be the place where you would specify that.
Right after this method we declare defaults = {"alias": None}. This sets the default value for the alias setting that we could set in the config.yml.
Right after this statement we declare language_list = None. This means that the component does not depend on a language. It's important to note that some components only work for certain languages. For example, the ConveRTFeaturizer will only work for the English language.
The load, persist and train methods are untouched and are also not relevant for this component. Since we're merely printing there's no need for a training phase or a phase where we load/store everything we've trained on disk.

The main change that we've made is in the process method which we'll zoom in on below.

def process(self, message: Message, **kwargs: Any) -> None:
    if self.component_config['alias']:
        print("\n")
        print(self.component_config['alias'])
    for k, v in message.data.items():
        if _is_list_tokens(v):
            print(f"{k}: {[t.text for t in v]}")
        else:
            print(f"{k}: {v.__repr__()}")

The process method of the Component object is where all the logic gets applied. In our case this is where all the printing happens. We can access all the available data by parsing the message that the method receives. In particular, we peek inside of message.data and iterate over all the items. These all get printed.

See the Effect

If you now train this system you should be able to see the effect. Let's train and run it.

> rasa train
> rasa shell

When you now talk to the assistant you'll see extra printed lines appear. When we type hello there you should see messages being printed from each printer.Printer component in our pipeline. We'll go over all of them.

After the Tokenizer

This is the information that we see right after tokenisation. Note that the alias setting is printed here.

after tokenizer
intent: {'name': None, 'confidence': 0.0}
entities: []
tokens: ['hello', 'there', '__CLS__']

Also note that we have three tokens. The __CLS__ token serves as a token that summarises the entire sentence.

After the first CountVectorizer

We now see that there are some sparse text features that have been added.

after 1st cv
intent: {'name': None, 'confidence': 0.0}
entities: []
tokens: ['hello', 'there', '__CLS__']
text_sparse_features: <3x272 sparse matrix of type '<class 'numpy.int64'>'
        with 4 stored elements in COOrdinate format>

Note the size of the sparse matrix. We keep track of features for three tokens,
one of which is the __CLS__ token.

After the second CountVectorizer

We now see that more sparse text features have been added. Because the settings specify that we're counting bigrams we also see that we add about 2250 features for each token by doing so.

after 2nd cv
intent: {'name': None, 'confidence': 0.0}
entities: []
tokens: ['hello', 'there', '__CLS__']
text_sparse_features: <3x2581 sparse matrix of type '<class 'numpy.longlong'>'
        with 80 stored elements in COOrdinate format>

After the LexicalSyntacticFeaturizer

The LexicalSyntacticFeaturizer adds another 24 features per token.

after lexical syntactic featurizer
intent: {'name': None, 'confidence': 0.0}
entities: []
tokens: ['hello', 'there', '__CLS__']
text_sparse_features: <3x2605 sparse matrix of type '<class 'numpy.float64'>'
        with 112 stored elements in COOrdinate format>

Note that the features for the __CLS__ token at this point is the sum of all the sparse tokens. Since all the features are sparse this is a reasonable way to summarise the features of all the words into a set of features that represents the entire utterance.

After the Diet Classifier

All the sparse features went into the DIETClassifier and this produced some
output. You can confirm that the pipeline now actually produces an intent.

after diet classifier
intent: {'name': 'greet', 'confidence': 0.9849509000778198}
entities: []
tokens: ['hello', 'there', '__CLS__']
text_sparse_features: <3x2605 sparse matrix of type '<class 'numpy.float64'>'
        with 112 stored elements in COOrdinate format>
intent_ranking: [{'name': 'greet', 'confidence': 0.9849509000778198}, {'name': 'talk_code', 'confidence': 0.008203224278986454}, {'name': 'goodbye', 'confidence': 0.005775876808911562}, {'name': 'bot_challenge', 'confidence': 0.0010700082639232278}]

If you were now to utter i want to talk about python you should see similar lines being printed but at the end you will now also see that entities have been detected too.

Conclusion

So what have we seen in this guide?

We've seen how to create a custom component that can read in settings from config.yml.
We've seen what features the component receives by looking at the output from the printer.Printer.
We've seen that the Rasa components continously add information to the message that is passed.

You may want to think twice about using this in production though. The printer.Printer is great when you're writing custom components because you can see the effects on the messages. The downside is that every time you add a printer.Printer to the pipeline you'll need to call rasa train to see the effects. All the print statements might also cause an overflow of logs to appear so it's best to keep this component for local development.

Feel free to use the example project found here to start playing around with this custom component.

Example Project

data/nlu.md

data/stories.md

config.yml

Printing Context

config.yml

Making the printer.Printer Component

printer.py

See the Effect

After the Tokenizer

After the first CountVectorizer

After the second CountVectorizer

After the LexicalSyntacticFeaturizer

After the Diet Classifier

Conclusion

`data/nlu.md`

`data/stories.md`

`config.yml`

`config.yml`

Making the `printer.Printer` Component

`printer.py`