Evaluating NLU Services: Conversational Question Answering System

We talked to Daniel Braun from Technical University of Munich about his SigDial paper evaluating the NLU services for conversational question answering systems. Learn about his research and his thoughts on the future of NLU.

Daniel, tell us a bit about yourself. What are your main research areas, and how did you end up in this field?

I'm a PhD student at the chair of Software Engineering for Business Information Systems at TU Munich and my main research area is natural language generation (NLG), but I'm also interested in natural language processing in general. I studied computer science with a minor in computational linguistics at Saarland University, so I was always interested in language processing. The focus on NLG developed when I was a research student at the University of Aberdeen. More recently, I also work in the area of conversational interfaces and chatbots, since they combine both aspects, understanding and generating natural language.

So tell us about your SigDial paper: "Evaluating Natural Language Understanding Services for Conversational Question Answering Systems" - what questions does it try to answer? Why is this important to the research community?

This paper is the result of a cooperation project with the Corporate Technology department at Siemens, called Vertical Social Software (https://wwwmatthes.in.tum.de/pages/2lilqthsigbu/Vertical-Social-Software-VSS). As part of this project, my colleague Adrian Hernandez-Mendez and I do a lot of prototyping and we try out new technologies and tools for building social software in a corporate environment. One of the prototypes we build was a chatbot, which we wanted to implement using a Natural Language Understanding (NLU) service. There are a lot of these services available right now, so we checked the scientific literature to find out which of the services would be the best choice. We soon realized that although many research projects are using NLU services, most of them don't really explain why they choose one service or another. However, this choice can potentially influence the performance of the whole system. Therefore, we wanted to find a way to compare different services and their classification performance, in order to enable scientists, but also developers from industry, to make more educated decision about which service they want to use and how this decision might influence their results.

How did you collect the training data used for the benchmark?

For our evaluation, we used two different corpora. One consisted of data that was collected by a Telegram chatbot for the public transport in Munich, the other one was extracted from the StackExchange platforms "ask ubuntu" and "Web Applications" and tagged with Mechanical Turk. Both datasets are available online (https://github.com/sebischair/NLU-Evaluation-Corpora).

What are the results of your evaluation?

As we expected, we indeed found differences in the classification quality of the different services, depending on the corpus and hence the domain. A bit more surprising for most people was the fact that Rasa performed better than most of the proprietary services, except for Luis from Microsoft. However, that's only for the corpora we've tested and the results are not generalizable for other domains. The aim of our paper is not to suggest a specific service, but to encourage people to evaluate the existing alternatives for their use-case and show them how such an evaluation could look like.

What do you think are the most important challenges in NLU in the next couple of years?

It will be very interesting to see, which influence deep learning will have on NLU. So far, the results we saw for NLP applications were not as impressive as for computer graphics. I'm also happy to see that recently, more and more people realize that machine learning doesn't make decades of research about language obsolete. I hope in the future, we will see more approaches combining machine learning with language theory and rules, because I think such hybrid approaches will lead to better results, especially for cases where there are no huge training datasets available.

Looking beyond NLU - what are the most important tools you'd like to have to support your research?

The progress we saw around NLU in recent years is very impressive, not only on an academic level, but also with regard to usability and accessibility. It's really easy to use modern NLU services, even without any programming skills. Other parts of the chatbot pipeline fall short of this development. I think there is still a lot of work to do when we leave the level of a single message and come to the discourse level. And for the NLG part, we're missing the accessibility and usability that makes the NLU services so popular. I hope I can contribute to making NLG more accessible for chatbot creators.

Thanks!