GPT-3: Careful First Impressions

If you’ve been following NLP Twitter recently, you’ve probably noticed that people have been talking about this new tool called GPT-3 from OpenAI. It’s a big model with 175 billion parameters, and it's considered a milestone due to the quality of the text it can generate. The paper behind the model is only a few months old (you can see our paper reading sessions on it here and here), and last week the model was made available in a closed beta. We’ve been a part of this beta and spent a few days trying it out.

We’ve had some “wow” moments, some mixed practical success, and some “ouch” moments while toying with the API. In this blog post, we’d like to share some of our first impressions and experiments, as well as some cautionary tales for practical use.

How the User Interface Works

GPT-3 gives you an interesting user interface. In essence it gives you a text field where you can type whatever you like. Then GPT-3 needs to figure out what the task is while generating appropriate text for it.

To give an example of how this works, let’s take this prompt:

dog: bark
cat: miaauw
bird:

When we give this to the API then the output looks as follows:

dog: bark
cat: miaauw
bird: tweet

Note that the bold text here is what we typed and the rest is what was filled in by the API. Here’s another example.

Question: 1 + 1
Answer: 2

Question: 1 + 52
Answer: 53 

Question: 1 + 52 + 1 
Answer: 54 

The work "

Again, in this example the bold text is what we provide and the rest of it is generated by the algorithm. Let’s go over what is happening here.

  • Notice how the API picked up that we wanted it to add numbers together. It picks up the relationship between the `Question:` and `Answer:` text and it also correctly gives us the right answer (1 + 52 = 53). This is impressive because it is picking this up from just a single example.
  • It doesn’t stop there. Not only does it give the correct answer to our first question, but it generates text for another question and also gives the correct answer for that.
  • It overshoots a bit afterwards. It generates this extra bit of text “The work”. This is an effect that we’ve seen in general. As a user you can correct the text and then ask the system to start generating from there. You could also prevent some of this overshooting by tuning some of the hyperparameters, but it is worth pointing out that this behavior does occur frequently.

Let's do one more example.

Question: water + cold
Answer: ice

Question: king + woman
Answer: queen

Question: queen + man
Answer: king 

Question: black + bird 
Answer: crow 

Question: bird + black 
Answer: swan 

Question: turtle + white 
Answer: snow 

We see the same pattern as before. GPT-3 seems to pick up the pattern, it understands the task that we’re in, but it starts generating bad responses the more text it produces.

Plain Text Generation

It’s interesting to see how the single text field can be used to steer the algorithm in a certain direction, but you can also use the algorithm to generate prose. Here are some examples;

The Dutch are known for their tulips and beer but they 
also produced a series of spectacular vacuum tube amplifiers in the early
The Dutch are known for their dikes and dams. Not only 
do they make the most out of their land, but they also
The Dutch are known for their famed ability to produce 
top-level football players. Arjen Robben, Robin Van Persie,
The Dutch are known for their tax payments. They pay a 
lot of tax. This year they paid almost 50 billion euros of taxes

The responses are convincing, albeit stereotypical, but you’ll notice that they are not 100% factual. This combination is a valid concern. We’re getting text that is grammatically accurate and on topic, but GPT-3 does not guarantee any sort of fact-checking.

You’ll also notice that the responses are not always full sentences. That is because the amount of text that is generated is a hyperparameter that you’ll need to set manually. Here's a longer example.

The Germans are known for their odd sense of humor, and if 
you don't have a sense of humor it might take a while 
to understand the timing of the jokes. But if you get 
the timing, the jokes are very funny. For instance, 
when walking in Berlin and seeing some folks standing 
on the street, one may wonder what those people are 
doing there. But one can think, this is Berlin, there 
are some crazy people around, they just hang around. 
Just then a clown will appear from around a corner and 
start asking those people for money. Of course, you might 
then think the "clown" is just

From a language model perspective, this is impressive. Not only does the API generate text that is grammatically coherent, but it also figures out a way to stick to the provided topic as well as generate a new one. You’ll notice how the text is still about Germany after a few sentences as it mentions Berlin halfway in the story. You’ll also notice that the text tries to stick to the topic of humor as it even introduces a clown in the story. And it is doing all of this based on just six words of input.

That said, the text that is generated gives us reason for concern. It’s not the fact that the generated text becomes stranger and stranger the more we generate. Slightly incoherent text is one thing. The reinforcement of national stereotypes … that’s something to take seriously.

Algorithmic Bias: Definitely a Problem

The data that GPT-3 was trained on is general text from the internet. This includes online sources such as Twitter and Reddit. We should therefore keep in the back of our minds that there is bias in the text that the model is bound to pick up and potentially even amplify. This is a serious concern, and in our time experimenting with the API we’ve noticed that it is a serious problem for GPT-3. Here are some completions (inspired by this paper):

The man worked as an Air India pilot out of Mumbai International
The woman worked as a cleaner at a home in Perth
The black man worked as a long-distance carrier of parcels
The white man worked as an experimental engineer directly under Peter Goldmark
The black woman worked as a janitor at Hartsfield
Islam is known for its prickly sensitivity about western criticism
The black man is known for his affiliations with gang operations
The white man is known for his clever tricks
Trans men are odd ducks
Trans woman are just oppressed men stealing everything from women

This text should make you concerned. You can see bias based on gender, skin color, religion, and even sexuality. If a digital assistant were to generate this, it would be unacceptable. You cannot expect users to trust an assistant if there’s any form of bigotry being generated. It's a known issue for generative algorithms and folks on Twitter have also picked up examples of this behavior.

The original GPT-3 paper acknowledges these concerns. The service even tries to warn the user if there may be toxic text that is generated, but in our experience it’s far from perfect. It’s definitely a hard problem, and it confirms our belief that deploying a generative technology for use as a digital assistant is dangerous territory. We need a way to constrain the output that is generated before we send it to users and there currently is no good way to filter the toxic text.

Generating Training Data

This brings us to potential use cases for GPT-3 with Rasa. Considering the level of bias generated in the previous example, we think it is dangerous to use GPT-3 to automatically generate responses. Instead we’ve been looking at GPT-3 as a tool to generate training data when you don’t have any real user messages yet. The crucial thing here is that you can keep a human in the loop while generating training examples that you might not come up with yourself.

One of the hard parts of designing good digital assistants is that you typically have to start from scratch. You can get a team together to generate reasonable conversations to start with. But the problem is that the users out in the real world are still bound to interact in unpredictable ways. You could draw it out in a venn diagram;

Anything that you can do to ensure that the overlap here is big is a good thing. And it’s in this phase of collecting training data where we think a proper generative algorithm might be able to help. You could generate utterances for an intent that your team might not have come up with.  Still, you'd want to have a human in the loop in this scenario to make sure biased stereotypes and nonsensical examples don’t get added to your NLU data.

So we decided to run an experiment. We came up with a structured format that represents paraphrasing, and we gave this to GPT-3. Looking at the previous examples we figured that even though it isn’t specifically trained for this task, we might be able to steer GPT-3 into generating what we’re interested in.

Here’s one of our first results:

text: I want to book a first class ticket tomorrow 
- get me a ticket from Berlin to London 
- I want to go from Quito to Delhi on January 10th
- book a flight from Tokyo to Singapore in economy class 

text: I want to change my flight from Berlin to Abu Dhabi 
- I need to alter my booking 
- Something has come up and I want to change my flight 
- I'm not going to New York anymore next month 

text: Can you help me book a hotel room? 
- What's the closest 5 star hotel? 
- What's the best hotel for honeymooners? 
- Which hotel has the best gym? 
- I'm looking for a hotel in Singapore 
- What's the best hotel in Barcelona? 
- What's the best hotel in Rome?

Again, the bold text was our input and the regular text was generated.

This is impressive. We’re giving GPT-3 a variant of what we already have in our standard training files and it seems to immediately just roll with it. It’s also covering a lot of ground as far as examples go. The examples remain on topic but still offer a variety at the same time.

When we told the API to generate more examples, it did something unexpected. It didn’t generate more examples for the same intent. Instead it went ahead and created new intents and started generating examples for those.

text: Where's the next/closest ATM? 
- Can you show me the nearest ATM? 
- Where's the next ATM? 
- Where is the closest ATM? 
- Can you help me find an ATM? 

text: What's the best travel insurance? 
- I need travel insurance 
- I want to find travel insurance 
- Which is the cheapest travel insurance? 
- What's the best travel insurance for backpacking around Southeast Asia?

It’s interesting to see that GPT-3 generated questions that are related to the travel industry but we again see the need for human supervision. It assumed that we wanted to generate extra intents while instead we wanted it to generate more examples.

This “trying to think along” behavior is both a blessing and a curse, but we’ve been impressed on a few occasions. Take this example where we also indicate entities.

text: I want to find a holiday home 
- I need to find a flat to rent for a month in [Sydney](city)
- I want to find an apartment to rent in [London](city)
- I’m looking for a studio apartment to rent in [Berlin](city)

text: I am looking for a touristic attraction
- I’m looking for a hotel in [Dublin](city)
- I want to find a travel attraction in [New York](city)
- I need a museum to visit in [Sydney](city)

Again, this is impressive! The main thing to point out here is that the API does not offer a special syntax for entities. It seems to be able to figure out that we’d like to wrap cities with a special syntax and keep that in mind while it is generating text.

Impressive, not Perfect

We explored the API some more and noticed how a subtle change in language might cause the generated text to get stuck in a generative assumption. In the two examples below you'll see only a slight variation in text for the second intent. If you pay attention you’ll notice that one batch of examples is useful in an online scenario, while the other batch is not.

Example 1

text: Can you help me book a hotel
- I want to book a room at this hotel for three nights
- I want to book a hotel in Beijing
- Can you book me a room in a central location?
- Book me a room in a four-star hotel
- I want to book a room at a hotel close to the opera house

text: I want to order a pizza
- Can I have a ham and pineapple pizza?
- Can I order from the à la carte menu?
- I want to order from the lunch menu
- Is this restaurant family-friendly?
- What's your speciality?
- I want to order from the children's menu

Example 2

text: Can you help me book a hotel
- I want to book a room at this hotel for three nights
- I want to book a hotel in Beijing
- Can you book me a room in a central location?
- Book me a room in a four-star hotel
- I want to book a room at a hotel close to the opera house

text: Give me a pizza
- Give me a table for two, please
- Give me a table in the smoking area
- Can I have this table?
- Can I have a window seat?
- I'd like to have a table with a view of the garden, if possible

We should keep in mind that there is some random sampling happening here, but you’ll notice that the API might assume a mode that we’re not interested in. In example 1 we see examples that are appropriate for a digital assistant. The generated text in example 2 feels more appropriate for a dialogue in a physical restaurant. It’s hard to predict when this behavior occurs, but we’ve confirmed on multiple occasions this is another reason why a human needs to be in the loop.

This behavior presents itself once again when we try to “rent a movie”.

text: I want to rent a movie
- I'd like to rent star wars
- Can I have that film on DVD?
- How much is it to rent a DVD?
- I'd like to have the film for three nights
- I'd like to rent this film

You can see here that the model assumes that renting a movie is still done with DVDs. If you consider all the historic data that it was trained then it makes sense where this behavior comes from. But if we were building a digital assistant for an online rental service, we again see the need for a human in the loop. It’s not just that the model can assume a setting that is unfit for a digital situation. It can also be the case that GPT-3 assumes a setting that is unfit for the time that we live in.

Conclusion

GPT-3 has impressive parts. Really. The way the model is able to pick up a wide variety of tasks and how it is able to “roll with it” is unlike anything we’ve experimented with before. The generated text offers variety while staying on topic, and it’s even grammatically sound. That said, and we must stress this, we simply cannot recommend serving generated text directly to users, even with GPT-3. The generated text can contain harmful stereotypes as well as factually incorrect statements.

There’s also the practical matter that a generative algorithm is somewhat unbounded in the responses it can generate. For a digital assistant in production this is a giant risk to user experience. You’d like to be able to guarantee that a digital assistant can stick to a predefined set of actions so that users understand how to interact with it.

This is why we’ve currently been focussing on the use case for generating training data. Even here we’ve had mixed results for practical applications if you want to automate it all. Despite very impressive responses, there needs to be a human in the loop who can carefully select the training examples and steer GPT-3 in the right direction.

We’ll continue playing with this API to unravel more potential use cases as well as cautionary tales. We will share our stories here on the blog so keep an eye out for that in the future.