So let’s talk about intents. The documentation is not bad in explaining what an intent is, but doesn’t really go into its strengths, or the best means to collect them.
First the important thing to understand with intents. How Watson perceives the world is defined by their intents. If you ask Watson a question, it can only understand it in relation to the intents. It cannot answer a question where it has not been trained on the context.
So for example if I say “I want to get a fishing license” may work for what you trained, but “I want to get driving license” may give you the same response, simply because it closely matches and falls outside of what your application is intended for.
So it is just as important to understand what is out of scope, but you may need to give an answer to.
Getting your questions for training.
The strength of intents is the ability to map your customers language to your domain language. I can’t stress this enough. While Watson can be quite intelligent in understanding terms with its training, it is making those connections of language which does not directly related to your domain is important.
This is where you can get the best results. So it is important to collect questions in the voice of your end-user.
The “voice” can also mean where and how the question was asked. How someone asks the question on the phone can be different to instant messaging. Depending on how you plan to create your application, depends on how you should capture those questions.
When collecting, make sure you do not accidentally bias the results. For example, if you have a subject matter expert collecting, you will find they will unconsciously change the question when writing it. Likewise if you question collect from surveys, try to avoid asking questions which will bias the results. Take these two examples.
- “Ask questions relating to school timetables”
- “You just arrived on campus, and you don’t know where or what to do next.”
The first one will generate a very narrow scope of test questions related to your application, and not what a person ask when in a situation. The second question is broader, but you may still find that people will say things like “campus”, “where”, “what”.
Which comes first? Questions or Intents?
If you have defined the intents first, you need to get the questions for them. However there is a danger that you are creating more work for yourself than needed.
If you do straight question collection, when you start to cluster into intents you will start to see something like this:
Everything right of the orange line (long tail) does not have enough to train Conversation. Now you could go out and try and find questions for the long tail, but that is the wrong way to approach this.
Focus on the left side (fat head), this is the most common stuff people will ask. It will also allow you to work on a very well polished user experience which most users will hit.
The long tail still needs to be addressed, and if you have a full flat line then you need to look at a different solution. For example Retrieve & Rank. There is an example that uses both.
Manufacturing Intent
Now creating manufactured questions is always a bad thing. There may be instances where you need to do this. But it has to be done carefully. Watson is pretty intelligent when it comes to understanding the cluster of questions. But the user who creates those questions may not speak in the way of the customer (even if they believe they do).
Take these examples:
- What is the status of my PMR?
- Can you give me an update on my PMR?
- What is happening with my PMR?
- What is the latest update of my PMR?
- I want to know the status of my PMR.
Straight away you can see “PMR” which is a common term for an SME, but may not be for the end-user. No where does it mention what a PMR is. You can also see “update” and “status” repeated, which is unlikely to be an issue for Watson but doesn’t really create much variance.
Test, Test, Test!
Just like a human that you teach, you need to test to make sure they understood the material.
Get real world data!
After you have clustered all your questions, take out a random 10%-20% (depending on how many you have). You set these aside and don’t look at the contents. This is normally called a “Blind Test”.
Run it against what you have trained on and get the results. These should give you an indicator of how it reacts in the real world*. Even if the results are bad, do not look as to why.
Instead you can create one or more of the following tests to see where things are going weird.
Test Set : Similar to the blind test, you remove 10%-20% and use that to test (don’t add back until you get more questions). You should get pretty close results to your blind test. You can examine the results to see why it’s not performing. The problem with the test set is that you are reducing the size of training set, so if you a low number of questions to begin with, then next two tests help.
K-fold cross validation : You split your training set into random segments (K). Use one set to test and the rest to train. You then work your way through all of them. This method will test everything, but will be extremely time-consuming. Also you need to pick a good size for K so that you can test correctly.
Monte Carlo cross validation : In this instance you take out a random 10%-20% (depending on train set size) and test against it. Normally run this test at least 3 times and take the average. Quicker to test. I have a sample python script which can help you here.
* If your questions were manufactured, then you are going to have a problem testing how well the system is going to perform in real life!
I got the results. Now what?
First check your results of your blind test vs whatever test you did above. They should fall within 5% of each other. If not then your system is not correctly trained.
If this is the case, you need to look at the wrong questions cluster, and also the clusters that got the wrong answer. You need to factor in the confidence of the system as well. You should look for patterns that explain why it picked the wrong answer.
More on that later.