Visualising Coverage in Conversation Logs.

One of the most important parts of a conversational system is to ensure that your end users are getting the most benefit out of it. To do this requires looking at patterns in your conversation logs. It can be time consuming.

A common approach is to put markers into your nodes then look for those entry/exit point markers. But a user question can hit multiple nodes + slots across multiple log lines. Making it tricker to see. Here is a couple of approaches to try and easily get information on your complex flows.

For this demo I am using the default demo skill in Watson Assistant to generate logs. I have created a number of simple conversations. A couple demonstrate a issue with how the user may interact. I have also supplied the example notebook and files for you to try out.

Creating the graph.

For generating I first need to convert the log to a graph format. The easiest way to is look at the nodes_visited column in the logs. Here is an example of a user making a reservation.

['Opening']
['Reservation using slots', 'handler_104_1498132501942', 'slot_102_1498132501942', 'handler_103_1498132501942', 'handler_6_1509695999145', 'handler_104_1498132501942', 'slot_102_1498132501942', 'handler_103_1498132501942', 'handler_107_1498132552870', 'slot_105_1498132552870']
['slot_105_1498132552870', 'handler_106_1498132552870', 'handler_10_1509132875735', 'slot_8_1509132875735', 'handler_9_1509132875735', 'handler_17_1509135162089', 'handler_104_1498132501942', 'slot_102_1498132501942']
['slot_102_1498132501942', 'handler_103_1498132501942', 'handler_107_1498132552870', 'slot_105_1498132552870', 'handler_106_1498132552870', 'handler_10_1509132875735', 'slot_8_1509132875735']
['slot_8_1509132875735', 'handler_9_1509132875735', 'handler_14_1509133469904', 'handler_24_1522444583114', 'slot_22_1522444583114', 'handler_23_1522444583114', 'handler_22_1522598191131', 'node_3_1519173961259', 'Reservation using slots']

Although each line is an interaction you can see that it is in fact a chain of events. When joining the chains you end up with.

['Opening'] ['Reservation using slots', 'handler_104_1498132501942', 'slot_102_1498132501942', 'handler_103_1498132501942', 'handler_6_1509695999145', 'handler_104_1498132501942', 'slot_102_1498132501942', 'handler_103_1498132501942', 'handler_107_1498132552870', 'slot_105_1498132552870', 'handler_106_1498132552870', 'handler_10_1509132875735', 'slot_8_1509132875735', 'handler_9_1509132875735', 'handler_17_1509135162089', 'handler_104_1498132501942', 'slot_102_1498132501942', 'handler_103_1498132501942', 'handler_107_1498132552870', 'slot_105_1498132552870', 'handler_106_1498132552870', 'handler_10_1509132875735', 'slot_8_1509132875735', 'handler_9_1509132875735', 'handler_14_1509133469904', 'handler_24_1522444583114', 'slot_22_1522444583114', 'handler_23_1522444583114', 'handler_22_1522598191131', 'node_3_1519173961259', 'Reservation using slots']

The second part is the whole interaction the user had in trying to book an appointment. It’s still not that readable. So I converted these over to make a little more readable.

  • slot_ = Take the variable that the slot object depends on.
  • node_ = Take the condition for the node in the skill.
  • frame = Top level slot node (not shown above, it’s part of the skill node attributes). Took the condition of the node.
  • response = This is the node that responds to the end user, or part of the slot. Added “response to: <parent node name>”
  • handler = Left the same.

Once this is done I started by converting the chain to Graph nodes and edges. For each time an edge is repeated a count is incremented to the edge object. You end up with this.

Red nodes are entry points to a single flow. Orange is a flow which could have been entered though other parts of the conversation. Blue are the slot values. Pink is a final response to the user from the flow.

As you can see it’s still a mess!

By selecting the entry point node you can delete all other nodes that do not have a path to it. In this case I selected “frame: #Customer_Care_Appointments”. This was generated.

Still a bit of a mess and not easy to see how the paths are flowing through the booking appointment. NetworkX was designed more for analysing graphs than visualising them.

Graph to Sankey

So using the generated graph data I moved it over to a Sankey. The nice thing with plotly is you can easily move the flows to see what is going on. Here is what is generated using the graph information from the last image.

Edge colors are red where there is more output from a node than there is input. In a normal conversational flow it should be fairly static if well trained. Not all red is an issue though. Taking the two biggest we can use these to drill down to a root cause.

#1

This is showing a lot of users are not progressing through the phone section of the flow and are going into a loop. As the second part is much smaller it would suggest that people are giving up on the flow. Looking through the logs shows the following pattern.

Clearly the end users are having problems trying to enter in a valid phone number. So this is something that should be looked at in resolving.

#2

You can see three inputs into the handler before it passes over to the “Ask for date” slot. This isn’t an issue as there are three conditions this could happen.

  • User supplies a date when asking for the appointment.
  • System asks the user for the date.
  • User asks to redo the appointment at final confirmation.

The handler is doing what it should be doing.

Conclusion

So this example is showing just one way to approach this problem. I’d be interested to hear how others are dealing with this.

Testing your intents

So this really only helps if you are doing a large number of intents, and you have not used entities as your primary method of determining intent.

First lets talk about perceived accuracy, and what this is trying to solve. Perceived accuracy is where someone will type in a few questions they know the answer to. Then depending on their manual test they perceive the system to be working or failing.

It puts the person training the system into a false sense of how it is performing.

If you have done the Watson Academy training for Conversation you will hear it mention K-fold testing. For this blog post, I’m going to skip the details as I briefly mentioned before.

K-fold cross validation : You split your training set into random segments (K). Use one set to test and the rest to train. You then work your way through all of them. This method will test everything, but will be extremely time-consuming. Also you need to pick a good size for K so that you can test correctly.

K-Fold works well by itself if you have a large training set that has come from a real world representative users. You will find this rarely happens. So you should use in conjunction with a blind.

Previously I didn’t cover how you actually do the test. So with that, here is the notebook giving a demonstration:

 

 

Removing the confusion in intents.

While the complexity of building Conversation has reduced for non-developers, one of the areas that people can sometimes struggle is training intents.

When trying to determine how the system performs, it is important to use tried and true methods of validation. If you go by someone just interacting with the system you can end up with what is called “perceived accuracy”. A person may ask three to four questions, and three may fail. Their perception becomes that the system is broken.

Using cross validation allows you to give a better feel of how the system is working. As will a blind / test set. But knowing how the system performs and trying to trying to interpret the results is where it takes practise.

Take this example test report. Each line is where a question is asked, and it determines what the answer should be, versus what answer came back. The X denotes where an answer failed.

report_0707

Unless you are used to doing this analysis full time, it is very hard to see the bigger picture. For example, is DOG_HEALTH the issue, or is it LITTER_HEALTH + BREEDER_GUARANTEE?

You have to manually analyse each of these clusters and determine what changes are required to be made.

Thankfully Sci-Kit makes your life easier with being able to create a confusion matrix. With this you can see how each intent performs against each other. So you end up with something like this:

confusion_matrix

So now you can quickly see what Intents are getting confused with others. So you focus on those to improve your accuracy better. In the example above, DOG_HEALTH and BREEDER_INFORMATION offer the best areas to investigate.

I’ve created a Sample Notebook which demonstrates the above, so you can modify to test your own conversations training.

 

I love Pandas!

Not the bamboo eating kind (but they are cute too), Python Pandas!

But first… Conversation has a new feature!

Logging! 

You can now download your logs from your conversation workspace into a JSON format. So I thought I’d take this moment to introduce Pandas. Some people love the “Improve” UI, but personally I like being able to easily mold the data to what I need.

First, if you are new to Python, I strongly recommend getting a Python Notebook like Jupyter set up or use IBM Data Science Experience. It makes learning so much easier, and you build your applications like actual documentation.

I have a notebook created so you can play along.

Making a connection

As the feature is just out, the SDK’s don’t have the API for it, so I will be using requests library.

url='https://gateway.watsonplatform.net/conversation/api/v1/workspaces/WORKSPACE_ID/logs?version=2017-04-21'
basic_auth = HTTPBasicAuth(ctx.get('username'), ctx.get('password'))
response = requests.get(url=url, auth=basic_auth)
j = json.loads(response.text)

So we have the whole log now sitting in j but we want to make a dataframe. Before we do that however, let’s talk about log analysis and the fields you need. There are three areas we want to analyse in logs.

Quantitive – These are fixed metrics, like number of users, response times, common intents, etc.

Qualitative – This is analysing how the end user is speaking, and how the system interpreted and responded. Some examples would be where the answer returned may give the wrong impression to the end user, or users ask things out of expected areas.

Debugging – This is really looking for coding issues with your conversation tree.

So on to the fields that cover these areas. These are all contained in j['response'].

Field Usage Description
input.text Qualitative This is what the user or the application typed in.
intents[] Qualitative This tells you the primary intent for the users question. You should capture the intent and confidence into columns. If the value is [] then means it was irrelevant.
entities[] Quantitive The entities found in relation to the call. With this and intents though, it’s important to understand that the application can override these values.
output.text[] Qualitative This is the response shown to the user (or application).
output.log_messages Debugging Capturing this field is handy to look for coding issues within your conversation tree. SPEL errors show up here if they happen.
output.nodes_visited Debugging
Qualitive
This can be used to see how a progression through a tree happens
context.conversation_id All Use this to group users conversation together. In some solutions however, one pass calls are sometimes done mid conversation. So if you do this, you need to factor that in.
context.system.branch_exited Debugging This tells you if your conversation left a branch and returned to root.
context.system.branch_exited_reason Debugging If branch.exited is true then this will tell the why. completed means that the branch found a matching node, and finished. fallback means that it could not find a matching node, so it jumps back to root to find the match.
context.??? All You may have context variables you want to capture. You can either do these individually, or code to remove conversation objects and grab what remains
request_timestamp Quantitive
Qualitative
When conversation received the users response.
response_timestamp Quantitive
Qualitative
When conversation responded to the user. You can do a delta to see if there are conversation performance issues, but generally keep one of the timestamp fields for analysis.

 

So we create a row array, and fill it with dict objects of the columns we want to capture. For clarity of the blog post, the sample code below

import pandas as pd
rows = []

# for object in Json Logs array.
for o in j['logs']:
    row = {}
 
    # Let's shorthand the response object.
    r = o['response']
 
    row['conversation_id'] = r['context']['conversation_id']
 
    # We need to check the fields exist before we read them. 
    if 'text' in r['input']: row['Input'] = r['input']['text']
    if 'text' in r['output']:row['Output'] = ' '.join(r['output']['text'])
 
    # Again we need to check it is not an Irrelevant response. 
    if len(r['intents']) > 0:
        row['Confidence'] = r['intents'][0]['confidence']
        row['Intent'] = r['intents'][0]['intent']

    rows.append(row)

# Build the dataframe. 
df = pd.DataFrame(rows,columns=['conversation_id','Input','Output','Intent','Confidence'])
df = df.fillna('')

# Display the dataframe. 
df

When this is run, all going well you end up with something like this:

report1-1804

The notebook has a better report, and is also sorted so it is actually readable.

report2-1804

Once you have everything you need in the dataframe, you can manipulate it very fast and easy. For example, let’s say you want to get a count of the intents found.

# Get the counts.
q_df = df.groupby('Intent').count()

# Remove all fields except conversation_id and intents. 
q_df = q_df.drop(['request TS', 'response TS', 'User Input', 'Output', 'Confidence', 'Exit Reason', 'Logging'],axis=1)

# Rename the conversation_id field to "Count".
q_df.columns = ['Count']

# Sort and display. 
q_df = q_df.sort_values(['Count'], ascending=[False])
q_df

This creates this:

report3-1804

The Jupyter notebook also allows for visualisation of data as well. Although I haven’t put any in the sample notebook.

Compound Questions

One problem that is tricky to solve is if a user has asked two questions. Previously some solutions were to look for conjunctions (“and”) or question marks. Then try to guess if it is a question.

But you could end up with a question like “Has my dog been around other dogs and other people?”. This is clearly one question.

With the new conversation feature of “Absolute Confidences”, it is now possible to detect this. Earlier versions of conversation would have all intents would add up to 1.0.

Now each confidence has it’s own value. Taking the earlier example, if we map the confidences to a chart, we get:

conv060217-1

Visually we can see that the first and second intent are not related. The next sentence “Has my dog been around other dogs and is it certified?” is two questions. When we chart this we see:

conv060217-2

Very easy to see that there are two questions. So how to do it in your code?

You can use a clustering technique called K-means. This will cluster your data into sets of ‘K’. In this case we have “important intents” and “unimportant intents”. Two groups, means K = 2.

For this demonstration I am going to use Python, but K-means exists in a number of languages. I have a sample of the full code, and example conversation workspace. So for this I will only show code snippets.

Walkthrough

Conversation request needs to set alternate_intents to true. So that you can get access to the top 10 intents.

Once you get your response back, convert your confidence list into an array.

intent_confidences = list(o['confidence'] for o in response['intents'])

Next the main method will return True if it thinks it is a compound question. It requires numpy + scipy.

def compoundQuestion(intents):
    v = np.array(intents)
    codebook, _ = kmeans(v,2)
    ci, _ = vq(v,codebook)

    # We want to make everything in the top bucket to have a value of 1.
    if ci[0] == 0: ci = 1-ci
    if sum(ci) == 2: return True
    return False

The first three lines will take the array of confidences and generate two centroids. A centroid is the mean of each cluster found. It will then group each of the confidences into one of the two centroids.

Once it runs ci will look something like this: [ 0, 0, 1, 1, 1, 1, 1, 1, 1, 1 ] . This however can be the reverse.

The first value is the first intent. So if the first value is 0 we invert the array and then add up all the values:

[ 1, 1, 0, 0, 0, 0, 0, 0, 0, 0 ] => 2 

If we get a value of 2, then the first two intents are related to the question that was entered. Any other value, then we only have one question, or potentially more than two important intents.

Example output from the code:

Has my dog been around other dogs and other people?
> Single intent: DOG_SOCIALISATION (0.9876400232315063)

Has my dog been around others dogs and is it certified?
> This might be a compound question. Intent 1: DOG_SOCIALISATION (0.7363447546958923). Intent 2: DOG_CERTIFICATION (0.6973928809165955).

Has my dog been around other dogs? Has it been around other people?
> Single intent: DOG_SOCIALISATION (0.992318868637085)

Do I need to get shots for the puppy and deworm it?
> This might be a compound question. Intent 1: DOG_VACCINATIONS (0.832768440246582). Intent 2: DOG_DEWORMING (0.49955931305885315).

Of course you still need to write code to take action on both intents, but this might make it a bit easier to handle compound questions.

Here is the sample code and workspace.

Data Science Experience

Apologies in my long time updating, life has been a bit crazy busy at the moment. I have a few entries cached to go, but couldn’t get around to finishing. As this year is nearly at an end for me, I should have some spare time to catch up.

So this is a brief entry to talk about IBM Data Science experience. This is a new service which hooks into Bluemix. Using Spark it allows you build python/R/Scala notebooks.

For those not familiar with notebooks, they are a really cool way to create prototyping code as documentation. It also has a whole host of extras that you can hook into to visualise and manipulate your data. As well as loads of datasets to play with.

dse.png

You can check it out for yourself. Here is the example notebook from above.

The road to good intentions.

So let’s talk about intents. The documentation is not bad in explaining what an intent is, but doesn’t really go into its strengths, or the best means to collect them.

First the important thing to understand with intents. How Watson perceives the world is defined by their intents. If you ask Watson a question, it can only understand it in relation to the intents. It cannot answer a question where it has not been trained on the context.

So for example if I say “I want to get a fishing license” may work for what you trained, but “I want to get driving license” may give you the same response, simply because it closely matches and falls outside of what your application is intended for.

So it is just as important to understand what is out of scope, but you may need to give an answer to.

Getting your questions for training.

The strength of intents is the ability to map your customers language to your domain language.  I can’t stress this enough. While Watson can be quite intelligent in understanding terms with its training, it is making those connections of language which does not directly related to your domain is important.

This is where you can get the best results. So it is important to collect questions in the voice of your end-user.

The “voice” can also mean where and how the question was asked. How someone asks the question on the phone can be different to instant messaging. Depending on how you plan to create your application, depends on how you should capture those questions.

When collecting, make sure you do not accidentally bias the results. For example, if you have a subject matter expert collecting, you will find they will unconsciously change the question when writing it. Likewise if you question collect from surveys, try to avoid asking questions which will bias the results. Take these two examples.

  • “Ask questions relating to school timetables”
  • “You just arrived on campus, and you don’t know where or what to do next.”

The first one will generate a very narrow scope of test questions related to your application, and not what a person ask when in a situation. The second question is broader, but you may still find that people will say things like “campus”, “where”, “what”.

Which comes first? Questions or Intents?

 

If you have defined the intents first, you need to get the questions for them. However there is a danger that you are creating more work for yourself than needed.

If you do straight question collection, when you start to cluster into intents you will start to see something like this:

Longtail.png

Everything right of the orange line (long tail) does not have enough to train Conversation. Now you could go out and try and find questions for the long tail, but that is the wrong way to approach this.

Focus on the left side (fat head),  this is the most common stuff people will ask. It will also allow you to work on a very well polished user experience which most users will hit.

The long tail still needs to be addressed, and if you have a full flat line then you need to look at a different solution. For example Retrieve & Rank. There is an example that uses both.

Manufacturing Intent

Now creating manufactured questions is always a bad thing. There may be instances where you need to do this. But it has to be done carefully. Watson is pretty intelligent when it comes to understanding the cluster of questions. But the user who creates those questions may not speak in the way of the customer (even if they believe they do).

Take these examples:

  • What is the status of my PMR?
  • Can you give me an update on my PMR?
  • What is happening with my PMR?
  • What is the latest update of my PMR?
  • I want to know the status of my PMR.

Straight away you can see “PMR” which is a common term for an SME, but may not be for the end-user. No where does it mention what a PMR is.  You can also see “update” and “status” repeated, which is unlikely to be an issue for Watson but doesn’t really create much variance.

Test, Test, Test!

Just like a human that you teach, you need to test to make sure they understood the material.

Get real world data!

After you have clustered all your questions, take out a random 10%-20% (depending on how many you have). You set these aside and don’t look at the contents. This is normally called a “Blind Test”.

Run it against what you have trained on and get the results. These should give you an indicator of how it reacts in the real world*. Even if the results are bad, do not look as to why.

Instead you can create one or more of the following tests to see where things are going weird.

Test Set : Similar to the blind test, you remove 10%-20% and use that to test (don’t add back until you get more questions). You should get pretty close results to your blind test. You can examine the results to see why it’s not performing. The problem with the test set is that you are reducing the size of training set, so if you a low number of questions to begin with, then next two tests help.

K-fold cross validation : You split your training set into random segments (K). Use one set to test and the rest to train. You then work your way through all of them. This method will test everything, but will be extremely time-consuming. Also you need to pick a good size for K so that you can test correctly.

Monte Carlo cross validation : In this instance you take out a random 10%-20% (depending on train set size) and test against it. Normally run this test at least 3 times and take the average. Quicker to test. I have a sample python script which can help you here.

* If your questions were manufactured, then you are going to have a problem testing how well the system is going to perform in real life!

I got the results. Now what?

First check your results of your blind test vs whatever test you did above. They should fall within 5% of each other. If not then your system is not correctly trained.

If this is the case, you need to look at the wrong questions cluster, and also the clusters that got the wrong answer. You need to factor in the confidence of the system as well. You should look for patterns that explain why it picked the wrong answer.

More on that later.

 

Building a Conversation interface in minutes.

I come from a Java development background, but since joining Watson I’ve started using Python and love it. 🙂 It’s like it was made for Conversation.

The Conversation test sidebar is handy, but sometimes you need to see the raw data, or certain parts that don’t show up in the side bar.

Creating a Bluemix application can be heavy if you just want to do some testing of your conversation. Python allows you to test with very little code. Here is some easy steps to get you started. I am making an assumption you have

1: If you are using a MAC you have python already installed. Otherwise you need to download from python.org.

2: Install the Watson Developer Cloud SDK. You can also just use Requests, but the SDK will make your life easier.

3: In your conversation service, copy the service credentials as-is (if you are using the latest UI). If it doesn’t look like below, you may need to alter it.

conv0310-2

4: Go to your conversation workspace, and check the details to get your workspace ID. Make a note of that.

5: Download the following code.

conv0310-1

The “ctx” part just paste in your service credentials, and update the workspace ID. The version number you can get from the Conversation API documentation.

6: Run the Python code. Assuming you put in the correct details, you can type into the console and get your responses back from conversation. Just type “…” to quit.

conv0310-3