Chihuahua or Muffin, revisited.

I just finished reading Maria Yao‘s article Chihuahua OR Muffin? Searching For The Best Computer Vision API. It’s a fun read, but I felt it didn’t really show off the power of Watson Visual Recognition.

For the demo in the article, the general classifier was being used.

One of the main advantages of Watson Visual recognition is that you can create your own custom classifiers. It is very simple too.

First, you need data.

Using Marias article I pulled the Chihuahua and Muffin pictures from ImageNet.

Like most data, it tends to need a bit of cleaning. So I deleted any images below 14KB in size. The reason for this was the majority at that size were just corrupted. I also went through and deleted any images which were adverts or “this image is no longer there” banners.

Overall that was 500 images deleted totally. It still left 3,000 images to play with.

Next I created a Visual Recognition service. For this I created the free version. This limited me to 250 events a day. So I had to lower my training sets to 100 pictures from each set.

I took a random 100 from each. I didn’t examine the photos at all, but here are a few to give you an idea of how the images look.

example_training

As you can see, no thought put into worrying about other items in the picture.

I zipped the images up, and created a classifier like so.

classifier

Then I clicked create, and waited a little over 10 minutes for it to analyse the pictures.

Once the classifier was finished training, I was ready to test. As some may not be aware, Visual Recognition also offers a food classifier. So for my first two tests I tried my classifier, General and Food.

test1

So you can see the red bar on the classifier I made. This is more because I only gave 100 examples. As you give more training examples, it’s confidence increases. But you can see that the difference between Muffin and Chihuahua is clear.

You can also see the food classifier got it as well.

What about the Chihuahua?

test2

As you can see, all three do quite well on classification. But what about the original pictures which look similar? I ran those through and ended up with this.

results2

As you can see it got them all right! None of these were used to train against.

As demos go though this is simple and fun. But with good classified images, it can be scary accurate for proper real world use cases.

Having said that, I did have one failure. Testing the samples on Maria’s page, it was able to understand the cookie monster muffin and the man holding the chihuahua.

But the muffin in the plastic bag with the Chihuahua it could not get. I tried cropping out the dog, but it still failed but with a lower confidence. I suspect this is a combination of training and a bad quality photo.

Screen Shot 2017-09-28 at 17.40.17

I have no confidence in Entities.

I have something I need to confess. I have a personal hatred of Entities. At least in their current form.

There is a difference between deterministic and probabilisitic programming, that a lot of developers new to Watson find it hard to switch to. Entities bring them back to that warm place of normal development.

For example, you are tasked with creating a learning system for selling CatsDogs, and Fishes. Collecting questions you get this:

  • I want to get a kitten
  • I want to buy a cat
  • Can I get a calico cat?
  • I want to get a siamese cat
  • Please may I have a kitty?
  • My wife loves kittens. I want to get her one as a present.
  • I want to buy a dog
  • Can I get a puppy?
  • I would like to purchase a puppy
  • Please may I have a dog?
  • Sell me a puppy
  • I would love to get a hound for my wife.
  • I want to buy a fish
  • Can I get a fish?
  • I want to purchase some fishes
  • I love fishies
  • I want a goldfish

The first instinct is to create a single intent of #PURCHASE_ANIMAL and the create entities for the cats, dogs and fishes. Because it’s easier to wrap your head around entities, then it is to wonder how Watson will respond.

So you end up with something like this:

20170910entities

Wow! So easy! Let’s set up our dialog. To make it easier, lets use a slot.

20170910entitiesA

In under a minute, I have created a system that can help someone pick an animal to buy. You even test it and it works perfectly.

IT’S A TRAP!

First the biggest red flag with this is you have now turned your conversation into a deterministic system.

Still doing cross validation to test your intents? Give up, it’s pointless.

You can break it by just typing something like “I want to buy a bulldog”. You are stuck into an endless loop.

The easiest solution is to tell the person what to type, or link/button it. But it doesn’t exhibit intelligence (and I hate buttons more than I hate entities 🙂 ).

The other option is to add “bulldog” to the @Animals:Dog entity. But when you go down that rabbit hole, you could realistically add the following.

  • 500+ types of breeds.
  • Common misspellings of those words.
  • plurals of each breed.
  • slang, variations and nicknames of those animals.

You are easily into the thousands of keywords to match, and all it takes is one person to make a typo you don’t have in the list and it still won’t work.

Using entities in a probabilistic way.

All is not lost! You can still use entities, and keep your system intelligent. First we break up the intents into the types of animals like so:

20170910entitiesB

So now if I type “I want to buy a bulldog” I get #PurchaseDog with 68% confidence. Which is great, as I didn’t even train it on that word.

So next I try “I want to buy a pet” and I get #PurchaseCat with 55% confidence.

20170910entitiesCat

Hmm, great for cat lovers but we want conversation to be not so sure about this.

So we create the entities as before for Cat, Dog, Fish. You can use the same values.

Next before you check intents, add a node with the following condition.

20170910entitiesC

This basically ensures that irrelevant hasn’t been hit, and then checks if the animal entities have not been mentioned.

Then in your JSON response you add the following code.

{
    "context": {
    "adjust_confidence": "<? intents[0].confidence = intents[0].confidence - 0.36 ?>"
    },
    "output": {
        "text": {
            "values": [],
            "selection_policy": "sequential"
        }
    }
}

The important part is the “adjust_confidence” context variable. This will lower the first intents confidence by 0.36 (36%).

We set the node to jump to the next node in line, so it can check the intents.

Now we get “I don’t understand” for the pet question. Bulldog still works as it doesn’t fall below the 20%.

Demo Details.

I used 36% for the demo, but this will vary in other projects. Also if your confidence level is too high, you can pick a smaller value, and then have another check for a lower bound. In other words, set your conversation to ignore any intent with a confidence lower then 30%, and then set your adjustment confidence to -10%.

Advantages

Using this approach, you don’t need to worry as much with training your entities, only your intents. This allows you to build a probabilistic model which isn’t impacted unless it is unsure to begin with.

I have supplied a Sample Conversation which demos above.

Manufacturing Intent

Let me start this article with a warning:  Manufacturing questions causes more problems than it solves.

Sure the documentation, and many videos say the reverse. But they tend to give examples that have a narrow scope.

Take the car demo for example. It works because there is a common domain language that everyone who uses a car knows. For someone who has never seen a car before, won’t understand what a “window wiper” is, but they may say something like “I can’t see out the window, because it is raining”.

This is why when building your conversation system, it is important to get questions from people who actually use the system, but don’t know the content. They tend not to know how to ask a question to get the answer.

But there are times when it can’t be avoided. For example, you might be creating a system that has no end users yet. In this case, manufacturing questions can help bootstrap the system.

There are some things to be aware of.

Manual creation.

This is actually very hard, even for the experienced. Here are the things you need to be aware of.

Your education and culture will shape what you write.

You can’t avoid it. Even if you are aware of this, you will fall back into the same patterns as you progress through creating questions. It’s not easy to see until you have a large sample. Sorting can give you a quick glance, while a bag of words makes it more evident.

If you know the content, you will write what you know.

Again, having knowledge of the systems answers will have you writing domain language in the questions. You will use terms that define the system, rather than describing what it does.

If you don’t know the content, use user stories.

If you manage to get someone who could be a representative user, be careful in how you ask them to write questions. If they don’t fully understand what you ask, they will use terms as keywords, rather than their underlying meaning.

Let’s compare two user stories:

  • “Ask questions about using the window wipers.”
  • “It is raining outside while you are driving, and it is getting harder to see. How might you ask the car to help you?”

With the first example, you will find that people will use “window wipers”, “wipers”, “window” frequently. Most of the questions will be about switching it on/off.

With the second example, you may end up seeing questions like this.

  • Switch on the windshield wipers.
  • Activate the windscreen wipers.
  • Is my rain sensor activated?
  • Please clear the windows.

Your final application will shape the questions as well.

If you have your question creation team working on a desktop machine, they are going to create questions which won’t be the same as someone typing on mobile, or talking to a robot.

The internet can be your friend.

Looking for similar questions online in forums can help you in seeing terms that people may use. For example, all these mean the same thing: “NCT”, “MOT”, “Smog Test”, “RWC”, “WoF”, “COF”.

But those are meaningless to people if they are in different countries.

Automated Creation

A lot of what I have seen in automation tends not to fare much better. If it did, we wouldn’t need people to write questions. 🙂

One technique is to try and create a few questions from existing questions. Again I should stress, this is generally a bad idea, and this example doesn’t work well, but might give you something to build on.

Take this example.

  • Can my child purchase a puppy?
  • Are children allowed to buy dogs?

From a manual view we can see the intent is to allow minors to buy. Going over to the code.

For this I am using Spacy, which can check the similarity of each word against each other. For example.

import spacy
nlp = spacy.load('en')

dog = nlp(u'dog')
puppy = nlp(u'puppy')
print( dog.similarity(puppy))

Will output: 0.760806754875

The higher the number, the closer the words to each other. By setting a threshold on the value, you can reduce it to the important words. Setting a threshold of 0.7, we get:

Screen Shot 2017-09-09 at 18.31.18

Playing with larger questions you will find that certain parts of speech (POS) are more noise. So you can drop the following to remove possible noise.

  • DET = determiner
  • PART = particle
  • CCONJ = conjunction
  • ADP = adposition

Now that you have reduced it to the main terms, you can build a synonym off of these, like so:

dog = nlp('dog')
word = nlp.vocab[dog.text]
sym = sorted(word.vocab, 
             key=lambda w: word.similarity(w), 
             reverse=True
)
print('\n'.join([w.orth_ for w in sym[:10]]))

 

Which will print out the following:

  • dog
  • DOG
  • Dog
  • dogs
  • DOGS
  • Dogs
  • puppy
  • Puppy
  • PUPPY
  • pet

As you can see, a lot of repetition. You can remove duplicates. Also be wary to set the upper bound of the sym object when reading.

So after you generate a group of sample synonyms, you end up with something like this.

Screen Shot 2017-09-09 at 18.50.07

Now it’s a simple matter of just generating a random set of questions from this table. You end up with something like:

  • need children purchasing dogs
  • can kids buying puppy
  • may child purchases dog
  • will children purchase dogs
  • will kids buy pets
  • make children cheap puppies
  • need children purchase puppies

As you can see, they are pretty bad. Not because of the word salad, but that you have a very narrow scope of what can be answered. But it can give you enough to have your intent trigger.

You can also mitigate this by creating a tensor of a number of similar questions, n-grams instead of single words, a custom domain dictionary and increasing your dictionary terms.

At the end of the day though, they are still going to be manufactured.