Zero-Shot Learning the Alphabetic Characters

Is it possible to recognise alphabetic characters, which have not been provided during training?

Sebastian Poliak
Towards Data Science

--

In this article, I would like to explain and practically demonstrate an area of machine learning called zero-shot learning, which I find really fascinating. Let’s start with a little description of what it actually is:

Zero-shot learning is a method of recognising categories, which have not been observed during training.

Compared to the traditional supervised learning approach, which relies on tons of examples for every category, the main idea of zero-shot learning is based on the semantic transfer from observed categories to the newly seen categories.

Imagine that you have never seen the letter “H” in your life. What if I told you that it consists of two vertical lines connected with a horizontal line in the middle? Would you be able to recognise it?

10 out of 15 manually designed features. Image by Author.

The key to such a semantic transfer is to encode categories as vectors in a semantic space. This is needed for both training and testing categories, and can be done in a supervised or an unsupervised way.

A supervised way would to manually annotate the categories by coming up with some features e.g. a dog = has tail, has fur, four legs etc. and encoding them into category vectors. The features could be also taken from already existing taxonomies in the given field.

An unsupervised way would be to use the word embeddings for the category names. This is because the word embeddings already capture semantic meaning of the categories, according to the context that the category names appeared in the text (e.g. in the wikipedia corpus).

Experiment

I decided to use the supervised way and create the category vectors myself, so I can get more insight into the resulting model. Therefore, I started looking for a task with a reasonable amount of categories, for which it would be feasible to come up with the features that make sense. As you already know, I ended up with the alphabetic characters, using A-Z Handwritten Alphabets dataset.

The next step was to design the features for all 26 categories (characters of the alphabet). The features had to be general enough so they always cover more than 1 category, however, every category had to be described by a unique set of features, in order to be distinguishable later on. All together, I came up with 15 features that meet this restrictions (10 of them are shown in the first image).

After that, I had to divide the dataset and decide which characters will be used as zero-shot categories. For this I selected five characters (“J”, “D”, “H”, “R”, “Z”) with relatively different features, and put all their data aside. The rest of the dataset with the remaining 21 categories was split between training and testing set for fitting and testing the model.

There are several approaches how to create a model for zero-shot learning. I wanted to try the most straightforward one, which simply predicts the features for any given input. The input of this model is an image of the character, and the target is its encoded category vector (0 or 1 for every feature). This task can be regarded as a multi-label classification, for which I used the following setup with two convolutional layers in Keras.

To evaluate and actually use the trained model, I had to somehow map the predicted category vectors back to their corresponding categories. This was done using the nearest neighbour matching with Euclidean distance. At first, I evaluated the model on the observed categories using the testing set, and reached the accuracy of 96.53%.

When I knew that the model is able to generalise properly, I could start evaluating it on the unseen zero-shot categories. Based on this paper, I found out that there are actually two ways how the zero-shot models are being evaluated.

The first one is a somehow restrictive setup, where at the prediction time we have the information whether the instance comes from seen or unseen categories. Zero-shot learning has been criticised for this, since in real world applications we usually do not have this information available. Our model evaluated in this way on the data that had been put aside, reached the accuracy of 68.36%.

The second setup takes into consideration all the possible categories at the prediction time. The field that is dealing with this is called generalised zero-shot learning. The results evaluated in this way are usually significantly lower, since the observed classes act as distractors in the search space. This was our case as well, where the model evaluated on the same data reached the accuracy of only 10.83%. The predictions for the character “R” were simply almost always closer to “P”, the predictions for “J” closer to “U” and so on. One way of solving this can be found in this paper.

Conclusion

This experiment gave me a nice first insight into the field of zero-shot learning, and what to really expect from it. I hope it did the same for you. I must admit that I was expecting a better performance, mainly in the generalised setting. However, the model that I used was pretty simple, and a more sophisticated one could be surely used. An approach that seems promising, and which I would like to try in the future, is to use a bilinear model. This model uses both the image and the category vector as an input, and predicts whether the vector belongs to the actual category of the image or not.

Thank you for reading. All the code is available in this kaggle notebook, so feel free to experiment and train your own models. I will leave you with a visualisation of the convolutional layer, since I was curious whether it looks somehow similar to the features that I designed.

A feature map of the first convolutional layer. Image by Author.

--

--

Machine Learning Engineer at Bloomreach | Exploring concepts with reproducible experiments.