Active learning deals with incrementally obtaining more labelled data for supervised learning by querying an oracle (e.g. a human annotator) to provide labels for unlabelled data instances. This involves retraining a learning model many times, whenever newly labelled data is obtained, in order to make new predictions that allow us to evaluate how informative certain unlabelled data instances are if we were to obtain their label. For high-dimensional data such as images however, models often take long to train, in particular (convolutional) neural networks. Thus, having to retrain complex neural networks after each newly queried label is very slow, and often infeasible in practice. Although active learning deals with situations where labelled data is scarce, it often is possible to obtain large amounts of unlabelled data. We propose an active learning scheme for image data that fully utilises this pool of unlabelled data by first learning useful representations from it. We use a (convolutional) variational autoencoder (VAE) to obtain latent space representations of all the data. Such a VAE is trained under the assumption that the latent variables come from some given prior distribution, which results in a latent space that provides a more structured and lower-dimensional representation of the data. These properties allow us to use a relatively simple and fast classification neural network, a multilayer perceptron (MLP), in an active learning setting. This way can perform much faster active learning iterations than if we would have to retrain a more slow and complex neural network in every iteration. Our experiments on the MNIST and SVHN image data sets show that it is possible to train simple MLPs that achieve decent accuracy on latent space representations obtained by convolutional VAEs. We show that while performing active learning in latent space, querying labels by means of a technique called uncertainty sampling can provide significant improvements over passive learning (i.e. querying random samples from the unlabelled data). Moreover, the latent space seems to make (active) learning more reliable in the sense that there are less fluctuations in classification accuracy between different experiments.
We also investigate whether modelling representativeness explicitly can prevent uncertainty sampling from querying outliers, but we observed no improvements in performance for the methods we used.
To further speed up the active learning iterations, we query batches of informative instances at once, rather than single instances, before retraining the classification model. We evaluate some diversity techniques that aim to reduce redundancy in batches, which only leads to small improvements in a limited number of situations.