Learning with Limited Labeled Data

FF10 · Apr 2019

This is an applied research report by Cloudera Fast Forward. We write reports about emerging technologies, and conduct experiments to explore what’s possible. Read our full report about Learning with Limited Labeled Data below, or download the PDF. The prototype for our report on Learning with Limited Labeled Data is called Active Learner. It is a tool that sheds light on and provides intuition for how and why active learning works. The prototype allows one to visualize the process of active learning over different types of datasets and selection strategies. We hope you enjoy exploring it.

Contents

1. Introduction

Data makes what is impossible today possible tomorrow. In recent years, machine learning technologies – especially deep learning – have made breakthroughs that have turned science fiction into reality. Autonomous cars are almost possible, and machines can comprehend language. These technical advances are unprecedented, but they hinge on the availability of vast amounts of data. And for the form of machine learning known as supervised learning, having data itself is not sufficient. Supervised machine learning, while powerful, needs data in a form that can serve as examples for what the machine should learn. These examples often manifest themselves in the form of labeled data; the labels are used to teach and guide the machine.

Figure 1.1: Supervised machine learning excels in cases where vast amounts of labeled data is available.
Figure 1.1: Supervised machine learning excels in cases where vast amounts of labeled data is available.

Unfortunately, data in the real world does not come nicely packaged with labels. Enterprises collect massive amounts of data, but only a small sliver (if any) is labeled. In order to leverage supervised machine learning opportunities, efforts are made to manually label the data – but this undertaking can be prohibitively expensive, inefficient, and time-consuming.

Figure 1.2: Real world data is often unlabeled and unorganized.
Figure 1.2: Real world data is often unlabeled and unorganized.

Learning with limited labeled data is a machine learning capability that enables enterprises to leverage their vast pools of unlabeled data to open up new product possibilities. While there are many approaches, most of them attempt to capture the available labeled data using a representation that can be further adjusted if and when new labels are obtained.

Active learning is one such approach. It takes advantage of collaboration between humans and machines to smartly pick a small subset of data to be labeled. A machine learning model is then built using this subset.

This report, with its accompanying prototype, explores active learning and its implications. While not a new framework, active learning has recently been adapted to deep learning applications, where the labeled data requirement is even more stringent. Along with the availability of tooling and a maturing supporting ecosystem, active learning is now newly exciting.

2. Machine and Human Collaboration

Machines learn broadly in two ways: with supervision and without. When learning with supervision, they rely on examples that illustrate the relationship between input features and output. With sufficient examples, machines can learn to accurately predict an output given a set of input features. Machines can also learn without supervision. In doing so, they do not require examples and learn to draw inferences purely from the input features.

Figure 2.1: Supervised machine learning requires labels while unsupervised machine learning draws inferences from structure of the data.
Figure 2.1: Supervised machine learning requires labels while unsupervised machine learning draws inferences from structure of the data.

Being able to teach machines with examples is powerful, but it requires vast amounts of data. In addition, the data needs to exist in a form that allows relationships between input features and output to be uncovered. One way to fulfill this requirement is through creating labels for the input features. If we want machines to analyze sentiment for product reviews, we need to provide examples of positive and negative reviews. This implies labeling each review, as represented by a set of input features, as either positive or negative. Labeled data is therefore a crucial ingredient in supervised machine learning.

But in reality, neatly labeled data is not available for most supervised machine learning opportunities. For financial institutions, risk assessment is crucial because it sets the amount of capital required to absorb systemic instability. Large volumes of contracts and loan agreements exist and can be used to build a risk model, but not all of them have been processed to extract relevant information such as the purpose of the agreement, the loan amount, and the collateral amount. In corporate IT, customer service chat logs are available and can be used to identify customer concerns and satisfaction levels, but not many are annotated. In healthcare, medical images are abundant and can be used to build a diagnostic model, but these are rarely labeled properly. In retail, many customer reviews exist and can be used to build recommendation systems, but most reviews are not even labeled as positive or negative.

Figure 2.2: Risk assessment, medical diagnosis, and chat log analysis are a few example use cases where data may be plentiful but unlabeled.
Figure 2.2: Risk assessment, medical diagnosis, and chat log analysis are a few example use cases where data may be plentiful but unlabeled.

While some opportunities can be reframed through the lens of unsupervised learning, this typically provides a proxy solution and does not answer the original question. One can cluster chat logs into different types; each cluster can be analyzed to provide insight into whether the customer is happy or angry. This is a coarse estimate, however, and does not allow one to leverage the many advances that have been made in supervised machine learning.

One straightforward solution is to manually create labels to link all input features to outputs. Instinctively, this is not scalable. Labeling is expensive - projects range from tens of thousands to millions of dollars. Labeling is also time-consuming and requires domain expertise. Some labels are easy to attach, others less so. It is easy to label images such as cars or boats, but it is not simple to label medical images of tumors as benign or malignant.

Surely, there has to be a better way. Instead of labeling all input features (datapoints), perhaps one can identify and only label the ones that are most impactful? If machines were allowed to ask questions, perhaps that would help to identify those datapoints. If both datapoint identification and labeling were done properly, perhaps machines could learn just as well with a smaller set of data?

This is the premise for active learning. Active learning opens up product possibilities previously constrained by limited labels. Applications can be built much faster and are less costly, while requiring a smaller subset of data to be labeled.

2.1 Labeling Data Smartly

Active learning attempts to smartly pick a small subset of datapoints for which to obtain labels. It is an iterative process, and ideally has access to some initial labels to start. These initial labels allow a human to build a baseline machine learning model, and use it to predict outputs for all the unlabeled datapoints. The model then looks through all its predictions, flags the one that it has the most difficulty with, and requests a label for it. The human steps in to provide the label, and the newly labeled data is combined with the initial labeled data to improve the model. Model performance is recorded, and the process repeats.

Figure 2.3: The active learning loop.
Figure 2.3: The active learning loop.

In each iteration, the amount of labeled data increases, and the performance of the model changes. If active learning is effective, two conditions should be met. First, the model’s performance should trend upward. Second, the amount of labeled data required to hit a performance threshold should be small; not all unlabeled data needs to be labeled.

Figure 2.4: Model performance trending up with each round of active learning.
Figure 2.4: Model performance trending up with each round of active learning.

Given that active learning is an iterative process, how does one know when to stop? Each label has an associated cost of acquisition, representing the resources it consumes (both time and money). Labeling images as cats or bears is simple and fast; no domain expertise is required. Assigning a label takes less than a second. Labeling chat logs as positive, negative, or neutral might require some domain expertise. Assigning a label takes less than a minute. Labeling automotive insurance claims for damage severity requires significant domain expertise. Assigning a label takes a couple of minutes.

In active learning, each round of labeling can result in a performance improvement. Each round of labeling also incurs a cost of acquisition. While it is possible to estimate the cost of acquisition, the performance improvement only becomes apparent after a round of active learning. Determining when to stop, then, can either be a static or a dynamic decision. Setting a budget for labeling or a threshold for a performance target are examples of static criteria. On the other hand, dynamic decisions are based on the trade-off between the cost of acquisition and model performance. In the initial rounds, one should expect to see a relatively large increase in performance per label. As the model is refined, the incremental improvement that each round of newly labeled data provides decreases, while the cost per label remains somewhat constant. When it is no longer worthwhile to obtain new labels, the active learning process stops.

Figure 2.5: Stopping criteria for active learning.
Figure 2.5: Stopping criteria for active learning.

The type of active learning we’ve been describing so far is known as pool-based active learning. The prerequisite for pool-based active learning is, as the name implies, a pool of unlabeled data and some labeled data. In this scenario, all the unlabeled data is available – the machine can evaluate all the datapoints concurrently to determine which one is the hardest to label. For some machine learning applications, such as predictions based on streaming sensor data in autonomous vehicles,[1] an accessible pool of unlabeled data does not exist. The data is too voluminous to store. Rather, data arrives in a stream. In this scenario, the machine can only look at the data once, and needs to decide whether to request a label or throw the data away. This is known as stream-based active learning. This report will not attempt to cover both scenarios. Instead, the focus will be on the most common scenario for enterprises with access to large pools of data: pool-based active learning.

2.2 Choosing What to Label

At the heart of active learning is a machine (learner) that requests labels for datapoints that it finds particularly hard to predict. The learner follows a strategy, and uses it to identify these datapoints. To evaluate the effectiveness of the strategy, a simple approach for choosing datapoints needs to be defined. A good starting point is to remove the intelligence of the learner; the datapoints are chosen independently of what the learner thinks.

2.2.1 Random Sampling

When we take the learner out of the picture, what is left is a pool of unlabeled data and some labeled data from which a model can be built. To improve the model, the only reasonable option is to randomly start labeling more data. This strategy is known as random sampling, and selects unlabeled datapoints from the pool according to no particular criteria. You can think of it as being akin to picking a card from the top of a shuffled deck, then reshuffling the deck without the previously chosen card and repeating the action. Because the learner does not help with the selection process, random sampling is also known as passive learning.

Figure 2.6: Random sampling is like picking the top card from a shuffled deck.
Figure 2.6: Random sampling is like picking the top card from a shuffled deck.

With this strategy you are guaranteed more – but not necessarily the best set of – labeled data. The model may improve as more data is added, especially in the initial rounds. But it is possible that some random queries, where a query is defined as choosing an unlabeled datapoint and getting a label for it, will be very uninformative. As a result, more labeled data (as compared to other, smarter strategies) is needed to achieve a target model performance.

Although random sampling might not be data-efficient, it is useful in two ways. First, it serves as a baseline against which all other strategies should be compared. Second, it is an appropriate strategy when there is very little or no labeled data available. Randomly getting some labeled data allows one to jumpstart a model; it is better than having no model at all.

2.2.2 Uncertainty Sampling

If we (humans) were asked to predict classifications of images, we would have difficulty with images about which we are uncertain. Getting labels for these images would help refine our classification approach. The same applies to learners. In uncertainty sampling, the learner looks at all unlabeled datapoints and surfaces ones about which it is uncertain. Labels are then provided by a human, and fed back into the model to refine it.

Uncertainty sampling is different from random sampling in that it enlists the help of the learner in choosing datapoints to label, and that it uses a specific criterion – uncertainty – to guide the selection of these datapoints. There are various ways to quantify uncertainty; we will discuss two important ones.

2.2.2.1 Margin Sampling

Suppose we have a simple model for sentiment analysis, along with predetermined lists of positive and negative words. The model takes in two features – the number of words with positive connotation, along with the number of words with negative connotation – and predicts whether a product review is positive or negative. Further, suppose we have looked over the features and determined that a model with a linear decision boundary is appropriate, as in Figure 2.7. Points to the right of the decision boundary represent positive reviews; the number of positive words is greater than the number of negative words in these reviews. Points to the left of the decision boundary represent negative reviews; there are more negative than positive words in these reviews. The decision boundary, then, is the line that separates the two classes. It is intricately related to the model. Any model refinement causes the boundary to shift; as a result, datapoints are classified differently and model performance metrics reflect that.

Figure 2.7: Sentiment model with a linear decision boundary. Points right on the boundary are equally likely to be classified as positive or negative.
Figure 2.7: Sentiment model with a linear decision boundary. Points right on the boundary are equally likely to be classified as positive or negative.

Datapoints far away from the decision boundary are safe from changes in the decision boundary. This implies that the model has high certainty in these classifications. Datapoints close to the boundary, however, can easily be affected by small changes in the boundary. The model (learner) is not certain about them; a slight shift in the decision boundary will cause them to be classified differently. The margin strategy therefore dictates that we surface the datapoint closest to the boundary and obtain a label for it.

Figure 2.8: A slight change in decision boundary changes the classifications for datapoints close to the boundary.
Figure 2.8: A slight change in decision boundary changes the classifications for datapoints close to the boundary.

Choosing datapoints this way allows us to refine the decision boundary, resulting in better classification accuracy. This necessarily clusters our labeled datapoints around that boundary, and exposes a fundamental trade-off in active learning: informativeness vs. representativeness. In focusing our data selection on the region near the decision boundary, we are choosing to label very informative datapoints, but failing to represent the whole data distribution. In 2.3.2 The Impact of Strategy we discuss the implications of this trade-off.

2.2.2.2 Entropy Sampling

Margin sampling aims to classify datapoints accurately by focusing on the difficult ones near the decision boundary. All misclassifications are treated equally; the goal is to reduce them. Misclassifying a cat as a tiger is treated similarly to misclassifying a pony as a horse. In many use cases, one might be interested in treating misclassifications differently. A strongly confident misclassification of a benign image as malignant should not be treated equally to a weakly confident misclassification. It should be penalized more. To achieve this, we can use entropy sampling.

Entropy formalizes the intuitive notion that the outcome of more uncertain events carries more information than that of events about which we are confident. In the case of a coin flip, where the outcome is either a head or a tail, entropy is at most 1 bit. For a fair coin, a coin flip is equally likely to result in a head or a tail. The outcome is difficult to predict; it has the maximum amount of uncertainty along with an entropy of 1. For a biased coin (where either a head or a tail is much more likely), the outcome of a coin toss is less uncertain, and the entropy is less than 1. For a double-headed coin, the outcome of the coin toss is known. There is no uncertainty, and entropy is 0.

Figure 2.9: Entropy of a coin flip.
Figure 2.9: Entropy of a coin flip.

In the context of active learning, labeling datapoints that have high entropy should improve our classifier more than those that have low entropy. To employ the entropy strategy, the learner computes entropy for all the unlabeled datapoints and requests a label for the one with the highest entropy. In contrast to margin sampling, the entropy approach results in a more representative dataset.

Although the idea behind entropy sampling is simple, it is surprisingly effective and often outperforms more complex and computationally demanding[2] As such, entropy sampling and random sampling form the core strategies for most active learning approaches.

2.2.3 Other Strategies

There are many other possible ways that a learner can use to choose which point to label. For example, to address the tension between informativeness and representativeness, density sampling can be used. Density sampling attempts to pick from regions with many datapoints. Whereas uncertainty sampling tends to select points that are informative, lying near to decision boundaries, density sampling moderates this, favoring selecting from regions where the data is concentrated.

For comprehensive and technical reviews of other strategies, please refer to the seminal paper by Burr Settles. For active learning applied to deep learning, we explore various strategies in 3. Modifications for Deep Learning. In 7. Future, we highlight other approaches of different maturity levels that can be used in the context of learning with limited labeled data.

2.3 Peeling Back the Layers

The promise of active learning is tempting – it relaxes the stringent labeled data requirement and makes it possible to build solid models faster, using limited labeled data. Can it be applied to any supervised machine learning problem? Does it always work? What is the catch?

To use active learning, decisions need to be made about both the type of learner and the strategy. The learner makes predictions for all the unlabeled data, and relies on the strategy to surface datapoints that are informative. In our sentiment analysis example, an explicit decision was made to use a learner with a linear decision boundary. This decision was driven by the labeled data available prior to starting active learning. An analysis of the labeled data at that point suggested that it was linearly separable; hence a linear model was appropriate. Once the learner is chosen, the strategy and the learner work hand in hand. The strategy helps the learner identify difficult datapoints, and the learner uses these to refine itself.

Figure 2.10: Feedback loop between the learner and the strategy
Figure 2.10: Feedback loop between the learner and the strategy

2.3.1 The Impact of a Learner

If we make the wrong choice for a learner – for example, picking a linear model when the problem is nonlinear – the strategy will not be effective. More labeled data will not help much. Assume we have a dataset that cannot be separated by a linear decision boundary. Further assume we have chosen to use a linear learner along with margin sampling. Points around the decision boundary will be surfaced, and labels will be provided by a human. The linear learner will proceed to incorporate the newly provided labels to adjust its decision boundary. This will cause the boundary to shift, but will not turn a linear decision boundary into a nonlinear line capable of separating two classes.

Figure 2.11: A linear model applied to a non-linear classification setting makes the data selection strategy inefficient.
Figure 2.11: A linear model applied to a non-linear classification setting makes the data selection strategy inefficient.

The need to choose a learner and understand the consequences of this choice is not unique to active learning; it is fundamental to machine learning. But the lack of labeled data, a common feature in active learning settings, makes it harder to ascertain the appropriateness of a learner. The impact of choosing the wrong learner is also more pronounced in these settings, because the learner doesn’t only predict an outcome; along with a strategy, the learner also helps refine itself. This feedback loop amplifies the effect of an inappropriate choice of learner.

One way to mitigate this effect is to use multiple learners, each of a different type (linear and nonlinear, for example). The best datapoints to select for labeling are then the ones that cause the most disagreement among the learners. Query by committee is an example of such an approach. Instead of training a single learner, a “committee” of diverse learners is maintained. The datapoint selected to be queried is the datapoint whose label the committee most disagrees about. This approach prevents the labeled pool being too tailored to one algorithm, but at the expense of maintaining several learners.

2.3.2 The Impact of a Strategy

In active learning, the coupling between learner and strategy surfaces the datapoints that are most informative, but these datapoints only give us a glimpse into some portion of the unlabeled dataset – a specific region around the decision boundary, for example. Our understanding of the data is therefore biased. This bias is dominated by the choice of strategy, but is introduced by both the learner and the strategy.

Figure 2.12: The most informative datapoints are not always representative of all the unlabeled datapoints.
Figure 2.12: The most informative datapoints are not always representative of all the unlabeled datapoints.

The existence of a bias leads to two questions. First, can we still use the final labeled dataset for other machine learning opportunities? The answer is sometimes. Research has shown that it is possible for one type of learner to learn from datasets constructed by another type of learner[3], but this is not true in all circumstances. Second, what is the effect of a biased representation? The dominant effect is a model that performs well during training, but does not generalize. Intuitively, this is because the learner is trained to perform well on a type (or region) of input features; once we venture outside of this region, the learner’s performance degrades.

In practice, the approach to strategy selection is to start with the easiest and move up to progressively harder ones. This implies starting with random sampling and moving on to entropy-based uncertainty sampling, while documenting and understanding the improvement (or lack of) in performance. To understand the effect of a biased dataset, the density-based approaches described in 2.3.3. Other Strategies can be attempted.

2.3.3 Label Quality

In our description of active learning, we assume a human will provide labels for the datapoints that are surfaced by an active learning strategy. In practice, this step is not trivial and can be accomplished through various approaches. “A human” can mean an in-house domain expert, an outsourced domain expert, or a crowdsourced worker, or the task may be outsourced to an enterprise that provides a workforce for labeling. Label quality varies within and across each of these options. It becomes harder to control and manage as the labeling task becomes more difficult. In addition, humans become fatigued over time, and it is easy to imagine a scenario where the labeling quality gradually becomes inconsistent.

In some cases, labeling tasks are too difficult for a human. Take credit card fraud as an example. A human simply cannot accurately determine just by looking at them if certain transactions are fraudulent; one needs to wait for the actual outcome. For these cases, active learning will not be helpful.

To solve the label quality issues, one can use a two-pronged approach to target workforce and fatigue issues separately. Regarding the workforce, managed and well-trained human labelers (annotators) will provide consistently better labels. (See 6. Ethics for more on workforce issues.) Multiple annotators can be used to arrive at a majority decision for labels in difficult tasks. Reducing human fatigue in the labeling process can be accomplished through a clever user interface designed to reduce cognitive load. As an example, Prodigy (see 5.3 Data Labeling Service Providers), an active learning workflow tool, formulates labeling tasks as a binary decision.

2.4 When and How to Use Active Learning

Active learning is particularly suited for machine learning opportunities that rely on image recognition and language processing. In addition, for applications where label availability across different classes is highly skewed, active learning tends to request labels for the underrepresented class, thus softening the impact of class imbalance. Medical diagnosis use cases, for example, have many fewer malignant images than benign ones.

Active learning changes the model development workflow, and requires budget to be set aside for labeling purposes. Given that more data does not always result in better performance, Figure 2.13 illustrates one possible way to determine how and when to use active learning.

Figure 2.13: Active learning flowchart
Figure 2.13: Active learning flowchart

2.4.1 When to Stop the Active Learning Loop

If active learning is the right approach, the follow-on task is to determine the number of datapoints for which to obtain labels. Unfortunately, there is not a one-size-fits-all number. Rather, it is dependent on the problem and the properties of the dataset. Various stopping criteria were mentioned earlier, in 2.1 Labeling Data Smartly. In particular, it is helpful to monitor the performance and effectiveness of each round of newly labeled data. This can be accomplished using two plots: the learning curve and data efficiency. The learning curve shows how each round of new labels impacts the model’s performance. Data efficiency compares the number of labeled datapoints required to achieve a given error rate under two different strategies. It pits one strategy against another and illustrates the incremental gain (or lack thereof) from using a more complex strategy. Combining both plots allows us to separate out the effectiveness of the labeled data from the effectiveness of the active learning strategy itself.

Figure 2.14: Learning curve shows the impact of each additional round of active learning. Data efficiency curve shows the effectiveness of a strategy when compared to random sampling.
Figure 2.14: Learning curve shows the impact of each additional round of active learning. Data efficiency curve shows the effectiveness of a strategy when compared to random sampling.

3. Modifications for Deep Learning

The active learning approach described in 2.1 Labeling Data Smartly focuses on finding the best datapoint to have labeled. The process is iterative. First, the most informative datapoint, as judged by various criteria, is selected from a pool of unlabeled data. Second, this datapoint is added to the training dataset. Third, the model is retrained with the new training set. This process repeats until some stopping criterion is met, driven by a combination of budget and performance constraints.

Deep learning introduces a couple of wrinkles that make direct application of this existing approach ineffective. The most obvious issue is that adding a single labeled datapoint does not have much impact on deep learning models, which train on batches of data. In addition, because the models need to be retrained until convergence after each point is added, this becomes an expensive undertaking – especially when viewed in terms of the performance improvement vs. acquisition cost (time and money) trade-off. One straightforward solution is to select a very large subset of datapoints to label. But depending on the type of heuristics used, this could result in correlated datapoints. Obtaining labels for these datapoints is not ideal – datapoints that are independent and diverse are much more effective at capturing the relationship between input and output.

The second problem is that existing criteria used to help select datapoints do not translate to deep learning easily. Some require computation that does not scale to models with high-dimensional parameters.footnote:[For example, methods such as optimal experiment design require computing the inverse of the Hessian matrix at each iteration.] These approaches are rendered impossible with deep learning. For the criteria that are computationally viable, reinterpretation under the light of deep learning is necessary.

There are various approaches to making active learning applicable to deep learning. Solutions specific to the batch nature of deep learning focus on selecting a set of unlabeled datapoints, such that a model trained on this set will perform as closely as possible to a model trained on the entire unlabeled dataset. Others focus on reformulating various datapoint selection approaches specifically for deep learning. The rest of this chapter takes the idea of uncertainty first introduced in 2.2.2 Uncertainty Sampling and examines it in the context of neural networks.

3.1 Distance from the Decision Boundary

In 2.2.2.1 Margin Sampling we introduced a simple heuristic to capture the notion of uncertainty. This heuristic is based on the distance between a datapoint and decision boundary of the model. Because a slight change in the decision boundary may change the classifications of datapoints right around it, the model is most uncertain about those datapoints. Providing labels for datapoints closest to the decision boundary is, then, most effective for refining the model. This approach requires computing the distance between a datapoint and the decision boundary. For deep neural networks, this computation is intractable because the decision boundary is highly complex and nonlinear.

One alternative is to compute the distance between the datapoint and its closest neighbor of a different class. This is computationally expensive, however, and only provides a coarse estimate of the distance between the datapoint and its decision boundary.

Figure 3.1: Approximating the distance between a datapoint and the decision boundary by using the distance between the same datapoint and its closest neighboring datapoint from another class
Figure 3.1: Approximating the distance between a datapoint and the decision boundary by using the distance between the same datapoint and its closest neighboring datapoint from another class

Another (more creative) approach is to use adversarial perturbations to estimate the distance. An adversarial perturbation occurs when the input to a neural network is modified with a small but specific noise that results in an unexpected misclassification – the input (datapoint) has crossed over the decision boundary. If one can find the smallest perturbation that causes a misclassification, its magnitude can then be used to estimate the distance between the datapoint and its decision boundary.

There are different types of adversarial attacks with varying degrees of strength. A technique that is harder to counter will generally provide a better estimate of the distance between the datapoint and the decision boundary. In addition, since the adversarial sample generated by the perturbation shares the same label as the original sample, one can potentially add the newly generated sample to the training dataset. This could result in a model that is both more robust and less sensitive to small adversarial perturbations.

Figure 3.2: Adversarial active learning strategy - approximating the distance between a datapoint and the decision boundary based on the perturbation magnitude
Figure 3.2: Adversarial active learning strategy - approximating the distance between a datapoint and the decision boundary based on the perturbation magnitude

Adversarial perturbations occur when the inputs to a neural network are modified with small but specific noises, resulting in misclassifications by the network. These noises or perturbations are almost imperceptible to the human eye. When an adversarial perturbation is added to an image of a panda, for example, the image ends up being classified as a gibbon with high confidence even though the correct label for both images is “panda.”

Figure 3.3 Adversarial input causing a classifier to miscategorize a panda as a gibbon. Image from Attacking Machine Learning with Adversarial Examples
Figure 3.3 Adversarial input causing a classifier to miscategorize a panda as a gibbon. Image from Attacking Machine Learning with Adversarial Examples

Although using adversarial perturbations allows one to translate the margin sampling approach to deep neural networks, each unlabeled datapoint needs to be perturbed. Depending on the type of adversarial perturbation used, this approach can become computationally heavy. Is there a fundamentally different approach to measuring uncertainty?

3.2 Revisiting Uncertainty Using a Bayesian Framework

Bayesian inference is a statistical technique that allows us to encode expert knowledge into a model by stating prior beliefs about what we think our data looks like. As we collect more data, we then update those beliefs, refining our predictions based on the evidence. Bayesian inference has three components: the probability that a hypothesis is true (the prior); the probability of observing the data, if the hypothesis is true; and the probability that the hypothesis is true, given the data we have (the posterior). The prior is the extent to which we believe a hypothesis; the posterior is our updated belief given the hypothesis and the data. (See our Probabilistic Programming report for an introduction to Bayesian approaches.)

Figure 3.4: Bayes’ rule.
Figure 3.4: Bayes’ rule.

In neural networks, the training and optimization procedures result in a point estimate of the weights - the “best” set of weights is obtained but we do not know how confident we should be of the result. Instead of a point estimate, Bayesian neural networks give a full picture of all the possible values that these weights can take, and how likely each value is to occur given the training data. This is the posterior, and with it we can make statements like “The best set of weights is likely to occur 80% of the time”.

Figure 3.5: Probability distribution of weights for a neural network.
Figure 3.5: Probability distribution of weights for a neural network.

Once we have the posterior, expressing the probability of a prediction given an input and training data - the uncertainty of the neural network - is mathematically straightforward.footnote:[For each realization (specific value) of the weights, we first compute the probability of a prediction using these weights by passing the output of the neural net through a softmax layer. We then multiply by the posterior, which is the probability that this set of weights will occur given the training data. We do this for all possible realizations of the weights, and sum them up.] This mathematical formulation turns out to be non-trivial to compute, and requires the implementation of a Bayesian neural network.

3.2.1 Dropout

One way to estimate the uncertainty of a neural network is to use dropout. Dropout is a well-known regularization technique. The idea is that during training, weights connecting to neurons will be dropped out (set to 0) with some probability. The neural network then becomes smaller, but is still trained on the same training data. This is akin to asking an organization to function well, even with a portion of its workforce removed: the remaining workers are forced to generalize. As a regularization technique, dropout is only turned on during training to force the network to generalize. At inference, dropout is turned off, and the weights are adjusted proportionally. As an example, a network trained with dropout of 50% implies that the final weights should be divided by 2 before the model can be used for inference.

Figure 3.6: Neural network before and after applying dropout.
Figure 3.6: Neural network before and after applying dropout.

Imagine now that dropout remains turned on at inference. Weights connecting to neurons will be set to 0 probabilistically, and a different neural network and associated set of weights will materialize at each inference. In effect, each inference introduces a new dropout mask. Given a sufficient number of inferences, we will have many sets of weights, along with the probabilities of their occurrence. This allows us to construct an approximation of the posterior, which represents the probability of a set of weights given the training data. Used this way, dropout becomes a way to sample from the approximate posterior.

Figure 3.7: Each inference with dropout turned on results in a different neural network.
Figure 3.7: Each inference with dropout turned on results in a different neural network.

In practice, to estimate uncertainty using dropout masks, we add a dropout layer to the weights and train the neural network as usual. During inference, multiple forward passes are performed, and the averaged softmax vectors are used to predict uncertainty. The uncertainty prediction is fed into any uncertainty-based active learning sampling approach (entropy, for example).

Estimating uncertainty with dropout introduces a new hyperparameter: the number of dropout masks. The theoretical underpinnings suggest that a high number gives a better approximation to the posterior. In practice, this parameter is tuned, and we have seen it range from single digits to double digits.

3.2.2 Ensembles

An alternative approach to estimating neural network uncertainty is to train an ensemble of classifiers, and use the averaged softmax vectors to predict uncertainty. All classifiers in the ensemble are trained with the same training data on the same network architecture, but they are each initialized differently.The uncertainty prediction is then fed into any uncertainty-based acquisition function.

Figure 3.8: An ensemble of neural networks leading to averaged softmax vectors.
Figure 3.8: An ensemble of neural networks leading to averaged softmax vectors.

The ensemble approach is computationally intensive because multiple networks need to be trained. But when compared to the dropout approach, which can be interpreted as an approximation to a full ensemble, it is more diverse. In the dropout approach, the weights and initializations are shared among all dropout-induced ensemble members. The ensemble method uses independent networks with their own weights and initializations. This difference could manifest itself when the network in the dropout approach results in a solution that is only optimal locally.

Figure 3.9: Local minimum vs global minimum.
Figure 3.9: Local minimum vs global minimum.

When using ensemble approaches to estimate uncertainty in practice, one needs to determine the number of networks (ensembles) to use. Although research has shown that the performance of ensemble-based active learning strategies is relatively stable with respect to the number of ensembles used, we expect this to vary across different datasets.

3.3 In a Nutshell

Active learning leverages collaboration between humans and machines to smartly pick the right set of datapoints for which to create labels. Direct application of classical datapoint selection strategies is not effective and sometimes not possible for deep learning. This chapter discusses the reasons and introduces modifications to classical uncertainty approaches to make them amenable to deep learning.

Active learning has also been used slightly differently in deep learning settings. Labeling is no longer a task limited to humans; both machines and strategy can provide labels (pseudo labels). These pseudo labels are combined with human-provided labels and fed back into the machine (learner) for the next active learning round.

In uncertainty sampling, both uncertain datapoints and highly certain datapoints are surfaced. Labels for the uncertain datapoints are provided by humans; labels for the highly certain datapoints are provided by the machine. In adversarial active learning strategies (see 3.1 Distance from the Decision Boundary), the adversarial sample shares the same label as the original sample. The label is implied and is a by-product of the strategy. Using active learning in this way provides cost-effective data augmentation and helps the model learn a better representation.

Active learning as a framework for learning with limited labeled data requires a change in the typical machine learning workflow. In the next chapter, we describe our prototype, which was built to illustrate the process of active learning.

4. Prototype

Designing a prototype to showcase a fundamental capability is challenging. Sometimes fundamental capabilities make building machine learning applications faster and more cost-effective – but users do not observe the speed with which the product was built. Other times, fundamental capabilities enable product possibilities – but users may not associate the product breakthrough with the capability. The goal of our prototype is to shed light on and provide intuition for how and why active learning works.

4.1 Datasets

Many of the decisions that need to be made when using active learning depend on the dataset (see Figure 2.13). The dataset affects the choice of learner and the selection strategy. The interaction between the dataset, learner, and selection strategy affects the stopping criteria. Therefore, it is important to illustrate active learning across different types of datasets. Accordingly, the datasets chosen for this prototype range from the simple and well-behaved to the more complex.

To illustrate active learning, we start with a fully labeled dataset but do not use most of the labels. Rather, we pretend only a small subset of this data has been labeled. A model is built using this subset, and all other existing labels are withheld. As each round of active learning surfaces new datapoints to be labeled, their corresponding labels are then extracted from the fully labeled dataset.

Admittedly, this is not an accurate depiction of the true active learning process, during which a human being would step in to provide the requested labels. In 4.4 Product: Active Learner, we discuss the rationale behind this approach.

4.1.1 MNIST

The MNIST dataset consists of handwritten digits from 0 to 9. It has a training set of 60,000 examples, and a test set of 10,000 examples. It is the “hello, world” classification dataset for machine learning and is included in many machine learning libraries.

For our prototype, we assume only 5,000 datapoints in the training set have labels and use this subset to build a classifier for all 10 digits. Each round of active learning surfaces another 1,000 datapoints for which to obtain labels.

Figure 4.1: Sample images from MNIST dataset.
Figure 4.1: Sample images from MNIST dataset.

4.1.2 Quick Draw

The Quick Draw dataset consists of 50 million doodles (hand-drawn figures) across 345 categories. We randomly selected 10 categories for which to build a classifier, and only used a subset of the available data for each category. This resulted in a training set of 65,729 examples and a test set of 16,436 examples. Restricting the number of categories to 10 was not driven by a technical limitation; it was done so to help with prototype visualization.

For our prototype, we assume only 5,000 datapoints in the training set have labels and use this subset to build a classifier for all 10 classes. Each round of active learning surfaces another 1,000 datapoints for which to obtain labels.

Figure 4.2: Sample images from Quick Draw dataset.
Figure 4.2: Sample images from Quick Draw dataset.

4.1.3 Caltech 256

The Caltech 256 dataset consists of 30,607 images from 256 categories. We randomly selected 10 categories for which to build a classifier. The training set has 822 examples, the testing set has 212 examples. As with the Quick Draw dataset, the small number of categories (10) was not driven by a technical limitation but was intended to make the prototype visually more digestible.

For our prototype, we assume only 300 datapoints in the training set have labels and use this subset to build a classifier for all 10 classes. Each round of active learning surfaces another 50 datapoints for which to obtain labels.

Figure 4.3: Sample images from Caltech dataset.
Figure 4.3: Sample images from Caltech dataset.

4.2 Learner

Given the goal of the prototype, we did not set out to build the best classifier for each dataset. Instead, we built a relatively straightforward learner without much optimization, and used that as a baseline to observe and understand the impact of active learning.

For the MNIST and Quick Draw datasets, we used a two-layer convolutional network with dropout followed by two linear layers (see our Deep Learning: Image Analysis report [FF03] for a refresher on convolutional networks). The layer feeding into the fully connected layers contains rich feature information about an image. We extracted this information (the embeddings) for the prototype.

Figure 4.4: Architecture of the convolutional network.
Figure 4.4: Architecture of the convolutional network.

For the Caltech 256 dataset we used Resnet18, a pretrained classifier trained on the much larger ImageNet dataset, to build the classifier. The total number of labeled images we had was too small to build one from scratch, but active learning can be leveraged along with transfer learning to build a model using only 300 labeled images. As with the models for MNIST and Quick Draw, we extracted the embeddings to use for the prototype.

4.3 Selection Strategy

In the prototype, we illustrate the active learning process using various selection strategies. Because the learners are deep learning models, we include both classical approaches, such as entropy-based uncertainty sampling (see 2.2.2.2 Entropy Sampling), and others specific to deep learning, such as ensemble and adversarial approaches (see 3. Modifications for Deep Learning). In an attempt to make it easier to compare strategies for a particular dataset, we used the same active learning parameters on all strategies. These parameters include the size of the initial labeled dataset and the number of datapoints for which to request labels.

4.4 Product: Active Learner

In our prototype, Active Learner, we help the user build intuition about how active learning works. We use three sample datasets to show how active learning strategies can more efficiently improve the accuracy of a model. We simulate the active learning process by discarding most of the labels and then adding them back in, one thousand points or fifty points at a time, according to the strategy and dataset selected.

The accuracy results are shown in a graph at the bottom of the prototype. The accuracy numbers demonstrate the quantifiable value of active learning. Beyond showing accuracy improvements, we wanted to help the user understand how and why active learning works. We use a dimensionality reduction technique called Uniform Manifold Approximation and Projection (UMAP) to visualize the clustering of the dataset. By showing which datapoints different active learning strategies select to be labeled, we help the user build intuition about how these strategies work.

The active learning strategies select points at the cluster boundaries. These visualized boundaries are the places where the model is most uncertain about how to predict labels for the data. By showing how the visualized clusters change as the selected points are labeled and the model is retrained, we help the user see how the model refines its boundaries based on the new labels.

Figure 4.5: In our prototype, you can see how active learning strategies select points near cluster boundaries for labeling.
Figure 4.5: In our prototype, you can see how active learning strategies select points near cluster boundaries for labeling.

4.4.1 Labeling and visualizing at scale

Early in the process, we considered a prototype where the user participates in the labeling themselves. This would have highlighted the level of effort required to label a dataset, reinforcing the importance of using a strategy like active learning to deploy that effort efficiently. We decided that approach risked trying the user’s patience, however. Our prototypes are meant to serve as explanatory tools, demonstrating a point that can be grasped within a few minutes. The number of labels that would be required to make a difference in the model accuracy was too great to ask a user to do by hand, so instead we decided to simulate the labeling process, adding one thousand labels in each round (or fifty, in the case of the Caltech dataset).

Simulating the labeling in rounds solved the problem of demonstrating the overall effect of an active learning strategy, but we still had to work hard to try and make what was happening approachable for the user. We used the UMAP visualization to give a spatial sense of what was happening, but the amount of data can still feel overwhelming. This makes sense, since the promise of machine learning is that it can operate on a scale beyond what humans can normally do. This is a great power, but it requires the development of new interface and visualization tools, in order to make sure we truly understand the effects of the new systems we’re building. Active Learner shows just one technique for using visualizations to better understand our machine learning systems. We’ll need to develop many more.

4.4.2 Technical Challenges

Figure 4.6: We used sprite sheets and shaders to display the image for each point in the visualization.
Figure 4.6: We used sprite sheets and shaders to display the image for each point in the visualization.

Another challenge with working on datasets of this scale is being able to smoothly display and interact with them in the browser. We’d used the 3D Javascript library three.js on a past prototype to do cluster visualization of a dataset. Because it uses WebGL, three.js is able to more efficiently render more points. For this prototype, we wrote three.js code that uses a shader to display the actual image for each datapoint in the visualization. This makes it much easier to get a feel for how the data is clustered. We were also able to write transitions for the visualization, which show how the clusters adjust after retraining. Being able to animate between these states was key to showing the user how the model develops, round by round.

4.4.3 Practical Challenges

4.4.3.1 Stopping Conditions and Strategy Selection

Our prototype illustrates the process of active learning and ends after eight iterations. In practice, multiple factors need to be considered when determining when to stop the active learning loop. In 2.1 Labeling Data Smartly, we introduced a dynamic stopping criteria that looks at the incremental gain in performance over each additional round of active learning. In 2.4.1 When to Stop the Active Learning Loop, we introduced the concept of data efficiency to help determine the effectiveness of a particular active learning strategy. By displaying the accuracy curves, our prototype lets the user explore both concepts.

While interacting with the prototype, the user can infer incremental gain in performance that each round of active learning unlocks from the accuracy curve. If curious about the effectiveness of a particular strategy, the user can compute data efficiency of that strategy when compared to a baseline approach. How effective is the ensemble approach for MNIST? The accuracy curve shows that 8,000 labeled datapoints are required to achieve an accuracy of 95.5% using random sampling while 6,000 labeled datapoints are required to achieve the same accuracy using the ensemble approach. The data efficiency of the ensemble approach is 75%. What about the effectiveness of the ensemble approach for other datasets? On the Quick Draw dataset (a more complex dataset compared to MNIST), 11,000 labeled datapoints are required to achieve an accuracy of 82.6% with random sampling while 6,000 labeled datapoints are required to achieve the same accuracy using an ensemble approach. Data efficiency of the ensemble approach is 54.5%.

4.4.3.2 A Pause Between Iterations

As alluded to in 4.4.1 Labeling and visualizing at scale, the labeling process in our prototype does not demonstrate a real human interaction loop; it is simulated. When applying active learning in real life, surfaced datapoints will need to be sent to a human for labeling. The next round of active learning cannot proceed until the newly labeled datapoints are ready and available.

The length of time between each active learning iteration varies depending on who provides the label. In a research scenario, a data scientist who builds the model and also creates labels will be able to iterate through each round of active learning quickly. In a production scenario, an outsourced labeling team will need more time for data exchange and label (knowledge) transfer to occur.

For active learning to be successful, the pause between iterations should be as small as practically possible. In addition to considering different types of labeling workforce, an efficient pipeline needs to be set up. This pipeline should include a platform for exchanging unlabeled datapoints, a user interface for creating labels, and a platform for transfering the labeled datapoints.

5. Landscape

Adoption of active learning requires a change in workflow to loop in human labelers. The ecosystem that enables and supports this change includes open source tools that support active learning from an algorithmic perspective, vendors that provide active learning capabilities, data labeling service providers, and companies that provide access to labor pools.

5.1 Open Source

Open source libraries for active learning focus on providing a framework for implementing datapoint selection methods, and often include a variety of strategies out of the box. New algorithms can be easily added by the user, but these libraries lack a labeling user interface (See 5.3 Data Labeling Service Providers for services that fill this particular gap.) As such, open source libraries are generally a good fit for research projects, or when the annotator is also the data scientist aiming to build better models with additional training data.

Given the impact a user interface has on reducing human fatigue while labeling, we hesitate to recommend open source libraries in active learning production environments. When and if sufficient resources are allocated to designing a great labeling interface, open source libraries serve as a good starting point for developing a fully custom active learning system. To further qualify as a solution for production environments, these custom systems need to incorporate quality control for both labelers and labels.

Two actively maintained and popular Python libraries are modAL and libact. The differences in functionality are highlighted in Table 5.1.

Table 5.1: Active learning libraries, Python
Table 5.1: Active learning libraries, Python

5.1.1 modAL

modAL is a modular active learning framework for Python. It makes fast prototyping possible, and simplifies development of both real-life active learning pipelines and novel algorithms. A wide range of pool- and stream-based active learning algorithms are supported. modAL is built on top of +scikit-learn+, and is designed to be easily extensible. To ensure code quality, extensive unit tests are provided and continuous integration is applied. In addition, detailed documentation with several tutorials is available for ease of use.

5.1.2 libact

libact implements pool-based active learning algorithms for Python. It has a unified interface for adding custom strategies and models. The package not only implements several popular active learning strategies, but also gives users the option to automatically select the best strategy on the fly. libact supports cost-sensitive active learning algorithms – a user can specify the cost of classification errors by passing a cost matrix as a parameter. Basic usage examples for interactive labeling functionality are provided.

5.2 Active Learning Tool Vendors

Our definition of an active learning tool vendor is one that serves three functions: it supports a customizable user interface for labeling purposes, a customizable active learning strategy, and a customizable model (learner) for surfacing informative datapoints.

5.2.1 Prodigy

Prodigy by Explosion.ai is a stream-based active learning productivity tool. It provides a workflow that enables rapid prototyping. Data scientists can use Prodigy to quickly label data in order to screen for viable modeling ideas. General-purpose natural language processing (NLP) models can also be refined by interactively labeling training data. Prodigy is an install-only tool; as such, data stays on premises.

Prodigy shines in its user interface design, which is highly optimized to reduce the cognitive load of labeling. Multi-class labeling problems are restated as simple binary decisions. This reduces fatigue and enables efficient and consistent annotation. A variety of built-in interfaces are available for labeling tasks; custom interfaces can also be incorporated via HTML templates.

Prodigy is modular and can be extended to interface with all Python machine learning frameworks. A typical workflow looks like this: a user specifies a model (learner), Prodigy surfaces the data that it needs labels for, and the user proceeds to provide labels. Each step of the workflow can be customized, but built-in options are available. As an example, a specific active learning strategy or classification model can be used. For NLP use cases specifically, spaCy is an industrial-strength tool that many of us at Cloudera Fast Forward Labs use. Models (available as built-ins) can be fine-tuned using active learning and exported back into the spaCy environment.

Extensive documentation and various implementation examples are available.

5.3 Data Labeling Service Providers

In this section, we survey a sample of data labeling service providers as categorized by their primary focus (language, images, or both). Their main function in the context of learning with limited labeled data is to provide labels for datapoints surfaced by active learning algorithms. In addition, these service providers address common pain points in scaling data annotation projects by i) providing a platform for hosting the data to be labeled, ii) providing a labeling user interface, iii) managing a large team of annotators, and iv) performing quality control of labels.

Figure 5.1: Vendor overview.
Figure 5.1: Vendor overview.

5.3.1 NLP

5.3.1.1 LightTag

LightTag provides a platform to execute and manage large-scale annotation projects. Specifically, it provides a web application and API for text annotation tasks such as entity recognition and phrase and document segmentation. It is language-agnostic and supports Chinese-derived languages and languages that use right-to-left scripts.

LightTag provides a labeling interface optimized for speed. It supports multiple annotators and offers functionality to manage the process of collaborative labeling and quality of labels.

To increase labeling efficiency, LightTag offers label suggestions to the user in near real time. The suggestion model learns as it goes. LightTag keeps the model updated and the UI speedy when multiple annotators are working in parallel. LightTag also allows users to provide initial models that can be used in recommending labels. The final trained model can be downloaded.

Customers of LightTag can host their data on premises or in the cloud via Amazon Web Services (AWS). The on-premise enterprise version comes with audit capabilities and is aimed primarily at customers subject to regulations (e.g., HIPAA or the GDPR). A community version of the software, with reduced functionality, is also available; with the community version, data is stored on the user’s local machine.

5.3.1.2 Defined Crowd

Defined Crowd focuses on speech transcription and NLP tasks such as sentiment annotation, semantic annotation, and named entity tagging. It also offers tagging, categorization, and transcription for images and videos.

The company runs its own annotator community, which specializes in highly regulated spaces like finance and healthcare – areas where translations have to be precisely accurate.

Defined Crowd provides an API through which clients can upload their data at the beginning of a project. This data is labeled by the annotator community using an internal user interface that also provides label suggestions. Labeled data then goes through an internal quality assessment process. Once the labeling project is completed, the data is available for download.

5.3.2 Computer Vision

5.3.2.1 Labelbox

Labelbox is a data labeling and training-data management platform company that provides out-of-the-box labeling interfaces for images, videos, and – to a lesser extent – text data.

It supports collaborative labeling and provides functionality to manage both the process and the quality. The labeling interface is open sourced and thus completely customizable by using the JavaScript SDK. Predefined interfaces are also available. This labeling interface can be combined with other open source libraries mentioned in 5.1 Open Source to form a baseline active learning workflow tool.

Labelbox offers different solutions with varying levels of data privacy. A fully managed solution in the cloud is the least private option, as Labelbox has access to both the labels and the images. A version where only generated label assets are managed by Labelbox is the middle option. A fully private on-premise installation is the most private option, as Labelbox does not have access to either the labels or the images.

5.3.2.2 Mighty AI

Mighty AI is a full-service computer vision annotation platform for the automotive, robotics, and retail sectors. It offers a fully managed solution and handles all aspects of the data lifecycle, including ingestion, classification, annotation, quality assessment, and dataset export.

Mighty AI maintains and trains an international community of annotators sourced through its website. Several layers of monitoring coupled with accuracy checks are performed on the labels before they are exported back to the client. In addition to generating ground-truth datasets, Mighty AI also validates labels generated by a client’s machine learning systems.

5.3.2.3 Scale

Scale provides services like semantic segmentation, image annotation, and sensor fusion. It focuses on the autonomous vehicle industry but also targets the robotics, retail, and augmented reality/virtual reality sectors.

Scale employs a team of internally trained contractors who examine and categorize visual data. It offers a fully managed solution. Client data is sent to Scale via an API. Contractors then provide labels and have the option of using labels suggested by machine learning models. These labels are reviewed by a human and go through a quality control process before being returned as ground-truth data. Human-provided labels are also compared against model predictions to ensure label quality.

5.3.3 Computer Vision & NLP

5.3.3.1 Amazon SageMaker Ground Truth

Ground Truth helps build training datasets for text classification, image classification, object detection (locating objects in images with bounding boxes), semantic segmentation, and custom user-defined tasks.

When it comes to labelers, there are multiple options to choose from:

With Ground Truth, one can additionally opt in for multiple annotators to label a datapoint and then consolidate their responses to build a high-fidelity label. The client can specify how the consolidation works; an example is to use majority voting.

For the labeling task, Ground Truth uses a web page to present task instructions to the human labelers. The client can choose to use a predefined template for the UI or create custom labeling workflows using HTML 2.0 components. A custom workflow can give the client a lot of control over what is being performed and how it is performed. From a UI perspective, it appears to be fairly unexceptional.

The client can optionally use the Automated Data Labeling feature, which decides which datapoints should be labeled by human annotators. This feature also provides the client with a trained machine learning model from the built-in algorithms available in Amazon SageMaker. That means Automated Data Labeling is going to incur Amazon SageMaker training and inference costs.

5.3.3.2 Figure Eight (now Appen)

Figure Eight is an enterprise annotation platform that focuses on use cases relating to speech and audio recognition, computer vision, natural language processing, and search relevance.

When engaging Figure Eight’s service, a client can either choose to use its own domain expertise or Figure Eight’s workforce to perform labeling tasks. Figure Eight provides multiple types of labeling workforces. It staffs a team of contractors with varying levels of expertise and experience. In addition, it provides access to outside independent vendors (5.4.1 iMerit is an example) who in turn provide the workforce. These vendors typically sign nondisclosure agreements before working on tasks where privacy is of paramount concern.

In terms of the user interface, a client can choose to use Figure Eights’s predefined version or customize its own interface and annotator instructions. Datasets to be annotated can be uploaded either via Figure Eights’s web interface or an API.

To assess and monitor the quality of the labels and labelers, gold-standard test questions are used. Figure Eight currently has a beta program that relies on IBM Watson models to help with labeling (through label suggestions). As the models are refined with more labeled data, they can then be used for quality control through comparison of human- and computer-generated labels.

5.4 Labeling Workforce Providers

Labeling workforce providers enable cost-efficient access to labor pools. The workforce is taught to annotate and tag photos, videos, text, and voice recordings with accurate vocabulary and descriptions. These providers typically support tools to manage the workforce, but they lack a platform for labeling data. As such, partnerships between workforce providers and data labeling service providers (see 5.3 Data Labeling Service Providers) are established. The workforce can then utilize the secure platform and user interface to provide labels.

Here, we survey two companies that operate in this space.

5.4.1 iMerit

iMerit is a managed services company with an in-house workforce that provides data labeling services for images, documents, and unstructured data.

iMerit solves the scalability issue of data annotation by employing a trained workforce to provide data annotation using a systematic approach. Its workforce is mostly drawn from underserved and underprivileged communities and is trained by domain experts to learn to label by examples. The training lasts a couple of weeks, during which the correct labeling approach is progressively refined through edge cases.

In addition to training its workforce, iMerit uses multiple annotators from diverse backgrounds to deliver high-quality and consistent labels. These labels are further audited by humans and compared to machine learning model predictions.

Clients of iMerit include eBay, TripAdvisor, and Figure Eight.

5.4.2 CloudFactory

CloudFactory provides a scalable labeling workforce for machine learning, document transcription, and data enrichment. Within machine learning, it helps prepare datasets to support applications such as autonomous driving, computer vision, geospatial mapping, augmented reality, and natural language processing.

CloudFactory has a global workforce and small-scale delivery centers in Nepal and Kenya, which are talent-rich but lacking in infrastructure and work opportunities. Workers are trained and organized into small teams, whose skills and strengths are known and valued by their team leads. Clients communicate directly with team leads to provide feedback, ensure quick task iterations, or discuss problems and new use cases.

Clients of CloudFactory include Embark, Microsoft, drive.ai, facetec, and pilot.ai.

6. Ethics

Having access to copious amounts of labeled data is important for any supervised machine learning problem. The ability to learn with less data opens up the possibility of doing many more exciting things, but it also introduces a different set of constraints to consider. In the framework of human and machine collaboration, the ability to learn with limited labeled data hinges on human-provided labels and machine-selected datapoints. Here, we explore the ethical implications for both.

6.1 Labeling Workforce

Figure 6.1: To get more labeled data, teams of human labelers are often employed.
Figure 6.1: To get more labeled data, teams of human labelers are often employed.

We speak at length in this report about the need for human-generated labels. There are several different ways of filling this need. One is by having subject matter experts within an organization hand-label certain datapoints. A second avenue is by hiring a company that exclusively employs “expert” labelers who have been trained to do this work. A third option is to rely on crowdsourced labelers, who typically work for only a few cents to a few dollars per task, with a high risk of job insecurity.

6.1.1 Misaligned Incentives

To various degrees, regardless of which workforce option one chooses for labeling, misaligned incentives could rear their ugly heads. In-house employees typically have benefits and incentives aligned with the organization building the models. Employees of an enterprise labeling service company, while not direct employees of the organization building the models, are generally steadily employed. Many of them are from underserved regions and view their job as an opportunity to be integrated into the global digital economy. While severe incentive misalignments are unlikely in either case, bad actors can creep into any labeling effort.

The picture for crowdsourced labelers is less rosy. At the heart of all modern crowdsourced work is an intricate push and pull between the “worker” and the one creating the work, leading to an often opposing set of incentives. In addition, the unregulated nature of crowdsourcing could result in worker exploitation. Crowdsourced workers tend to earn very little for each job they complete and often have to do very repetitive tasks. These workers have a keen understanding of their compensation, sometimes down to the minute. As such, they may resort to a multitude of ways to maximize the return for their time, often leading to subpar quality of work. In addition, crowdsourcing platforms can be very vulnerable to bad actors attempting to intentionally thwart labeling efforts.

6.1.2 Skewed Selection

The need for human-provided labels implies that a group of labelers need be identified at the beginning of each project. This is a simple task for in-house labelers. Typically, a subject matter expert (or several of them) is pulled in. For enterprises providing labeling services, the nature of the project drives a training process for a selected a group of workers. The selection criteria are unclear and not visible to outsiders.

For crowdsourced labeling, effective tests need to be created to determine who is capable to do this work, both technically and culturally. Labeling images, for instance, might require a large vocabulary in a particular language. In cases where the data being labeled could be sensitive to cultural or socioeconomic experience, it is especially important to critically evaluate the selection criteria for each labeling task.

The selection process for in-house, enterprise, and crowdsourced labelers can help narrow down a group of qualified data labelers, but it could also result in a pool that is skewed in some way. It is important to think through the process properly and set it up in such a way as to mitigate this effect.

6.2 Data Bias

The data used to train machine learning models underlies the classification decisions that will come from those models. It is becoming increasingly apparent that data science and machine learning practitioners must be thoughtful when choosing the data they use to train their models. For instance, in 2018, researchers evaluated three facial analysis programs from major software companies and found them to be highly biased in terms of classifying both race and gender.[4] When the dataset used in one model was examined, they found that it consisted of 77 percent male and more than 83 percent white faces.

This problem is not unique to learning with limited data, but it is of particular concern when the dataset being used to train a model may not be representative. Additionally, if the original model itself is biased (and so the decision boundaries may be set with those biases), further winnowing down the pool of data will likely not reverse these biases, but rather reinforce them. If we’re only choosing a subsample of the data to train our models on, and that limited set of data is algorithmically chosen, it opens the door for potential unintended data biases. Although there are ways to combat data bias[5], we recommend critically examining the quality and understanding both the unlabeled and labeled datasets.

6.3 Human Biases in Labeling

In active learning frameworks, the images or texts that are sent to humans for labeling are often those that are not easily categorized by machines. While techniques that judiciously use human labelers can decrease cost and the amount of labeled data needed, there is a real danger of human bias creeping in. It’s much easier under uncertain conditions to apply one’s own beliefs and biases to make decisions. One solution to this problem that is typically employed (to mitigate bias and to increase the accuracy of the labels) is to have multiple people label a single datapoint and only choose labels that have between-labeler reliability.

In fact, in situations both ambiguous and not, clear effects of cognitive biases can be observed in the behaviors of crowdsourced data labelers. Recently, a study done by collecting data from Mechanical Turk workers found evidence of in-group age bias when the workers were presented with an ambiguous image of a woman who could be interpreted as being either young or old. Even when data service providers train their workforces to be as accurate as possible, they cannot overcome all the inherent cognitive biases that are more prevalent in ambiguous settings. Studies like this one highlight the ways in which the crowdsourced data labelers are not like machines in the way they label data, but rather still suffer from fundamental human shortcomings. (Even when “experts” are labeling data, human bias is still entirely possible, whether conscious or not.)

In fact, we recommend considering not only the biases of those labeling the data, but also those designing the systems that present the data to be labeled. Bias in the way screening questions are set up can impact what group of labelers even get to see the data (see 6.1.2 Skewed Selection). There is also room for bias in the questions asked of the labelers, and subtle cultural norms hidden in the questions can sway the outcome of the labeling greatly. Finally, the type of data itself can be vulnerable to human bias (for example, if labelers are asked to make gender discriminations).

Getting around human biases in labeling is not straightforward. We see a potential increased risk of biased labels for ambiguous data and therefore highlight this as a consideration here. Instead of trying to control for or thwart the effects of these human biases, we recommend identifying potential biases and building those assumptions into the way in which quality assurance is performed on the labeled data.

7. Future

Active learning utilizes a model and a strategy to identify the datapoints that are most valuable to the model. Human annotators then step in to label those datapoints, so that the model can be iteratively improved.

In practice, this approach hinges on two things. First, within the human and machine collaboration framework, a human is required to provide the labels. Second, a small subset of labeled data must be available; the model and selection strategy are chosen based on the limited signals available early on from this data.

7.1 When humans are unable to provide labels

The first point implies that active learning is only effective for data that humans can provide labels for – mainly images and text. There may be situations where it is not possible for humans to accurately label the data (for example, labeling credit card transactions as fraudulent or not). There could also be cases where obtaining labeled data is simply not possible until an event occurs. For instance, in the medical domain, capturing whether or not an irregular heartbeat signals a serious heart condition is not possible until there is an actual occurence of a medical episode. In cases like these, some other areas of machine learning that do not focus on label acquisition from humans can come to the rescue.

7.1.1 Reinforcment Learning

Reinforcement learning, for instance, does not need labeled data, but instead attempts to learn actions by trial and error. In contrast to a supervised approach, optimal actions are learned not from a label, but from a time-delayed metric called a reward. This coarse-grained metric tells the model whether the outcome of its actions were good or bad. Hence, the goal of reinforcement learning is to take actions in order to maximize reward.

7.1.2 Weak Supervision

Another way to combat the need for humans to provide ground-truth labels is through weak supervision, which programmatically creates lower-quality training data to weakly supervise models. The idea is to collect a set of noisy labels for each datapoint in the unlabeled dataset. These noisy labels could come from a variety of sources: for example, they could be crowdsourced, or they could use a weak or biased classifier. The labels could also take the form of functions that encode domain expertise. Once you have a set of multiple noisy labels, a key technical challenge is to somehow unify and denoise them. One way to do this is through a generative modeling process that essentially learns which labels are relatively more accurate than the others and assigns the most accurate label from the set of labels to each datapoint. The denoised dataset can then be used to train a supervised (discriminative) model.

7.2 Alternatives for representation learning

Inherent to active learning, due to the fact that it relies on a small subset of labeled data to choose both the model and datapoint selection approach, is the existence of bias. As discussed in 2.3 Peeling Back the Layers, the lack of labeled data makes it harder to ascertain the appropriateness of a learner and a strategy. This is a fundamental problem with active learning, and can be exacerbated by the fact that each round of active learning is greedy. In each round, active learning looks for the best set of datapoints for which to obtain labels and does not consider the longer-term impact of selecting these datapoints.

Figure 7.1: Bias caused by learner and strategy.
Figure 7.1: Bias caused by learner and strategy.

At the heart of active learning is an attempt to learn a generalizable representation that is initially based on a small labeled dataset, and then refine by iterating with more labeled data in each round. Although this is quite a straightforward and reasonable approach, many other areas in machine learning share this idea of representation learning (also known as feature learning).

Figure 7.2: Learning representations.
Figure 7.2: Learning representations.

7.2.1 Transfer Learning

Transfer learning – in which you “transfer” a model trained on a separate (but similar) and larger labeled dataset – is a common way of dealing with limited labeled data. This is typically achieved by initializing the parameters of a new model with those from the original model, and fine-tuning this new model using the smaller labeled dataset. Another way could be to use the original model as a feature extractor: for instance, we may use a neural network trained on ImageNet as the first layers of the new model, and then add a few layers to the end to train on the smaller labeled dataset. The advantage of either of these approaches is effectively encapsulating the information from large amounts of labeled data in the model. The trade-off is that the data used to train the model may be only tangentially related to the domain.

7.2.2 Meta Learning

There are also several approaches that fall under the meta learning category that attempt to teach machines to learn representations from a very small number of labeled examples. These systems are trained by being exposed to a large number of tasks and are then tested on their ability to learn new tasks. Training meta learners takes a two-level approach consisting of a learner and a trainer. The first level is a fast learner; the goal is to quickly learn new tasks from a small amount of new data. (A “task” here refers to any machine learning problem – predicting a class given a small number of examples is one such task.) At the second (higher) level, a trainer trains this learner by repeatedly showing it hundreds or thousands of different tasks. Learning, then, happens at two levels: the first level focuses on quick acquisition of knowledge within each task, and the second level slowly pulls out and digests information from across all the tasks.

Consider this in the context of image classification. If you were given a set of examples and were asked to classify a new image, the natural thing to do would be to compare the new image to the examples, find the one that is the most similar, and use its class as the label for the new image. This is the idea behind similarity-based meta learning approaches. To classify a new image based on the examples available, first find its closest image from the examples, then use that image’s label as a prediction. In matching networks, images are represented by their embeddings, and distances between images are simply cosine distances between image embeddings.

Embeddings, which we discussed in the context of language in our Summarization (FF04) and Semantic Recommendations (FF07) reports, are rich numerical representations of text that a computer can understand. For images, embeddings can be thought of as groups of features (lines, edges) that richly represent the images. The goal of the matching network approach is to converge to embeddings that will result in choosing a labeled image that is closest to the one we want to classify.

With matching networks, we end up with a model capable of generating representations of images (through embeddings) that capture both the difference and commonality between images. This suggests another approach to teaching machines to learn quickly with little training data. We start by finding an internal representation that can be fine-tuned easily with new tasks – the model can then rapidly adapt to new tasks using only a few datapoints.

In the context of deep learning, we can think of this internal representation as an initial set of network parameters. The goal is to initialize the network with this set of parameters, thereby allowing it to quickly and efficiently learn the parameters of new tasks. Alternatively, one can explicitly optimize for the model’s ability to generalize well quickly (i.e., only using a small number of updates). The meta learner is trained to help the learner converge to a good solution rapidly on each task.

7.2.3 Semi-supervised and Unsupervised approaches

Finally, the idea of representation learning is also prevalent in both semi-supervised and unsupervised learning approaches. They can leverage both labeled and unlabeled datasets to produce representations for general utility, which can then be used for downstream tasks. For example, one might use clustering to first identify the underlying structure of the data and then a supervised machine learning approach to find the decision boundary. A two-phased approach like this can help improve the classification accuracy when limited labeled data is available.

In addition, a variety of representation learning algorithms have recently been proposed based on the idea of autoencoding, where the goal is to learn a mapping from high-dimensional inputs to a lower-dimensional representation space such that the original inputs can be reconstructed (approximately) from the lower-dimensional representation. A supervised learning approach can then use these mappings (pretrained weights) to perform learning with limited labeled data.

7.3 Sci-fi Story: The Amethyst Incident

A short story by Danielle Thorp. Inspired by active learning.

Author’s note: Naming robots is harder than you might think, so I’d like to say a special thanks to all the Cloudera employees who hopped in on a (rather lengthy and hilarious) thread to suggest their ideas - and especially to Jeff Lee who suggested the RRC (Robot Rock Collector). Thanks also to the Cloudera Fast Forward Labs team members who helped brainstorm ideas for the story and refine the tech; Chris Wallace’s “eccentric collector” idea was the inspiration for Amy, and the feedback on the storyline from Grant Custer and Shioulin Sam was invaluable.

“Houston, we have … an unusual problem.”

These were not the words Houston Summers expected to hear from her colleague during a routine check-in. Their small team from the Interplanetary Geological Society (IGS) was manning Space Station 247B, in orbit around Artemis, a newly-discovered planet in the 82 G. Eridani system.

The team’s assignment was to collect geological data from the surface of the planet, with the help of a robot designed for the purpose. The RRC2000 was equipped with state-of-the-art cameras and a small storage bay, and had been trained on carefully labeled datasets of rocks from across South Africa. Aside from an initial tendency to misidentify any purple rock as an amethyst (which earned it the affectionate nickname Amy during early training days), it had done very well in classifying new types of rocks on past missions. When aerial views of Artemis indicated terrain similar to that of western South Africa, the IGS sent Amy and its team to explore.

Amy had been exploring Artemis for about three weeks now, and had already sent thousands of labeled images to the team. Some of the rocks it found were indeed familiar, but it had already classified at least 35 new types of rocks, storing a sample of each one in its storage bay, creating new scientific names based on their similarities to known rocks, and labeling each image it sent to the station with coordinates of its location. With Amy’s data, the team was creating a fairly accurate geological map of Artemis.

The team, however, had begun to notice a consistent irregularity in Amy’s transmissions. It seemed that Amy had discovered a “favorite spot” to which it was returning daily. The question, however, was why. Amy was programmed to explore the planet’s surface at random, much like the autonomous robotic vacuum cleaners of the early 2000s. While revisiting a particular spot would not be unlikely, this pattern of recurrence seemed oddly deliberate.

“Houston, I don’t know how to explain it other than this: Amy is up to something.”

Houston blinked at her colleague in disbelief. Other than the expected error of misclassifying a rock on occasion, Amy had never shown any signs of flawed programming.

“Permission to bring Amy back to the station for testing?”

Houston deliberated. The first phase of the exploration was only meant to be four weeks long. Was this aberration worth cutting the mission short?

“No, let’s wait until the scheduled end date. What kind of trouble could a rock-collecting robot possibly get into?”

Figure 7.3: Amy the robot and her friends.
Figure 7.3: Amy the robot and her friends.

The IGS crew gathered to watch as Amy returned to the station a week later; in the interim, it had continued its odd pattern of revisiting the same spot daily, with one exception: just prior to returning to the station, Amy made one last trip to its favorite spot, and directly following, sensor data indicated several new items in its storage bay.

In keeping with protocol,the contents of Amy’s storage bay were emptied into a large basin for decontamination, before being sent to the lab. The crew watched in delight as the beautiful Artemisian rocks in various shades of white, beige, grey, pink, green, and purple began to come through to the lab on a conveyor belt, collecting them carefully into individual trays for study. Last to come through was a particularly sparkly collection of rocks which resembled amethysts. Houston reached for one - and it scurried away.

27th April 2075

To the president of the IGS, the Board, and all interested parties:

Upon review, we believe we understand at least the basic cause of the Amethyst Incident on Space Station 247B last month.

The RRC2000 is not sentient; its ability to comprehend what it “sees” is limited to a series of still images on which it bases its assumptions. It does not register movement, nor view it as a problem. In short, we do not believe that Amy understood the creatures it collected to be alive.

We are happy to report that the pholidota amethystos are thriving. We were able to replicate their home environment in the lab, and while more study is needed, it appears that they survive on a combination of minerals absorbed from the rocks around them, and basic photosynthesis. And no, they do not bite.

Why Amy chose the pholidota amethystos as a particular interest is still unclear. We speculate that it may have had something to do with bias in the training data.

Or - perhaps it just thought they were cute.

Kind regards, + Houston Summers, Ph.D. + Lead Scientist for the Artemis Team, IGS

8. Conclusion

The ability to learn with limited labeled data is a game changer. The approach explored in this report enables enterprises to begin leveraging their existing pools of unlabeled data to build new and exciting products that would otherwise be beyond reach. With our prototype, we bring the process of active learning to life, shedding light on how and why active learning works.

But active learning is only one piece of the puzzle of learning with limited labeled data. Active learning relies on the feedback loop between humans and machines to enable applications to be built using a small subset of labeled data. In this feedback loop, machines request labels for difficult datapoints and humans step in to provide the labels. This implies that active learning is most suited for natural data, or data that humans know how to label – and human biases can easily creep in. In addition, active learning relies on a very small initially labeled dataset to determine the most appropriate learner and strategy – it does not balance exploration and exploitation well.

In 7. Future we briefly mentioned other approaches, such as transfer learning, meta learning, and weak supervision, and discussed how they fit into the puzzle. Like with everything else in machine learning, the right approach depends on both the use case and the type of data. As some of these approaches continue to mature, we look forward to clever combinations with unsupervised machine learning techniques to learn with limited labeled data.


  1. Autonomous cars will generate more than 300 TB of data per year ↩︎

  2. A Benchmark and Comparison of Active Learning ↩︎

  3. From Theories to Queries: Active Learning in Practice ↩︎

  4. Study finds gender and skin-type bias in commercial artificial-intelligence systems ↩︎

  5. The Dataset Nutrition Label ↩︎