An Empirical Evaluation of Active Learning and Selective Sampling Variations Supporting Large Corpus Labeling

Theodore J Markowitz, Pace University


A constant challenge to researchers is the lack of large and timely datasets of domain examples (research corpora) used for training and testing their latest algorithms. Corpora examples are often annotated with special labels that represent class categories, numeric predictions, etc., depending on the research problem. While acquiring large numbers of examples is often not difficult, ensuring that each is correctly and consistently labeled certainly can be. Human experts may be required to visually inspect, annotate, and cross-check each example to guarantee its accuracy. Unfortunately, the costs incurred performing this adjudication have lead to a shortage of labeled corpora, particularly bigger and more recent ones. The primary goal of our research has been to determine how larger volumes of examples could be autonomously annotated to create more substantial datasets using a minimum of human intervention while maintaining acceptable levels of labeling accuracy. We chose a form of Machine Learning, Active Learning, as the basis for building a suite of automated corpus labeling tools. Our labelers start with a few pre-labeled examples and a larger number of unlabeled examples. They then iteratively select small batches of these examples for labeling by an "oracle", which may be a live human expert or some other authoritative source. This "selective sampling" step picks those queries which the tools themselves think would enhance their future labeling predictions. Once the labelers have been trained, the learning iterations cease and the rest of the unlabeled examples in a corpus can be confidently labeled without additional human intervention. To sample the most informative queries we began with the well-known Uncertainty Sampling (US) technique. However, US can be computationally expensive, and so we have proposed a new variant, Approximate Uncertainty Sampling (AUS), that is nearly as effective, but which has lower complexity costs and much less processing overhead. These reductions allow AUS to select queries more frequently and support other types of computation during labeling. In this way AUS encourages the building of larger and more topical corpora for the research communities that require them.

Subject Area

Computer science

Recommended Citation

Markowitz, Theodore J, "An Empirical Evaluation of Active Learning and Selective Sampling Variations Supporting Large Corpus Labeling" (2011). ETD Collection for Pace University. AAI3502057.



Remote User: Click Here to Login (must have Pace University remote login ID and password. Once logged in, click on the View More link above)