Select, Label, Evaluate: Active Testing in NLP

📅 2026-03-23

📈 Citations: 0

✨ Influential: 0

career value

162K/year

🤖 AI Summary

This work addresses the high cost and time consumption of manual annotation in natural language processing (NLP) test sets by proposing and systematically evaluating an Active Testing framework. This approach selects the most informative samples for annotation to efficiently estimate model performance under limited annotation budgets. As the first study to formally define active testing in NLP, we construct a large-scale benchmark encompassing 18 datasets, four task types, and four embedding strategies, enabling a comprehensive comparison of existing sample selection methods. We also introduce an adaptive stopping criterion that does not require a pre-specified annotation budget. Experimental results demonstrate that our method can reduce annotation effort by up to 95% while maintaining performance estimation error below 1%, substantially lowering evaluation costs.

Technology Category

Application Category

📝 Abstract

Human annotation cost and time remain significant bottlenecks in Natural Language Processing (NLP), with test data annotation being particularly expensive due to the stringent requirement for low-error and high-quality labels necessary for reliable model evaluation. Traditional approaches require annotating entire test sets, leading to substantial resource requirements. Active Testing is a framework that selects the most informative test samples for annotation. Given a labeling budget, it aims to choose the subset that best estimates model performance while minimizing cost and human effort. In this work, we formalize Active Testing in NLP and we conduct an extensive benchmarking of existing approaches across 18 datasets and 4 embedding strategies spanning 4 different NLP tasks. The experiments show annotation reductions of up to 95%, with performance estimation accuracy difference from the full test set within 1%. Our analysis reveals variations in method effectiveness across different data characteristics and task types, with no single approach emerging as universally superior. Lastly, to address the limitation of requiring a predefined annotation budget in existing sample selection strategies, we introduce an adaptive stopping criterion that automatically determines the optimal number of samples.

Problem

Research questions and friction points this paper is trying to address.

Active Testing

test set annotation

annotation cost

model evaluation

sample selection

Innovation

Methods, ideas, or system contributions that make the work stand out.

Active Testing

sample selection

adaptive stopping criterion