Have LLMs Made Active Learning Obsolete? Surveying the NLP Community

📅 2025-03-12
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
With the rise of large language models (LLMs), it remains contested whether active learning (AL) retains practical relevance in NLP—particularly given alternatives like few-shot learning and synthetic data generation. Method: We conduct a large-scale, mixed-methods empirical survey (N=XXX) targeting global NLP practitioners, combining quantitative analysis with qualitative insights to systematically assess AL’s real-world role across annotation cost, effectiveness, implementation barriers, and future trajectories. Contribution/Results: This is the first community-driven empirical study to demonstrate that labeled data remains a core bottleneck, with 72% of current practitioners affirming AL’s effectiveness. We identify three persistent, decade-old challenges: deployment complexity, difficulty in quantifying return on investment (ROI), and lack of mature tooling ecosystems. Additionally, we publicly release the first anonymized, reproducible dataset capturing real-world NLP annotation practices—establishing a foundational benchmark for next-generation human-AI collaborative annotation research.

Technology Category

Application Category

📝 Abstract
Supervised learning relies on annotated data, which is expensive to obtain. A longstanding strategy to reduce annotation costs is active learning, an iterative process, in which a human annotates only data instances deemed informative by a model. Large language models (LLMs) have pushed the effectiveness of active learning, but have also improved methods such as few- or zero-shot learning, and text synthesis - thereby introducing potential alternatives. This raises the question: has active learning become obsolete? To answer this fully, we must look beyond literature to practical experiences. We conduct an online survey in the NLP community to collect previously intangible insights on the perceived relevance of data annotation, particularly focusing on active learning, including best practices, obstacles and expected future developments. Our findings show that annotated data remains a key factor, and active learning continues to be relevant. While the majority of active learning users find it effective, a comparison with a community survey from over a decade ago reveals persistent challenges: setup complexity, estimation of cost reduction, and tooling. We publish an anonymized version of the collected dataset
Problem

Research questions and friction points this paper is trying to address.

Assessing if LLMs have made active learning obsolete in NLP.
Exploring the relevance and effectiveness of active learning in data annotation.
Identifying challenges and future developments in active learning practices.
Innovation

Methods, ideas, or system contributions that make the work stand out.

Online survey in NLP community
Comparison with decade-old survey
Anonymized dataset publication
🔎 Similar Papers
No similar papers found.
J
Julia Romberg
GESIS – Leibniz Institute for the Social Sciences
C
Christopher Schroder
Institute for Applied Informatics at Leipzig University (InfAI), Center for Scalable Data Analytics and Artificial Intelligence (ScaDS.AI) Dresden/Leipzig
Julius Gonsior
Julius Gonsior
Technische Universität Dresden
Weak SupervisionActive LearningSemi-supervised Learning
Katrin Tomanek
Katrin Tomanek
Google Research
Natural Language ProcessingActive LearningAutomatic Speech RecogitionMachine Translation
Fredrik Olsson
Fredrik Olsson
Head of Data Science & Product Owner
computational linguisticsnatural language processingmachine learningartificial intelligence