Comparing Specialised Small and General Large Language Models on Text Classification: 100 Labelled Samples to Achieve Break-Even Performance

📅 2024-02-20

📈 Citations: 7

✨ Influential: 0

🤖 AI Summary

This work investigates the minimal labeled data requirements for specialized small-scale models to surpass general-purpose large language models (LLMs) under few-shot settings (10–1,000 samples). Using eight text classification tasks, we systematically evaluate performance inflection points of seven models—including LLaMA and BERT—across fine-tuning, instruction tuning, prompt engineering, and in-context learning. We find that, on average, multi-class tasks require ≤100 samples for small models to match or exceed LLM performance, whereas binary classification may demand up to 5,000 samples. We propose a robust sample-requirement estimation framework that explicitly accounts for performance variance, improving estimation accuracy by 100–200% (up to 1,500% in extreme cases). Our core contribution is the quantitative characterization of how task properties—particularly class count and output variance—affect small-model data efficiency. This yields reproducible, task-aware guidelines for determining optimal annotation budgets in low-resource NLP scenarios.

Technology Category

Application Category

📝 Abstract

When solving NLP tasks with limited labelled data, researchers can either use a general large language model without further update, or use a small number of labelled examples to tune a specialised smaller model. In this work, we address the research gap of how many labelled samples are required for the specialised small models to outperform general large models, while taking the performance variance into consideration. By observing the behaviour of fine-tuning, instruction-tuning, prompting and in-context learning on 7 language models, we identify such performance break-even points across 8 representative text classification tasks of varying characteristics. We show that the specialised models often need only few samples (on average $10 - 1000$) to be on par or better than the general ones. At the same time, the number of required labels strongly depends on the dataset or task characteristics, with this number being significantly lower on multi-class datasets (up to $100$) than on binary datasets (up to $5000$). When performance variance is taken into consideration, the number of required labels increases on average by $100 - 200%$ and even up to $1500%$ in specific cases.

Problem

Research questions and friction points this paper is trying to address.

Determine required labeled samples for small models to outperform large ones

Compare performance break-even points across 8 text classification tasks

Assess impact of model size and quantization on performance variance

Innovation

Methods, ideas, or system contributions that make the work stand out.

Specialised small models outperform general large models

Fine-tuning requires 100 labelled samples on average

Performance variance increases required labels by 100-200%

🔎 Similar Papers

No similar papers found.

Authors to Follow