LongTail-Swap: benchmarking language models' abilities on rare words

📅 2025-10-05

📈 Citations: 0

✨ Influential: 0

career value

173K/year

🤖 AI Summary

This work addresses the weak generalization of language models on lexical long-tail (i.e., rare words). We introduce LongTail-Swap, the first zero-shot evaluation benchmark explicitly designed for the long-tail distribution of pretraining corpora. Built upon the BabyLM dataset, it constructs grammatical/ungrammatical sentence pairs centered on extremely low-frequency tokens; model competence in semantics and syntax for such tokens is assessed via zero-shot average log-probability scores over sentence pairs. Unlike conventional benchmarks emphasizing high-frequency vocabulary, LongTail-Swap systematically exposes severe performance bottlenecks on rare words—revealing that architectural differences yield substantially larger performance gaps in the long tail than in the head. Empirical evaluation across 16 BabyLM models confirms the benchmark’s validity and diagnostic utility for probing long-tail generalization.

Technology Category

Application Category

📝 Abstract

Children learn to speak with a low amount of data and can be taught new words on a few-shot basis, making them particularly data-efficient learners. The BabyLM challenge aims at exploring language model (LM) training in the low-data regime but uses metrics that concentrate on the head of the word distribution. Here, we introduce LongTail-Swap (LT-Swap), a benchmark that focuses on the tail of the distribution, i.e., measures the ability of LMs to learn new words with very little exposure, like infants do. LT-Swap is a pretraining corpus-specific test set of acceptable versus unacceptable sentence pairs that isolate semantic and syntactic usage of rare words. Models are evaluated in a zero-shot fashion by computing the average log probabilities over the two members of each pair. We built two such test sets associated with the 10M words and 100M words BabyLM training sets, respectively, and evaluated 16 models from the BabyLM leaderboard. Our results not only highlight the poor performance of language models on rare words but also reveal that performance differences across LM architectures are much more pronounced in the long tail than in the head. This offers new insights into which architectures are better at handling rare word generalization. We've also made the code publicly avail

Problem

Research questions and friction points this paper is trying to address.

Benchmarking language models' rare word learning abilities

Evaluating few-shot word acquisition like children's learning

Testing semantic and syntactic usage of rare words

Innovation

Methods, ideas, or system contributions that make the work stand out.

LT-Swap benchmark tests rare word learning

Evaluates models using zero-shot sentence pair probabilities

Compares architecture performance on long-tail distribution

🔎 Similar Papers

No similar papers found.