BLiSS 1.0: Evaluating Bilingual Learner Competence in Second Language Small Language Models

📅 2025-10-22
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
Existing benchmarks inadequately assess language models’ capacity to model second language acquisition (SLA) principles. Method: We introduce BLiSS 1.0, a novel benchmark built on 2.8 million authentic sentences from bilingual learners. It pioneers a “selective tolerance” evaluation paradigm, organizing inputs into triplets—grammatically correct sentences, naturally occurring learner errors, and manually crafted ungrammatical sentences—to isolate models’ ability to distinguish acquisition-consistent errors from arbitrary violations. Evaluation employs grammaticality acceptability scoring and clustering analysis. Contribution/Results: We find selective tolerance is empirically distinct from conventional grammaticality judgment; moreover, model performance clusters clearly by training objective (e.g., pretraining vs. instruction tuning). BLiSS 1.0 enables fine-grained, goal-sensitive assessment of models’ SLA simulation capability for the first time, establishing a cognitively grounded standard for evaluating language models.

Technology Category

Application Category

📝 Abstract
To bridge the gap between performance-oriented benchmarks and the evaluation of cognitively inspired models, we introduce BLiSS 1.0, a Benchmark of Learner Interlingual Syntactic Structure. Our benchmark operationalizes a new paradigm of selective tolerance, testing whether a model finds a naturalistic learner error more plausible than a matched, artificial error within the same sentence. Constructed from over 2.8 million naturalistic learner sentences, BLiSS provides 136,867 controlled triplets (corrected, learner, artificial) for this purpose. Experiments on a diverse suite of models demonstrate that selective tolerance is a distinct capability from standard grammaticality, with performance clustering strongly by training paradigm. This validates BLiSS as a robust tool for measuring how different training objectives impact a model's alignment with the systematic patterns of human language acquisition.
Problem

Research questions and friction points this paper is trying to address.

Evaluating bilingual learner competence in second language models
Bridging performance benchmarks with cognitive model evaluation
Measuring training objectives' impact on human acquisition patterns
Innovation

Methods, ideas, or system contributions that make the work stand out.

Introduces BLiSS benchmark for bilingual evaluation
Uses selective tolerance paradigm for error plausibility
Leverages naturalistic learner sentences for controlled triplets
🔎 Similar Papers
No similar papers found.
Y
Yuan Gao
ALTA Institute, Department of Computer Science & Technology, University of Cambridge
Suchir Salhan
Suchir Salhan
University of Cambridge
Machine LearningLanguage ModelsNatural Language ProcessingLinguisticsCognitive Science
Andrew Caines
Andrew Caines
University of Cambridge
P
Paula Buttery
ALTA Institute, Department of Computer Science & Technology, University of Cambridge
W
Weiwei Sun
ALTA Institute, Department of Computer Science & Technology, University of Cambridge