Multi-label Scandinavian Language Identification (SLIDE)

📅 2025-02-10
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
This paper addresses the challenge of fine-grained, sentence-level multilabel language identification (LID) for Scandinavian languages—Danish, Norwegian Bokmål, Norwegian Nynorsk, and Swedish—where sentences frequently exhibit intra-sentential code-mixing, rendering single-label classification inadequate. To tackle this, we introduce SLIDE, the first manually annotated multilabel evaluation dataset for Scandinavian LID. We propose a lightweight, multilayer neural architecture leveraging character- and token-level features, integrated with threshold optimization and label-correlation modeling to enable tunable precision–efficiency trade-offs. Experiments demonstrate that multilabel modeling is essential for accurate LID: our method achieves a mean F1-score of 89.2% on SLIDE, substantially outperforming single-label baselines. The lightweight variant processes over 10,000 sentences per second, satisfying both industrial deployment constraints and academic evaluation rigor.

Technology Category

Application Category

📝 Abstract
Identifying closely related languages at sentence level is difficult, in particular because it is often impossible to assign a sentence to a single language. In this paper, we focus on multi-label sentence-level Scandinavian language identification (LID) for Danish, Norwegian Bokm {a}l, Norwegian Nynorsk, and Swedish. We present the Scandinavian Language Identification and Evaluation, SLIDE, a manually curated multi-label evaluation dataset and a suite of LID models with varying speed-accuracy tradeoffs. We demonstrate that the ability to identify multiple languages simultaneously is necessary for any accurate LID method, and present a novel approach to training such multi-label LID models.
Problem

Research questions and friction points this paper is trying to address.

Identify multiple Scandinavian languages simultaneously.
Develop multi-label language identification models.
Create a dataset for evaluating language identification.
Innovation

Methods, ideas, or system contributions that make the work stand out.

Multi-label language identification
Manually curated dataset
Novel training approach
🔎 Similar Papers
No similar papers found.
Mariia Fedorova
Mariia Fedorova
University of Oslo
NLP
J
Jonas Sebulon Frydenberg
Department of Informatics, University of Oslo
V
Victoria Handford
Department of Informatics, University of Oslo
V
Victoria Ovedie Chruickshank Lango
Department of Informatics, University of Oslo
S
Solveig Helene Willoch
Department of Informatics, University of Oslo
M
Marthe Loken Midtgaard
Department of Informatics, University of Oslo
Yves Scherrer
Yves Scherrer
Department of Informatics, University of Oslo
Natural language processing
P
Petter Maehlum
Department of Informatics, University of Oslo
David Samuel
David Samuel
Language Technology Group, University of Oslo
language modelingsemantic parsingnatural language processing