LuxBorrow: From Pompier to Pompjee, Tracing Borrowing in Luxembourgish

📅 2026-03-11
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
This study systematically identifies and analyzes cross-linguistic borrowing in Luxembourgish news media and its diachronic evolution. Drawing on a multilingual corpus spanning 27 years and comprising 259,305 news articles, the research integrates sentence-level language identification with token-level borrowing annotation, enhanced by lemmatization, a curated loanword inventory, and custom morphological and orthographic normalization rules to establish a borrowing-centered analytical framework. The work introduces novel metrics—including borrowed token/type ratios, donor entropy, and assimilation ratio—to reveal that Luxembourgish exhibits a pattern of localized insertion rather than balanced bilingualism. Findings indicate that 77.1% of articles contain foreign elements (predominantly from French), 63.8% of loanwords undergo morphological adaptation, and the code-mixing index rises from 6.1 to 8.4, demonstrating a diachronic intensification of cross-linguistic mixing.

Technology Category

Application Category

📝 Abstract
We present LuxBorrow, a borrowing-first analysis of Luxembourgish (LU) news spanning 27 years (1999-2025), covering 259,305 RTL articles and 43.7M tokens. Our pipeline combines sentence-level language identification (LU/DE/FR/EN) with a token-level borrowing resolver restricted to LU sentences, using lemmatization, a collected loanword registry, and compiled morphological and orthographic rules. Empirically, LU remains the matrix language across all documents, while multilingual practice is pervasive: 77.1% of articles include at least one donor language and 65.4% use three or four. Breadth does not imply intensity: median code-mixing index (CMI) increases from 3.90 (LU+1) to only 7.00 (LU+3), indicating localized insertions rather than balanced bilingual text. Domain and period summaries show moderate but persistent mixing, with CMI rising from 6.1 (1999-2007) to a peak of 8.4 in 2020. Token-level adaptations total 25,444 instances and exhibit a mixed profile: morphological 63.8%, orthographic 35.9%, lexical 0.3%. The most frequent individual rules are orthographic, such as on->oun and eur->er, while morphology is collectively dominant. Diachronically, code-switching intensifies, and morphologically adapted borrowings grow from a small base. French overwhelmingly supplies adapted items, with modest growth for German and negligible English. We advocate borrowing-centric evaluation, including borrowed token and type rates, donor entropy over borrowed items, and assimilation ratios, rather than relying only on document-level mixing indices.
Problem

Research questions and friction points this paper is trying to address.

borrowing
code-mixing
Luxembourgish
language identification
loanwords
Innovation

Methods, ideas, or system contributions that make the work stand out.

borrowing detection
code-mixing analysis
morphological adaptation
multilingual NLP
diachronic linguistics
🔎 Similar Papers
No similar papers found.
N
Nina Hosseini-Kivanani
1 University of Luxembourg, Luxembourg; 2 Radio Télévision Luxembourg (RTL), Luxembourg
Fred Philippy
Fred Philippy
University of Luxembourg
Natural Language ProcessingDeep LearningData Science