Subword-Based Comparative Linguistics across 242 Languages Using Wikipedia Glottosets

📅 2026-01-26
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
This study proposes a subword-based framework for large-scale multilingual lexical comparison to quantitatively analyze lexical similarity, divergence, and genealogical relationships among languages. We construct a lexicon comprising 242 languages written in Latin and Cyrillic scripts and, for the first time, systematically apply Byte Pair Encoding (BPE) for subword segmentation, enabling cross-linguistic comparison through ranked subword vectors. Empirical results show that BPE significantly outperforms random baselines across 15 languages (F1 = 0.34 vs. 0.15) and that BPE-derived lexical similarity correlates strongly with phylogenetic distance (r = 0.329, p < 0.001). Romance languages form the tightest cluster (mean distance = 0.51), while inter-family language pairs are markedly more distant (0.82). Notably, nearly half (48.7%) of orthographic cognates in related languages exhibit divergent BPE segmentations, revealing the sensitivity of subword structure to linguistic evolution.

Technology Category

Application Category

📝 Abstract
We present a large-scale comparative study of 242 Latin and Cyrillic-script languages using subword-based methodologies. By constructing'glottosets'from Wikipedia lexicons, we introduce a framework for simultaneous cross-linguistic comparison via Byte-Pair Encoding (BPE). Our approach utilizes rank-based subword vectors to analyze vocabulary overlap, lexical divergence, and language similarity at scale. Evaluations demonstrate that BPE segmentation aligns with morpheme boundaries 95% better than random baseline across 15 languages (F1 = 0.34 vs 0.15). BPE vocabulary similarity correlates significantly with genetic language relatedness (Mantel r = 0.329, p<0.001), with Romance languages forming the tightest cluster (mean distance 0.51) and cross-family pairs showing clear separation (0.82). Analysis of 26,939 cross-linguistic homographs reveals that 48.7% receive different segmentations across related languages, with variation correlating to phylogenetic distance. Our results provide quantitative macro-linguistic insights into lexical patterns across typologically diverse languages within a unified analytical framework.
Problem

Research questions and friction points this paper is trying to address.

comparative linguistics
subword analysis
language similarity
lexical divergence
multilingual comparison
Innovation

Methods, ideas, or system contributions that make the work stand out.

subword-based analysis
Byte-Pair Encoding (BPE)
cross-linguistic comparison
glottosets
lexical divergence
🔎 Similar Papers
No similar papers found.
I
Iaroslav Chelombitko
DataSpike; Neapolis University Pafos, Paphos, Cyprus; Metropolia University of Applied Sciences, Helsinki, Finland
Mika Hämäläinen
Mika Hämäläinen
Metropolia University of Applied Sciences
NLPNLGcomputational creativityendangered languagesdigital humanities
A
A. Komissarov
Neapolis University Pafos, Paphos, Cyprus; aglabx, Paphos, Cyprus