Annotating and Inferring Compositional Structures in Numeral Systems Across Languages

📅 2025-03-03

📈 Citations: 0

✨ Influential: 0

career value

145K/year

🤖 AI Summary

Cross-linguistic numeral structure comparison lacks a standardized, lightweight encoding framework. Method: We propose the first scalable numeral encoding framework and a human-in-the-loop annotation workflow, structurally annotating numerals 1–40 across 25 typologically diverse languages. Our methodology integrates rule-based morphological analysis with supervised and unsupervised morpheme segmentation (Morfessor, LSTM-segmenter) and subword algorithms (BPE, WordPiece), conducting systematic comparative experiments. Contributions/Results: (1) We systematically reveal that over 78% of numerals exhibit mismatches between surface form and underlying morphological structure; (2) we identify allomorphy as the primary cause of segmentation errors—accounting for over 41% of failures—and empirically refute the applicability of subword segmentation to low-resource numeral analysis; (3) we release the first cross-lingual structured numeral dataset (1–40), achieving a maximum segmentation F1-score of 82.3%.

Technology Category

Application Category

📝 Abstract

Numeral systems across the world's languages vary in fascinating ways, both regarding their synchronic structure and the diachronic processes that determined how they evolved in their current shape. For a proper comparison of numeral systems across different languages, however, it is important to code them in a standardized form that allows for the comparison of basic properties. Here, we present a simple but effective coding scheme for numeral annotation, along with a workflow that helps to code numeral systems in a computer-assisted manner, providing sample data for numerals from 1 to 40 in 25 typologically diverse languages. We perform a thorough analysis of the sample, focusing on the systematic comparison between the underlying and the surface morphological structure. We further experiment with automated models for morpheme segmentation, where we find allomorphy as the major reason for segmentation errors. Finally, we show that subword tokenization algorithms are not viable for discovering morphemes in low-resource scenarios.

Problem

Research questions and friction points this paper is trying to address.

Standardize coding for numeral system comparison across languages

Analyze morphological structure in diverse numeral systems

Evaluate morpheme segmentation in low-resource language scenarios

Innovation

Methods, ideas, or system contributions that make the work stand out.

Standardized coding scheme for numeral annotation

Computer-assisted workflow for numeral system coding

Automated models for morpheme segmentation analysis

🔎 Similar Papers

No similar papers found.