Automated Quality Control for Language Documentation: Detecting Phonotactic Inconsistencies in a Kokborok Wordlist

📅 2025-10-24
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
Lexical data in linguistic documentation frequently contain transcription errors and unannotated loanwords, introducing bias into phonological analysis. This paper addresses these challenges for the low-resource language Kokborok by proposing an unsupervised anomaly detection method that innovatively integrates character-level and syllable-aware phonological features to identify both transcription errors and covert loanwords in lexical inventories. Evaluated on a Kokborok–Bengali multilingual dataset, the method significantly outperforms a character-only baseline, achieving high recall while maintaining practical applicability and systematic rigor. Although precision is constrained by the subtle, linguistically embedded nature of certain anomalies, the method’s strong recall enables field linguists to perform actionable data quality diagnostics. It thus establishes a novel paradigm for automated cleaning and annotation of lexical data in low-resource language documentation.

Technology Category

Application Category

📝 Abstract
Lexical data collection in language documentation often contains transcription errors and undocumented borrowings that can mislead linguistic analysis. We present unsupervised anomaly detection methods to identify phonotactic inconsistencies in wordlists, applying them to a multilingual dataset of Kokborok varieties with Bangla. Using character-level and syllable-level phonotactic features, our algorithms identify potential transcription errors and borrowings. While precision and recall remain modest due to the subtle nature of these anomalies, syllable-aware features significantly outperform character-level baselines. The high-recall approach provides fieldworkers with a systematic method to flag entries requiring verification, supporting data quality improvement in low-resourced language documentation.
Problem

Research questions and friction points this paper is trying to address.

Detecting phonotactic inconsistencies in Kokborok wordlists
Identifying transcription errors and undocumented borrowings
Improving data quality through automated anomaly detection
Innovation

Methods, ideas, or system contributions that make the work stand out.

Unsupervised anomaly detection for phonotactic inconsistencies
Character-level and syllable-level phonotactic feature analysis
Systematic verification flagging for low-resourced language documentation
🔎 Similar Papers
No similar papers found.
Kellen Parker van Dam
Kellen Parker van Dam
University of Passau
Language documentationPhonologyToneHistorical linguisticsSino-Tibetan
A
Abishek Stephen
Institute of Formal and Applied Linguistics, Charles University, Czech Republic