Automated Quality Control for Language Documentation: Detecting Phonotactic Inconsistencies in a Kokborok Wordlist

📅 2025-10-24

📈 Citations: 0

✨ Influential: 0

career value

156K/year

🤖 AI Summary

Lexical data in linguistic documentation frequently contain transcription errors and unannotated loanwords, introducing bias into phonological analysis. This paper addresses these challenges for the low-resource language Kokborok by proposing an unsupervised anomaly detection method that innovatively integrates character-level and syllable-aware phonological features to identify both transcription errors and covert loanwords in lexical inventories. Evaluated on a Kokborok–Bengali multilingual dataset, the method significantly outperforms a character-only baseline, achieving high recall while maintaining practical applicability and systematic rigor. Although precision is constrained by the subtle, linguistically embedded nature of certain anomalies, the method’s strong recall enables field linguists to perform actionable data quality diagnostics. It thus establishes a novel paradigm for automated cleaning and annotation of lexical data in low-resource language documentation.

Technology Category

Application Category

📝 Abstract

Lexical data collection in language documentation often contains transcription errors and undocumented borrowings that can mislead linguistic analysis. We present unsupervised anomaly detection methods to identify phonotactic inconsistencies in wordlists, applying them to a multilingual dataset of Kokborok varieties with Bangla. Using character-level and syllable-level phonotactic features, our algorithms identify potential transcription errors and borrowings. While precision and recall remain modest due to the subtle nature of these anomalies, syllable-aware features significantly outperform character-level baselines. The high-recall approach provides fieldworkers with a systematic method to flag entries requiring verification, supporting data quality improvement in low-resourced language documentation.

Problem

Research questions and friction points this paper is trying to address.

Detecting phonotactic inconsistencies in Kokborok wordlists

Identifying transcription errors and undocumented borrowings

Improving data quality through automated anomaly detection

Innovation

Methods, ideas, or system contributions that make the work stand out.

Unsupervised anomaly detection for phonotactic inconsistencies

Character-level and syllable-level phonotactic feature analysis

Systematic verification flagging for low-resourced language documentation

🔎 Similar Papers

No similar papers found.