🤖 AI Summary
Lexical data in linguistic documentation frequently contain transcription errors and unannotated loanwords, introducing bias into phonological analysis. This paper addresses these challenges for the low-resource language Kokborok by proposing an unsupervised anomaly detection method that innovatively integrates character-level and syllable-aware phonological features to identify both transcription errors and covert loanwords in lexical inventories. Evaluated on a Kokborok–Bengali multilingual dataset, the method significantly outperforms a character-only baseline, achieving high recall while maintaining practical applicability and systematic rigor. Although precision is constrained by the subtle, linguistically embedded nature of certain anomalies, the method’s strong recall enables field linguists to perform actionable data quality diagnostics. It thus establishes a novel paradigm for automated cleaning and annotation of lexical data in low-resource language documentation.
📝 Abstract
Lexical data collection in language documentation often contains transcription errors and undocumented borrowings that can mislead linguistic analysis. We present unsupervised anomaly detection methods to identify phonotactic inconsistencies in wordlists, applying them to a multilingual dataset of Kokborok varieties with Bangla. Using character-level and syllable-level phonotactic features, our algorithms identify potential transcription errors and borrowings. While precision and recall remain modest due to the subtle nature of these anomalies, syllable-aware features significantly outperform character-level baselines. The high-recall approach provides fieldworkers with a systematic method to flag entries requiring verification, supporting data quality improvement in low-resourced language documentation.