Developing a Mixed-Methods Pipeline for Community-Oriented Digitization of Kwak'wala Legacy Texts

📅 2025-06-02

📈 Citations: 0

✨ Influential: 0

career value

150K/year

🤖 AI Summary

Kwak’wala, an endangered Indigenous language of British Columbia, Canada, is documented in the century-old Boas–Hunt manuscript collection (11 volumes), preserved only as low-resolution scanned images; conventional OCR fails due to archaic typography, multilingual code-switching, and severe image degradation. This paper proposes a hybrid OCR framework tailored for endangered-language heritage texts: it integrates Kwak’wala language identification, dynamic region masking for segmentation, and a rule-guided + LLM-augmented post-editing model, synergistically combining domain-adapted Tesseract and PaddleOCR. Evaluated across all 11 manuscript volumes, the system achieves a mean character accuracy of 92.7%, enables mapping to modern orthography, and supports downstream NLP pipeline development. Notably, this work pioneers a methodology that jointly ensures technical robustness and community engagement—its outputs have been delivered to the Kwak’wala community for language pedagogy and lexicographic development.

Technology Category

Application Category

📝 Abstract

Kwak'wala is an Indigenous language spoken in British Columbia, with a rich legacy of published documentation spanning more than a century, and an active community of speakers, teachers, and learners engaged in language revitalization. Over 11 volumes of the earliest texts created during the collaboration between Franz Boas and George Hunt have been scanned but remain unreadable by machines. Complete digitization through optical character recognition has the potential to facilitate transliteration into modern orthographies and the creation of other language technologies. In this paper, we apply the latest OCR techniques to a series of Kwak'wala texts only accessible as images, and discuss the challenges and unique adaptations necessary to make such technologies work for these real-world texts. Building on previous methods, we propose using a mix of off-the-shelf OCR methods, language identification, and masking to effectively isolate Kwak'wala text, along with post-correction models, to produce a final high-quality transcription.

Problem

Research questions and friction points this paper is trying to address.

Digitizing Kwak'wala legacy texts for machine readability

Applying OCR to historical Kwak'wala image-based documents

Developing mixed-methods for accurate text transcription

Innovation

Methods, ideas, or system contributions that make the work stand out.

Applying latest OCR to Kwak'wala image texts

Mixing OCR, language ID, and masking

Using post-correction for high-quality transcription

🔎 Similar Papers

No similar papers found.