KD-MSLRT: Lightweight Sign Language Recognition Model Based on Mediapipe and 3D to 1D Knowledge Distillation

📅 2025-01-04
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
To address the challenges of large model size, deployment difficulty, data scarcity, and low inference efficiency in sign language recognition (SLR), this paper proposes a lightweight and efficient end-to-end SLR framework. Methodologically, we introduce the first 3D-to-1D cross-modal multi-knowledge distillation, integrating MediaPipe-based landmark representations with text error correction pretraining; we further design landmark augmentation and TensorFlow Lite (TFLite) quantization strategies. Our contributions include: (i) releasing the first large-scale Chinese sign language dataset and a domain-specific vocabulary; and (ii) constructing the current lightest (12.93 MB), fastest (deployable on Intel CPUs), and most accurate end-to-end SLR model—achieving ≥1.4% WER reduction on PHOENIX14/T—thereby significantly enhancing interaction accessibility for deaf and hard-of-hearing users.

Technology Category

Application Category

📝 Abstract
Artificial intelligence has achieved notable results in sign language recognition and translation. However, relatively few efforts have been made to significantly improve the quality of life for the 72 million hearing-impaired people worldwide. Sign language translation models, relying on video inputs, involves with large parameter sizes, making it time-consuming and computationally intensive to be deployed. This directly contributes to the scarcity of human-centered technology in this field. Additionally, the lack of datasets in sign language translation hampers research progress in this area. To address these, we first propose a cross-modal multi-knowledge distillation technique from 3D to 1D and a novel end-to-end pre-training text correction framework. Compared to other pre-trained models, our framework achieves significant advancements in correcting text output errors. Our model achieves a decrease in Word Error Rate (WER) of at least 1.4% on PHOENIX14 and PHOENIX14T datasets compared to the state-of-the-art CorrNet. Additionally, the TensorFlow Lite (TFLite) quantized model size is reduced to 12.93 MB, making it the smallest, fastest, and most accurate model to date. We have also collected and released extensive Chinese sign language datasets, and developed a specialized training vocabulary. To address the lack of research on data augmentation for landmark data, we have designed comparative experiments on various augmentation methods. Moreover, we performed a simulated deployment and prediction of our model on Intel platform CPUs and assessed the feasibility of deploying the model on other platforms.
Problem

Research questions and friction points this paper is trying to address.

Sign Language Recognition
Data Scarcity
Complexity and Efficiency
Innovation

Methods, ideas, or system contributions that make the work stand out.

3D to 1D Knowledge Distillation
Automatic Error Correction Algorithm
Sign Language Recognition Efficiency
🔎 Similar Papers
No similar papers found.
B
Bolin Ren
School of AI and Advanced Computing, XJTLU Entrepreneur College (Taicang), Xi’an Jiaotong-Liverpool University, Suzhou, 215123, China
K
Ke Hu
School of Humanities and Social Sciences, Xi’an Jiaotong-Liverpool University, Suzhou, 215123, China
C
Changyuan Liu
School of AI and Advanced Computing, XJTLU Entrepreneur College (Taicang), Xi’an Jiaotong-Liverpool University, Suzhou, 215123, China
Zhengyong Jiang
Zhengyong Jiang
Xi’an Jiaotong-Liverpool University
Deep LearningReinforcement Learning
Kang Dang
Kang Dang
Xi'an Jiaotong-Liverpool University
Computer VisionMedicial Image Analysis
Jionglong Su
Jionglong Su
Xi'an Jiaotong-Liverpool University
AI Big Data Machine Learning Statistics