🤖 AI Summary
This work addresses the challenge of transferring pre-trained vision encoders to continuous sign language recognition (CSLR), where existing models suffer from poor cross-task generalization. We propose CLIP-SLA, a parameter-efficient adaptation framework that successfully transfers the CLIP vision encoder to CSLR for the first time. Our method introduces two lightweight adapter modules—SLA-Adapter and SLA-LoRA—that enable end-to-end video-to-text transcription with minimal parameter updates. Key contributions include: (1) establishing the first CLIP adaptation paradigm tailored for sign language videos; (2) designing spatiotemporal alignment–aware adapter architectures to bridge modality and semantic gaps; and (3) achieving new state-of-the-art results on four major benchmarks—Phoenix2014, Phoenix2014-T, CSL-Daily, and Isharah-500—via multi-dataset joint training and unified evaluation. CLIP-SLA attains superior recognition accuracy using ≤0.5% trainable parameters, significantly improving both generalization and computational efficiency.
📝 Abstract
Continuous sign language recognition (CSLR) focuses on interpreting and transcribing sequences of sign language gestures in videos. In this work, we propose CLIP sign language adaptation (CLIP-SLA), a novel CSLR framework that leverages the powerful pre-trained visual encoder from the CLIP model to sign language tasks through parameter-efficient fine-tuning (PEFT). We introduce two variants, SLA-Adapter and SLA-LoRA, which integrate PEFT modules into the CLIP visual encoder, enabling fine-tuning with minimal trainable parameters. The effectiveness of the proposed frameworks is validated on four datasets: Phoenix2014, Phoenix2014-T, CSL-Daily, and Isharah-500, where both CLIP-SLA variants outperformed several SOTA models with fewer trainable parameters. Extensive ablation studies emphasize the effectiveness and flexibility of the proposed methods with different vision-language models for CSLR. These findings showcase the potential of adapting large-scale pre-trained models for scalable and efficient CSLR, which pave the way for future advancements in sign language understanding.