CLIP-SLA: Parameter-Efficient CLIP Adaptation for Continuous Sign Language Recognition

📅 2025-04-02
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
This work addresses the challenge of transferring pre-trained vision encoders to continuous sign language recognition (CSLR), where existing models suffer from poor cross-task generalization. We propose CLIP-SLA, a parameter-efficient adaptation framework that successfully transfers the CLIP vision encoder to CSLR for the first time. Our method introduces two lightweight adapter modules—SLA-Adapter and SLA-LoRA—that enable end-to-end video-to-text transcription with minimal parameter updates. Key contributions include: (1) establishing the first CLIP adaptation paradigm tailored for sign language videos; (2) designing spatiotemporal alignment–aware adapter architectures to bridge modality and semantic gaps; and (3) achieving new state-of-the-art results on four major benchmarks—Phoenix2014, Phoenix2014-T, CSL-Daily, and Isharah-500—via multi-dataset joint training and unified evaluation. CLIP-SLA attains superior recognition accuracy using ≤0.5% trainable parameters, significantly improving both generalization and computational efficiency.

Technology Category

Application Category

📝 Abstract
Continuous sign language recognition (CSLR) focuses on interpreting and transcribing sequences of sign language gestures in videos. In this work, we propose CLIP sign language adaptation (CLIP-SLA), a novel CSLR framework that leverages the powerful pre-trained visual encoder from the CLIP model to sign language tasks through parameter-efficient fine-tuning (PEFT). We introduce two variants, SLA-Adapter and SLA-LoRA, which integrate PEFT modules into the CLIP visual encoder, enabling fine-tuning with minimal trainable parameters. The effectiveness of the proposed frameworks is validated on four datasets: Phoenix2014, Phoenix2014-T, CSL-Daily, and Isharah-500, where both CLIP-SLA variants outperformed several SOTA models with fewer trainable parameters. Extensive ablation studies emphasize the effectiveness and flexibility of the proposed methods with different vision-language models for CSLR. These findings showcase the potential of adapting large-scale pre-trained models for scalable and efficient CSLR, which pave the way for future advancements in sign language understanding.
Problem

Research questions and friction points this paper is trying to address.

Adapts CLIP model for continuous sign language recognition
Proposes parameter-efficient fine-tuning for sign language tasks
Validates performance on multiple sign language datasets
Innovation

Methods, ideas, or system contributions that make the work stand out.

Parameter-efficient fine-tuning for CLIP adaptation
SLA-Adapter and SLA-LoRA variants integration
Leveraging pre-trained visual encoder efficiency
🔎 Similar Papers
No similar papers found.