Pose-Based Sign Language Spotting via an End-to-End Encoder Architecture

📅 2025-12-09
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
This paper introduces the novel task of sign language keyword detection, which aims to determine, in an end-to-end manner, whether a given sign is present in continuous sign language video—without relying on intermediate gloss recognition or text-based matching. The proposed method operates directly on human pose keypoint sequences and employs a lightweight encoder-only architecture coupled with a binary classification head to assess semantic alignment between the query sign and the input sequence. By bypassing gloss-level recognition, the approach eliminates error propagation and mitigates visual noise interference inherent in conventional spotting methods, while significantly reducing computational overhead. Evaluated on the WSLP 2025 Word Existence Prediction dataset, the model achieves 61.88% accuracy and 60.00% F1 score—demonstrating, for the first time, the feasibility and effectiveness of purely pose-driven, end-to-end sign language keyword detection.

Technology Category

Application Category

📝 Abstract
Automatic Sign Language Recognition (ASLR) has emerged as a vital field for bridging the gap between deaf and hearing communities. However, the problem of sign-to-sign retrieval or detecting a specific sign within a sequence of continuous signs remains largely unexplored. We define this novel task as Sign Language Spotting. In this paper, we present a first step toward sign language retrieval by addressing the challenge of detecting the presence or absence of a query sign video within a sentence-level gloss or sign video. Unlike conventional approaches that rely on intermediate gloss recognition or text-based matching, we propose an end-to-end model that directly operates on pose keypoints extracted from sign videos. Our architecture employs an encoder-only backbone with a binary classification head to determine whether the query sign appears within the target sequence. By focusing on pose representations instead of raw RGB frames, our method significantly reduces computational cost and mitigates visual noise. We evaluate our approach on the Word Presence Prediction dataset from the WSLP 2025 shared task, achieving 61.88% accuracy and 60.00% F1-score. These results demonstrate the effectiveness of our pose-based framework for Sign Language Spotting, establishing a strong foundation for future research in automatic sign language retrieval and verification. Code is available at https://github.com/EbimoJohnny/Pose-Based-Sign-Language-Spotting
Problem

Research questions and friction points this paper is trying to address.

Detects specific sign language signs in continuous sequences
Uses pose keypoints for end-to-end sign language spotting
Reduces computational cost by focusing on pose representations
Innovation

Methods, ideas, or system contributions that make the work stand out.

End-to-end encoder architecture using pose keypoints
Binary classification head for query sign detection
Pose-based approach reducing computational cost and noise
🔎 Similar Papers
No similar papers found.
S
Samuel Ebimobowei Johnny
Carnegie Mellon University Africa, Kigali, Rwanda
B
Blessed Guda
Carnegie Mellon University Africa, Kigali, Rwanda
E
Emmanuel Enejo Aaron
Carnegie Mellon University Africa, Kigali, Rwanda
Assane Gueye
Assane Gueye
Associate Teaching Professor
Carnegie Mellon University Africa