STARK: Spatio-Temporal Attention for Representation of Keypoints for Continuous Sign Language Recognition

📅 2026-03-17
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
This work addresses the high parameter count and computational complexity of existing keypoint-based methods for continuous sign language recognition, which hinder efficient spatiotemporal modeling. The authors propose a unified spatiotemporal attention network that jointly captures spatial relationships among keypoints and their temporal dynamics within local time windows, yielding context-aware spatiotemporal representations. Notably, this approach introduces, for the first time, a unified spatiotemporal attention mechanism into keypoint modeling, replacing conventional combinations of graph convolution and 1D convolution, thereby substantially reducing model complexity. Evaluated on the Phoenix-14T dataset, the proposed method achieves performance comparable to state-of-the-art keypoint-based approaches while using 70–80% fewer parameters.

Technology Category

Application Category

📝 Abstract
Continuous Sign Language Recognition (CSLR) is a crucial task for understanding the languages of deaf communities. Contemporary keypoint-based approaches typically rely on spatio-temporal encoding, where spatial interactions among keypoints are modeled using Graph Convolutional Networks or attention mechanisms, while temporal dynamics are captured using 1D convolutional networks. However, such designs often introduce a large number of parameters in both the encoder and the decoder. This paper introduces a unified spatio-temporal attention network that computes attention scores both spatially (across keypoints) and temporally (within local windows), and aggregates features to produce a local context-aware spatio-temporal representation. The proposed encoder contains approximately $70-80\%$ fewer parameters than existing state-of-the-art models while achieving comparable performance to keypoint-based methods on the Phoenix-14T dataset.
Problem

Research questions and friction points this paper is trying to address.

Continuous Sign Language Recognition
Keypoints
Spatio-Temporal Modeling
Model Complexity
Parameter Efficiency
Innovation

Methods, ideas, or system contributions that make the work stand out.

spatio-temporal attention
keypoint representation
parameter efficiency
continuous sign language recognition
local context-aware
🔎 Similar Papers
No similar papers found.