STARK: Spatio-Temporal Attention for Representation of Keypoints for Continuous Sign Language Recognition

📅 2026-03-17

📈 Citations: 0

✨ Influential: 0

career value

195K/year

🤖 AI Summary

This work addresses the high parameter count and computational complexity of existing keypoint-based methods for continuous sign language recognition, which hinder efficient spatiotemporal modeling. The authors propose a unified spatiotemporal attention network that jointly captures spatial relationships among keypoints and their temporal dynamics within local time windows, yielding context-aware spatiotemporal representations. Notably, this approach introduces, for the first time, a unified spatiotemporal attention mechanism into keypoint modeling, replacing conventional combinations of graph convolution and 1D convolution, thereby substantially reducing model complexity. Evaluated on the Phoenix-14T dataset, the proposed method achieves performance comparable to state-of-the-art keypoint-based approaches while using 70–80% fewer parameters.

Technology Category

Application Category

📝 Abstract

Continuous Sign Language Recognition (CSLR) is a crucial task for understanding the languages of deaf communities. Contemporary keypoint-based approaches typically rely on spatio-temporal encoding, where spatial interactions among keypoints are modeled using Graph Convolutional Networks or attention mechanisms, while temporal dynamics are captured using 1D convolutional networks. However, such designs often introduce a large number of parameters in both the encoder and the decoder. This paper introduces a unified spatio-temporal attention network that computes attention scores both spatially (across keypoints) and temporally (within local windows), and aggregates features to produce a local context-aware spatio-temporal representation. The proposed encoder contains approximately $70-80\%$ fewer parameters than existing state-of-the-art models while achieving comparable performance to keypoint-based methods on the Phoenix-14T dataset.

Problem

Research questions and friction points this paper is trying to address.

Continuous Sign Language Recognition

Keypoints

Spatio-Temporal Modeling

Model Complexity

Parameter Efficiency

Innovation

Methods, ideas, or system contributions that make the work stand out.

spatio-temporal attention

keypoint representation

parameter efficiency