Less is More: Decoder-Free Masked Modeling for Efficient Skeleton Representation Learning

📅 2026-03-11

📈 Citations: 0

✨ Influential: 0

career value

158K/year

🤖 AI Summary

This work addresses two critical limitations in current skeleton-based action representation learning: the neglect of local details in contrastive learning and the computational inefficiency of Masked Autoencoders (MAE), which rely on heavy decoders and suffer from a computation asymmetry between pretraining and downstream tasks. To overcome these issues, the authors propose SLiM, a novel framework that, for the first time, enables decoder-free masked modeling by unifying masked modeling and contrastive learning within a shared encoder, thereby compelling the encoder to directly learn discriminative features. SLiM introduces semantic tube masking and bone-aware augmentation strategies to effectively mitigate trivial reconstructions caused by high temporal correlation. The method achieves state-of-the-art performance across all downstream tasks while reducing inference computational cost by 7.89× compared to existing MAE-based approaches.

Technology Category

Application Category

📝 Abstract

The landscape of skeleton-based action representation learning has evolved from Contrastive Learning (CL) to Masked Auto-Encoder (MAE) architectures. However, each paradigm faces inherent limitations: CL often overlooks fine-grained local details, while MAE is burdened by computationally heavy decoders. Moreover, MAE suffers from severe computational asymmetry -- benefiting from efficient masking during pre-training but requiring exhaustive full-sequence processing for downstream tasks. To resolve these bottlenecks, we propose SLiM (Skeleton Less is More), a novel unified framework that harmonizes masked modeling with contrastive learning via a shared encoder. By eschewing the reconstruction decoder, SLiM not only eliminates computational redundancy but also compels the encoder to capture discriminative features directly. SLiM is the first framework with decoder-free masked modeling of representative learning. Crucially, to prevent trivial reconstruction arising from high skeletal-temporal correlation, we introduce semantic tube masking, alongside skeletal-aware augmentations designed to ensure anatomical consistency across diverse temporal granularities. Extensive experiments demonstrate that SLiM consistently achieves state-of-the-art performance across all downstream protocols. Notably, our method delivers this superior accuracy with exceptional efficiency, reducing inference computational cost by 7.89x compared to existing MAE methods.

Problem

Research questions and friction points this paper is trying to address.

skeleton-based action representation

masked modeling

contrastive learning

computational asymmetry

decoder-free

Innovation

Methods, ideas, or system contributions that make the work stand out.

decoder-free

masked modeling

skeleton representation learning