SkelHCC: A Hyperbolic CLIP-Driven Cache Adaptation Framework for Skeleton-based One-Shot Action Recognition

📅 2026-06-02

📈 Citations: 0

✨ Influential: 0

career value

219K/year

🤖 AI Summary

This work addresses the challenges of motion hierarchical modeling and semantic alignment in one-shot skeleton-based action recognition, where data scarcity severely limits performance. To this end, the authors propose SkelHCC, a unified framework that introduces hyperbolic geometry into this domain for the first time. SkelHCC explicitly captures the hierarchical structure among joints, body parts, and the full body through an Explicit Hierarchical Hyperbolic CLIP module (EH-HCLIP) and enables efficient training-free inference via a large language model–guided multi-granularity voting cache (LMV-Cache). Extensive experiments on NTU RGB+D 60/120 and PKU-MMD benchmarks demonstrate that SkelHCC significantly outperforms existing methods, confirming its superior performance and strong cross-modal generalization capability under the one-shot setting.

📝 Abstract

Skeleton-based action recognition aims to understand human behaviors from body joint sequences and is especially challenging in the one-shot setting, where only a single labeled exemplar is available for each novel action. A key challenge is learning representations that capture the hierarchical and compositional structure of human motion while aligning effectively with high-level action semantics under extreme data scarcity. Existing approaches, largely based on Euclidean embeddings and low-level motion cues, struggle to model the tree-like organization of skeleton data, limiting cross-modal alignment and generalization to unseen action categories. We propose SkelHCC, a unified skeleton hyperbolic CLIP-driven cache adaptation framework for one-shot skeleton-based action recognition. SkelHCC introduces an Explicitly Hierarchical Hyperbolic CLIP (EH-HCLIP) module that embeds skeleton sequences and action language into a shared hyperbolic space. By leveraging the negative curvature and exponential volume growth of hyperbolic geometry, EH-HCLIP naturally encodes the joint-part-body hierarchy of human anatomy and yields structurally consistent cross-modal representations. To support efficient one-shot adaptation, SkelHCC further integrates a training-free LLM-guided Multi-granularity Voting Cache (LMV-Cache) for context-aware inference. Experiments on NTU RGB+D 60, NTU RGB+D 120, and PKU-MMD demonstrate that SkelHCC consistently outperforms state-of-the-art methods.

Problem

Research questions and friction points this paper is trying to address.

one-shot action recognition

skeleton-based action recognition

hierarchical representation

cross-modal alignment

data scarcity

Innovation

Methods, ideas, or system contributions that make the work stand out.

Hyperbolic Geometry

CLIP

One-shot Action Recognition