🤖 AI Summary
This work addresses the challenge of cross-modal gait recognition arising from the significant modality gap between LiDAR and RGB cameras by proposing TCFDNet. The method introduces, for the first time, a large language model to construct a gait textual dictionary and leverages CLIP to align visual and textual semantic features. A text-guided feature disentanglement module is designed, combining residual decomposition with orthogonality constraints to separate modality-shared and modality-specific representations. Furthermore, a feature stability enhancement mechanism and a cross-modal patch swapping strategy are proposed to improve robustness and generalization. Extensive experiments on the SUSTech1K and FreeGait datasets demonstrate state-of-the-art performance, validating the effectiveness and superiority of the proposed approach.
📝 Abstract
Gait recognition is a biometric technique that identifies individuals based on their walking patterns, offering advantages in long-range, non-intrusive scenarios. However, real-world scenarios often involve heterogeneous sensing modalities such as LiDAR and RGB cameras, making LiDAR-Camera Cross-modal Gait recognition (LCCGR) a critical yet challenging task due to the substantial modality gap between 2D videos and 3D point cloud sequences. To address this challenge, we propose TCFDNet, a Text-guided Cross-modal Feature Disentanglement Network, which leverages modality-aware textual priors as semantic anchors to guide the learning of disentangled modality-shared representations. Specifically, we construct a Gait Modality Text Dictionary (GMTD) using large language models to generate rich semantic descriptions of gait across modalities and viewpoints. A CLIP-based Multi-grained Feature Encoder then aligns visual and textual features within a unified vision-language space. Furthermore, the Text-guided Feature Disentanglement (TFD) module selects the topk matched textual descriptions to reconstruct modality-specific representations and derive modality-shared features via residual decomposition and orthogonality constraints. To mitigate the fragility of the disentangled shared features, we propose a Feature Stability Enhancement (FSE) module, which models spatial and channel-wise correlations to improve feature robustness. In addition, a cross-modal patch exchange strategy is introduced to further improve generalization. Extensive experiments on SUSTech1K and FreeGait datasets demonstrate that TCFDNet achieves new state-of-the-art results and validate the effectiveness of the proposed modules.