🤖 AI Summary
To address severe background interference and the difficulty of ensuring cross-modal consistency in multi-modal person re-identification (ReID), this paper proposes a Selective Interaction and Global-Local Alignment framework. Methodologically, it introduces a Selective Interaction Module (SIM) that filters salient image patch tokens and enhances class-token interactions to suppress background noise and improve feature discriminability. Additionally, it designs a Global Alignment Module (GAM) and a Local Alignment Module (LAM) within a 3D Gramian space, jointly leveraging self-attention mechanisms and geometric constraints to achieve cross-modal global semantic consistency and local structural alignment. Extensive experiments demonstrate significant improvements over state-of-the-art methods on three challenging benchmarks—RGBNT201, RGBNT100, and MSVR310—validating both the effectiveness and generalizability of the proposed approach.
📝 Abstract
Multi-modal object Re-IDentification (ReID) is devoted to retrieving specific objects through the exploitation of complementary multi-modal image information. Existing methods mainly concentrate on the fusion of multi-modal features, yet neglecting the background interference. Besides, current multi-modal fusion methods often focus on aligning modality pairs but suffer from multi-modal consistency alignment. To address these issues, we propose a novel selective interaction and global-local alignment framework called Signal for multi-modal object ReID. Specifically, we first propose a Selective Interaction Module (SIM) to select important patch tokens with intra-modal and inter-modal information. These important patch tokens engage in the interaction with class tokens, thereby yielding more discriminative features. Then, we propose a Global Alignment Module (GAM) to simultaneously align multi-modal features by minimizing the volume of 3D polyhedra in the gramian space. Meanwhile, we propose a Local Alignment Module (LAM) to align local features in a shift-aware manner. With these modules, our proposed framework could extract more discriminative features for object ReID. Extensive experiments on three multi-modal object ReID benchmarks (i.e., RGBNT201, RGBNT100, MSVR310) validate the effectiveness of our method. The source code is available at https://github.com/010129/Signal.