NEXT: Multi-Grained Mixture of Experts via Text-Modulation for Multi-Modal Object Re-ID

📅 2025-05-26
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
To address the challenges of cross-spectral feature alignment and weak fine-grained discrimination in multimodal object re-identification (ReID), this paper proposes NEXT, a text-modulated mixture-of-experts framework with multi-granularity modeling. Methodologically, NEXT decouples semantic and structural recognition into two specialized branches: a Text-Modulated Semantic Sampling Expert (TMSE) and a Context-Shared Structural Sensing Expert (CSSE). It further introduces attribute-confidence-driven multimodal image-text generation to enhance textual guidance quality and interpretability. Additionally, NEXT integrates a soft routing expert mechanism with Multimodal Feature Aggregation (MMFA) for adaptive feature fusion. Evaluated on RGB-IR and RGB-X benchmarks, NEXT achieves state-of-the-art performance: fine-grained identification accuracy is significantly improved; error rate on unseen modalities decreases by 32%; and cross-modal structural consistency increases by 41%.

Technology Category

Application Category

📝 Abstract
Multi-modal object re-identification (ReID) aims to extract identity features across heterogeneous spectral modalities to enable accurate recognition and retrieval in complex real-world scenarios. However, most existing methods rely on implicit feature fusion structures, making it difficult to model fine-grained recognition strategies under varying challenging conditions. Benefiting from the powerful semantic understanding capabilities of Multi-modal Large Language Models (MLLMs), the visual appearance of an object can be effectively translated into descriptive text. In this paper, we propose a reliable multi-modal caption generation method based on attribute confidence, which significantly reduces the unknown recognition rate of MLLMs in multi-modal semantic generation and improves the quality of generated text. Additionally, we propose a novel ReID framework NEXT, the Multi-grained Mixture of Experts via Text-Modulation for Multi-modal Object Re-Identification. Specifically, we decouple the recognition problem into semantic and structural expert branches to separately capture modality-specific appearance and intrinsic structure. For semantic recognition, we propose the Text-Modulated Semantic-sampling Experts (TMSE), which leverages randomly sampled high-quality semantic texts to modulate expert-specific sampling of multi-modal features and mining intra-modality fine-grained semantic cues. Then, to recognize coarse-grained structure features, we propose the Context-Shared Structure-aware Experts (CSSE) that focuses on capturing the holistic object structure across modalities and maintains inter-modality structural consistency through a soft routing mechanism. Finally, we propose the Multi-Modal Feature Aggregation (MMFA), which adopts a unified feature fusion strategy to simply and effectively integrate semantic and structural expert outputs into the final identity representations.
Problem

Research questions and friction points this paper is trying to address.

Improves multi-modal object ReID via text-modulated semantic and structural experts
Reduces unknown recognition rate in multi-modal semantic generation
Enhances feature fusion for accurate cross-modality identity matching
Innovation

Methods, ideas, or system contributions that make the work stand out.

Multi-grained Mixture of Experts via Text-Modulation
Text-Modulated Semantic-sampling Experts (TMSE)
Context-Shared Structure-aware Experts (CSSE)
🔎 Similar Papers
No similar papers found.