🤖 AI Summary
Group ReID aims to match pedestrian groups across camera views, but faces fundamental challenges due to dynamic variations in group size and spatial configuration—conditions under which existing deterministic models exhibit poor generalization to unseen group structures. To address this, we propose the first CLIP-based framework for Group ReID, introducing a novel uncertainty-aware modeling paradigm. Specifically, we design a Bernoulli-distribution-driven member variation simulation module to explicitly model stochastic member occlusion or absence; an identity-aware uncertain text generator that adaptively describes group composition and topology; and a group-relation-guided individual feature enhancement encoder. These components are jointly optimized via cross-modal contrastive learning to align visual group representations with generated textual descriptions. Extensive experiments demonstrate significant improvements over state-of-the-art methods on multiple benchmarks, particularly under unknown group configurations—achieving superior robustness, generalization, and identification accuracy.
📝 Abstract
Group Re-Identification (Group ReID) aims matching groups of pedestrians across non-overlapping cameras. Unlike single-person ReID, Group ReID focuses more on the changes in group structure, emphasizing the number of members and their spatial arrangement. However, most methods rely on certainty-based models, which consider only the specific group structures in the group images, often failing to match unseen group configurations. To this end, we propose a novel Group-CLIP UncertaintyModeling (GCUM) approach that adapts group text descriptions to undetermined accommodate member and layout variations. Specifically, we design a Member Variant Simulation (MVS)module that simulates member exclusions using a Bernoulli distribution and a Group Layout Adaptation (GLA) module that generates uncertain group text descriptions with identity-specific tokens. In addition, we design a Group RelationshipConstruction Encoder (GRCE) that uses group features to refine individual features, and employ cross-modal contrastive loss to obtain generalizable knowledge from group text descriptions. It is worth noting that we are the first to employ CLIP to GroupReID, and extensive experiments show that GCUM significantly outperforms state-of-the-art Group ReID methods.