๐ค AI Summary
Fine-grained ambiguity in fashion item detection arises from high visual diversity and inter-subclass similarity. To address this, we propose Holi-DETR, an end-to-end holistic detection framework built upon the DETR architecture. Holi-DETR is the first to jointly model three heterogeneous contextual cues within DETR: cross-category co-occurrence patterns, relative spatial layouts, and geometric correlations between human keypoints and garments. It introduces a multi-source context encoding module, a keypoint-guided attention mechanism, and a joint relational modeling headโdeparting from conventional independent single-item detection paradigms. Evaluated on standard fashion detection benchmarks, Holi-DETR achieves absolute AP improvements of +3.6 and +1.1 over the original DETR and Co-DETR, respectively, while significantly reducing category confusion. These results demonstrate that global, collaborative contextual modeling substantially enhances fine-grained fashion understanding.
๐ Abstract
Fashion item detection is challenging due to the ambiguities introduced by the highly diverse appearances of fashion items and the similarities among item subcategories. To address this challenge, we propose a novel Holistic Detection Transformer (Holi-DETR) that detects fashion items in outfit images holistically, by leveraging contextual information. Fashion items often have meaningful relationships as they are combined to create specific styles. Unlike conventional detectors that detect each item independently, Holi-DETR detects multiple items while reducing ambiguities by leveraging three distinct types of contextual information: (1) the co-occurrence relationship between fashion items, (2) the relative position and size based on inter-item spatial arrangements, and (3) the spatial relationships between items and human body key-points. %Holi-DETR explicitly incorporates three types of contextual information: (1) the co-occurrence probability between fashion items, (2) the relative position and size based on inter-item spatial arrangements, and (3) the spatial relationships between items and human body key-points. To this end, we propose a novel architecture that integrates these three types of heterogeneous contextual information into the Detection Transformer (DETR) and its subsequent models. In experiments, the proposed methods improved the performance of the vanilla DETR and the more recently developed Co-DETR by 3.6 percent points (pp) and 1.1 pp, respectively, in terms of average precision (AP).