Holi-DETR: Holistic Fashion Item Detection Leveraging Contextual Information

📅 2025-12-29

📈 Citations: 0

✨ Influential: 0

career value

162K/year

🤖 AI Summary

Fine-grained ambiguity in fashion item detection arises from high visual diversity and inter-subclass similarity. To address this, we propose Holi-DETR, an end-to-end holistic detection framework built upon the DETR architecture. Holi-DETR is the first to jointly model three heterogeneous contextual cues within DETR: cross-category co-occurrence patterns, relative spatial layouts, and geometric correlations between human keypoints and garments. It introduces a multi-source context encoding module, a keypoint-guided attention mechanism, and a joint relational modeling head—departing from conventional independent single-item detection paradigms. Evaluated on standard fashion detection benchmarks, Holi-DETR achieves absolute AP improvements of +3.6 and +1.1 over the original DETR and Co-DETR, respectively, while significantly reducing category confusion. These results demonstrate that global, collaborative contextual modeling substantially enhances fine-grained fashion understanding.

Technology Category

Application Category

📝 Abstract

Fashion item detection is challenging due to the ambiguities introduced by the highly diverse appearances of fashion items and the similarities among item subcategories. To address this challenge, we propose a novel Holistic Detection Transformer (Holi-DETR) that detects fashion items in outfit images holistically, by leveraging contextual information. Fashion items often have meaningful relationships as they are combined to create specific styles. Unlike conventional detectors that detect each item independently, Holi-DETR detects multiple items while reducing ambiguities by leveraging three distinct types of contextual information: (1) the co-occurrence relationship between fashion items, (2) the relative position and size based on inter-item spatial arrangements, and (3) the spatial relationships between items and human body key-points. %Holi-DETR explicitly incorporates three types of contextual information: (1) the co-occurrence probability between fashion items, (2) the relative position and size based on inter-item spatial arrangements, and (3) the spatial relationships between items and human body key-points. To this end, we propose a novel architecture that integrates these three types of heterogeneous contextual information into the Detection Transformer (DETR) and its subsequent models. In experiments, the proposed methods improved the performance of the vanilla DETR and the more recently developed Co-DETR by 3.6 percent points (pp) and 1.1 pp, respectively, in terms of average precision (AP).

Problem

Research questions and friction points this paper is trying to address.

Detects fashion items holistically using contextual relationships

Reduces ambiguities in fashion item detection via co-occurrence and spatial cues

Integrates heterogeneous contextual information into transformer-based detection models

Innovation

Methods, ideas, or system contributions that make the work stand out.

Uses co-occurrence relationships between fashion items

Leverages inter-item spatial arrangements for positioning

Incorporates spatial relationships with human body keypoints

🔎 Similar Papers

Content and Salient Semantics Collaboration for Cloth-Changing Person Re-Identification