Holi-DETR: Holistic Fashion Item Detection Leveraging Contextual Information

๐Ÿ“… 2025-12-29
๐Ÿ“ˆ Citations: 0
โœจ Influential: 0
๐Ÿ“„ PDF
๐Ÿค– AI Summary
Fine-grained ambiguity in fashion item detection arises from high visual diversity and inter-subclass similarity. To address this, we propose Holi-DETR, an end-to-end holistic detection framework built upon the DETR architecture. Holi-DETR is the first to jointly model three heterogeneous contextual cues within DETR: cross-category co-occurrence patterns, relative spatial layouts, and geometric correlations between human keypoints and garments. It introduces a multi-source context encoding module, a keypoint-guided attention mechanism, and a joint relational modeling headโ€”departing from conventional independent single-item detection paradigms. Evaluated on standard fashion detection benchmarks, Holi-DETR achieves absolute AP improvements of +3.6 and +1.1 over the original DETR and Co-DETR, respectively, while significantly reducing category confusion. These results demonstrate that global, collaborative contextual modeling substantially enhances fine-grained fashion understanding.

Technology Category

Application Category

๐Ÿ“ Abstract
Fashion item detection is challenging due to the ambiguities introduced by the highly diverse appearances of fashion items and the similarities among item subcategories. To address this challenge, we propose a novel Holistic Detection Transformer (Holi-DETR) that detects fashion items in outfit images holistically, by leveraging contextual information. Fashion items often have meaningful relationships as they are combined to create specific styles. Unlike conventional detectors that detect each item independently, Holi-DETR detects multiple items while reducing ambiguities by leveraging three distinct types of contextual information: (1) the co-occurrence relationship between fashion items, (2) the relative position and size based on inter-item spatial arrangements, and (3) the spatial relationships between items and human body key-points. %Holi-DETR explicitly incorporates three types of contextual information: (1) the co-occurrence probability between fashion items, (2) the relative position and size based on inter-item spatial arrangements, and (3) the spatial relationships between items and human body key-points. To this end, we propose a novel architecture that integrates these three types of heterogeneous contextual information into the Detection Transformer (DETR) and its subsequent models. In experiments, the proposed methods improved the performance of the vanilla DETR and the more recently developed Co-DETR by 3.6 percent points (pp) and 1.1 pp, respectively, in terms of average precision (AP).
Problem

Research questions and friction points this paper is trying to address.

Detects fashion items holistically using contextual relationships
Reduces ambiguities in fashion item detection via co-occurrence and spatial cues
Integrates heterogeneous contextual information into transformer-based detection models
Innovation

Methods, ideas, or system contributions that make the work stand out.

Uses co-occurrence relationships between fashion items
Leverages inter-item spatial arrangements for positioning
Incorporates spatial relationships with human body keypoints
๐Ÿ”Ž Similar Papers
No similar papers found.
Y
Youngchae Kwon
Department of CSEE, Handong Global University General Graduate School, 558 Handong-ro Buk-gu, Pohang, 37554, Gyeongbuk, Republic of Korea.
J
Jinyoung Choi
Department of CSEE, Handong Global University General Graduate School, 558 Handong-ro Buk-gu, Pohang, 37554, Gyeongbuk, Republic of Korea.
Injung Kim
Injung Kim
Professor, Handong Global University
AIdeep learningimage analysis and synthesisspeech synthesissmart factory