🤖 AI Summary
Fine-grained fashion style classification faces dual challenges: large intra-class variation and high inter-class visual similarity. To address these, we propose a domain-knowledge-enhanced multi-granularity modeling framework. First, we introduce Item-Region Pooling (IRP), a novel mechanism that explicitly captures local features of individual garments and their compositional relationships. Second, we design a Gated Feature Fusion (GFF) module to adaptively integrate heterogeneous features from domain-specialized networks and large-scale vision-language pre-trained backbones (e.g., ViT, ConvNeXt, EfficientNet, Swin). Our dual-backbone collaborative architecture significantly enhances discriminative capability: it achieves average accuracy gains of 6.9–7.6% on FashionStyle14 and ShowniqV3, with peak improvements reaching 15.1%. Visual analysis further confirms its superior ability to distinguish highly similar style categories.
📝 Abstract
Fashion style classification is a challenging task because of the large visual variation within the same style and the existence of visually similar styles.
Styles are expressed not only by the global appearance, but also by the attributes of individual items and their combinations.
In this study, we propose an item region-based fashion style classification network (IRSN) to effectively classify fashion styles by analyzing item-specific features and their combinations in addition to global features.
IRSN extracts features of each item region using item region pooling (IRP), analyzes them separately, and combines them using gated feature fusion (GFF).
In addition, we improve the feature extractor by applying a dual-backbone architecture that combines a domain-specific feature extractor and a general feature extractor pre-trained with a large-scale image-text dataset.
In experiments, applying IRSN to six widely-used backbones, including EfficientNet, ConvNeXt, and Swin Transformer, improved style classification accuracy by an average of 6.9% and a maximum of 14.5% on the FashionStyle14 dataset and by an average of 7.6% and a maximum of 15.1% on the ShowniqV3 dataset. Visualization analysis also supports that the IRSN models are better than the baseline models at capturing differences between similar style classes.