Mixed-Modality Dual Face-Hair Retrieval

📅 2026-06-02
📈 Citations: 0
Influential: 0
📄 PDF

career value

209K/year
🤖 AI Summary
This work addresses the challenge of jointly retrieving facial identity and hairstyle under semantic independence and heterogeneous modalities by introducing a dual-reference face-hairstyle retrieval task. The proposed method constructs a unified embedding space that enables fine-grained fusion of heterogeneous modalities through feature disentanglement, token injection mechanisms, and cross-modal semantic alignment, all supervised by multi-view constraints. It establishes the first hybrid-modality face-hairstyle retrieval paradigm and introduces DFHR-Bench, a novel benchmark comprising 180,000 annotated triplets. The framework supports both image-to-image and image-to-text retrieval modes and demonstrates its effectiveness on the newly curated benchmark, thereby advancing identity-aware and attribute-controllable cross-modal retrieval.
📝 Abstract
We introduce Dual Face-Hair Retrieval (DFHR), a new mixed-modality dual-reference task in image retrieval where a query consists of a face image specifying identity and a hairstyle reference expressed as either an image or text. Unlike prior retrieval settings, DFHR requires cross-component reasoning between two semantically independent attributes -- identity and hairstyle -- originating from heterogeneous modalities. This formulation demands localized feature disentanglement, cross-modal semantic alignment, and mixed-modality composition within a unified embedding space. We construct DFHR-Bench, the first benchmark for mixed-modality face-hair retrieval, comprising over 180K annotated triplets across dual-image and image-text settings, built via a multi-stage annotation protocol ensuring semantic and identity integrity. We further propose MFHC (Multimodal Face-Hair Combiner), a unified framework that fuses disentangled identity and hairstyle embeddings through token injection and multi-view supervision. DFHR and DFHR-Bench together establish a new paradigm for identity-aware, attribute-controllable visual retrieval across modalities.
Problem

Research questions and friction points this paper is trying to address.

mixed-modality
face-hair retrieval
dual-reference
cross-modal alignment
attribute disentanglement
Innovation

Methods, ideas, or system contributions that make the work stand out.

mixed-modality retrieval
face-hair disentanglement
cross-modal alignment
dual-reference retrieval
multimodal embedding
Q
Quoc-Anh Bui-Huynh
Vietnam National University, Ho Chi Minh City, Vietnam; University of Information Technology, VNU-HCM, Ho Chi Minh City, Vietnam
M
Mai-Tuyen Lam
Vietnam National University, Ho Chi Minh City, Vietnam; University of Information Technology, VNU-HCM, Ho Chi Minh City, Vietnam
D
Dai-Anh-Tuan Nguyen
Vietnam National University, Ho Chi Minh City, Vietnam; University of Information Technology, VNU-HCM, Ho Chi Minh City, Vietnam
Thanh Duc Ngo
Thanh Duc Ngo
University of Information Technology, Vietnam National University Ho Chi Minh City, Vietnam
Computer VisionMultimedia Content Analysis