Mixed-Modality Dual Face-Hair Retrieval

📅 2026-06-02

📈 Citations: 0

✨ Influential: 0

career value

209K/year

🤖 AI Summary

This work addresses the challenge of jointly retrieving facial identity and hairstyle under semantic independence and heterogeneous modalities by introducing a dual-reference face-hairstyle retrieval task. The proposed method constructs a unified embedding space that enables fine-grained fusion of heterogeneous modalities through feature disentanglement, token injection mechanisms, and cross-modal semantic alignment, all supervised by multi-view constraints. It establishes the first hybrid-modality face-hairstyle retrieval paradigm and introduces DFHR-Bench, a novel benchmark comprising 180,000 annotated triplets. The framework supports both image-to-image and image-to-text retrieval modes and demonstrates its effectiveness on the newly curated benchmark, thereby advancing identity-aware and attribute-controllable cross-modal retrieval.

📝 Abstract

We introduce Dual Face-Hair Retrieval (DFHR), a new mixed-modality dual-reference task in image retrieval where a query consists of a face image specifying identity and a hairstyle reference expressed as either an image or text. Unlike prior retrieval settings, DFHR requires cross-component reasoning between two semantically independent attributes -- identity and hairstyle -- originating from heterogeneous modalities. This formulation demands localized feature disentanglement, cross-modal semantic alignment, and mixed-modality composition within a unified embedding space. We construct DFHR-Bench, the first benchmark for mixed-modality face-hair retrieval, comprising over 180K annotated triplets across dual-image and image-text settings, built via a multi-stage annotation protocol ensuring semantic and identity integrity. We further propose MFHC (Multimodal Face-Hair Combiner), a unified framework that fuses disentangled identity and hairstyle embeddings through token injection and multi-view supervision. DFHR and DFHR-Bench together establish a new paradigm for identity-aware, attribute-controllable visual retrieval across modalities.

Problem

Research questions and friction points this paper is trying to address.

mixed-modality

face-hair retrieval

dual-reference

cross-modal alignment

attribute disentanglement

Innovation

Methods, ideas, or system contributions that make the work stand out.

mixed-modality retrieval

face-hair disentanglement

cross-modal alignment