🤖 AI Summary
Weakly supervised facial parsing (WSFP) methods rely solely on image-level labels and natural language descriptions to reduce annotation costs; however, high co-occurrence and visual similarity among facial components lead to ambiguous activations and suboptimal segmentation performance. To address this, we propose an explicit–implicit disentanglement framework: (1) a co-occurrence-aware decoupling strategy mitigates dataset bias by explicitly modeling component correlations, and (2) a text-guided disentanglement loss leverages linguistic priors to enhance semantic separation of facial parts. Our approach unifies weakly supervised semantic segmentation, representation disentanglement, and multimodal joint supervision. Extensive experiments on CelebAMask-HQ, LaPa, and Helen demonstrate significant improvements over state-of-the-art WSFP methods, validating the effectiveness of our disentanglement mechanism in enhancing localization accuracy and part discrimination capability.
📝 Abstract
Face parsing aims to segment facial images into key components such as eyes, lips, and eyebrows. While existing methods rely on dense pixel-level annotations, such annotations are expensive and labor-intensive to obtain. To reduce annotation cost, we introduce Weakly Supervised Face Parsing (WSFP), a new task setting that performs dense facial component segmentation using only weak supervision, such as image-level labels and natural language descriptions. WSFP introduces unique challenges due to the high co-occurrence and visual similarity of facial components, which lead to ambiguous activations and degraded parsing performance. To address this, we propose DisFaceRep, a representation disentanglement framework designed to separate co-occurring facial components through both explicit and implicit mechanisms. Specifically, we introduce a co-occurring component disentanglement strategy to explicitly reduce dataset-level bias, and a text-guided component disentanglement loss to guide component separation using language supervision implicitly. Extensive experiments on CelebAMask-HQ, LaPa, and Helen demonstrate the difficulty of WSFP and the effectiveness of DisFaceRep, which significantly outperforms existing weakly supervised semantic segmentation methods. The code will be released at href{https://github.com/CVI-SZU/DisFaceRep}{ extcolor{cyan}{https://github.com/CVI-SZU/DisFaceRep}}.