🤖 AI Summary
Existing image-text multimodal style transfer methods suffer from two key limitations: (1) inconsistent cross-modal style representations, and (2) indiscriminate style application to both salient objects and background, leading to content-style misalignment. To address these, we propose a saliency-aware disentangled style supervision and alignment framework. First, we introduce a novel style-specific masked directional CLIP loss to enable fine-grained cross-modal alignment. Second, we design a salient-object-to-keypoint mapping mechanism to decouple style control for foreground objects and background. Third, we incorporate image-level style fusion and harmonization strategies to ensure global visual coherence. Extensive experiments demonstrate that our method significantly outperforms state-of-the-art approaches in both quantitative metrics and perceptual quality, generating multimodal stylized images with consistent style distribution, precise content-style alignment, and natural object boundaries.
📝 Abstract
We propose ObjMST, an object-focused multimodal style transfer framework that provides separate style supervision for salient objects and surrounding elements while addressing alignment issues in multimodal representation learning. Existing image-text multimodal style transfer methods face the following challenges: (1) generating non-aligned and inconsistent multimodal style representations; and (2) content mismatch, where identical style patterns are applied to both salient objects and their surrounding elements. Our approach mitigates these issues by: (1) introducing a Style-Specific Masked Directional CLIP Loss, which ensures consistent and aligned style representations for both salient objects and their surroundings; and (2) incorporating a salient-to-key mapping mechanism for stylizing salient objects, followed by image harmonization to seamlessly blend the stylized objects with their environment. We validate the effectiveness of ObjMST through experiments, using both quantitative metrics and qualitative visual evaluations of the stylized outputs. Our code is available at: https://github.com/chandagrover/ObjMST.