ObjMST: An Object-Focused Multimodal Style Transfer Framework

📅 2025-03-06

📈 Citations: 0

✨ Influential: 0

career value

170K/year

🤖 AI Summary

Existing image-text multimodal style transfer methods suffer from two key limitations: (1) inconsistent cross-modal style representations, and (2) indiscriminate style application to both salient objects and background, leading to content-style misalignment. To address these, we propose a saliency-aware disentangled style supervision and alignment framework. First, we introduce a novel style-specific masked directional CLIP loss to enable fine-grained cross-modal alignment. Second, we design a salient-object-to-keypoint mapping mechanism to decouple style control for foreground objects and background. Third, we incorporate image-level style fusion and harmonization strategies to ensure global visual coherence. Extensive experiments demonstrate that our method significantly outperforms state-of-the-art approaches in both quantitative metrics and perceptual quality, generating multimodal stylized images with consistent style distribution, precise content-style alignment, and natural object boundaries.

Technology Category

Application Category

📝 Abstract

We propose ObjMST, an object-focused multimodal style transfer framework that provides separate style supervision for salient objects and surrounding elements while addressing alignment issues in multimodal representation learning. Existing image-text multimodal style transfer methods face the following challenges: (1) generating non-aligned and inconsistent multimodal style representations; and (2) content mismatch, where identical style patterns are applied to both salient objects and their surrounding elements. Our approach mitigates these issues by: (1) introducing a Style-Specific Masked Directional CLIP Loss, which ensures consistent and aligned style representations for both salient objects and their surroundings; and (2) incorporating a salient-to-key mapping mechanism for stylizing salient objects, followed by image harmonization to seamlessly blend the stylized objects with their environment. We validate the effectiveness of ObjMST through experiments, using both quantitative metrics and qualitative visual evaluations of the stylized outputs. Our code is available at: https://github.com/chandagrover/ObjMST.

Problem

Research questions and friction points this paper is trying to address.

Addresses alignment issues in multimodal style transfer

Separates style supervision for objects and surroundings

Ensures consistent style representations using CLIP Loss

Innovation

Methods, ideas, or system contributions that make the work stand out.

Style-Specific Masked Directional CLIP Loss

Salient-to-key mapping mechanism

Image harmonization for seamless blending

🔎 Similar Papers

No similar papers found.