MTADiffusion: Mask Text Alignment Diffusion Model for Object Inpainting

📅 2025-06-29

📈 Citations: 0

✨ Influential: 0

career value

158K/year

🤖 AI Summary

Existing image inpainting methods often suffer from semantic misalignment, structural distortion, and stylistic inconsistency. To address these challenges, we propose MTAPipeline—a multi-task diffusion model grounded in mask-text alignment, which jointly optimizes semantic understanding, edge-structure reconstruction, and style consistency for coherent inpainting. Our key contributions include: (i) introducing MTADataset, the first large-scale mask-text aligned dataset comprising 5 million images and 25 million annotation pairs; (ii) designing an end-to-end automated annotation framework; (iii) incorporating an auxiliary edge-prediction task to enhance geometric fidelity; and (iv) proposing a style-consistency loss based on VGG features and Gram matrices. Evaluated on BrushBench and EditBench, MTAPipeline achieves state-of-the-art performance, significantly improving semantic accuracy, geometric preservation, and visual style consistency of inpainted regions.

Technology Category

Application Category

📝 Abstract

Advancements in generative models have enabled image inpainting models to generate content within specific regions of an image based on provided prompts and masks. However, existing inpainting methods often suffer from problems such as semantic misalignment, structural distortion, and style inconsistency. In this work, we present MTADiffusion, a Mask-Text Alignment diffusion model designed for object inpainting. To enhance the semantic capabilities of the inpainting model, we introduce MTAPipeline, an automatic solution for annotating masks with detailed descriptions. Based on the MTAPipeline, we construct a new MTADataset comprising 5 million images and 25 million mask-text pairs. Furthermore, we propose a multi-task training strategy that integrates both inpainting and edge prediction tasks to improve structural stability. To promote style consistency, we present a novel inpainting style-consistency loss using a pre-trained VGG network and the Gram matrix. Comprehensive evaluations on BrushBench and EditBench demonstrate that MTADiffusion achieves state-of-the-art performance compared to other methods.

Problem

Research questions and friction points this paper is trying to address.

Addresses semantic misalignment in image inpainting

Reduces structural distortion in generated content

Improves style consistency in inpainted regions

Innovation

Methods, ideas, or system contributions that make the work stand out.

MTAPipeline for automatic mask-text annotation

Multi-task training for structural stability

Style-consistency loss using VGG and Gram matrix

🔎 Similar Papers

No similar papers found.