Investigating and Improving Counter-Stereotypical Action Relation in Text-to-Image Diffusion Models

📅 2025-03-13
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
Text-to-image diffusion models systematically fail to generate counter-stereotypical action relations (e.g., “a mouse chases a cat”) due to distributional biases in training data—not inherent architectural limitations. To address this, we propose Role-Bridging Decomposition (RBD), a compositional reasoning framework that progressively guides rare-action generation via semantically coherent intermediate relations (e.g., “a mouse chases a boy”). We introduce ActionBench, the first benchmark explicitly designed for evaluating action-relational compositionality. Our approach integrates reverse prompt engineering, intermediate-relation distillation, and architecture-agnostic fine-tuning—enabling effective compositional inference within standard diffusion pipelines. Experiments demonstrate substantial improvements over state-of-the-art methods across both automated metrics (e.g., CLIP-Score, COMET) and human evaluation, significantly enhancing accuracy and plausibility of counter-stereotypical action generation.

Technology Category

Application Category

📝 Abstract
Text-to-image diffusion models consistently fail at generating counter-stereotypical action relationships (e.g.,"mouse chasing cat"), defaulting to frequent stereotypes even when explicitly prompted otherwise. Through systematic investigation, we discover this limitation stems from distributional biases rather than inherent model constraints. Our key insight reveals that while models fail on rare compositions when their inversions are common, they can successfully generate similar intermediate compositions (e.g.,"mouse chasing boy"). To test this hypothesis, we develop a Role-Bridging Decomposition framework that leverages these intermediates to gradually teach rare relationships without architectural modifications. We introduce ActionBench, a comprehensive benchmark specifically designed to evaluate action-based relationship generation across stereotypical and counter-stereotypical configurations. Our experiments validate that intermediate compositions indeed facilitate counter-stereotypical generation, with both automatic metrics and human evaluations showing significant improvements over existing approaches. This work not only identifies fundamental biases in current text-to-image systems but demonstrates a promising direction for addressing them through compositional reasoning.
Problem

Research questions and friction points this paper is trying to address.

Text-to-image models fail at generating counter-stereotypical action relationships.
Distributional biases cause models to default to frequent stereotypes.
Intermediate compositions help improve counter-stereotypical generation.
Innovation

Methods, ideas, or system contributions that make the work stand out.

Role-Bridging Decomposition leverages intermediate compositions
ActionBench evaluates action-based relationship generation
Intermediate compositions improve counter-stereotypical generation
🔎 Similar Papers
No similar papers found.