AcT2I: Evaluating and Improving Action Depiction in Text-to-Image Models

📅 2025-09-19
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
Text-to-image (T2I) models often produce distorted outputs when generating action-centric scenes, primarily due to their inability to effectively model temporal dynamics, semantic attributes, and contextual relationships inherent in actions. To address this, we introduce AcT2I—the first benchmark explicitly designed for evaluating action generation in T2I systems—and propose a training-free knowledge distillation framework. Our method leverages large language models (LLMs) to perform multi-dimensional prompt augmentation of action descriptions, explicitly injecting temporal, attribute-based, and situational knowledge into the text encoder. This is the first systematic evaluation of T2I models’ capability to generate temporally coherent actions, revealing the critical role of fine-grained temporal cues in generation quality. Experiments across mainstream T2I models demonstrate significant average accuracy improvements, with peak gains reaching 72%, thereby validating the effectiveness and generalizability of language-guided knowledge infusion for complex vision–action reasoning.

Technology Category

Application Category

📝 Abstract
Text-to-Image (T2I) models have recently achieved remarkable success in generating images from textual descriptions. However, challenges still persist in accurately rendering complex scenes where actions and interactions form the primary semantic focus. Our key observation in this work is that T2I models frequently struggle to capture nuanced and often implicit attributes inherent in action depiction, leading to generating images that lack key contextual details. To enable systematic evaluation, we introduce AcT2I, a benchmark designed to evaluate the performance of T2I models in generating images from action-centric prompts. We experimentally validate that leading T2I models do not fare well on AcT2I. We further hypothesize that this shortcoming arises from the incomplete representation of the inherent attributes and contextual dependencies in the training corpora of existing T2I models. We build upon this by developing a training-free, knowledge distillation technique utilizing Large Language Models to address this limitation. Specifically, we enhance prompts by incorporating dense information across three dimensions, observing that injecting prompts with temporal details significantly improves image generation accuracy, with our best model achieving an increase of 72%. Our findings highlight the limitations of current T2I methods in generating images that require complex reasoning and demonstrate that integrating linguistic knowledge in a systematic way can notably advance the generation of nuanced and contextually accurate images.
Problem

Research questions and friction points this paper is trying to address.

Evaluating action depiction accuracy in text-to-image models
Addressing contextual detail loss in action-centric image generation
Improving temporal reasoning for complex action visualization
Innovation

Methods, ideas, or system contributions that make the work stand out.

Training-free knowledge distillation technique
LLM-enhanced prompts with temporal details
Three-dimensional dense information incorporation
🔎 Similar Papers
No similar papers found.