TunerDiT: Training-free Progressive Steering of Diffusion Transformer for Multi-Event Video Generation

📅 2026-05-29

📈 Citations: 0

✨ Influential: 0

career value

166K/year

🤖 AI Summary

This work addresses the challenge in text-to-video generation of simultaneously preserving clear event boundaries and cross-event coherence in long-duration, multi-event scenarios. The authors propose a training-free, progressive guidance mechanism that, for the first time, identifies semantic turning points in the denoising trajectory of Diffusion Transformers (DiTs). Leveraging these turning points, they introduce a dual-regulation strategy combining event-partitioned masks with cross-event prompt fusion to dynamically modulate textual conditioning during critical denoising stages. This enables controllable multi-event video synthesis from global layout to fine-grained details. Experiments demonstrate that the method achieves state-of-the-art performance among training-free approaches across eight metrics, with improved text alignment as the number of events increases, and introduces Meve—the first benchmark for multi-event video generation evaluation.

📝 Abstract

Text-to-video (T2V) generation faces challenging questions when generating videos with long horizons containing multiple events. Inspired by the intrinsics of the diffusion process, we probe video diffusion transformers (DiTs) and uncover intrinsic turning points in the DiT denoising trajectory where conditioning text affects generation from global layout to fine-grained details. Building on this finding, we present TunerDiT, a simple yet effective progressive steering method that requires no additional training for multi-event generation. TunerDiT comprises two steering handles: (1) Event-Partitioned Masking that enforces event boundaries while allowing cross-event transition bands; (2) Cross-Event Prompt Fusion that injects neighboring event semantics for late-stage refinement. We contribute a self-curated prompt suite for benchmarking multi-event generation, i.e., Meve. TunerDiT achieves state-of-the-art performance across 8 metrics and offers a tunable trade-off between video consistency and event separation, compared with other training-free methods. The improvement in text alignment increases with the event count, indicating a scaling possibility with increasing event count.

Problem

Research questions and friction points this paper is trying to address.

text-to-video generation

multi-event video

long-horizon generation

event boundary

text alignment

Innovation

Methods, ideas, or system contributions that make the work stand out.

Diffusion Transformer

Training-free Steering

Multi-event Video Generation