Modality Forcing for Scalable Spatial Generation

📅 2026-06-11

📈 Citations: 0

✨ Influential: 0

🤖 AI Summary

This work proposes Modality Forcing, a novel approach for high-quality conditional or joint image–depth generation under the practical constraint of sparse ground-truth depth supervision. By assigning modality-specific noise levels to image and depth inputs and employing dedicated decoders within a single DiT diffusion model, the method enables scalable joint generation without requiring dense depth annotations. To the best of our knowledge, this is the first framework to achieve such capability using only sparse depth supervision, thereby highlighting the potential of image generation as a spatially aware pretraining objective. Leveraging large-scale text-to-image models trained from scratch with 370M–3.3B parameters, the strongest variant sets a new state of the art in monocular depth estimation, reducing the AbsRel error by 57% compared to existing joint generation approaches.

📝 Abstract

Text-to-image (T2I) models contain rich spatial priors. Synthesizing photorealistic, cluttered scenes requires an understanding of geometry, including perspective and relative scale. Prior works adapt T2I models to leverage this prior for depth prediction, but they require dense depth data and involve complex recipes. We propose Modality Forcing, a simple, scalable post-training recipe for joint image-depth generation using a single DiT trained on sparse depth data. Modality Forcing enables conditional and joint generation of image and depth in any permutation by assigning separate noise levels per modality. Per-modality decoders let us train on sparse, real-world depth and achieve strong, generalizable depth prediction. We further show that Modality Forcing inherits the scalability of T2I pre-training: by training a set of T2I models from scratch (370M to 3.3B parameters), we find that larger models trained on more image data produce more accurate depth. Our strongest model is competitive with state-of-the-art monocular depth estimators and reduces AbsRel by 57% relative to existing joint image-depth generative models. These results provide strong evidence that image generation is a scalable pre-training objective for spatial perception. https://modality-forcing.github.io/

Problem

Research questions and friction points this paper is trying to address.

text-to-image

depth estimation

spatial perception

generative modeling

scalable pre-training

Innovation

Methods, ideas, or system contributions that make the work stand out.

Modality Forcing

joint image-depth generation

scalable pre-training