Generative Photography: Scene-Consistent Camera Control for Realistic Text-to-Image Synthesis

📅 2024-12-03

🏛️ arXiv.org

📈 Citations: 0

✨ Influential: 0

career value

181K/year

🤖 AI Summary

Existing text-to-image models lack explicit, continuous control over camera intrinsics—such as focal length and field of view—resulting in geometric and semantic inconsistencies across lens configurations and limiting applicability in professional photography. This work introduces a text-driven, photorealistic image generation framework that pioneers “dimensional lifting” and “contrastive camera learning” to enable differentiable, scene-consistent modeling of camera parameters for the first time. Technically, it integrates camera-parameter embeddings, geometry-aware attention mechanisms, and multi-scale contrastive losses into a diffusion architecture to explicitly encode physical imaging priors. Experiments demonstrate significant improvements over Stable Diffusion 3 and FLUX on lens-switching, depth-of-field, and perspective transformation tasks. Generated images exhibit both physical plausibility and high visual fidelity, advancing controllable, physics-informed generative modeling.

Technology Category

Application Category

📝 Abstract

Image generation today can produce somewhat realistic images from text prompts. However, if one asks the generator to synthesize a particular camera setting such as creating different fields of view using a 24mm lens versus a 70mm lens, the generator will not be able to interpret and generate scene-consistent images. This limitation not only hinders the adoption of generative tools in photography applications but also exemplifies a broader issue of bridging the gap between the data-driven models and the physical world. In this paper, we introduce the concept of Generative Photography, a framework designed to control camera intrinsic settings during content generation. The core innovation of this work are the concepts of Dimensionality Lifting and Contrastive Camera Learning, which achieve continuous and consistent transitions for different camera settings. Experimental results show that our method produces significantly more scene-consistent photorealistic images than state-of-the-art models such as Stable Diffusion 3 and FLUX.

Problem

Research questions and friction points this paper is trying to address.

Enables precise camera control in text-to-image synthesis

Ensures scene consistency across different camera settings

Improves realism for professional photography applications

Innovation

Methods, ideas, or system contributions that make the work stand out.

Controls camera settings during image generation

Uses Dimensionality Lifting for transitions

Learns Differential Camera Intrinsics

🔎 Similar Papers

CameraCtrl: Enabling Camera Control for Text-to-Video Generation

2024-04-02arXiv.orgCitations: 64

Apple

San Francisco Bay Area, United States of America

Machine Learning Engineer — Camera & Photos, Creative Foundations

Apple

San Diego, United States of America

Research Scientist Intern, Multimodal Generative AI and Robotics (PhD)