🤖 AI Summary
This work addresses the limited ability of existing image generation models to effectively incorporate common photographic composition rules. The authors propose an anchor-conditioned fine-tuning framework that, for the first time, integrates a four-dimensional composition anchor vector with a disentangled cross-attention mechanism into diffusion models via Fourier encoding and a three-way classifier-free guidance dropout strategy, enabling precise compositional control in landscape image generation. Experimental results demonstrate that the method outperforms baseline approaches in both horizon detection rate (0.850) and rule-of-thirds alignment (0.817). Training on compositionally homogeneous subsets reduces horizon deviation by up to 40%, further revealing that compositional control accuracy exhibits scene-category dependency.
📝 Abstract
Image generative models, though widely used as creative tools, offer limited support for the kind of compositional control that photographers and visual artists routinely exercise. This paper presents early results on an anchor conditioned finetuning framework for landscape image generation, in which a four dimensional compositional anchor vector is extracted from training images and injected into a diffusion model via a decoupled cross attention mechanism with Fourier encoding and three way classifier free guidance dropout. Quantitative evaluation against a baseline and three ablation variants shows that the proposed architecture achieves the highest horizon detection rate of 0.850 and the highest rule of thirds alignment of 0.817. A category specific ablation further demonstrates that training on compositionally homogeneous scene subsets reduces horizon deviation by up to 40 percent compared to mixed training. This establishes that compositional control precision is category dependent.