🤖 AI Summary
Existing generative image synthesis methods struggle to simultaneously preserve foreground detail fidelity and enable controllable pose/view manipulation. To address this, we propose a multi-reference image synthesis framework centered on a cross-reference feature calibration mechanism: it explicitly aligns local detail features from multiple reference images with global background context, enabling consistent inter-reference modeling and background-aware fusion. Our method adopts a generative model architecture and jointly optimizes three modules—feature extraction, cross-reference calibration, and background adaptation—thereby preserving texture and structural details while supporting flexible pose and viewpoint editing. Experiments on MVImgNet and MureCom demonstrate that our approach significantly outperforms state-of-the-art methods in FID, LPIPS, and user study metrics, achieving substantial improvements in visual realism, geometric plausibility, and detail completeness of synthesized images.
📝 Abstract
Image composition aims to seamlessly insert foreground object into background. Despite the huge progress in generative image composition, the existing methods are still struggling with simultaneous detail preservation and foreground pose/view adjustment. To address this issue, we extend the existing generative composition model to multi-reference version, which allows using arbitrary number of foreground reference images. Furthermore, we propose to calibrate the global and local features of foreground reference images to make them compatible with the background information. The calibrated reference features can supplement the original reference features with useful global and local information of proper pose/view. Extensive experiments on MVImgNet and MureCom demonstrate that the generative model can greatly benefit from the calibrated reference features.