COLLAR: Cascaded Object-Level Latent Refinement for High-Fidelity Conditional Generation

📅 2026-05-30

📈 Citations: 0

✨ Influential: 0

career value

167K/year

🤖 AI Summary

This work addresses the challenge of achieving high-fidelity, fine-grained object-level conditioning in diffusion Transformers, which is currently hindered by visual artifacts and insufficient local control precision. To overcome these limitations, the authors propose a training-free, cascaded object-level latent optimization framework that progressively refines object features through a field-of-view expansion mechanism. The approach innovatively integrates a cross-scale semantic alignment module with a frequency-based recurrent feature injection strategy, enabling adaptive high-frequency fusion between local objects and the global background. Evaluated on the COCO-MIG and COCO-POS benchmarks, the method consistently outperforms state-of-the-art approaches in terms of semantic alignment, image quality, and spatial fidelity.

📝 Abstract

Achieving high-fidelity object-level control in Diffusion Transformers remains a significant challenge despite the introduction of structural priors like depth and Canny maps. Current object-level conditional generation methods frequently suffer from visual artifacts and struggle to maintain precise control over objects within small localized regions. To address these limitations, we propose Cascaded Object-Level Latent Refinement (COLLAR), a training-free framework that progressively optimizes object-level features via the Field-of-View (FoV) expansion. First, we propose the Cross-Scale Semantic Alignment (CSSA) module to address spatial-semantic gaps by injecting object-level features into extended-FoV branches via attention mechanisms. To further optimize these features, the Cyclic Feature Injection (CFI) module introduces a reciprocal background feedback mechanism. It leverages a frequency-based adaptive strategy to selectively update the global backbone with context-aligned local information. Finally, the extended-FoV branch serves as a hub for feature optimization, ensuring that object-level features are integrated into the global generation process without compromising final image quality. Extensive experiments on the COCO-MIG and COCO-POS benchmarks demonstrate that our approach consistently outperforms state-of-the-art methods across semantic alignment, image quality, and spatial fidelity.

Problem

Research questions and friction points this paper is trying to address.

object-level control

high-fidelity generation

visual artifacts

spatial fidelity

conditional generation

Innovation

Methods, ideas, or system contributions that make the work stand out.

Cascaded Object-Level Latent Refinement

Field-of-View expansion

Cross-Scale Semantic Alignment