CountDiffusion: Text-to-Image Synthesis with Training-Free Counting-Guidance Diffusion

📅 2025-05-07
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
To address the prevalent issue of inaccurate object counting in text-to-image (T2I) generation, this paper proposes a training-free, two-stage counting-guided diffusion framework. First, a pre-trained visual counting model (e.g., CountViT) performs real-time object counting on intermediate denoised images. Second, a gradient-guided correction module—built upon self-attention maps—dynamically refines the diffusion process to precisely align with the target count. This work introduces the first *training-free* counting guidance mechanism: it is plug-and-play, requiring no fine-tuning or retraining of existing T2I diffusion models (e.g., Stable Diffusion). Notably, it establishes the first general-purpose synergy between attention-based guidance and external counting modules. Evaluated across multiple benchmarks, the method improves object counting accuracy by 32.7% relatively, while preserving image fidelity and full compatibility with off-the-shelf diffusion models.

Technology Category

Application Category

📝 Abstract
Stable Diffusion has advanced text-to-image synthesis, but training models to generate images with accurate object quantity is still difficult due to the high computational cost and the challenge of teaching models the abstract concept of quantity. In this paper, we propose CountDiffusion, a training-free framework aiming at generating images with correct object quantity from textual descriptions. CountDiffusion consists of two stages. In the first stage, an intermediate denoising result is generated by the diffusion model to predict the final synthesized image with one-step denoising, and a counting model is used to count the number of objects in this image. In the second stage, a correction module is used to correct the object quantity by changing the attention map of the object with universal guidance. The proposed CountDiffusion can be plugged into any diffusion-based text-to-image (T2I) generation models without further training. Experiment results demonstrate the superiority of our proposed CountDiffusion, which improves the accurate object quantity generation ability of T2I models by a large margin.
Problem

Research questions and friction points this paper is trying to address.

Generating images with accurate object counts from text
Overcoming training challenges in quantity-aware image synthesis
Enhancing diffusion models without additional training for counting
Innovation

Methods, ideas, or system contributions that make the work stand out.

Training-free counting-guidance diffusion framework
Two-stage correction with counting model
Plug-in for any diffusion-based T2I models
🔎 Similar Papers
No similar papers found.
Yanyu Li
Yanyu Li
PhD, Northeastern University
Machine Learning
P
Pencheng Wan
Department of Computer Science and Technology, Harbin Institute of Technology, Shenzhen, China
L
Liang Han
Department of Computer Science and Technology, Harbin Institute of Technology, Shenzhen, China
Yaowei Wang
Yaowei Wang
The Hong Kong Polytechnic University
L
Liqiang Nie
Department of Computer Science and Technology, Harbin Institute of Technology, Shenzhen, China
M
Min Zhang
Department of Computer Science and Technology, Harbin Institute of Technology, Shenzhen, China