Rethink Sparse Signals for Pose-guided Text-to-image Generation

📅 2025-06-25
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
In pose-guided text-to-image generation, dense conditioning signals (e.g., depth maps, DensePose) suffer from poor editability and frequent conflicts with text prompts. To address this, we revisit sparse pose representations—specifically OpenPose keypoints—and propose Spatial-Pose ControlNet (SP-Ctrl). Our method introduces two key innovations within the ControlNet framework: (1) modeling OpenPose keypoints as learnable embeddings, and (2) integrating keypoint concept learning with spatially aware attention to enable fine-grained, sparse conditional control. Experiments demonstrate that SP-Ctrl significantly outperforms existing sparse-conditioning methods on human and animal image generation, achieving performance on par with dense-signal approaches while offering superior editability and cross-species generalization.

Technology Category

Application Category

📝 Abstract
Recent works favored dense signals (e.g., depth, DensePose), as an alternative to sparse signals (e.g., OpenPose), to provide detailed spatial guidance for pose-guided text-to-image generation. However, dense representations raised new challenges, including editing difficulties and potential inconsistencies with textual prompts. This fact motivates us to revisit sparse signals for pose guidance, owing to their simplicity and shape-agnostic nature, which remains underexplored. This paper proposes a novel Spatial-Pose ControlNet(SP-Ctrl), equipping sparse signals with robust controllability for pose-guided image generation. Specifically, we extend OpenPose to a learnable spatial representation, making keypoint embeddings discriminative and expressive. Additionally, we introduce keypoint concept learning, which encourages keypoint tokens to attend to the spatial positions of each keypoint, thus improving pose alignment. Experiments on animal- and human-centric image generation tasks demonstrate that our method outperforms recent spatially controllable T2I generation approaches under sparse-pose guidance and even matches the performance of dense signal-based methods. Moreover, SP-Ctrl shows promising capabilities in diverse and cross-species generation through sparse signals. Codes will be available at https://github.com/DREAMXFAR/SP-Ctrl.
Problem

Research questions and friction points this paper is trying to address.

Revisiting sparse signals for pose-guided image generation
Enhancing controllability of sparse signals in image synthesis
Improving pose alignment with learnable spatial representations
Innovation

Methods, ideas, or system contributions that make the work stand out.

Extends OpenPose to learnable spatial representation
Introduces keypoint concept learning for alignment
Proposes Spatial-Pose ControlNet for robust controllability
🔎 Similar Papers
No similar papers found.