Imagine Before You Draw: Visual Prompt Engineering for Image Generation

📅 2026-06-03

📈 Citations: 0

✨ Influential: 0

career value

165K/year

🤖 AI Summary

This work addresses the challenges in text-to-image generation—namely, modeling difficulty, detail loss, and limited editability—stemming from the absence of intermediate semantic representations. To this end, the authors propose Visual Prompt Engineering (VPE), a unified framework that autoregressively generates semantic visual tokens (e.g., SigLIP features) as “visual prompts” and uses them to condition diffusion-based image synthesis, thereby tightly coupling semantic planning with image generation. VPE is the first approach to seamlessly integrate intermediate semantic guidance within a single-stage model, circumventing the information bottleneck inherent in conventional two-stage pipelines. Experiments demonstrate that, at comparable model scales, VPE substantially improves editing fidelity (PSNR: 26.76 vs. 19.92), accelerates convergence, elevates the upper bound of generation quality, and supports diverse tasks including class-conditional generation, text-to-image synthesis, and image editing.

📝 Abstract

Incorporating visual semantic representations as an intermediate step before image generation can reduce the modeling difficulty between text and images, thereby improving generation quality. Recent works such as X-Omni and BLIP3o-Next have explored this direction, but they typically use a two-stage external pipeline: a separate autoregressive model first generates semantic tokens, which are then fed as conditioning to an independent diffusion decoder. Since the decoder cannot jointly access the original input and the semantic plan, this design introduces an information bottleneck that limits detail preservation in downstream tasks such as editing. Internal architectures such as Transfusion, BAGEL, and Show-o2 avoid this bottleneck by enabling cross-modal interaction within a single model, but they still face the difficult text-to-pixel modeling gap without intermediate semantic guidance. We propose Visual Prompt Engineering (VPE), which can be seamlessly integrated into such internal frameworks. Specifically, the model first autoregressively generates visual semantic tokens (e.g., SigLIP 2) as "visual prompts" that capture the semantic layout, then generates the full image tokens conditioned on this plan. We validate VPE across class-conditional generation, text-to-image generation, and image editing, covering various token types and model architectures. Results show that VPE can accelerate convergence, raise quality ceilings, and through internal integration, achieve substantially better editing preservation (PSNR: 26.76 vs. 19.92) than external alternatives of the same parameter scale, while maintaining competitive editing responsiveness.

Problem

Research questions and friction points this paper is trying to address.

text-to-image generation

semantic representation

image editing

diffusion models

visual prompting

Innovation

Methods, ideas, or system contributions that make the work stand out.

Visual Prompt Engineering

semantic tokens

internal architecture