UniAPO: Unified Multimodal Automated Prompt Optimization

📅 2025-08-25
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
Existing automatic prompt optimization (APO) methods struggle to adapt to multimodal tasks, hindered by two key bottlenecks: visual token inflation—leading to context window constraints and sparse feedback—and the absence of process-level supervision. This paper introduces MPO, the first unified prompt optimization framework for multimodal inputs (text, images, and videos). Its core innovations are: (1) an EM-inspired heuristic optimization procedure integrating short- and long-term memory, explicitly decoupling feedback modeling from prompt updating; (2) a historical feedback cache that alleviates context pressure while leveraging past prompts to guide optimization direction; and (3) explicit modeling of process-level supervision to strengthen gradient signals. Extensive experiments on multimodal benchmarks demonstrate significant and consistent performance gains, validating MPO’s effectiveness, cross-task transferability, and generalization capability across diverse modalities and downstream tasks.

Technology Category

Application Category

📝 Abstract
Prompting is fundamental to unlocking the full potential of large language models. To automate and enhance this process, automatic prompt optimization (APO) has been developed, demonstrating effectiveness primarily in text-only input scenarios. However, extending existing APO methods to multimodal tasks, such as video-language generation introduces two core challenges: (i) visual token inflation, where long visual token sequences restrict context capacity and result in insufficient feedback signals; (ii) a lack of process-level supervision, as existing methods focus on outcome-level supervision and overlook intermediate supervision, limiting prompt optimization. We present UniAPO: Unified Multimodal Automated Prompt Optimization, the first framework tailored for multimodal APO. UniAPO adopts an EM-inspired optimization process that decouples feedback modeling and prompt refinement, making the optimization more stable and goal-driven. To further address the aforementioned challenges, we introduce a short-long term memory mechanism: historical feedback mitigates context limitations, while historical prompts provide directional guidance for effective prompt optimization. UniAPO achieves consistent gains across text, image, and video benchmarks, establishing a unified framework for efficient and transferable prompt optimization.
Problem

Research questions and friction points this paper is trying to address.

Optimizing prompts for multimodal tasks like video-language generation
Addressing visual token inflation and lack of process-level supervision
Developing unified framework for stable multimodal prompt optimization
Innovation

Methods, ideas, or system contributions that make the work stand out.

EM-inspired decoupled optimization process
Short-long term memory mechanism
Unified multimodal prompt optimization framework
🔎 Similar Papers
No similar papers found.
Q
Qipeng Zhu
ByteDance
Y
Yanzhe Chen
ByteDance
Huasong Zhong
Huasong Zhong
Tsinghua University
Y
Yan Li
ByteDance
J
Jie Chen
Fudan University
Zhixin Zhang
Zhixin Zhang
Ph.D of Robotics, University of Manchester
SLAMVINSLIOSensor FusionRobotics
J
Junping Zhang
Fudan University
Zhenheng Yang
Zhenheng Yang
TikTok
Computer VisionMachine LearningDeep Learning