🤖 AI Summary
This work addresses the limitation of existing prompt optimization methods that rely on weighted summation and thus struggle to adequately explore the Pareto frontier between accuracy and inference cost. To overcome this, the authors propose a front-aware Pareto optimization framework that integrates complementary accuracy- and cost-oriented editing generators, a Pareto gap acquisition strategy, and an NSGA-II-based population maintenance mechanism. This approach dynamically selects and validates candidate prompts under a constrained model query budget, avoiding solution set collapse caused by scalarization. Experiments across six classification and reasoning benchmarks demonstrate that the method substantially outperforms single-objective or weighted-sum baselines, yielding prompt sets that simultaneously span high-accuracy and low-cost regions and achieve a broader, more complete approximation of the Pareto frontier.
📝 Abstract
Prompts tuned for accuracy often grow long, raising inference cost on every model call. The best accuracy-cost trade-off depends on the task and the budget, so prompt optimization is a search over the Pareto front of accuracy and prompt-token cost rather than for one prompt. The usual shortcut, collapsing the objectives into a weighted sum, fixes the trade-off weight before search and often recovers only a narrow region of the front, a failure we call scalarization collapse. We present CRAFT (Cost-aware Refinement And Front-aware Tuning), a Pareto-front prompt optimizer that treats target-LLM validation calls as the scarce resource and allocates them to candidates near the optimistic candidate front. Each round, complementary accuracy-oriented and cost-oriented generators propose edits, Pareto-gap acquisition spends the per-round validation budget, and NSGA-II retention keeps a spread-out population. Across six classification and reasoning benchmarks, CRAFT's retained fronts reach both high-accuracy and low-cost regions, while accuracy-only, cost-only, and weighted-sum baselines each concentrate in narrower regions. The accuracy-cost trade-off becomes a post-search choice, not a pre-search weight.