IOPO: Empowering LLMs with Complex Instruction Following via Input-Output Preference Optimization

📅 2024-11-09

🏛️ arXiv.org

📈 Citations: 13

✨ Influential: 2

career value

212K/year

🤖 AI Summary

Large language models (LLMs) exhibit weak generalization on complex instruction-following tasks, hindered by the lack of high-quality evaluation benchmarks and specialized alignment methods. Method: This paper introduces TRACE—the first large-scale benchmark for complex instruction evaluation, comprising 120K diverse samples—and proposes Input-Output Preference Optimization (IOPO), a novel end-to-end framework that jointly models preferences over both instruction inputs and response outputs, eliminating reliance on costly human annotations of complex instructions. Contribution/Results: Experiments demonstrate that IOPO significantly improves generalization: on in-domain test sets, it outperforms supervised fine-tuning (SFT) and direct preference optimization (DPO) by +8.15% and +2.18%, respectively; on out-of-domain test sets, gains reach +6.29% over SFT and +3.13% over DPO, confirming its superior robustness and cross-domain generalization capability.

Technology Category

Application Category

📝 Abstract

In the realm of large language models (LLMs), the ability of models to accurately follow instructions is paramount as more agents and applications leverage LLMs for construction, where the complexity of instructions are rapidly increasing. However, on the one hand, there is only a certain amount of complex instruction evaluation data; on the other hand, there are no dedicated algorithms to improve the ability to follow complex instructions. To this end, this paper introduces TRACE, a benchmark for improving and evaluating the complex instructionfollowing ability, which consists of 120K training data and 1K evaluation data. Furthermore, we propose IOPO (Input-Output Preference Optimization) alignment method which takes both input and output preference pairs into consideration, where LLMs not only rapidly align with response preferences but also meticulously explore the instruction preferences. Extensive experiments on both in-domain and outof-domain datasets confirm the effectiveness of IOPO, showing 8.15%, 2.18% improvements on in-domain data and 6.29%, 3.13% on outof-domain data compared to SFT and DPO respectively.

Problem

Research questions and friction points this paper is trying to address.

Enhancing LLMs' ability to follow complex instructions

Addressing lack of dedicated algorithms for complex instructions

Improving performance on in-domain and out-of-domain datasets

Innovation

Methods, ideas, or system contributions that make the work stand out.

Introduces TRACE benchmark for complex instruction evaluation

Proposes IOPO method for input-output preference optimization

Shows significant improvements over SFT and DPO methods

🔎 Similar Papers

No similar papers found.