Aligning Language Models with Demonstrated Feedback

📅 2024-06-02

📈 Citations: 17

✨ Influential: 3

career value

178K/year

🤖 AI Summary

Large language models (LLMs) tend to produce generic outputs, limiting their adaptability to individual user styles and domain-specific tasks. To address this, we propose DITTO, a novel framework that establishes an online preference feedback mechanism via imitation learning using fewer than ten user-provided demonstrations—eliminating the need for manual annotation or large-scale supervised datasets. DITTO dynamically constructs pairwise preference relations between LLM-generated responses and expert demonstrations in an online imitation learning paradigm, and is compatible with preference optimization algorithms such as DPO, enabling continual alignment across model checkpoints. Extensive experiments across news, email, and blog domains demonstrate that DITTO achieves an average 19-percentage-point improvement in win rate over few-shot prompting, supervised fine-tuning, and self-play baselines. A user study with 16 participants further confirms its effectiveness in personalized alignment.

Technology Category

Application Category

📝 Abstract

Language models are aligned to emulate the collective voice of many, resulting in outputs that align with no one in particular. Steering LLMs away from generic output is possible through supervised finetuning or RLHF, but requires prohibitively large datasets for new ad-hoc tasks. We argue that it is instead possible to align an LLM to a specific setting by leveraging a very small number (<10) of demonstrations as feedback. Our method, Demonstration ITerated Task Optimization (DITTO), directly aligns language model outputs to a user's demonstrated behaviors. Derived using ideas from online imitation learning, DITTO cheaply generates online comparison data by treating users' demonstrations as preferred over output from the LLM and its intermediate checkpoints. Concretely, DITTO operates by having an LLM generate examples that are presumed to be inferior to expert demonstrations. The method iteratively constructs pairwise preference relationships between these LLM-generated samples and expert demonstrations, potentially including comparisons between different training checkpoints. These constructed preference pairs are then used to train the model using a preference optimization algorithm (e.g. DPO). We evaluate DITTO's ability to learn fine-grained style and task alignment across domains such as news articles, emails, and blog posts. Additionally, we conduct a user study soliciting a range of demonstrations from participants (N = 16). Across our benchmarks and user study, we find that win-rates for DITTO outperform few-shot prompting, supervised fine-tuning, and other self-play methods by an avg. of 19% points. By using demonstrations as feedback directly, DITTO offers a novel method for effective customization of LLMs.

Problem

Research questions and friction points this paper is trying to address.

Aligning LLMs to specific settings with minimal demonstrations

Reducing reliance on large datasets for ad-hoc task alignment

Improving model output personalization via user demonstration feedback

Innovation

Methods, ideas, or system contributions that make the work stand out.

Uses small demonstrations as feedback

Generates online comparison data cheaply

Trains model with preference optimization

🔎 Similar Papers

Inverse-RLignment: Inverse Reinforcement Learning from Demonstrations for LLM Alignment