Escaping the Mode Lottery: Multi-Response Training Improves Language Model Generalization

📅 2026-05-30
📈 Citations: 0
Influential: 0
📄 PDF

career value

190K/year
🤖 AI Summary
This work addresses the limitation of conventional language model fine-tuning, which typically relies on single-response prompt-response pairs and neglects the multimodal nature of output distributions, thereby exacerbating the “mode lottery” problem and suppressing valid response modes. To overcome this, the authors propose a Multi-Response Training (MRT) framework that preserves multiple valid responses per prompt, treating prompts and responses as heterogeneous statistical resources and optimizing data allocation to enhance generalization. The study uncovers a variance budget trade-off between prompts and responses, introduces an unbiased response selection strategy, and formulates a submodular optimization objective balancing quality and diversity. Through Random-K-of-N sampling, reward-guided selection, and controlled experiments, MRT demonstrates significantly improved distributional generalization—particularly in settings with high response diversity and low prompt redundancy—and validates its efficacy on a newly constructed multi-prompt, multi-response benchmark.
📝 Abstract
Modern language-model fine-tuning typically pairs each prompt with a single response, even though many prompts admit multiple valid completions. This effectively reduces a multi-modal conditional distribution to a one-sample view, a phenomenon we call the "mode lottery," where training emphasizes a subset of plausible modes while leaving others underrepresented. We study multi-response training (MRT), which retains multiple responses per prompt, and develop a principled account of when and why it helps. Our key insight is that prompts and responses are distinct statistical resources: additional prompts reduce uncertainty about the input distribution, while additional responses reduce uncertainty about the conditional output distribution. This yields a variance-budget tradeoff that predicts when retaining multiple responses is worthwhile, shows diminishing returns as prompt-level uncertainty dominates, and explains why large redundant corpora can exhibit an implicit multi-response effect. We further analyze response selection, and show that Random-K-of-N is the unbiased default for distributional fine-tuning, reward-based selection can induce mode collapse, and a submodular quality-diversity objective provides an efficient alternative with theoretical guarantees. Controlled simulations validate the predicted variance and selection effects, including a striking failure mode where reward-only selection produces gradients misaligned with the true objective. Across structured and real-world datasets, including a new multi-prompt, multi-response benchmark, MRT consistently improves distributional generalization, with the largest gains in high response-diversity, low prompt-redundancy regimes. MRT reframes response multiplicity as a data-allocation problem with clear guidance: when responses are cheap and diverse, keeping more than one is not a heuristic, but a statistically grounded choice.
Problem

Research questions and friction points this paper is trying to address.

mode lottery
multi-response training
language model generalization
conditional distribution
response diversity
Innovation

Methods, ideas, or system contributions that make the work stand out.

Multi-Response Training
Mode Lottery
Distributional Generalization
Variance-Budget Tradeoff
Response Selection
🔎 Similar Papers