🤖 AI Summary
Open-ended tasks—such as creative writing and instruction following—lack canonical answers, rendering conventional reward modeling approaches like Reinforcement Learning from Verifiable Rewards (RLVR) inapplicable. Method: We propose Verifiable Multiple-Choice Reconstruction (VMR), a novel strategy that automatically reformulates open-generation outputs into multiple-choice questions, thereby constructing auditable and verifiable implicit supervision signals. VMR enables automated assessment and optimization of model reasoning without human annotation. Contribution/Results: To our knowledge, this is the first work to successfully adapt answer-dependent RLVR frameworks to tasks with no ground-truth solutions. Evaluated across eight open-ended benchmarks, VMR achieves an average improvement of 5.99 points over strong baselines, demonstrating both the critical role of reasoning in open-ended tasks and the effectiveness and generalizability of VMR.
📝 Abstract
Reinforcement Learning with Verifiable Rewards (RLVR) has demonstrated great potential in enhancing the reasoning capabilities of large language models (LLMs), achieving remarkable progress in domains such as mathematics and programming where standard answers are available. However, for open-ended tasks lacking ground-truth solutions (e.g., creative writing and instruction following), existing studies typically regard them as non-reasoning scenarios, thereby overlooking the latent value of reasoning capabilities. This raises a key question: Can strengthening reasoning improve performance in open-ended tasks? To address this, we explore the transfer of the RLVR paradigm to the open domain. Yet, since RLVR fundamentally relies on verifiers that presuppose the existence of standard answers, it cannot be directly applied to open-ended tasks. To overcome this challenge, we introduce Verifiable Multiple-Choice Reformulation (VMR), a novel training strategy that restructures open-ended data into verifiable multiple-choice formats, enabling effective training even in the absence of explicit ground truth. Experimental results on multiple benchmarks validate the effectiveness of our method in improving LLM performance on open-ended tasks. Notably, across eight open-ended benchmarks, our VMR-based training delivers an average gain of 5.99 points over the baseline. Code will be released upon acceptance to facilitate reproducibility.