π€ AI Summary
This work addresses the challenges of policy space explosion and credit assignment difficulty that arise when a single agent simultaneously performs evidence retrieval and answer generation in multi-step reasoning. To mitigate these issues, the authors propose a role-decomposed multi-agent framework that decouples the task into specialized retriever and generator agents. Efficient cross-agent credit assignment is achieved by using the generatorβs abstention signal as a reward for the retriever. Built upon a shared backbone, the model employs parameter-efficient LoRA modules to enable collaborative training and cross-agent learning signal propagation. Additionally, robustness is enhanced through hard positive evidence augmentation. The approach significantly outperforms existing methods that rely on full fine-tuning of monolithic models on both general and multi-hop question answering benchmarks.
π Abstract
Modern language agents which perform multi-step reasoning have shown strong performance in knowledge-intensive question answering. However, existing approaches typically couple evidence acquisition and answer generation within a single policy. This forces a single model to play multiple potentially conflicting roles, inducing a combinatorial explosion in the policy space and hindering efficient exploration. It also introduces a credit assignment problem during training: a search action that retrieves sufficient evidence may still be penalized when generation fails, and vice versa. We propose DAC (Divide and Cooperate), a role-decomposed multi-agent training framework that divides agentic search into two cooperative subtasks, each handled by a dedicated agent trained with role-specific learning signals. The generator serves a dual role as both an answer producer and an evidence sufficiency verifier, abstaining when retrieved evidence is insufficient. This abstention signal is incorporated into the search agent's reward, providing structured cross-agent learning signals that improve credit assignment. Conversely, the searcher exposes the generator to diverse and challenging evidence environments by hard-positive evidence augmentation, improving its robustness. Experiments on general and multi-hop QA benchmarks show that DAC, implemented via parameter-efficient LoRA modules over a shared backbone, achieves strong performance against prior baselines that rely on full fine-tuning of monolithic models.