🤖 AI Summary
To address the limited quality and robustness of text summarization by individual large language models (LLMs), this paper proposes a multi-LLM collaborative summarization framework based on a generation–evaluation two-stage paradigm: *k* LLMs generate diverse summaries in parallel, followed by dynamic selection and consensus aggregation via either centralized evaluation (a single LLM scoring all candidates) or decentralized evaluation (*k* LLMs performing cross-evaluation). This is the first systematic study comparing these two multi-model coordination mechanisms for summarization, innovatively integrating prompt engineering, cross-validation, and result optimization techniques. Experiments demonstrate that the proposed approach achieves, on average, a threefold improvement over single-LLM baselines across standard metrics—including ROUGE and BERTScore—significantly enhancing summary accuracy, consistency, and generalization. The results validate both the effectiveness and scalability of the multi-LLM collaborative paradigm.
📝 Abstract
In this work, we propose a Multi-LLM summarization framework, and investigate two different multi-LLM strategies including centralized and decentralized. Our multi-LLM summarization framework has two fundamentally important steps at each round of conversation: generation and evaluation. These steps are different depending on whether our multi-LLM decentralized summarization is used or centralized. In both our multi-LLM decentralized and centralized strategies, we have k different LLMs that generate diverse summaries of the text. However, during evaluation, our multi-LLM centralized summarization approach leverages a single LLM to evaluate the summaries and select the best one whereas k LLMs are used for decentralized multi-LLM summarization. Overall, we find that our multi-LLM summarization approaches significantly outperform the baselines that leverage only a single LLM by up to 3x. These results indicate the effectiveness of multi-LLM approaches for summarization.