🤖 AI Summary
Single pre-trained large language models (LLMs) exhibit unstable or inconclusive performance on complex natural language tasks—e.g., text-to-SQL, scientific question answering, and bias detection.
Method: This paper proposes a multi-source LLM population-driven genetic algorithm framework. It initializes a heterogeneous population from diverse open- and closed-source LLMs and iteratively optimizes it via selection, crossover, mutation, and a neutral fitness function—systematically exploring cooperative co-evolution among LLMs for the first time.
Contribution/Results: The core innovation lies in leveraging complementary capabilities across LLMs to overcome individual performance bottlenecks—without fine-tuning or human annotation. Experiments demonstrate that the framework achieves accuracy comparable to the best individual LLM across multiple reasoning benchmarks, while exhibiting strong generalization and robustness. This work establishes a novel paradigm for LLM ensemble optimization.
📝 Abstract
Large Language Models (LLMs) are widely used across research domains to tackle complex tasks, but their performance can vary significantly depending on the task at hand. Evolutionary algorithms, inspired by natural selection, can be used to refine solutions iteratively at inference-time. To the best of our knowledge, there has not been exploration on leveraging the collective capabilities of multi-source seeding for LLM-guided genetic algorithms. In this paper, we introduce a novel approach, MultiGA, which applies genetic algorithm principles to address complex natural language tasks and reasoning problems by sampling from a diverse population of LLMs to initialize the population. MultiGA generates a range of outputs from various parent LLMs, open source and closed source, and uses a neutral fitness function to evaluate them. Through an iterative recombination process, we mix and refine these generations until an optimal solution is achieved. We benchmark our approach using text-to-SQL code generation tasks, trip planning, GPQA benchmark for grad-level science questions, and the BBQ bias benchmark. Our results show that MultiGA converges to the accuracy of the LLM best fit for the task, and these insights lay the foundation for future research looking closer at integrating multiple LLMs for unexplored tasks in which selecting only one pre-trained model is unclear or suboptimal.