🤖 AI Summary
To address the imbalanced cross-category performance of single large language models (LLMs) in source code time-complexity prediction, this paper proposes a Multi-Expert Consensus System. We design a performance-aware expert role assignment mechanism that specializes individual LLMs for distinct complexity classes and introduce a structured multi-agent debate framework, integrating class-specific instructions and weighted ensemble strategies to mitigate reasoning degradation and prevent erroneous majority convergence—without requiring an external adjudicator model. Evaluated on the CodeComplex benchmark, our method achieves ≥10% average improvements in accuracy and macro-F1 over open-source baselines, outperforming GPT-4o-mini and matching GPT-4o’s performance. Our core contribution is the first LLM collaboration paradigm for complexity prediction that is both adjudicator-free and class-adaptive.
📝 Abstract
Predicting the complexity of source code is essential for software development and algorithm analysis. Recently, Baik et al. (2025) introduced CodeComplex for code time complexity prediction. The paper shows that LLMs without fine-tuning struggle with certain complexity classes. This suggests that no single LLM excels at every class, but rather each model shows advantages in certain classes. We propose MEC$^3$O, a multi-expert consensus system, which extends the multi-agent debate frameworks. MEC$^3$O assigns LLMs to complexity classes based on their performance and provides them with class-specialized instructions, turning them into experts. These experts engage in structured debates, and their predictions are integrated through a weighted consensus mechanism. Our expertise assignments to LLMs effectively handle Degeneration-of-Thought, reducing reliance on a separate judge model, and preventing convergence to incorrect majority opinions. Experiments on CodeComplex show that MEC$^3$O outperforms the open-source baselines, achieving at least 10% higher accuracy and macro-F1 scores. It also surpasses GPT-4o-mini in macro-F1 scores on average and demonstrates competitive on-par F1 scores to GPT-4o and GPT-o4-mini on average. This demonstrates the effectiveness of multi-expert debates and weight consensus strategy to generate the final predictions. Our code and data is available at https://github.com/suhanmen/MECO.