RefactorCoderQA: Benchmarking LLMs for Multi-Domain Coding Question Solutions in Cloud and Edge Deployment

📅 2025-09-12

📈 Citations: 0

✨ Influential: 0

career value

193K/year

🤖 AI Summary

Large language models (LLMs) exhibit weak reasoning capabilities and suboptimal deployment performance on multi-domain programming tasks. Method: We propose a cloud-edge collaborative multi-agent prompting framework: a lightweight GuideLLM at the edge performs problem understanding; a powerful SolverLLM in the cloud handles complex code generation; and a JudgeLLM automatically evaluates outputs and provides optimization feedback. Contribution/Results: We introduce RefactorCoderQA, the first benchmark covering software engineering, data science, and other domains, built from real Stack Overflow coding challenges with fine-grained evaluation tasks. Our RefactorCoder-MoE model achieves 76.84% accuracy on this benchmark—significantly outperforming state-of-the-art open-source and commercial LLMs. Human evaluation confirms solution accuracy, interpretability, and practical utility. System benchmarks demonstrate balanced latency and throughput, validating architectural feasibility for real-world deployment.

Technology Category

Application Category

📝 Abstract

To optimize the reasoning and problem-solving capabilities of Large Language Models (LLMs), we propose a novel cloud-edge collaborative architecture that enables a structured, multi-agent prompting framework. This framework comprises three specialized components: GuideLLM, a lightweight model deployed at the edge to provide methodological guidance; SolverLLM, a more powerful model hosted in the cloud responsible for generating code solutions; and JudgeLLM, an automated evaluator for assessing solution correctness and quality. To evaluate and demonstrate the effectiveness of this architecture in realistic settings, we introduce RefactorCoderQA, a comprehensive benchmark designed to evaluate and enhance the performance of Large Language Models (LLMs) across multi-domain coding tasks. Motivated by the limitations of existing benchmarks, RefactorCoderQA systematically covers various technical domains, including Software Engineering, Data Science, Machine Learning, and Natural Language Processing, using authentic coding challenges from Stack Overflow. Extensive experiments reveal that our fine-tuned model, RefactorCoder-MoE, achieves state-of-the-art performance, significantly outperforming leading open-source and commercial baselines with an overall accuracy of 76.84%. Human evaluations further validate the interpretability, accuracy, and practical relevance of the generated solutions. In addition, we evaluate system-level metrics, such as throughput and latency, to gain deeper insights into the performance characteristics and trade-offs of the proposed architecture.

Problem

Research questions and friction points this paper is trying to address.

Proposing cloud-edge architecture for LLM coding solutions

Introducing benchmark for multi-domain coding task evaluation

Assessing solution correctness and system performance metrics

Innovation

Methods, ideas, or system contributions that make the work stand out.

Cloud-edge collaborative multi-agent architecture

Specialized LLMs for guidance, solving, evaluation

RefactorCoderQA benchmark using Stack Overflow challenges

🔎 Similar Papers

No similar papers found.