Cross-Task Benchmarking and Evaluation of General-Purpose and Code-Specific Large Language Models

📅 2025-12-04

📈 Citations: 0

✨ Influential: 0

career value

200K/year

🤖 AI Summary

Prior work lacks systematic, cross-domain, multidimensional evaluation of both general-purpose and code-specialized large language models (LLMs). Method: This paper introduces the first unified benchmark covering six task categories—language understanding, mathematical reasoning, trustworthiness assessment, and others—and systematically evaluates five general-purpose LLMs (e.g., Llama-3-8B, Mistral-7B) and three code-specialized models (e.g., CodeLlama), augmented by in-depth analysis of code interpretation capability, output consistency, and trustworthiness using the CoNaLa dataset. Contribution/Results: Contrary to common assumptions, code-specialized models not only dominate on coding tasks but also outperform leading general-purpose models on select mathematical reasoning and syntactic precision benchmarks, indicating underappreciated generalization capacity. This work establishes a reproducible, multidimensional evaluation framework and offers novel insights into LLM capability landscapes.

Technology Category

Application Category

📝 Abstract

Large Language Models (LLMs) have revolutionized both general natural language processing and domain-specific applications such as code synthesis, legal reasoning, and finance. However, while prior studies have explored individual model capabilities, a systematic cross-domain comparison that unifies linguistic, reasoning, and code understanding abilities remains underexplored. In this work, we present a comprehensive evaluation of five general-purpose and three code-specific state-of-the-art LLMs across six diverse benchmarks encompassing linguistic competence, mathematical reasoning, and trustworthiness. Additionally, we analyze model behavior on the CoNaLa dataset for code explanation, comparing natural language and code-specialized LLMs. Our findings reveal that models optimized for code (e.g., CodeLLaMA variants) exhibit strong reasoning and syntactic precision, that even for non-coding tasks can show measurable performance gains, in contrast to general-purpose models like Mistral-7B and Llama-3-8B.

Problem

Research questions and friction points this paper is trying to address.

Systematically compare general-purpose and code-specific LLMs across domains

Evaluate linguistic, reasoning, and code understanding abilities in unified benchmarks

Analyze performance differences on code explanation and non-coding tasks

Innovation

Methods, ideas, or system contributions that make the work stand out.

Cross-domain benchmarking of general-purpose and code-specific LLMs

Evaluation across linguistic, reasoning, and code understanding tasks

Analysis of code-optimized models' performance on non-coding tasks

🔎 Similar Papers

How Does Code Pretraining Affect Language Model Task Performance?