TREAT: A Code LLMs Trustworthiness / Reliability Evaluation and Testing Framework

📅 2025-10-20

📈 Citations: 0

✨ Influential: 0

🤖 AI Summary

Existing code large language model (LLM) evaluation benchmarks suffer from insufficient task coverage, limited robustness, and poor reliability, failing to reflect trustworthiness in realistic software engineering scenarios. To address this, we propose the first comprehensive evaluation framework for assessing trustworthiness and reliability of code LLMs, integrating multi-task, multi-language, and multimodal understanding with semantic-preserving robustness testing. Our method introduces adaptive decoding, diverse prompt engineering, and a multi-dimensional scoring mechanism to enhance assessment validity. The framework supports semantic-equivalent code transformations, multimodal input parsing, and adaptive output verification. A systematic evaluation of 26 state-of-the-art code models reveals that current multimodal models exhibit significant performance degradation on UI code generation and editing tasks—exposing critical capability gaps in real-world applicability and trustworthiness.

Technology Category

Application Category

📝 Abstract

Large foundation models are fundamentally transforming the software engineering landscape, demonstrating exceptional capabilities across diverse tasks such as code generation, debugging, and testing. Despite this rapid progress, a significant gap remains in how to comprehensively evaluate these models' trustworthiness in real-world software engineering scenarios. Existing benchmarks suffer from limited task scope and fail to incorporate critical evaluation aspects such as the robustness and reliability of models. To bridge this gap, we present an evaluation framework called TREAT (Code LLMs Trustworthiness / Reliability Evaluation And Testing) that provides a holistic assessment of model performance in code intelligence tasks. Our evaluation framework addresses key limitations in existing approaches with four main improvements: (1) Multi-Task Holistic Evaluation that spans diverse software engineering activities rather than limited coding tasks; (2) Multi-Language and Multi-Modality Assessment that extends beyond traditional single-language, text-only benchmarks to include multi-modality coding tasks; (3) Robustness Assessment that evaluates model reliability under semantically-preserving code transformations; and (4) Rigorous Evaluation Methodology that enhances the trustworthiness of evaluation results through diverse evaluation prompts and adaptive solution extraction. Based on this evaluation framework, we assess 26 state-of-the-art models and uncover both their strengths and limitations, yielding several key insights:(1) Current models show substantial performance variation across programming tasks; (2) Multi-modal language models demonstrate specific performance limitations in UI code generation and edit;

Problem

Research questions and friction points this paper is trying to address.

Evaluating code LLMs' trustworthiness in real-world software engineering scenarios

Addressing limited task scope and robustness assessment in existing benchmarks

Providing holistic multi-task evaluation across diverse programming languages and modalities

Innovation

Methods, ideas, or system contributions that make the work stand out.

Multi-task holistic evaluation across software engineering activities

Multi-language and multi-modality assessment for coding tasks

Robustness assessment under semantic-preserving code transformations

🔎 Similar Papers

Large-scale, Independent and Comprehensive study of the power of LLMs for test case generation

2024-06-28arXiv.orgCitations: 8

Authors to Follow