CODECRASH: Stress Testing LLM Reasoning under Structural and Semantic Perturbations

📅 2025-04-19
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
This paper addresses the insufficient robustness of large language models (LLMs) in code understanding and reasoning tasks—particularly under structural perturbations (e.g., AST deformations) and semantic interference (e.g., comment pollution). To this end, we introduce CodeCrash, the first unified benchmark explicitly modeling dual-code perturbations. Methodologically, we propose a two-task evaluation framework covering both input comprehension and output generation, supporting both direct and chain-of-thought reasoning paths, and integrating perturbation injection mechanisms from CRUXEval and LiveCodeBench. Empirical analysis across 17 LLMs and 3 large reasoning models (LRMs) reveals that structural noise is the primary source of vulnerability; LLMs over-rely on natural language cues, and LRMs’ self-reflective reasoning mechanisms exhibit severe degradation under perturbations. We open-source the benchmark, evaluation tools, and a robustness leaderboard to advance trustworthy AI assessment in code intelligence.

Technology Category

Application Category

📝 Abstract
Large Language Models (LLMs) have recently showcased strong capabilities in code-related tasks, yet their robustness in code comprehension and reasoning remains underexplored. In this paper, we present CodeCrash, a unified benchmark that evaluates LLM robustness under code structural and textual distraction perturbations, applied to two established benchmarks -- CRUXEval and LiveCodeBench -- across both input and output prediction tasks. We evaluate seventeen LLMs using direct and Chain-of-Thought inference to systematically analyze their robustness, identify primary reasons for performance degradation, and highlight failure modes. Our findings reveal the fragility of LLMs under structural noise and the inherent reliance on natural language cues, highlighting critical robustness issues of LLMs in code execution and understanding. Additionally, we examine three Large Reasoning Models (LRMs) and discover the severe vulnerability of self-reflective reasoning mechanisms that lead to reasoning collapse. CodeCrash provides a principled framework for stress-testing LLMs in code understanding, offering actionable directions for future evaluation and benchmarking. The code of CodeCrash and the robustness leaderboard are publicly available at https://donaldlamnl.github.io/CodeCrash/ .
Problem

Research questions and friction points this paper is trying to address.

Evaluates LLM robustness in code comprehension under structural and semantic perturbations
Identifies performance degradation reasons and failure modes in code reasoning
Assesses vulnerability of self-reflective reasoning mechanisms in Large Reasoning Models
Innovation

Methods, ideas, or system contributions that make the work stand out.

Evaluates LLM robustness under code perturbations
Uses structural and textual distraction benchmarks
Analyzes failure modes and performance degradation reasons
🔎 Similar Papers
No similar papers found.