ThrowBench: Benchmarking LLMs by Predicting Runtime Exceptions

📅 2025-03-06

📈 Citations: 0

✨ Influential: 0

career value

160K/year

🤖 AI Summary

Existing code benchmarks overlook large language models’ (LLMs) ability to predict runtime exception behavior, hindering evaluation of their dynamic program understanding. Method: We introduce ThrowBench—the first benchmark explicitly designed to assess LLMs’ capability to predict whether a program throws an exception and, if so, its precise type. It avoids data contamination by generating all ground-truth labels via actual program execution across Python, Java, JavaScript, and Go (2,400+ short programs). Multi-language exception patterns are modeled via static analysis, and performance is evaluated using macro-F1 score. Contribution/Results: Extensive zero-shot and few-shot evaluations on six state-of-the-art code LLMs reveal critically low F1 scores (19%–38%), exposing a fundamental gap in current LLMs’ reasoning about runtime behavior—despite strong syntactic and static analysis capabilities.

Technology Category

Application Category

📝 Abstract

Modern Large Language Models (LLMs) have shown astounding capabilities of code understanding and synthesis. In order to assess such capabilities, several benchmarks have been devised (e.g., HumanEval). However, most benchmarks focus on code synthesis from natural language instructions. Hence, such benchmarks do not test for other forms of code understanding. Moreover, there have been concerns about contamination and leakage. That is, benchmark problems (or closely related problems) may appear in training set, strongly biasing benchmark results. In this work we investigate whether large language models can correctly predict runtime program behavior. To this end, we introduce ThrowBench, a benchmark consisting of over 2,400 short user-written programs written in four different programming languages. The majority of these programs throw an exception during runtime (due to a bug). LLMs are asked to predict whether a presented program throws an exception and, if so, which one. Evaluating our benchmark on six state-of-the-art code LLMs we see modest performance ranging from 19 to 38% (F1 score). Benchmarking a wider set of code capabilities could improve the assessment of code LLMs and help identify weak points in current models. Moreover, as ground-truth answers have been determined through program execution, leakage is not a concern. We release ThrowBench as well as all of our results together with this work.

Problem

Research questions and friction points this paper is trying to address.

Assessing LLMs' ability to predict runtime exceptions in code

Introducing ThrowBench to test code understanding beyond synthesis

Evaluating LLMs on 2,400 programs to identify model weaknesses

Innovation

Methods, ideas, or system contributions that make the work stand out.

ThrowBench tests LLMs on runtime exception prediction

Includes 2,400 programs in four programming languages

Ground-truth answers prevent data leakage concerns

🔎 Similar Papers

LangBiTe: A Platform for Testing Bias in Large Language Models