TensorBench: Benchmarking Coding Agents on a Compiler-Based Tensor Framework

📅 2026-06-03

📈 Citations: 0

✨ Influential: 0

career value

183K/year

🤖 AI Summary

Existing benchmarks for code generation struggle to balance task difficulty with the scalability of automated evaluation. This work proposes TensorBench, a benchmark comprising 199 challenging tasks built upon an extended PyTorch compiler framework that supports both dense and sparse tensors. It enables fully automated verification of patch correctness through comprehensive test suites and introduces, for the first time, an agent-driven self-augmenting testing mechanism. Integrating compiler techniques, tensor optimizations, IR transformations, and randomized regression testing, TensorBench effectively evaluates model performance on realistic, system-level programming tasks. Evaluations of seven prominent code-generation agents reveal pass rates ranging from 22.1% to 64.8%, with low inter-task result overlap (Cohen’s κ as low as −0.07), demonstrating the benchmark’s high discriminative power and task diversity.

📝 Abstract

Repository-level coding benchmarks face a trade-off between task difficulty and evaluation reliability: tasks that challenge frontier models often involve large codebases with incomplete test coverage, while human review does not scale. We introduce TensorBench, a benchmark of 199 feature-addition and refactoring tasks on an open-source compiler-based tensor framework that extends PyTorch with first-class support for dense and sparse tensors. Tasks cover new sparse formats, dense optimization passes, IR transformations, scheduler changes, runtime components, and high-level numerical operators. TensorBench grades each run by applying the agent's patch and running the framework's test suite, which includes the pre-existing randomized regression tests and any tests the agent adds. For feature-addition tasks, a pass means that the patched repository preserves the tested pre-existing behavior and satisfies the agent-added checks for the requested feature. We evaluate seven coding agents spanning three frontier model families and one open-weight model. Pass rates under this criterion range from $64.8\%$ for the strongest agent to $22.1\%$ for the weakest. Agents pass different subsets of tasks: pairwise Cohen's $κ$ ranges from $-0.07$ to $0.43$, with $κ= 0.05$ for the two strongest agents.

Problem

Research questions and friction points this paper is trying to address.

coding agents

benchmarking

compiler-based tensor framework

evaluation reliability

large codebases

Innovation

Methods, ideas, or system contributions that make the work stand out.

TensorBench

compiler-based tensor framework

code generation benchmark