Themisto: Jupyter-Based Runtime Benchmark

📅 2025-04-16

📈 Citations: 0

✨ Influential: 0

career value

190K/year

🤖 AI Summary

Existing large language models (LLMs) exhibit poor capability in leveraging runtime context for code output prediction and generation, revealing a critical, underexplored research gap—“runtime-aware code understanding.” Method: We introduce the first runtime benchmark grounded in real-world Jupyter Notebook development trajectories, featuring dynamic execution-state annotations, multi-stage output prediction tasks, and zero-shot/few-shot evaluation protocols. Contribution/Results: Our systematic evaluation demonstrates that state-of-the-art LLMs achieve an average accuracy of less than 35% on this benchmark, exposing fundamental deficiencies in modeling execution context. This work formally defines and empirically establishes “runtime-aware code understanding” as a novel research direction. Moreover, it provides a standardized benchmark and methodological foundation for evaluating and improving the runtime-context awareness of code LLMs.

Technology Category

Application Category

📝 Abstract

In this work, we present a benchmark that consists of Jupyter notebooks development trajectories and allows measuring how large language models (LLMs) can leverage runtime information for predicting code output and code generation. We demonstrate that the current generation of LLMs performs poorly on these tasks and argue that there exists a significantly understudied domain in the development of code-based models, which involves incorporating the runtime context.

Problem

Research questions and friction points this paper is trying to address.

Benchmarking LLMs using Jupyter notebooks trajectories

Evaluating LLMs' runtime context utilization for predictions

Identifying understudied runtime context in code model development

Innovation

Methods, ideas, or system contributions that make the work stand out.

Jupyter notebooks benchmark for LLMs

Measures runtime context utilization

Highlights understudied LLM code prediction

🔎 Similar Papers

No similar papers found.