Themisto: Jupyter-Based Runtime Benchmark

📅 2025-04-16
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
Existing large language models (LLMs) exhibit poor capability in leveraging runtime context for code output prediction and generation, revealing a critical, underexplored research gap—“runtime-aware code understanding.” Method: We introduce the first runtime benchmark grounded in real-world Jupyter Notebook development trajectories, featuring dynamic execution-state annotations, multi-stage output prediction tasks, and zero-shot/few-shot evaluation protocols. Contribution/Results: Our systematic evaluation demonstrates that state-of-the-art LLMs achieve an average accuracy of less than 35% on this benchmark, exposing fundamental deficiencies in modeling execution context. This work formally defines and empirically establishes “runtime-aware code understanding” as a novel research direction. Moreover, it provides a standardized benchmark and methodological foundation for evaluating and improving the runtime-context awareness of code LLMs.

Technology Category

Application Category

📝 Abstract
In this work, we present a benchmark that consists of Jupyter notebooks development trajectories and allows measuring how large language models (LLMs) can leverage runtime information for predicting code output and code generation. We demonstrate that the current generation of LLMs performs poorly on these tasks and argue that there exists a significantly understudied domain in the development of code-based models, which involves incorporating the runtime context.
Problem

Research questions and friction points this paper is trying to address.

Benchmarking LLMs using Jupyter notebooks trajectories
Evaluating LLMs' runtime context utilization for predictions
Identifying understudied runtime context in code model development
Innovation

Methods, ideas, or system contributions that make the work stand out.

Jupyter notebooks benchmark for LLMs
Measures runtime context utilization
Highlights understudied LLM code prediction
🔎 Similar Papers
No similar papers found.