🤖 AI Summary
Current approaches to software evolution analysis and continuous integration rely heavily on test pass/fail outcomes, often overlooking fine-grained runtime behavior, which limits their ability to detect partial oracles, flaky failures, and silent performance or output drifts. This work proposes a novel paradigm—behavioral co-versioning—that jointly manages Git code history with queryable archives of runtime behavior. During test execution, method-level inputs, outputs, and performance signals are captured and stored append-only, indexed by commit and test context. By treating runtime behavior as a first-class artifact, the approach enables semantic-level differencing, behavior-aware regression localization, and historical auditing. A Python-based prototype demonstrates feasibility, successfully uncovering behavioral evolutions invisible to conventional textual diffing techniques.
📝 Abstract
Behavioral Co-Versioning remains absent from mainstream practice: while developers routinely version source code with Git, they rarely persist and query how run-time behavior evolves across revisions. This paper argues that this mismatch contributes to a blind spot in software evolution analysis and CI, where rich execution information is discarded and typically reduced to pass/fail outcomes -- despite partial test oracles, flakiness, and silent output or performance drift. We propose \textit{Behavioral Co-Versioning}, a paradigm that couples the Git history with a \textit{Behavioral Archive}: an append-only, queryable store of selected run-time observations (e.g., method I/O and performance signals) collected during test runs and keyed by commit and test context. This enables semantic diffing, behavior-aware regression localization, and retrospective auditing by querying historical executions, complementing proactive, signal-specific monitoring tools. We first outline a minimal data model and change diagnostics based on code/test/behavior fingerprints, and then demonstrate feasibility with a laptop-scale prototype that replays historical commits of a Python project, archives run-time observations in a local Parquet-backed store, and detects behavioral changes not apparent from textual diffs.