🤖 AI Summary
Current large language models for code generation rely solely on static code corpora, limiting their ability to capture dynamic runtime behavior and leading to the acquisition of erroneous patterns.
Method: We propose the “Observation Lakehouse” architecture—the first system enabling persistent, versioned, and interactive analysis of software behavior. It constructs Stimulus-Response Cubes (SRCs) via dynamic execution, stores behavioral data efficiently using Parquet/Iceberg/DuckDB, automates ingestion via LASSO-guided CI pipelines, and supports SQL-driven incremental view materialization and fine-grained slicing queries.
Contribution/Results: Evaluated on 509 benchmark problems, the system ingested 8.6 million behavioral observations (<51 MiB). On a standard laptop, SRC reconstruction and clustering complete within 100 ms, demonstrating feasibility of a low-cost, sustainable behavior-aware infrastructure. This work establishes a new paradigm for behavior-driven model training and evaluation.
📝 Abstract
Code-generating LLMs are trained largely on static artifacts (source, comments, specifications) and rarely on materializations of run-time behavior. As a result, they readily internalize buggy or mislabeled code. Since non-trivial semantic properties are undecidable in general, the only practical way to obtain ground-truth functionality is by dynamic observation of executions. In prior work, we addressed representation with Sequence Sheets, Stimulus-Response Matrices (SRMs), and Stimulus-Response Cubes (SRCs) to capture and compare behavior across tests, implementations, and contexts. These structures make observation data analyzable offline and reusable, but they do not by themselves provide persistence, evolution, or interactive analytics at scale. In this paper, therefore, we introduce observation lakehouses that operationalize continual SRCs: a tall, append-only observations table storing every actuation (stimulus, response, context) and SQL queries that materialize SRC slices on demand. Built on Apache Parquet + Iceberg + DuckDB, the lakehouse ingests data from controlled pipelines (LASSO) and CI pipelines (e.g., unit test executions), enabling n-version assessment, behavioral clustering, and consensus oracles without re-execution. On a 509-problem benchmark, we ingest $approx$8.6M observation rows ($<$51MiB) and reconstruct SRM/SRC views and clusters in $<$100ms on a laptop, demonstrating that continual behavior mining is practical without a distributed cluster of machines. This makes behavioral ground truth first-class alongside other run-time data and provides an infrastructure path toward behavior-aware evaluation and training. The Observation Lakehouse, together with the accompanying dataset, is publicly available as an open-source project on GitHub: https://github.com/SoftwareObservatorium/observation-lakehouse