🤖 AI Summary
Determining functional equivalence of Java bytecode binaries—i.e., whether differently built binaries exhibit identical semantics—remains challenging under non-bitwise-identical conditions, especially when requiring human-interpretable justifications.
Method: This paper proposes a relational normalization approach based on Datalog: bytecode is decompiled into a relational database schema, and semantically equivalent programs are normalized via formally verifiable Datalog rules, enabling class-level equivalence inference.
Contribution/Results: Our method is the first to generate human-readable, step-by-step deductive proofs for equivalence conclusions, substantially improving audit efficiency and accuracy. In large-scale evaluation across 2,714 JAR file pairs (265,690 class pairs), it identifies significantly more true-positive equivalent components than state-of-the-art tools, while reducing manual verification effort by a large margin.
📝 Abstract
The security of software builds has attracted increased attention in recent years in response to incidents like solarwinds and xz. Now, several companies including Oracle and Google rebuild open source projects in a secure environment and publish the resulting binaries through dedicated repositories. This practice enables direct comparison between these rebuilt binaries and the original ones produced by developers and published in repositories such as Maven Central. These binaries are often not bitwise identical; however, in most cases, the differences can be attributed to variations in the build environment, and the binaries can still be considered equivalent. Establishing such equivalence, however, is a labor-intensive and error-prone process.
While there are some tools that can be used for this purpose, they all fall short of providing provenance, i.e. readable explanation of why two binaries are equivalent, or not. To address this issue, we present daleq, a tool that disassembles Java byte code into a relational database, and can normalise this database by applying datalog rules. Those databases can then be used to infer equivalence between two classes. Notably, equivalence statements are accompanied with datalog proofs recording the normalisation process. We demonstrate the impact of daleq in an industrial context through a large-scale evaluation involving 2,714 pairs of jars, comprising 265,690 class pairs. In this evaluation, daleq is compared to two existing bytecode transformation tools. Our findings reveal a significant reduction in the manual effort required to assess non-bitwise equivalent artifacts, which would otherwise demand intensive human inspection. Furthermore, the results show that daleq outperforms existing tools by identifying more artifacts rebuilt from the same code as equivalent, even when no behavioral differences are present.