🤖 AI Summary
This study addresses a critical limitation in the evaluation of Java static analysis frameworks, which commonly assumes monotonicity and semantic comparability among algorithms and configurations—an assumption that frequently breaks down in the presence of modern language features such as lambdas and reflection, leading to inconsistent call graphs. By establishing a precision partial order, the authors conduct a large-scale empirical study across four major frameworks—Soot, SootUp, WALA, and Doop—revealing, for the first time, significant intra- and inter-framework semantic gaps in call graph construction. Their findings demonstrate that algorithmic precision rankings become unstable under advanced language constructs, that configurations and algorithms can co-fail in nontrivial ways, and that irreconcilable semantic discrepancies exist between frameworks. These results challenge conventional evaluation paradigms in static analysis and advocate for a new perspective that jointly considers algorithms, configurations, and framework-specific semantics.
📝 Abstract
Java static analysis frameworks are commonly compared under the assumption that analysis algorithms and configurations compose monotonically and yield semantically comparable results across tools. In this work, we show that this assumption is fundamentally flawed. We present a large-scale empirical study of semantic consistency within and across four widely used Java static analysis frameworks: Soot, SootUp, WALA, and Doop. Using precision partial orders over analysis algorithms and configurations, we systematically identify violations where increased precision introduces new call-graph edges or amplifies inconsistencies. Our results reveal three key findings. First, algorithmic precision orders frequently break within frameworks due to modern language features such as lambdas, reflection, and native modeling. Second, configuration choices strongly interact with analysis algorithms, producing synergistic failures that exceed the effects of algorithm or configuration changes alone. Third, cross-framework comparisons expose irreconcilable semantic gaps, demonstrating that different frameworks operate over incompatible notions of call-graph ground truth. These findings challenge prevailing evaluation practices in static analysis and highlight the need to reason jointly about algorithms, configurations, and framework semantics when assessing precision and soundness.