🤖 AI Summary
This work addresses the challenge of identifying behavior-preserving modifications—such as refactorings—within mixed-code changes, a task poorly supported by existing tools. We conduct the first quantitative evaluation of mainstream refactoring detection tools on real-world behavior-preserving changes, revealing only 33.9% coverage. To systematically identify and classify such changes, we propose a function-equivalence-based analysis framework integrating automated detection with manual annotation. Crucially, we introduce 67 fine-grained equivalence-preserving operations, significantly improving decomposition coverage by over 128%. Our findings expose fundamental limitations in current tools’ semantic equivalence modeling, particularly their inability to capture subtle behavioral invariance across syntactically divergent code variants. The resulting benchmark dataset and methodology provide a rigorous foundation for advancing refactoring detection, change understanding, and the separation of mixed changes—offering both empirical insights and a scalable, extensible analytical approach.
📝 Abstract
Developers sometimes mix behavior-preserving modifications, such as refactorings, with behavior-altering modifications, such as feature additions. Several approaches have been proposed to support understanding such modifications by separating them into those two parts. Such refactoring-aware approaches are expected to be particularly effective when the behavior-preserving parts can be decomposed into a sequence of more primitive behavior-preserving operations, such as refactorings, but this has not been explored. In this paper, as an initial validation, we quantify how much of the behavior-preserving modifications can be decomposed into refactoring operations using a dataset of functionally-equivalent method pairs. As a result, when using an existing refactoring detector, only 33.9% of the changes could be identified as refactoring operations. In contrast, when including 67 newly defined functionally-equivalent operations, the coverage increased by over 128%. Further investigation into the remaining unexplained differences was conducted, suggesting improvement opportunities.