BDiff: Block-aware and Accurate Text-based Code Differencing

📅 2025-10-23
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
Existing code diff tools decompose multi-line block-level edits—such as method moves/copies or conditional branch relocations—into fragmented line-level operations, impairing change comprehension. To address this, we propose BDiff, a text-based diff analysis method that supports fine-grained edit operation identification. BDiff is the first to jointly model two categories of block-level edits (move, copy) and five categories of line-level edits (insert, delete, replace, wrap, unwrap). It generates candidate line/block mappings via classical diff algorithms and refines them into an optimal alignment using the Kuhn–Munkres algorithm, minimizing the resulting edit script size. We further implement an interactive web-based visualization tool. Evaluation shows that BDiff significantly outperforms state-of-the-art diff tools—including LLM-based baselines—in edit script quality (accuracy and readability), while maintaining efficient runtime performance and better aligning with developer intent and practical workflow needs.

Technology Category

Application Category

📝 Abstract
Code differencing is a fundamental technique in software engineering practice and research. While researchers have proposed text-based differencing techniques capable of identifying line changes over the past decade, existing methods exhibit a notable limitation in identifying edit actions (EAs) that operate on text blocks spanning multiple lines. Such EAs are common in developers' practice, such as moving a code block for conditional branching or duplicating a method definition block for overloading. Existing tools represent such block-level operations as discrete sequences of line-level EAs, compelling developers to manually correlate them and thereby substantially impeding the efficiency of change comprehension. To address this issue, we propose BDiff, a text-based differencing algorithm capable of identifying two types of block-level EAs and five types of line-level EAs. Building on traditional differencing algorithms, we first construct a candidate set containing all possible line mappings and block mappings. Leveraging the Kuhn-Munkres algorithm, we then compute the optimal mapping set that can minimize the size of the edit script (ES) while closely aligning with the original developer's intent. To validate the effectiveness of BDiff, we selected five state-of-the-art tools, including large language models (LLMs), as baselines and adopted a combined qualitative and quantitative approach to evaluate their performance in terms of ES size, result quality, and running time. Experimental results show that BDiff produces higher-quality differencing results than baseline tools while maintaining competitive runtime performance. Our experiments also show the unreliability of LLMs in code differencing tasks regarding result quality and their infeasibility in terms of runtime efficiency. We have implemented a web-based visual differencing tool.
Problem

Research questions and friction points this paper is trying to address.

Identifies block-level edit actions in code changes
Reduces edit script size while preserving developer intent
Improves accuracy over line-based and LLM differencing methods
Innovation

Methods, ideas, or system contributions that make the work stand out.

Identifies block-level and line-level edit actions
Uses Kuhn-Munkres algorithm for optimal mapping
Produces minimal edit scripts aligned with developer intent
🔎 Similar Papers
No similar papers found.
Y
Yao Lu
National University of Defense Technology, China
W
Wanwei Liu
National University of Defense Technology, China
Tanghaoran Zhang
Tanghaoran Zhang
National University of Defense Technology
software engineering
K
Kang Yang
National University of Defense Technology, China
Y
Yang Zhang
National University of Defense Technology, China
W
Wenyu Xu
National University of Defense Technology, China
L
Longfei Sun
National University of Defense Technology, China
X
Xinjun Mao
National University of Defense Technology, China
Shuzheng Gao
Shuzheng Gao
The Chinese University of Hong Kong
Code IntelligenceSoftware EngineeringLarge Language Models
Michael R. Lyu
Michael R. Lyu
Professor of Computer Science & Engineering, The Chinese University of Hong Kong
software engineeringsoftware reliabilityfault tolerancemachine learningdistributed systems