BDiff: Block-aware and Accurate Text-based Code Differencing

📅 2025-10-23

📈 Citations: 0

✨ Influential: 0

career value

200K/year

🤖 AI Summary

Existing code diff tools decompose multi-line block-level edits—such as method moves/copies or conditional branch relocations—into fragmented line-level operations, impairing change comprehension. To address this, we propose BDiff, a text-based diff analysis method that supports fine-grained edit operation identification. BDiff is the first to jointly model two categories of block-level edits (move, copy) and five categories of line-level edits (insert, delete, replace, wrap, unwrap). It generates candidate line/block mappings via classical diff algorithms and refines them into an optimal alignment using the Kuhn–Munkres algorithm, minimizing the resulting edit script size. We further implement an interactive web-based visualization tool. Evaluation shows that BDiff significantly outperforms state-of-the-art diff tools—including LLM-based baselines—in edit script quality (accuracy and readability), while maintaining efficient runtime performance and better aligning with developer intent and practical workflow needs.

Technology Category

Application Category

📝 Abstract

Code differencing is a fundamental technique in software engineering practice and research. While researchers have proposed text-based differencing techniques capable of identifying line changes over the past decade, existing methods exhibit a notable limitation in identifying edit actions (EAs) that operate on text blocks spanning multiple lines. Such EAs are common in developers' practice, such as moving a code block for conditional branching or duplicating a method definition block for overloading. Existing tools represent such block-level operations as discrete sequences of line-level EAs, compelling developers to manually correlate them and thereby substantially impeding the efficiency of change comprehension. To address this issue, we propose BDiff, a text-based differencing algorithm capable of identifying two types of block-level EAs and five types of line-level EAs. Building on traditional differencing algorithms, we first construct a candidate set containing all possible line mappings and block mappings. Leveraging the Kuhn-Munkres algorithm, we then compute the optimal mapping set that can minimize the size of the edit script (ES) while closely aligning with the original developer's intent. To validate the effectiveness of BDiff, we selected five state-of-the-art tools, including large language models (LLMs), as baselines and adopted a combined qualitative and quantitative approach to evaluate their performance in terms of ES size, result quality, and running time. Experimental results show that BDiff produces higher-quality differencing results than baseline tools while maintaining competitive runtime performance. Our experiments also show the unreliability of LLMs in code differencing tasks regarding result quality and their infeasibility in terms of runtime efficiency. We have implemented a web-based visual differencing tool.

Problem

Research questions and friction points this paper is trying to address.

Identifies block-level edit actions in code changes

Reduces edit script size while preserving developer intent

Improves accuracy over line-based and LLM differencing methods

Innovation

Methods, ideas, or system contributions that make the work stand out.

Identifies block-level and line-level edit actions

Uses Kuhn-Munkres algorithm for optimal mapping

Produces minimal edit scripts aligned with developer intent

🔎 Similar Papers

No similar papers found.