🤖 AI Summary
Empirical code review research is hindered by high technical barriers to cross-platform (e.g., GitHub/GitLab) data acquisition and analysis, heavy reliance on custom scripts, and poor reproducibility. To address these challenges, we propose the first LLM-integrated code review mining framework. It enables natural-language interactive querying, automatic API endpoint discovery, multi-source authentication management, and joint parsing of structured and unstructured review artifacts—including comments, patches, and metadata. The framework unifies support for both quantitative statistics and qualitative analysis, substantially reducing the need for manual script development. We implement and evaluate a prototype system across multiple platforms, demonstrating its feasibility for conducting efficient, low-barrier empirical software engineering studies. Results show improved reproducibility and broader academic accessibility of code review research.
📝 Abstract
Empirical research on code review processes is increasingly central to understanding software quality and collaboration. However, collecting and analyzing review data remains a time-consuming and technically intensive task. Most researchers follow similar workflows - writing ad hoc scripts to extract, filter, and analyze review data from platforms like GitHub and GitLab. This paper introduces RevMine, a conceptual tool that streamlines the entire code review mining pipeline using large language models (LLMs). RevMine guides users through authentication, endpoint discovery, and natural language-driven data collection, significantly reducing the need for manual scripting. After retrieving review data, it supports both quantitative and qualitative analysis based on user-defined filters or LLM-inferred patterns. This poster outlines the tool's architecture, use cases, and research potential. By lowering the barrier to entry, RevMine aims to democratize code review mining and enable a broader range of empirical software engineering studies.