HybridCodeAuthorship: A Benchmark Dataset for Line-Level Code Authorship Detection

📅 2026-06-10
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
Existing benchmarks struggle to support fine-grained authorship attribution in industrial-scale codebases where AI and human developers collaboratively write code, particularly lacking realistic simulations of line-by-line mixed authorship. This work introduces the first Python benchmark dataset derived from real-world GitHub open-source projects, simulating practical scenarios in which developers alternate coding with AI programming assistants. The dataset is constructed using CodeSearchNet combined with LLM-generated code and rigorous human curation, moving beyond conventional binary classification assumptions over entire code snippets to enable both line-level and block-level evaluation. Experimental results reveal that even the current state-of-the-art method, AIGCode Detector, achieves only 0.56 F1 score at the line level and 0.48 at the block level on this benchmark, underscoring the task’s inherent difficulty and the necessity of this contribution.
📝 Abstract
Thanks to the rapid adoption of AI code assistants powered by large language models (LLMs), industry codebases are, increasingly, a hybrid of AI- and human-authored code. For risk management and productivity analysis purposes, it is crucial to enable fine-grained location detection of AI-generated code. To develop algorithms for this task, quality benchmarks are needed to assess performance. However, existing benchmarks tend to comprise academic, LeetCode-style problems and presume a code snippet is either completely human-authored or completely AI-authored, which is not reflective of the diverse intents and styles of industry codebases utilizing AI code assistants. To fill these gaps, we introduce HybridCodeAuthorship, a novel benchmark of Python code files with interleaved human- and AI-authored lines of code to simulate authentic utilization of AI code assistants. In this paper, we first present our dataset construction pipeline, which leverages CodeSearchNet, a massive collection of links to open sourced repositories on GitHub. We then benchmark the performance of two state-of-the-art AI-generated code detection algorithms at both the line- and chunk-level. Experimental results demonstrate that HybridCodeAuthorship is a challenging benchmark with a top-scoring algorithm, AIGCode Detector, obtaining a highest F1 score of 0.48 and 0.56 on chunk-level and line-level code detection tasks, respectively.
Problem

Research questions and friction points this paper is trying to address.

code authorship detection
AI-generated code
hybrid code
benchmark dataset
line-level detection
Innovation

Methods, ideas, or system contributions that make the work stand out.

HybridCodeAuthorship
line-level code authorship detection
AI-generated code detection
benchmark dataset
human-AI collaborative coding