ArkEval: Benchmarking and Evaluating Automated CodeRepair for ArkTS

📅 2026-02-09

📈 Citations: 0

✨ Influential: 0

career value

169K/year

🤖 AI Summary

This work addresses the lack of a high-quality, automated code repair evaluation benchmark for the ArkTS language within the HarmonyOS ecosystem, a gap that has hindered the development of related tools. To bridge this gap, the authors propose ArkEval, the first systematic benchmark tailored for ArkTS code repair. By mining over 400 applications from Huawei’s official repositories and applying a multi-stage filtering process, they curate 502 reproducible defects. A novel multi-model voting mechanism is employed to generate reliable test cases and standardized problem descriptions. The benchmark has been evaluated on four leading large language models, revealing the current limitations of these models in repairing code for low-resource programming languages and establishing a solid foundation for future research in this domain.

Technology Category

Application Category

📝 Abstract

Large language models have transformed code generation, enabling unprecedented automation in software development. As mobile ecosystems evolve, HarmonyOS has emerged as a critical platform requiring robust development tools. Software development for the HarmonyOS ecosystem relies heavily on ArkTS, a statically typed extension of TypeScript. Despite its growing importance, the ecosystem lacks robust tools for automated code repair, primarily due to the absence of a high-quality benchmark for evaluation. To address this gap, we present ArkEval, a unified framework for ArkTS automated repair workflow evaluation and benchmark construction. It provides the first comprehensive benchmark specifically designed for ArkTS automated program repair. We constructed this benchmark by mining issues from a large-scale official Huawei repository containing over 400 independent ArkTS applications. Through a rigorous multi-stage filtering process, we curated 502 reproducible issues. To ensure testability, we employed a novel LLM-based test generation and voting mechanism involving Claude and other models. Furthermore, we standardized problem statements to facilitate fair evaluation. Finally, we evaluated four state-of-the-art Large Language Models (LLMs) on our benchmark using a retrieval-augmented repair workflow. Our results highlight the current capabilities and limitations of LLMs in repairing ArkTS code, paving the way for future research in this low-resource language domain.

Problem

Research questions and friction points this paper is trying to address.

automated code repair

ArkTS

benchmark

HarmonyOS

LLM evaluation

Innovation

Methods, ideas, or system contributions that make the work stand out.

ArkTS

automated program repair

LLM-based test generation