🤖 AI Summary
This work addresses the scarcity of public datasets and evaluation benchmarks for ArkTS code intelligence. To bridge this gap, we construct and open-source the first large-scale ArkTS code retrieval dataset by crawling repositories from GitHub and Gitee, extracting functions along with their corresponding natural language comments, and applying precise parsing and cross-platform deduplication using tree-sitter-arkts. We propose a comment-based single-retrieval task formulation and establish a systematic evaluation benchmark. Furthermore, we fine-tune existing embedding models on both ArkTS and TypeScript data. Experimental results demonstrate that the fine-tuned models significantly outperform baseline approaches on this task, thereby providing the ArkTS community with high-quality data, effective tooling, and a standardized evaluation framework for code understanding.
📝 Abstract
ArkTS is a core programming language in the OpenHarmony ecosystem, yet research on ArkTS code intelligence is hindered by the lack of public datasets and evaluation benchmarks. This paper presents a large-scale ArkTS dataset constructed from open-source repositories, targeting code retrieval and code evaluation tasks. We design a single-search task, where natural language comments are used to retrieve corresponding ArkTS functions. ArkTS repositories are crawled from GitHub and Gitee, and comment-function pairs are extracted using tree-sitter-arkts, followed by cross-platform deduplication and statistical analysis of ArkTS function types. We further evaluate existing open-source code embedding models on the single-search task and perform fine-tuning using both ArkTS and TypeScript training datasets, resulting in a high-performing model for ArkTS code understanding. This work establishes the first systematic benchmark for ArkTS code retrieval. Both the dataset and our fine-tuned model are available at https://huggingface.co/hreyulog/embedinggemma_arkts and https://huggingface.co/datasets/hreyulog/arkts-code-docstring .