ArkTS-CodeSearch: A Open-Source ArkTS Dataset for Code Retrieval

📅 2026-02-05

📈 Citations: 0

✨ Influential: 0

career value

171K/year

🤖 AI Summary

This work addresses the scarcity of public datasets and evaluation benchmarks for ArkTS code intelligence. To bridge this gap, we construct and open-source the first large-scale ArkTS code retrieval dataset by crawling repositories from GitHub and Gitee, extracting functions along with their corresponding natural language comments, and applying precise parsing and cross-platform deduplication using tree-sitter-arkts. We propose a comment-based single-retrieval task formulation and establish a systematic evaluation benchmark. Furthermore, we fine-tune existing embedding models on both ArkTS and TypeScript data. Experimental results demonstrate that the fine-tuned models significantly outperform baseline approaches on this task, thereby providing the ArkTS community with high-quality data, effective tooling, and a standardized evaluation framework for code understanding.

Technology Category

Application Category

📝 Abstract

ArkTS is a core programming language in the OpenHarmony ecosystem, yet research on ArkTS code intelligence is hindered by the lack of public datasets and evaluation benchmarks. This paper presents a large-scale ArkTS dataset constructed from open-source repositories, targeting code retrieval and code evaluation tasks. We design a single-search task, where natural language comments are used to retrieve corresponding ArkTS functions. ArkTS repositories are crawled from GitHub and Gitee, and comment-function pairs are extracted using tree-sitter-arkts, followed by cross-platform deduplication and statistical analysis of ArkTS function types. We further evaluate existing open-source code embedding models on the single-search task and perform fine-tuning using both ArkTS and TypeScript training datasets, resulting in a high-performing model for ArkTS code understanding. This work establishes the first systematic benchmark for ArkTS code retrieval. Both the dataset and our fine-tuned model are available at https://huggingface.co/hreyulog/embedinggemma_arkts and https://huggingface.co/datasets/hreyulog/arkts-code-docstring .

Problem

Research questions and friction points this paper is trying to address.

ArkTS

code retrieval

dataset

code intelligence

evaluation benchmark

Innovation

Methods, ideas, or system contributions that make the work stand out.

ArkTS

code retrieval

code embedding