ArkTS-CodeSearch: A Open-Source ArkTS Dataset for Code Retrieval

📅 2026-02-05
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
This work addresses the scarcity of public datasets and evaluation benchmarks for ArkTS code intelligence. To bridge this gap, we construct and open-source the first large-scale ArkTS code retrieval dataset by crawling repositories from GitHub and Gitee, extracting functions along with their corresponding natural language comments, and applying precise parsing and cross-platform deduplication using tree-sitter-arkts. We propose a comment-based single-retrieval task formulation and establish a systematic evaluation benchmark. Furthermore, we fine-tune existing embedding models on both ArkTS and TypeScript data. Experimental results demonstrate that the fine-tuned models significantly outperform baseline approaches on this task, thereby providing the ArkTS community with high-quality data, effective tooling, and a standardized evaluation framework for code understanding.

Technology Category

Application Category

📝 Abstract
ArkTS is a core programming language in the OpenHarmony ecosystem, yet research on ArkTS code intelligence is hindered by the lack of public datasets and evaluation benchmarks. This paper presents a large-scale ArkTS dataset constructed from open-source repositories, targeting code retrieval and code evaluation tasks. We design a single-search task, where natural language comments are used to retrieve corresponding ArkTS functions. ArkTS repositories are crawled from GitHub and Gitee, and comment-function pairs are extracted using tree-sitter-arkts, followed by cross-platform deduplication and statistical analysis of ArkTS function types. We further evaluate existing open-source code embedding models on the single-search task and perform fine-tuning using both ArkTS and TypeScript training datasets, resulting in a high-performing model for ArkTS code understanding. This work establishes the first systematic benchmark for ArkTS code retrieval. Both the dataset and our fine-tuned model are available at https://huggingface.co/hreyulog/embedinggemma_arkts and https://huggingface.co/datasets/hreyulog/arkts-code-docstring .
Problem

Research questions and friction points this paper is trying to address.

ArkTS
code retrieval
dataset
code intelligence
evaluation benchmark
Innovation

Methods, ideas, or system contributions that make the work stand out.

ArkTS
code retrieval
code embedding
fine-tuning
open-source dataset
🔎 Similar Papers
No similar papers found.
Yulong He
Yulong He
St Petersburg University
A
Artem Ermakov
ITMO University
Sergey Kovalchuk
Sergey Kovalchuk
ITMO University
artificial intelligencehuman-AI interactioncomplex systemscomputational science
A
Artem Aliev
St. Petersburg State University
D
Dmitry Shalymov
St. Petersburg State University