BETA-Labeling for Multilingual Dataset Construction in Low-Resource IR

📅 2026-02-16
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
This work addresses the challenges of information retrieval (IR) for low-resource languages, where high-quality annotated data is scarce and automatically generated labels often suffer from reliability issues and biases. The authors propose BETA-Labeling, a novel framework that systematically evaluates the effectiveness of large language model (LLM)-assisted annotation in low-resource IR. By integrating multi-model collaborative labeling, context alignment, consistency verification, and majority voting—augmented with human evaluation—they construct the first high-quality Bengali IR dataset. The study also investigates the feasibility of reusing single-hop machine-translated data from other low-resource languages, revealing performance risks in cross-lingual transfer due to inconsistent semantic preservation and language-dependent biases. Experimental results demonstrate that the proposed approach substantially improves annotation quality, while the efficacy of cross-lingual data reuse is shown to be highly dependent on the linguistic characteristics of the language pair involved.

Technology Category

Application Category

📝 Abstract
IR in low-resource languages remains limited by the scarcity of high-quality, task-specific annotated datasets. Manual annotation is expensive and difficult to scale, while using large language models (LLMs) as automated annotators introduces concerns about label reliability, bias, and evaluation validity. This work presents a Bangla IR dataset constructed using a BETA-labeling framework involving multiple LLM annotators from diverse model families. The framework incorporates contextual alignment, consistency checks, and majority agreement, followed by human evaluation to verify label quality. Beyond dataset creation, we examine whether IR datasets from other low-resource languages can be effectively reused through one-hop machine translation. Using LLM-based translation across multiple language pairs, we experimented on meaning preservation and task validity between source and translated datasets. Our experiment reveal substantial variation across languages, reflecting language-dependent biases and inconsistent semantic preservation that directly affect the reliability of cross-lingual dataset reuse. Overall, this study highlights both the potential and limitations of LLM-assisted dataset creation for low-resource IR. It provides empirical evidence of the risks associated with cross-lingual dataset reuse and offers practical guidance for constructing more reliable benchmarks and evaluation pipelines in low-resource language settings.
Problem

Research questions and friction points this paper is trying to address.

low-resource IR
dataset construction
label reliability
cross-lingual reuse
semantic preservation
Innovation

Methods, ideas, or system contributions that make the work stand out.

BETA-labeling
low-resource IR
LLM-based annotation
cross-lingual dataset reuse
multilingual dataset construction
🔎 Similar Papers
No similar papers found.
M
Md. Najib Hasan
Wichita State University
M
Mst. Jannatun Ferdous Rain
Begum Rokeya University, Rangpur
F
Fyad Mohammed
Khulna University of Engineering & Technology
Nazmul Siddique
Nazmul Siddique
Ulster University
Computational IntelligenceMachine LearningNature-inspired ComputingCyberneticsRobotics