🤖 AI Summary
This work addresses the challenges of information retrieval (IR) for low-resource languages, where high-quality annotated data is scarce and automatically generated labels often suffer from reliability issues and biases. The authors propose BETA-Labeling, a novel framework that systematically evaluates the effectiveness of large language model (LLM)-assisted annotation in low-resource IR. By integrating multi-model collaborative labeling, context alignment, consistency verification, and majority voting—augmented with human evaluation—they construct the first high-quality Bengali IR dataset. The study also investigates the feasibility of reusing single-hop machine-translated data from other low-resource languages, revealing performance risks in cross-lingual transfer due to inconsistent semantic preservation and language-dependent biases. Experimental results demonstrate that the proposed approach substantially improves annotation quality, while the efficacy of cross-lingual data reuse is shown to be highly dependent on the linguistic characteristics of the language pair involved.
📝 Abstract
IR in low-resource languages remains limited by the scarcity of high-quality, task-specific annotated datasets. Manual annotation is expensive and difficult to scale, while using large language models (LLMs) as automated annotators introduces concerns about label reliability, bias, and evaluation validity. This work presents a Bangla IR dataset constructed using a BETA-labeling framework involving multiple LLM annotators from diverse model families. The framework incorporates contextual alignment, consistency checks, and majority agreement, followed by human evaluation to verify label quality. Beyond dataset creation, we examine whether IR datasets from other low-resource languages can be effectively reused through one-hop machine translation. Using LLM-based translation across multiple language pairs, we experimented on meaning preservation and task validity between source and translated datasets. Our experiment reveal substantial variation across languages, reflecting language-dependent biases and inconsistent semantic preservation that directly affect the reliability of cross-lingual dataset reuse. Overall, this study highlights both the potential and limitations of LLM-assisted dataset creation for low-resource IR. It provides empirical evidence of the risks associated with cross-lingual dataset reuse and offers practical guidance for constructing more reliable benchmarks and evaluation pipelines in low-resource language settings.