Annif at SemEval-2025 Task 5: Traditional XMTC augmented by LLMs

📅 2025-04-28

📈 Citations: 1

✨ Influential: 0

career value

189K/year

🤖 AI Summary

This study addresses two key challenges in multilingual document subject indexing: (1) cross-lingual semantic alignment difficulty and (2) scarcity of annotated training data for low-resource subject thesauri (e.g., GND). To this end, we propose an LLM-augmented eXtreme Multi-Label Classification (XMTC) framework designed for scalability and cross-lingual robustness. Methodologically, it integrates multilingual LLMs (e.g., mBART, Llama-3) to jointly perform thesaurus-driven cross-lingual semantic alignment, synthetic data generation, and ensemble prediction from monolingual XMTC models. Notably, this work pioneers the use of LLMs for low-resource thesaurus annotation enhancement and cross-lingual subject mapping within XMTC. Empirical evaluation on the bilingual TIBKAT bibliography and SemEval-2025 Task 5 demonstrates state-of-the-art performance—achieving top rank across all subject categories and second place on the tib-core subset—while significantly improving indexing accuracy and cross-lingual generalization. The results validate the effectiveness and novelty of synergistic modeling between traditional XMTC and LLMs.

Technology Category

Application Category

📝 Abstract

This paper presents the Annif system in SemEval-2025 Task 5 (LLMs4Subjects), which focussed on subject indexing using large language models (LLMs). The task required creating subject predictions for bibliographic records from the bilingual TIBKAT database using the GND subject vocabulary. Our approach combines traditional natural language processing and machine learning techniques implemented in the Annif toolkit with innovative LLM-based methods for translation and synthetic data generation, and merging predictions from monolingual models. The system ranked first in the all-subjects category and second in the tib-core-subjects category in the quantitative evaluation, and fourth in qualitative evaluations. These findings demonstrate the potential of combining traditional XMTC algorithms with modern LLM techniques to improve the accuracy and efficiency of subject indexing in multilingual contexts.

Problem

Research questions and friction points this paper is trying to address.

Combining traditional XMTC with LLMs for subject indexing

Improving accuracy in multilingual subject prediction tasks

Enhancing efficiency using synthetic data and translation methods

Innovation

Methods, ideas, or system contributions that make the work stand out.

Combines traditional NLP with LLM techniques

Uses LLMs for translation and data generation

Merges predictions from monolingual models

🔎 Similar Papers

No similar papers found.