Annif at the GermEval-2025 LLMs4Subjects Task: Traditional XMTC Augmented by Efficient LLMs

📅 2025-08-21
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
This study addresses the high computational cost of large language models (LLMs) in bibliographic subject prediction. We propose an efficient hybrid framework: lightweight language models first perform multilingual translation and synthetic data augmentation, while the Annif platform integrates small-model feature extraction with LLM-based candidate subject re-ranking. Our key innovation lies in restricting LLM usage exclusively to the high-value re-ranking stage—bypassing redundant full-text classification—and leveraging synthetic data to mitigate annotation scarcity in specialized domains. Evaluated on the GermEval-2025 LLMs4Subjects Subtask 2, our approach achieves first place in both quantitative and qualitative assessments, demonstrating significant improvements in both accuracy and efficiency. The results validate that strategic LLM deployment, combined with lightweight model synergy and data augmentation, enables scalable and effective subject indexing for bibliographic records.

Technology Category

Application Category

📝 Abstract
This paper presents the Annif system in the LLMs4Subjects shared task (Subtask 2) at GermEval-2025. The task required creating subject predictions for bibliographic records using large language models, with a special focus on computational efficiency. Our system, based on the Annif automated subject indexing toolkit, refines our previous system from the first LLMs4Subjects shared task, which produced excellent results. We further improved the system by using many small and efficient language models for translation and synthetic data generation and by using LLMs for ranking candidate subjects. Our system ranked 1st in the overall quantitative evaluation of and 1st in the qualitative evaluation of Subtask 2.
Problem

Research questions and friction points this paper is trying to address.

Automated subject indexing for bibliographic records using language models
Improving computational efficiency in large-scale subject prediction
Enhancing subject ranking accuracy through synthetic data generation
Innovation

Methods, ideas, or system contributions that make the work stand out.

Traditional XMTC augmented by efficient LLMs
Small efficient models for translation and data generation
LLMs used for ranking candidate subjects
🔎 Similar Papers
No similar papers found.
Osma Suominen
Osma Suominen
Information Systems Specialist, National Library of Finland
J
Juho Inkinen
National Library of Finland, University of Helsinki
M
Mona Lehtinen
National Library of Finland, University of Helsinki