MeXtract: Light-Weight Metadata Extraction from Scientific Papers

📅 2025-10-08

📈 Citations: 0

✨ Influential: 0

career value

160K/year

🤖 AI Summary

To address weak cross-domain generalization and poor schema adaptability in scientific paper metadata extraction, this paper proposes MeXtract—a family of lightweight language models (0.5B–3B parameters) fine-tuned from Qwen-2.5, incorporating schema-aware training and a transferable architecture. We introduce MOLE-Bench+, an extended benchmark specifically designed for metadata extraction, featuring new multi-domain and multi-format test subsets. Experiments demonstrate that MeXtract achieves state-of-the-art performance on MOLE, significantly outperforming existing methods, while maintaining strong generalization to unseen schemas and cross-domain settings. All code, data, and models are publicly released.

Technology Category

Application Category

📝 Abstract

Metadata plays a critical role in indexing, documenting, and analyzing scientific literature, yet extracting it accurately and efficiently remains a challenging task. Traditional approaches often rely on rule-based or task-specific models, which struggle to generalize across domains and schema variations. In this paper, we present MeXtract, a family of lightweight language models designed for metadata extraction from scientific papers. The models, ranging from 0.5B to 3B parameters, are built by fine-tuning Qwen 2.5 counterparts. In their size family, MeXtract achieves state-of-the-art performance on metadata extraction on the MOLE benchmark. To further support evaluation, we extend the MOLE benchmark to incorporate model-specific metadata, providing an out-of-domain challenging subset. Our experiments show that fine-tuning on a given schema not only yields high accuracy but also transfers effectively to unseen schemas, demonstrating the robustness and adaptability of our approach. We release all the code, datasets, and models openly for the research community.

Problem

Research questions and friction points this paper is trying to address.

Extracting metadata accurately from scientific papers

Overcoming domain generalization limitations in extraction models

Handling schema variations in metadata extraction tasks

Innovation

Methods, ideas, or system contributions that make the work stand out.

Lightweight models fine-tuned from Qwen 2.5

Achieves state-of-the-art on MOLE benchmark

Generalizes effectively to unseen metadata schemas

🔎 Similar Papers

Toward Reliable Ad-hoc Scientific Information Extraction: A Case Study on Two Materials Datasets

2024-06-08Annual Meeting of the Association for Computational LinguisticsCitations: 2

💼 Related Jobs

Research Scientist, AI Language