MOLE: Metadata Extraction and Validation in Scientific Papers Using LLMs

📅 2025-05-26

📈 Citations: 0

✨ Influential: 0

career value

143K/year

🤖 AI Summary

This study addresses the challenges of automated metadata extraction and verification from multilingual scientific papers—particularly non-Arabic datasets—where current approaches suffer from low accuracy and heavy reliance on manual annotation. We propose the first schema-driven, end-to-end LLM framework for this task. Our method integrates structured schema constraints, context-length optimization, few-shot prompting, and a multi-stage verification mechanism enhanced by web browsing, supporting diverse input formats (e.g., PDF, HTML). Key contributions include: (1) the first benchmark dataset for multilingual scientific paper metadata extraction; (2) state-of-the-art cross-lingual performance (89.3% F1); and (3) open-sourced code and data. Experiments demonstrate that modern LLMs exhibit practical viability for this task, significantly improving data discoverability and research reproducibility.

Technology Category

Application Category

📝 Abstract

Metadata extraction is essential for cataloging and preserving datasets, enabling effective research discovery and reproducibility, especially given the current exponential growth in scientific research. While Masader (Alyafeai et al.,2021) laid the groundwork for extracting a wide range of metadata attributes from Arabic NLP datasets' scholarly articles, it relies heavily on manual annotation. In this paper, we present MOLE, a framework that leverages Large Language Models (LLMs) to automatically extract metadata attributes from scientific papers covering datasets of languages other than Arabic. Our schema-driven methodology processes entire documents across multiple input formats and incorporates robust validation mechanisms for consistent output. Additionally, we introduce a new benchmark to evaluate the research progress on this task. Through systematic analysis of context length, few-shot learning, and web browsing integration, we demonstrate that modern LLMs show promising results in automating this task, highlighting the need for further future work improvements to ensure consistent and reliable performance. We release the code: https://github.com/IVUL-KAUST/MOLE and dataset: https://huggingface.co/datasets/IVUL-KAUST/MOLE for the research community.

Problem

Research questions and friction points this paper is trying to address.

Automating metadata extraction from multilingual scientific papers using LLMs

Reducing reliance on manual annotation for dataset cataloging

Validating extracted metadata for consistency and reliability

Innovation

Methods, ideas, or system contributions that make the work stand out.

Uses LLMs for automatic metadata extraction

Schema-driven processing for multiple formats

Integrates validation mechanisms for consistent output

🔎 Similar Papers

Toward Reliable Ad-hoc Scientific Information Extraction: A Case Study on Two Materials Datasets