Overview of SCIDOCA 2025 Shared Task on Citation Prediction, Discovery, and Placement

📅 2025-09-29

📈 Citations: 0

✨ Influential: 0

🤖 AI Summary

SCIDOCA 2025 Shared Task addresses the fine-grained modeling of citation relations in scientific documents. It defines three core subtasks: (1) citation discovery (identifying paragraph-level relevant references), (2) masked citation prediction (precisely recovering masked citation positions), and (3) citation sentence prediction (establishing exact sentence-to-reference mappings). The task introduces a large-scale, high-quality annotated dataset—62K paragraphs—built upon S2ORC. It proposes the first multi-dimensional evaluation framework covering citation identification, candidate ranking, and sentence-level alignment. The released benchmark supports both end-to-end and modular method evaluation, attracting seven participating teams, three of which submitted valid results. Empirical analysis reveals critical bottlenecks in existing models, particularly in cross-sentence citation resolution and context-sensitive reasoning. This work establishes a reproducible, scalable paradigm and an authoritative evaluation standard for scholarly citation understanding.

Technology Category

Application Category

📝 Abstract

We present an overview of the SCIDOCA 2025 Shared Task, which focuses on citation discovery and prediction in scientific documents. The task is divided into three subtasks: (1) Citation Discovery, where systems must identify relevant references for a given paragraph; (2) Masked Citation Prediction, which requires selecting the correct citation for masked citation slots; and (3) Citation Sentence Prediction, where systems must determine the correct reference for each cited sentence. We release a large-scale dataset constructed from the Semantic Scholar Open Research Corpus (S2ORC), containing over 60,000 annotated paragraphs and a curated reference set. The test set consists of 1,000 paragraphs from distinct papers, each annotated with ground-truth citations and distractor candidates. A total of seven teams registered, with three submitting results. We report performance metrics across all subtasks and analyze the effectiveness of submitted systems. This shared task provides a new benchmark for evaluating citation modeling and encourages future research in scientific document understanding. The dataset and task materials are publicly available at https://github.com/daotuanan/scidoca2025-shared-task.

Problem

Research questions and friction points this paper is trying to address.

Identifying relevant references for scientific paragraphs

Selecting correct citations for masked citation slots

Determining correct references for each cited sentence

Innovation

Methods, ideas, or system contributions that make the work stand out.

Large-scale dataset from Semantic Scholar

Three subtasks for citation analysis

Public benchmark for citation modeling

🔎 Similar Papers

No similar papers found.

Authors to Follow