Helping Figures Tell their Story! Paper-Grounded Video Generation Explaining Complex Scientific Figures

📅 2026-06-10

📈 Citations: 0

✨ Influential: 0

🤖 AI Summary

This work addresses the challenge that existing video generation methods struggle to produce step-by-step narrations of complex scientific figures while synchronously highlighting relevant regions, as required for effective academic communication. To bridge this gap, we introduce a novel task termed “paper-anchored figure-to-video generation” and propose MINARD, a multimodal parsing framework that jointly leverages scholarly text and visual figure content to achieve temporally aligned narration and component-level visual grounding. We further construct FigTalk, a new evaluation benchmark, and introduce metrics for sequential coherence and region-alignment fidelity. Experimental results demonstrate that videos generated by our approach significantly outperform existing methods in both automatic and human evaluations, exhibiting greater faithfulness to the source paper and more natural, human-like explanatory quality.

📝 Abstract

Scientific figures compress complex pipelines into a single canvas, yet understanding them requires paper-grounded, step-by-step narration aligned with visual highlights a capability missing from current video generation systems and benchmarks. To address this, we introduce paper-grounded figure-to-video generation: generating narrated, region-grounded walkthrough videos from a figure and its paper. We propose MINARD (Multimodal Interpretation of Narrated Architecture via Region Decomposition), a pipeline that generates paper-grounded narrations and sequentially grounds them to figure regions. We also release FigTalk, a benchmark with new sequential and component-level grounding metrics derived. On FigTalk, MINARD generates humanlike, paper-faithful narrations and outperforms narration-conditioned figure spatial grounding compared to existing approaches in both automatic and human evaluation

Problem

Research questions and friction points this paper is trying to address.

scientific figures

video generation

paper-grounded narration

visual grounding

figure understanding

Innovation

Methods, ideas, or system contributions that make the work stand out.

paper-grounded video generation

figure-to-video

region grounding