🤖 AI Summary
This work addresses the challenge that existing video generation methods struggle to produce step-by-step narrations of complex scientific figures while synchronously highlighting relevant regions, as required for effective academic communication. To bridge this gap, we introduce a novel task termed “paper-anchored figure-to-video generation” and propose MINARD, a multimodal parsing framework that jointly leverages scholarly text and visual figure content to achieve temporally aligned narration and component-level visual grounding. We further construct FigTalk, a new evaluation benchmark, and introduce metrics for sequential coherence and region-alignment fidelity. Experimental results demonstrate that videos generated by our approach significantly outperform existing methods in both automatic and human evaluations, exhibiting greater faithfulness to the source paper and more natural, human-like explanatory quality.
📝 Abstract
Scientific figures compress complex pipelines into a single canvas, yet understanding them requires paper-grounded, step-by-step narration aligned with visual highlights a capability missing from current video generation systems and benchmarks. To address this, we introduce paper-grounded figure-to-video generation: generating narrated, region-grounded walkthrough videos from a figure and its paper. We propose MINARD (Multimodal Interpretation of Narrated Architecture via Region Decomposition), a pipeline that generates paper-grounded narrations and sequentially grounds them to figure regions. We also release FigTalk, a benchmark with new sequential and component-level grounding metrics derived. On FigTalk, MINARD generates humanlike, paper-faithful narrations and outperforms narration-conditioned figure spatial grounding compared to existing approaches in both automatic and human evaluation