MTBBench: A Multimodal Sequential Clinical Decision-Making Benchmark in Oncology

📅 2025-11-25
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
Existing biomedical multimodal large language model (MLLM) evaluation benchmarks lack clinical realism—particularly in modeling longitudinal, multi-agent, multimodal decision-making processes such as Molecular Tumor Board (MTB) consultations. Method: We introduce the first MTB-oriented, multimodal, longitudinal clinical decision benchmark, featuring an agent-based evaluation framework that supports multi-turn interaction, cross-modal (imaging/text/genomic) fusion, and temporal reasoning—validated by clinical experts to ensure annotation fidelity. Crucially, we propose a foundation-model-driven, tool-augmented agent architecture enabling evidence conflict resolution and temporally grounded inference. Contribution/Results: Our framework substantially mitigates hallucination and weak temporal reasoning in MLLMs, achieving 9.0–11.2% performance gains on critical clinical decision tasks. It establishes a novel paradigm for trustworthy, precision oncology–aligned AI decision support.

Technology Category

Application Category

📝 Abstract
Multimodal Large Language Models (LLMs) hold promise for biomedical reasoning, but current benchmarks fail to capture the complexity of real-world clinical workflows. Existing evaluations primarily assess unimodal, decontextualized question-answering, overlooking multi-agent decision-making environments such as Molecular Tumor Boards (MTBs). MTBs bring together diverse experts in oncology, where diagnostic and prognostic tasks require integrating heterogeneous data and evolving insights over time. Current benchmarks lack this longitudinal and multimodal complexity. We introduce MTBBench, an agentic benchmark simulating MTB-style decision-making through clinically challenging, multimodal, and longitudinal oncology questions. Ground truth annotations are validated by clinicians via a co-developed app, ensuring clinical relevance. We benchmark multiple open and closed-source LLMs and show that, even at scale, they lack reliability -- frequently hallucinating, struggling with reasoning from time-resolved data, and failing to reconcile conflicting evidence or different modalities. To address these limitations, MTBBench goes beyond benchmarking by providing an agentic framework with foundation model-based tools that enhance multi-modal and longitudinal reasoning, leading to task-level performance gains of up to 9.0% and 11.2%, respectively. Overall, MTBBench offers a challenging and realistic testbed for advancing multimodal LLM reasoning, reliability, and tool-use with a focus on MTB environments in precision oncology.
Problem

Research questions and friction points this paper is trying to address.

Assessing multimodal clinical reasoning in oncology workflows
Evaluating longitudinal data integration in tumor board simulations
Benchmarking reliability of LLMs for precision oncology decisions
Innovation

Methods, ideas, or system contributions that make the work stand out.

Agentic framework simulating Molecular Tumor Board workflows
Foundation model-based tools for multimodal reasoning enhancement
Longitudinal data integration for clinical decision-making improvement
🔎 Similar Papers
No similar papers found.