MCERF: Advancing Multimodal LLM Evaluation of Engineering Documentation with Enhanced Retrieval

📅 2026-01-31
🏛️ Journal of Mechanical Design
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
Engineering specification documents comprise multimodal content—including text, tables, and illustrations—which poses significant challenges for conventional text-only RAG systems. This work proposes the Multimodal ColPali Enhanced Retrieval and Reasoning Framework (MCERF), which integrates vision–language retrieval, higher-order reasoning, and self-consistency–based decision making. MCERF features a modular architecture and a dynamic query routing mechanism that supports both single-path routing and multi-agent collaboration, enabling efficient and scalable document understanding without requiring full-document ingestion. Evaluated on the DesignQA benchmark, MCERF achieves an average accuracy improvement of 41.1% over the strongest RAG baseline, substantially outperforming existing approaches—particularly on multimodal and complex reasoning tasks.
📝 Abstract
Engineering rulebooks and technical standards contain multimodal information like dense text, tables, and illustrations that are challenging for retrieval augmented generation (RAG) systems. Building upon the DesignQA framework [1], which relied on full-text ingestion and text-based retrieval, this work establishes a Multimodal ColPali Enhanced Retrieval and Reasoning Framework (MCERF), a system that couples a multimodal retriever with large language model reasoning for accurate and efficient question answering from engineering documents. The system employs the ColPali, which retrieves both textual and visual information, and multiple retrieval and reasoning strategies: (i) Hybrid Lookup mode for explicit rule mentions, (ii) Vision to Text fusion for figure and table guided queries, (iii) High Reasoning LLM mode for complex multi modal questions, and (iv) SelfConsistency decision to stabilize responses. The modular framework design provides a reusable template for future multimodal systems regardless of underlying model architecture. Furthermore, this work establishes and compares two routing approaches: a single case routing approach and a multi-agent system, both of which dynamically allocate queries to optimal pipelines. Evaluation on the DesignQA benchmark illustrates that this system improves average accuracy across all tasks with a relative gain of +41.1% from baseline RAG best results, which is a significant improvement in multimodal and reasoning-intensive tasks without complete rulebook ingestion. This shows how vision language retrieval, modular reasoning, and adaptive routing enable scalable document comprehension in engineering use cases.
Problem

Research questions and friction points this paper is trying to address.

multimodal retrieval
engineering documentation
retrieval augmented generation
technical standards
visual information
Innovation

Methods, ideas, or system contributions that make the work stand out.

multimodal retrieval
retrieval-augmented generation
vision-language models
adaptive routing
modular reasoning
🔎 Similar Papers
K
Kiarash Naghavi Khanghah
School of Mechanical, Aerospace, and Manufacturing Engineering, University of Connecticut, Storrs, CT 06269
H
Hoang Anh Nguyen
School of Mechanical, Aerospace, and Manufacturing Engineering, University of Connecticut, Storrs, CT 06269
A
Anna C. Doris
Department of Mechanical Engineering, Massachusetts Institute of Technology, Cambridge, MA 02139, USA
A
Amir Mohammad Vahedi
School of Mechanical, Aerospace, and Manufacturing Engineering, University of Connecticut, Storrs, CT 06269
D
Daniele Grandi
Autodesk Research, The Landmark @ One Market, Ste. 400, San Francisco, CA 94105, USA
Faez Ahmed
Faez Ahmed
Associate Professor, MIT
Generative AIEngineering DesignMachine LearningEngineering OptimizationData-driven Design
Hongyi Xu
Hongyi Xu
Associate Professor at University of Connecticut | Ford R&A | '14 PhD, Northwestern
Engineering DesignDigital ManufacturingArtificial IntelligenceMicrostructureMetamaterial