Can Large Language Models Unveil the Mysteries? An Exploration of Their Ability to Unlock Information in Complex Scenarios

📅 2025-02-27
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
To address the limited multimodal perception integration and compositional reasoning capabilities of large language models (LLMs) in complex scenarios, this paper introduces two novel benchmarks—CVQA and CPVQA—specifically designed to evaluate cross-image visual understanding, synthesis, and cryptic precise interpretation. We propose three plug-and-play techniques: input-driven reasoning, minimum-gap random decoding augmentation, and semantic-relevant visual information retrieval, which collectively enhance cross-modal compositional reasoning. Evaluated on state-of-the-art closed-source multimodal LLMs, our methods improve accuracy on CVQA and CPVQA by 22.17% and 9.40%, respectively, achieving 55.21% and 16.78%. These results demonstrate both the effectiveness and generalizability of our approach for challenging compositional reasoning tasks involving multi-image, multimodal inputs.

Technology Category

Application Category

📝 Abstract
Combining multiple perceptual inputs and performing combinatorial reasoning in complex scenarios is a sophisticated cognitive function in humans. With advancements in multi-modal large language models, recent benchmarks tend to evaluate visual understanding across multiple images. However, they often overlook the necessity of combinatorial reasoning across multiple perceptual information. To explore the ability of advanced models to integrate multiple perceptual inputs for combinatorial reasoning in complex scenarios, we introduce two benchmarks: Clue-Visual Question Answering (CVQA), with three task types to assess visual comprehension and synthesis, and Clue of Password-Visual Question Answering (CPVQA), with two task types focused on accurate interpretation and application of visual data. For our benchmarks, we present three plug-and-play approaches: utilizing model input for reasoning, enhancing reasoning through minimum margin decoding with randomness generation, and retrieving semantically relevant visual information for effective data integration. The combined results reveal current models' poor performance on combinatorial reasoning benchmarks, even the state-of-the-art (SOTA) closed-source model achieves only 33.04% accuracy on CVQA, and drops to 7.38% on CPVQA. Notably, our approach improves the performance of models on combinatorial reasoning, with a 22.17% boost on CVQA and 9.40% on CPVQA over the SOTA closed-source model, demonstrating its effectiveness in enhancing combinatorial reasoning with multiple perceptual inputs in complex scenarios. The code will be publicly available.
Problem

Research questions and friction points this paper is trying to address.

Evaluate combinatorial reasoning in multi-modal LLMs
Assess visual comprehension and synthesis in CVQA
Improve accuracy in visual data interpretation via CPVQA
Innovation

Methods, ideas, or system contributions that make the work stand out.

Plug-and-play multi-modal reasoning enhancement
Minimum margin decoding with randomness generation
Semantic visual information retrieval integration
🔎 Similar Papers
No similar papers found.
C
Chao Wang
Shanghai University, Shanghai, China
Luning Zhang
Luning Zhang
Shanghai University MS student
Multi-modal reasoning
Z
Zheng Wang
Zhejiang University of Technology, Zhejiang, China
Y
Yang Zhou
Shanghai University, Shanghai, China