🤖 AI Summary
To address the limited multimodal perception integration and compositional reasoning capabilities of large language models (LLMs) in complex scenarios, this paper introduces two novel benchmarks—CVQA and CPVQA—specifically designed to evaluate cross-image visual understanding, synthesis, and cryptic precise interpretation. We propose three plug-and-play techniques: input-driven reasoning, minimum-gap random decoding augmentation, and semantic-relevant visual information retrieval, which collectively enhance cross-modal compositional reasoning. Evaluated on state-of-the-art closed-source multimodal LLMs, our methods improve accuracy on CVQA and CPVQA by 22.17% and 9.40%, respectively, achieving 55.21% and 16.78%. These results demonstrate both the effectiveness and generalizability of our approach for challenging compositional reasoning tasks involving multi-image, multimodal inputs.
📝 Abstract
Combining multiple perceptual inputs and performing combinatorial reasoning in complex scenarios is a sophisticated cognitive function in humans. With advancements in multi-modal large language models, recent benchmarks tend to evaluate visual understanding across multiple images. However, they often overlook the necessity of combinatorial reasoning across multiple perceptual information. To explore the ability of advanced models to integrate multiple perceptual inputs for combinatorial reasoning in complex scenarios, we introduce two benchmarks: Clue-Visual Question Answering (CVQA), with three task types to assess visual comprehension and synthesis, and Clue of Password-Visual Question Answering (CPVQA), with two task types focused on accurate interpretation and application of visual data. For our benchmarks, we present three plug-and-play approaches: utilizing model input for reasoning, enhancing reasoning through minimum margin decoding with randomness generation, and retrieving semantically relevant visual information for effective data integration. The combined results reveal current models' poor performance on combinatorial reasoning benchmarks, even the state-of-the-art (SOTA) closed-source model achieves only 33.04% accuracy on CVQA, and drops to 7.38% on CPVQA. Notably, our approach improves the performance of models on combinatorial reasoning, with a 22.17% boost on CVQA and 9.40% on CPVQA over the SOTA closed-source model, demonstrating its effectiveness in enhancing combinatorial reasoning with multiple perceptual inputs in complex scenarios. The code will be publicly available.