🤖 AI Summary
Machine vision exhibits limited reasoning capabilities in non-linguistic tasks—such as spatial reasoning and medical image diagnosis—and struggles to adapt decision-making in novel scenarios.
Method: We propose a human-inspired dual-system visual reasoning framework: System I performs rapid perceptual processing, while System II employs reinforcement learning–based self-play and inference-time optimization, enabling progressive performance gains with extended computation time. The framework eliminates reliance on large-scale labeled data by integrating few-shot learning with vision foundation models.
Contribution/Results: Evaluated on multiple computer vision benchmarks and five organ-specific cancer localization tasks, our approach significantly outperforms supervised models, foundation models, and even human experts. It is the first to enable dynamic, scalable, inference-time learning for non-linguistic visual tasks—demonstrating that deliberate “extended thinking time” systematically enhances the robustness and generalization of visual intelligence.
📝 Abstract
Reasoning is a hallmark of human intelligence, enabling adaptive decision-making in complex and unfamiliar scenarios. In contrast, machine intelligence remains bound to training data, lacking the ability to dynamically refine solutions at inference time. While some recent advances have explored reasoning in machines, these efforts are largely limited to verbal domains such as mathematical problem-solving, where explicit rules govern step-by-step reasoning. Other critical real-world tasks - including visual perception, spatial reasoning, and radiological diagnosis - require non-verbal reasoning, which remains an open challenge. Here we present a novel learning paradigm that enables machine reasoning in vision by allowing performance improvement with increasing thinking time (inference-time compute), even under conditions where labelled data is very limited. Inspired by dual-process theories of human cognition in psychology, our approach integrates a fast-thinking System I module for familiar tasks, with a slow-thinking System II module that iteratively refines solutions using self-play reinforcement learning. This paradigm mimics human reasoning by proposing, competing over, and refining solutions in data-scarce scenarios. We demonstrate superior performance through extended thinking time, compared not only to large-scale supervised learning but also foundation models and even human experts, in real-world vision tasks. These tasks include computer-vision benchmarks and cancer localisation on medical images across five organs, showcasing transformative potential for non-verbal machine reasoning.