Reasoning in machine vision: learning to think fast and slow

📅 2025-06-27
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
Machine vision exhibits limited reasoning capabilities in non-linguistic tasks—such as spatial reasoning and medical image diagnosis—and struggles to adapt decision-making in novel scenarios. Method: We propose a human-inspired dual-system visual reasoning framework: System I performs rapid perceptual processing, while System II employs reinforcement learning–based self-play and inference-time optimization, enabling progressive performance gains with extended computation time. The framework eliminates reliance on large-scale labeled data by integrating few-shot learning with vision foundation models. Contribution/Results: Evaluated on multiple computer vision benchmarks and five organ-specific cancer localization tasks, our approach significantly outperforms supervised models, foundation models, and even human experts. It is the first to enable dynamic, scalable, inference-time learning for non-linguistic visual tasks—demonstrating that deliberate “extended thinking time” systematically enhances the robustness and generalization of visual intelligence.

Technology Category

Application Category

📝 Abstract
Reasoning is a hallmark of human intelligence, enabling adaptive decision-making in complex and unfamiliar scenarios. In contrast, machine intelligence remains bound to training data, lacking the ability to dynamically refine solutions at inference time. While some recent advances have explored reasoning in machines, these efforts are largely limited to verbal domains such as mathematical problem-solving, where explicit rules govern step-by-step reasoning. Other critical real-world tasks - including visual perception, spatial reasoning, and radiological diagnosis - require non-verbal reasoning, which remains an open challenge. Here we present a novel learning paradigm that enables machine reasoning in vision by allowing performance improvement with increasing thinking time (inference-time compute), even under conditions where labelled data is very limited. Inspired by dual-process theories of human cognition in psychology, our approach integrates a fast-thinking System I module for familiar tasks, with a slow-thinking System II module that iteratively refines solutions using self-play reinforcement learning. This paradigm mimics human reasoning by proposing, competing over, and refining solutions in data-scarce scenarios. We demonstrate superior performance through extended thinking time, compared not only to large-scale supervised learning but also foundation models and even human experts, in real-world vision tasks. These tasks include computer-vision benchmarks and cancer localisation on medical images across five organs, showcasing transformative potential for non-verbal machine reasoning.
Problem

Research questions and friction points this paper is trying to address.

Enabling machine reasoning in vision tasks with limited labeled data
Integrating fast and slow thinking modules for dynamic solution refinement
Improving performance in non-verbal reasoning like medical image diagnosis
Innovation

Methods, ideas, or system contributions that make the work stand out.

Dual-process integration for machine reasoning
Self-play reinforcement learning for refinement
Performance improves with thinking time
🔎 Similar Papers
No similar papers found.
Shaheer U. Saeed
Shaheer U. Saeed
University College London
Machine LearningMedical Image ComputingReinforcement Learning
Y
Yipei Wang
Department of Medical Physics and Biomedical Engineering, University College London, UK
V
Veeru Kasivisvanathan
Division of Surgery and Interventional Sciences, University College London, UK
B
Brian R. Davidson
Division of Surgery and Interventional Sciences, University College London, UK
Matthew J. Clarkson
Matthew J. Clarkson
Professor of Biomedical Engineering at University College London
Image Guided SurgeryMedical Image ComputingImage RegistrationComputer Vision
Y
Yipeng Hu
Department of Computer Science, University College London, UK
Daniel C. Alexander
Daniel C. Alexander
Professor of Imaging Science, Centre for Medical Image Computing, Department of Computer Science
Computer scienceMachine learningMedical imagingdiffusion MRINeuroscience