Reasoning in machine vision: learning to think fast and slow

📅 2025-06-27

📈 Citations: 0

✨ Influential: 0

career value

240K/year

🤖 AI Summary

Machine vision exhibits limited reasoning capabilities in non-linguistic tasks—such as spatial reasoning and medical image diagnosis—and struggles to adapt decision-making in novel scenarios. Method: We propose a human-inspired dual-system visual reasoning framework: System I performs rapid perceptual processing, while System II employs reinforcement learning–based self-play and inference-time optimization, enabling progressive performance gains with extended computation time. The framework eliminates reliance on large-scale labeled data by integrating few-shot learning with vision foundation models. Contribution/Results: Evaluated on multiple computer vision benchmarks and five organ-specific cancer localization tasks, our approach significantly outperforms supervised models, foundation models, and even human experts. It is the first to enable dynamic, scalable, inference-time learning for non-linguistic visual tasks—demonstrating that deliberate “extended thinking time” systematically enhances the robustness and generalization of visual intelligence.

Technology Category

Application Category

📝 Abstract

Reasoning is a hallmark of human intelligence, enabling adaptive decision-making in complex and unfamiliar scenarios. In contrast, machine intelligence remains bound to training data, lacking the ability to dynamically refine solutions at inference time. While some recent advances have explored reasoning in machines, these efforts are largely limited to verbal domains such as mathematical problem-solving, where explicit rules govern step-by-step reasoning. Other critical real-world tasks - including visual perception, spatial reasoning, and radiological diagnosis - require non-verbal reasoning, which remains an open challenge. Here we present a novel learning paradigm that enables machine reasoning in vision by allowing performance improvement with increasing thinking time (inference-time compute), even under conditions where labelled data is very limited. Inspired by dual-process theories of human cognition in psychology, our approach integrates a fast-thinking System I module for familiar tasks, with a slow-thinking System II module that iteratively refines solutions using self-play reinforcement learning. This paradigm mimics human reasoning by proposing, competing over, and refining solutions in data-scarce scenarios. We demonstrate superior performance through extended thinking time, compared not only to large-scale supervised learning but also foundation models and even human experts, in real-world vision tasks. These tasks include computer-vision benchmarks and cancer localisation on medical images across five organs, showcasing transformative potential for non-verbal machine reasoning.

Problem

Research questions and friction points this paper is trying to address.

Enabling machine reasoning in vision tasks with limited labeled data

Integrating fast and slow thinking modules for dynamic solution refinement

Improving performance in non-verbal reasoning like medical image diagnosis

Innovation

Methods, ideas, or system contributions that make the work stand out.

Dual-process integration for machine reasoning

Self-play reinforcement learning for refinement

Performance improves with thinking time

🔎 Similar Papers

Dual Thinking and Logical Processing -- Are Multi-modal Large Language Models Closing the Gap with Human Vision ?