🤖 AI Summary
Existing single-agent medical vision-language models (Med-LVLMs) exhibit poor cross-specialty generalization, while mainstream multi-agent frameworks rely on static interaction protocols and lack dynamic reasoning adaptability. To address this, we propose the first reinforcement learning–based (PPO) dynamic dual-medical-agent collaboration framework. It features a general practitioner (GP) and specialist agent architecture built upon Qwen2.5-VL, augmented with a curriculum-guided RL training strategy enabling the GP to adaptively aggregate expert inputs and perform active self-correction. Evaluated on five medical visual question answering benchmarks, our method achieves an average 18.4% improvement over supervised fine-tuning baselines and significantly outperforms leading open- and closed-source Med-LVLMs. Notably, it demonstrates human-like stepwise reasoning behavior. Our core contributions are: (1) an RL-driven dynamic collaboration mechanism that enables adaptive inter-agent reasoning, and (2) a generalizable multi-specialty reasoning paradigm applicable across diverse medical domains.
📝 Abstract
Medical Large Vision-Language Models (Med-LVLMs) have shown strong potential in multimodal diagnostic tasks. However, existing single-agent models struggle to generalize across diverse medical specialties, limiting their performance. Recent efforts introduce multi-agent collaboration frameworks inspired by clinical workflows, where general practitioners (GPs) and specialists interact in a fixed sequence. Despite improvements, these static pipelines lack flexibility and adaptability in reasoning. To address this, we propose MMedAgent-RL, a reinforcement learning (RL)-based multi-agent framework that enables dynamic, optimized collaboration among medical agents. Specifically, we train two GP agents based on Qwen2.5-VL via RL: the triage doctor learns to assign patients to appropriate specialties, while the attending physician integrates the judgments from multi-specialists and its own knowledge to make final decisions. To address the inconsistency in specialist outputs, we introduce a curriculum learning (CL)-guided RL strategy that progressively teaches the attending physician to balance between imitating specialists and correcting their mistakes. Experiments on five medical VQA benchmarks demonstrate that MMedAgent-RL not only outperforms both open-source and proprietary Med-LVLMs, but also exhibits human-like reasoning patterns. Notably, it achieves an average performance gain of 18.4% over supervised fine-tuning baselines.