ToolFG: Towards Well-Grounded Fine-Grained Image Classification

📅 2026-06-01
📈 Citations: 0
Influential: 0
📄 PDF

career value

179K/year
🤖 AI Summary
This work addresses the challenges of fine-grained image classification, where subtle inter-class differences often lead to unreliable discrimination and decision-making. To overcome these limitations, the authors propose ToolFG, a novel framework that integrates multimodal large language models (MLLMs) with external visual tools, enabling the model to autonomously invoke tools and actively interact with input images during inference to gather verifiable visual evidence. The key innovations include a Monte Carlo Tree Search (MCTS)-guided knowledge distillation mechanism for tool utilization and a co-evolution strategy that jointly optimizes the toolset and reasoning policy. Extensive experiments demonstrate that ToolFG significantly improves both classification accuracy and decision interpretability across multiple benchmark datasets.
📝 Abstract
Fine-grained image classification (FGIC) has broad applications and has attracted significant research attention. In this paper, we explore a novel paradigm for solving FGIC by proposing \textbf{ToolFG}, the first tool-integrated MLLM-based framework tailored to FGIC. ToolFG enables MLLMs to autonomously and flexibly use external tools during the reasoning process, actively interact with images, and collect verifiable visual cues for distinguishing highly similar categories in a more \textit{reliable} and \textit{well-grounded} manner. To equip the model with such tool-use ability, we design a novel \textbf{MCTS-guided tool-use knowledge distillation mechanism}, which effectively mines tool-use- and FGIC-relevant knowledge from advanced proprietary MLLMs for model training. Furthermore, we propose a \textbf{model-tool co-evolution mechanism} that jointly refines the toolset and the model's tool-use policy, driving them toward a mutually adapted and FGIC-specialized state. Extensive experiments demonstrate the effectiveness of our framework.
Problem

Research questions and friction points this paper is trying to address.

fine-grained image classification
well-grounded reasoning
visual cues
highly similar categories
reliable classification
Innovation

Methods, ideas, or system contributions that make the work stand out.

ToolFG
fine-grained image classification
tool-integrated MLLM
MCTS-guided knowledge distillation
model-tool co-evolution
🔎 Similar Papers
No similar papers found.