Beyond APIs: Probing the Limits of MLLMs in Physical Tool Use

πŸ“… 2026-06-09
πŸ“ˆ Citations: 0
✨ Influential: 0
πŸ“„ PDF
πŸ€– AI Summary
This work addresses the lack of systematic evaluation of multimodal large language models (MLLMs) in real-world physical tool-use scenarios. We introduce PhysTool-Bench, the first benchmark specifically designed for this task, comprising 2,678 real-world tools and 2,510 task-oriented queries, and establish an end-to-end evaluation framework centered on tool recognition and usage planning. Leveraging a large-scale dataset of real images paired with task instructions, we evaluate 13 prominent MLLMs and find that even the strongest model, Gemini-1.5-Pro, correctly identifies only 58.7% of tools and achieves a mere 21.0% task-completion rate. These results reveal substantial limitations in current models’ abilities to perceive real-world tools and reason about their functional affordances, offering critical insights for advancing embodied intelligence.
πŸ“ Abstract
Multimodal Large Language Models (MLLMs) excel at utilizing digital APIs and increasingly serve as the "brain" of embodied AI, instructing robots to interact with the physical world. In such embodied settings, a central capability is the use of physical tools, which underpins MLLMs' ability to assist humans in real-world tasks. Despite the importance, MLLMs' proficiency in physical tool use remains largely unexplored. To address this gap, we introduce PhysTool-Bench, the first physical tool-use benchmark designed to evaluate MLLMs' ability to comprehend real-world scenarios, identify physical tools, and plan their use. PhysTool-Bench comprises 2,510 queries over 2,678 real-world physical tools spanning diverse domains, including manufacturing, electrical work, agriculture, and healthcare. Concretely, models are evaluated along two primary dimensions: 1) recognizing all physical tools present in the scene, and 2) planning the tool selection and use sequence based on the instruction and visual context. Across 13 leading MLLMs, even the strongest model (Gemini-3.1-Pro) identifies only 58.7% of tools in a scene and completes merely 21.0% of queries end-to-end. Our analysis reveals a two-level deficit: MLLMs struggle to perceive tools in realistic scenes, and the much larger drop at the planning stage further indicates a lack of functional commonsense for mapping perceived tools onto task semantics, pinpointing a critical bottleneck for the development of practical embodied AI.
Problem

Research questions and friction points this paper is trying to address.

physical tool use
Multimodal Large Language Models
embodied AI
functional commonsense
tool recognition
Innovation

Methods, ideas, or system contributions that make the work stand out.

physical tool use
Multimodal Large Language Models
embodied AI
tool recognition
functional commonsense