🤖 AI Summary
This study addresses the long-standing challenge of computationally modeling human flexible tool selection—a core cognitive capability poorly captured by existing computational approaches. We propose a cross-modal, low-dimensional semantic alignment framework that unifies visual, functional, and psychological tool properties with linguistic task requirements via 13 interpretable, human-defined attributes (e.g., graspability, hand-relatedness, slenderness), enabling attribute-level cross-modal matching. Departing from opaque end-to-end modeling, our lightweight architecture combines ResNet or ViT for vision encoding with fine-tuned GPT-2 or LLaMA for language grounding. Evaluated on our novel ToolNet dataset (115 tools × 13 attributes × scene descriptions), the method achieves 74% accuracy—substantially outperforming baselines (20%–58%). Our work establishes the first interpretable, parameter-efficient, and computationally efficient paradigm for tool cognition modeling, offering a principled foundation for embodied intelligence and cognitive computation.
📝 Abstract
Flexible tool selection reflects a complex cognitive ability that distinguishes humans from other species, yet computational models that capture this ability remain underdeveloped. We developed a framework using low-dimensional attribute representations to bridge visual tool perception and linguistic task understanding. We constructed a comprehensive dataset (ToolNet) containing 115 common tools labeled with 13 carefully designed attributes spanning physical, functional, and psychological properties, paired with natural language scenarios describing tool usage. Visual encoders (ResNet or ViT) extract attributes from tool images while fine-tuned language models (GPT-2, LLaMA, DeepSeek) derive required attributes from task descriptions. Our approach achieves 74% accuracy in tool selection tasks-significantly outperforming direct tool matching (20%) and smaller multimodal models (21%-58%), while approaching performance of much larger models like GPT-4o (73%) with substantially fewer parameters. Ablation studies revealed that manipulation-related attributes (graspability, hand-relatedness, elongation) consistently prove most critical across modalities. This work provides a parameter-efficient, interpretable solution that mimics human-like tool cognition, advancing both cognitive science understanding and practical applications in tool selection tasks.