Flexible Tool Selection through Low-dimensional Attribute Alignment of Vision and Language

📅 2025-05-28
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
This study addresses the long-standing challenge of computationally modeling human flexible tool selection—a core cognitive capability poorly captured by existing computational approaches. We propose a cross-modal, low-dimensional semantic alignment framework that unifies visual, functional, and psychological tool properties with linguistic task requirements via 13 interpretable, human-defined attributes (e.g., graspability, hand-relatedness, slenderness), enabling attribute-level cross-modal matching. Departing from opaque end-to-end modeling, our lightweight architecture combines ResNet or ViT for vision encoding with fine-tuned GPT-2 or LLaMA for language grounding. Evaluated on our novel ToolNet dataset (115 tools × 13 attributes × scene descriptions), the method achieves 74% accuracy—substantially outperforming baselines (20%–58%). Our work establishes the first interpretable, parameter-efficient, and computationally efficient paradigm for tool cognition modeling, offering a principled foundation for embodied intelligence and cognitive computation.

Technology Category

Application Category

📝 Abstract
Flexible tool selection reflects a complex cognitive ability that distinguishes humans from other species, yet computational models that capture this ability remain underdeveloped. We developed a framework using low-dimensional attribute representations to bridge visual tool perception and linguistic task understanding. We constructed a comprehensive dataset (ToolNet) containing 115 common tools labeled with 13 carefully designed attributes spanning physical, functional, and psychological properties, paired with natural language scenarios describing tool usage. Visual encoders (ResNet or ViT) extract attributes from tool images while fine-tuned language models (GPT-2, LLaMA, DeepSeek) derive required attributes from task descriptions. Our approach achieves 74% accuracy in tool selection tasks-significantly outperforming direct tool matching (20%) and smaller multimodal models (21%-58%), while approaching performance of much larger models like GPT-4o (73%) with substantially fewer parameters. Ablation studies revealed that manipulation-related attributes (graspability, hand-relatedness, elongation) consistently prove most critical across modalities. This work provides a parameter-efficient, interpretable solution that mimics human-like tool cognition, advancing both cognitive science understanding and practical applications in tool selection tasks.
Problem

Research questions and friction points this paper is trying to address.

Bridging visual tool perception and linguistic task understanding
Developing parameter-efficient models for human-like tool selection
Identifying critical attributes for cross-modal tool cognition
Innovation

Methods, ideas, or system contributions that make the work stand out.

Low-dimensional attribute alignment for vision-language integration
Comprehensive ToolNet dataset with 13 multi-property attributes
Parameter-efficient framework outperforming larger models in accuracy
🔎 Similar Papers
2024-01-19IEEE Workshop/Winter Conference on Applications of Computer VisionCitations: 14
Guangfu Hao
Guangfu Hao
Laboratory of Brain Atlas and Brain-inspired Intelligence, Institute of Automation, CAS
Computational NeuroscienceBrain-Inspired Neural NetworksLarge Language ModelsCognitive Models
H
Haojie Wen
State Key Laboratory of Cognitive Neuroscience and Learning, Beijing Normal University; IDG/McGovern Institute for Brain Research, Beijing Normal University
L
Liangxuna Guo
Laboratory of Brain Atlas and Brain-inspired Intelligence, Institute of Automation Chinese Academy of Sciences (CASIA); School of Future Technology, University of Chinese Academy of Sciences (UCAS)
Y
Yang Chen
Laboratory of Brain Atlas and Brain-inspired Intelligence, Institute of Automation Chinese Academy of Sciences (CASIA)
Yanchao Bi
Yanchao Bi
Professor, School of Psychological and Cognitive Sciences, Peking University
cognitive neuroscienceconceptslanguageneuroimaging
S
Shan Yu
Laboratory of Brain Atlas and Brain-inspired Intelligence, Institute of Automation Chinese Academy of Sciences (CASIA); School of Future Technology, University of Chinese Academy of Sciences (UCAS)