KRAST: Knowledge-Augmented Robotic Action Recognition with Structured Text for Vision-Language Models

📅 2025-09-19

📈 Citations: 0

✨ Influential: 0

🤖 AI Summary

To address insufficient accuracy of vision-based action recognition in complex indoor environments, this paper proposes a knowledge-enhanced prompt learning framework: structured action semantic descriptions—formulated as verb-object-scene triplets—are encoded into learnable text prompts and injected into a frozen vision-language model (VLM), enabling low-supervision, end-to-end action recognition. The method operates solely on RGB video inputs, requiring no auxiliary modalities or fine-tuning of the visual backbone. Its key innovation lies in explicitly modeling domain knowledge as structured textual prompts and introducing a multi-granularity encoding strategy to improve semantic alignment between visual and linguistic representations. Evaluated on the ETRI-Activity3D dataset, the approach achieves 95.2% accuracy—significantly outperforming existing state-of-the-art methods—and demonstrates both effectiveness and generalization potential of knowledge-guided prompt learning for robotic vision understanding tasks.

Technology Category

Application Category

📝 Abstract

Accurate vision-based action recognition is crucial for developing autonomous robots that can operate safely and reliably in complex, real-world environments. In this work, we advance video-based recognition of indoor daily actions for robotic perception by leveraging vision-language models (VLMs) enriched with domain-specific knowledge. We adapt a prompt-learning framework in which class-level textual descriptions of each action are embedded as learnable prompts into a frozen pre-trained VLM backbone. Several strategies for structuring and encoding these textual descriptions are designed and evaluated. Experiments on the ETRI-Activity3D dataset demonstrate that our method, using only RGB video inputs at test time, achieves over 95% accuracy and outperforms state-of-the-art approaches. These results highlight the effectiveness of knowledge-augmented prompts in enabling robust action recognition with minimal supervision.

Problem

Research questions and friction points this paper is trying to address.

Enhancing robotic action recognition using knowledge-augmented vision-language models

Improving video-based indoor daily action recognition for robotic perception

Developing prompt-learning frameworks with structured textual descriptions

Innovation

Methods, ideas, or system contributions that make the work stand out.

Knowledge-augmented prompts for VLMs

Learnable prompts from class-level descriptions

Structured text encoding strategies for actions

🔎 Similar Papers

No similar papers found.

Authors to Follow