NextMotionQA: Benchmarking and Judging Human Motion Understanding with Vision-Language Models

📅 2026-06-03

📈 Citations: 0

✨ Influential: 0

career value

199K/year

🤖 AI Summary

Existing action understanding benchmarks suffer from coarse semantic granularity, mixed difficulty levels, low annotation quality, and ambiguous answers, hindering precise diagnosis of model deficiencies. To address these limitations, this work proposes NextMotionQA—the first fine-grained, multi-level human action understanding benchmark that integrates expert validation with semi-automated construction. It comprises three tasks: multiple-choice question answering, video description, and error correction, all hierarchically designed along semantic dimensions and complexity levels. Evaluation of twelve state-of-the-art vision-language models on this benchmark reveals strong agreement with human experts on coarse-grained action judgments (Cohen’s κ = 0.70), yet markedly poor performance on body-part-level fine-grained reasoning (κ = 0.10), exposing a critical weakness in current models’ perceptual and semantic capabilities.

📝 Abstract

Reliable evaluation of human motion understanding is fundamental to advancing embodied AI, robotics, and animation. However, existing benchmarks suffer from coarse semantic granularity, undifferentiated difficulty, limited annotation quality, and pervasive answer ambiguity, leaving them unable to diagnose where current models fail. To bridge this gap, we introduce NextMotionQA, a comprehensive benchmark that leverages vision-language models (VLMs) for semi-automated, expert-verified dataset. NextMotionQA features three complementary tasks: multiple-choice question answering, video captioning, and fine-grained error correction. Each task is systematically structured across three core semantic axes and stratified into three task complexity levels. Our extensive evaluation of twelve representative VLMs uncovers critical capability gaps and weakness that remain invisible under conventional, single-task evaluations. In a complementary direction, recent work has begun using VLMs as judges for text-to-motion evaluation; we ask whether they show the same degradation under harder tasks. We find that VLMs align strongly with expert ratings on coarse criteria (Cohen's κ=0.70) but break down on fine-grained, part-level judgment (κ=0.10), validating the paradigm in its strong regime while clarifying its limits.

Problem

Research questions and friction points this paper is trying to address.

human motion understanding

vision-language models

benchmark evaluation

semantic granularity

answer ambiguity

Innovation

Methods, ideas, or system contributions that make the work stand out.

vision-language models

human motion understanding

fine-grained evaluation