See, Infer, Intervene: Proactive World Modeling for Goal-Oriented Social Intelligence

📅 2026-06-02

📈 Citations: 0

✨ Influential: 0

career value

207K/year

🤖 AI Summary

This study addresses the challenge of enabling multimodal retail agents to proactively recognize customer behaviors and deliver timely services without explicit requests. The authors propose a See–Infer–Intervene framework that integrates the AIDA purchasing stages with the Belief–Desire–Intention (BDI) cognitive model to construct rich customer state representations. Within this framework, they design an action-conditioned intent transition prediction mechanism and introduce the Proactive Intent World Model (PIWM) for multimodal perception, intent reasoning, and action selection. As part of their contributions, they release GuidanceSalesBench, the first benchmark for proactive retail service. Experiments show a macro F1 score of 0.641 in action selection under known states, significantly outperforming zero-shot large models; end-to-end performance from raw video inputs achieves 0.295, highlighting state grounding as a critical bottleneck; and a real-world pilot deployment attains a macro F1 of 0.579.

📝 Abstract

Multimodal retail agents should not only recognize what a customer is doing, but also decide whether and how to assist before an explicit request is made. We study this setting through the See--Infer--Intervene (SII) framework, where a device must see pre-interaction behavior, infer latent customer intent, and act by selecting an appropriate service intervention or choosing to wait. We instantiate SII with the Proactive Intent World Model (PIWM), which represents customer state with AIDA (Attention, Interest, Desire, Action) purchasing phases and BDI (belief, desire, intention) psychological fields, predicts action-conditioned intent transitions, and selects from five response classes: Greet, Elicit, Inform, Recommend, and Hold. We further construct GuidanceSalesBench, a smart-retail benchmark containing state manifests, pre-interaction videos, candidate responses, action-conditioned outcomes, and best-action labels. When conditioned on ground-truth customer state to isolate action selection, PIWM achieves 0.641 macro F1 on 30 held-out target videos, outperforming a zero-shot Qwen2.5-VL-7B baseline and training variants without balanced action supervision; end-to-end video-only selection drops to 0.295, below the 5-class balanced random baseline of 0.414, identifying video-to-state grounding as the dominant deployment-time bottleneck. A preliminary staged real-store pilot (recorded with paid participants performing scripted customer behaviors) reaches 0.579 action macro F1 on 20 fully annotated videos, with 10 additional accessible videos released with index-level labels.

Problem

Research questions and friction points this paper is trying to address.

proactive intervention

customer intent inference

multimodal retail agents

goal-oriented social intelligence

pre-interaction behavior

Innovation

Methods, ideas, or system contributions that make the work stand out.

Proactive Intent World Model

See-Infer-Intervene framework

AIDA-BDI state representation

GuidanceSalesBench

action-conditioned intent prediction

🔎 Similar Papers

Entering Real Social World! Benchmarking the Social Intelligence of Large Language Models from a First-person Perspective

2024-10-08Citations: 0