M-CALLM: Multi-level Context Aware LLM Framework for Group Interaction Prediction

📅 2025-11-18

📈 Citations: 0

✨ Influential: 0

career value

207K/year

🤖 AI Summary

This study addresses the challenge of predicting group coordination patterns in collaborative mixed reality (MR). We propose a multilevel context-aware large language model (LLM) framework that encodes multimodal sensor streams into hierarchical natural-language contexts—capturing individual actions, group structure, and temporal dynamics—to support zero-shot prompting, few-shot learning, and supervised fine-tuning, augmented with low-latency inference techniques. Our key contribution is the first real-time semantic modeling of MR group interactions using LLMs, with systematic quantification of differential contributions from multimodal inputs. Evaluated on experimental data from 64 participants, our approach achieves 96% accuracy in conversation-level prediction—3.2× higher than an LSTM baseline—while maintaining end-to-end latency under 35 ms. However, autoregressive simulation suffers an 83% performance drop due to error accumulation, revealing a critical limitation of LLMs in long-horizon collaborative modeling.

Technology Category

Application Category

📝 Abstract

This paper explores how large language models can leverage multi-level contextual information to predict group coordination patterns in collaborative mixed reality environments. We demonstrate that encoding individual behavioral profiles, group structural properties, and temporal dynamics as natural language enables LLMs to break through the performance ceiling of statistical models. We build M-CALLM, a framework that transforms multimodal sensor streams into hierarchical context for LLM-based prediction, and evaluate three paradigms (zero-shot prompting, few-shot learning, and supervised fine-tuning) against statistical baselines across intervention mode (real-time prediction) and simulation mode (autoregressive forecasting) Head-to-head comparison on 16 groups (64 participants, ~25 hours) demonstrates that context-aware LLMs achieve 96% accuracy for conversation prediction, a 3.2x improvement over LSTM baselines, while maintaining sub-35ms latency. However, simulation mode reveals brittleness with 83% degradation due to cascading errors. Deep-dive into modality-specific performance shows conversation depends on temporal patterns, proximity benefits from group structure (+6%), while shared attention fails completely (0% recall), exposing architectural limitations. We hope this work spawns new ideas for building intelligent collaborative sensing systems that balance semantic reasoning capabilities with fundamental constraints.

Problem

Research questions and friction points this paper is trying to address.

Predicting group coordination patterns using multi-level contextual information

Transforming multimodal sensor data into hierarchical context for LLM prediction

Evaluating LLM performance against statistical baselines in collaborative environments

Innovation

Methods, ideas, or system contributions that make the work stand out.

Multi-level context encoding for group interaction prediction

Transforming sensor streams into hierarchical LLM inputs

Evaluating three paradigms against statistical baselines

🔎 Similar Papers

Leveraging Large Language Models for Collective Decision-Making