🤖 AI Summary
Multimodal activity recognition from sensor data faces challenges including misaligned training samples and high computational overhead for model training. Method: This paper proposes a zero-shot, LLM-driven late-fusion framework that bypasses end-to-end multimodal modeling. It leverages pretrained large language models (LLMs) to perform cross-modal inference by feeding modality-specific encoded features—such as class labels or semantic descriptions derived from audio and motion time-series data—directly as contextual prompts. Crucially, the approach requires no modality-aligned annotations or parameter updates. Contribution/Results: Evaluated on a 12-class subset of Ego4D, the method achieves zero-shot and one-shot activity recognition with F1 scores significantly surpassing random baselines. By eliminating alignment supervision and gradient-based optimization, it drastically reduces computational and storage costs while demonstrating strong generalization and practical applicability.
📝 Abstract
Sensor data streams provide valuable information around activities and context for downstream applications, though integrating complementary information can be challenging. We show that large language models (LLMs) can be used for late fusion for activity classification from audio and motion time series data. We curated a subset of data for diverse activity recognition across contexts (e.g., household activities, sports) from the Ego4D dataset. Evaluated LLMs achieved 12-class zero- and one-shot classification F1-scores significantly above chance, with no task-specific training. Zero-shot classification via LLM-based fusion from modality-specific models can enable multimodal temporal applications where there is limited aligned training data for learning a shared embedding space. Additionally, LLM-based fusion can enable model deploying without requiring additional memory and computation for targeted application-specific multimodal models.