MODF-SIR: A Multi-agent Omni-modal Distilled Framework for Social Intelligence Reasoning

📅 2026-06-10

📈 Citations: 0

✨ Influential: 0

🤖 AI Summary

This work addresses the challenge of long-tailed events being obscured by dominant activities and noise in complex social scenarios. To this end, the authors propose a lightweight multi-agent collaborative framework built upon a multimodal large language model. The framework uniquely integrates test-time adaptation (TTA) throughout the entire inference pipeline, employs LoRA for distillation-enhanced instance-level fine-tuning, and explicitly textualizes critical long-tailed events to strengthen their representation. By synergistically combining end-to-end knowledge distillation, Chain-of-Thought prompting, and multimodal coordination, the method achieves state-of-the-art performance across multiple benchmarks—surpassing both open-source and closed-source models—while using only approximately 30% of the IntentTrain training data.

📝 Abstract

We propose a multi-agent collaborative framework built upon a lightweight Multimodal Large Language Model (MLLM), specifically designed for social intelligence reasoning. A key feature of our approach is that both the training and inference phases are augmented via knowledge distillation. Within this architecture, multi-modal data pertinent to social intelligence is precisely localized. Furthermore, relevant long-tail events are identified, extracted, and rendered as formatted, explicit text. This formatting strategy prevents critical long-tail information from being overshadowed by head events and environmental noise during the tokenization process. Specifically, we integrate Test-Time Adaptation (TTA) across the entire reasoning pipeline, encompassing the extraction and representation of long-tail events, Chain-of-Thought (CoT) prompting, and self-reflection. This TTA mechanism is also distillation-enhanced, utilizing Low-Rank Adaptation (LoRA) to fine-tune the foundation model exclusively for instance-level reasoning. Extensive evaluations against various open-source and proprietary AI models across multiple benchmarks demonstrate the effectiveness of the proposed framework. With around 30% of training data from IntentTrain, we achieve state-of-the-art results. Codes are available at https://github.com/eeee-sys/MODF-SIR, demo is available at https://huggingface.co/spaces/Harry-1234/MODF-SIR, LoRA is available at https://huggingface.co/Harry-1234/MODF-SIR and the dataset for training router is available at https://huggingface.co/datasets/Harry-1234/IntentRouterTrain.

Problem

Research questions and friction points this paper is trying to address.

social intelligence reasoning

multimodal data

long-tail events

knowledge distillation

Test-Time Adaptation

Innovation

Methods, ideas, or system contributions that make the work stand out.

knowledge distillation

test-time adaptation

long-tail event extraction

low-rank adaptation

multimodal social reasoning

🔎 Similar Papers

Towards Rationality in Language and Multimodal Agents: A Survey

2024-06-01Citations: 6

Cognitive Insights and Stable Coalition Matching for Fostering Multi-Agent Cooperation

2024-05-28Citations: 0