MELLM: Exploring LLM-Powered Micro-Expression Understanding Enhanced by Subtle Motion Perception

📅 2025-05-11
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
Existing micro-expression recognition methods primarily focus on discrete emotion classification, failing to adequately model subtle, transient facial dynamics and deep affective semantics. Method: This paper introduces the first multimodal large language model (MLLM) for micro-expression analysis—Micro-Expression Large Language Model (MELLM). It proposes a motion-enhanced color-image input mechanism integrating optical flow (onset-apex) with grayscale frames; constructs a Facial Action Coding System (FACS)-annotated action unit–affect instruction paired dataset; and employs joint vision-language fine-tuning coupled with FACS-driven instruction tuning. Contribution/Results: Experiments demonstrate that MELLM significantly improves robustness and generalization in micro-expression understanding (MEU) across multiple benchmarks. It enables fine-grained affective semantic generation guided by motion-sensitive facial regions, establishing a novel, interpretable, and reasoning-capable paradigm for micro-expression analysis.

Technology Category

Application Category

📝 Abstract
Micro-expressions (MEs) are crucial psychological responses with significant potential for affective computing. However, current automatic micro-expression recognition (MER) research primarily focuses on discrete emotion classification, neglecting a convincing analysis of the subtle dynamic movements and inherent emotional cues. The rapid progress in multimodal large language models (MLLMs), known for their strong multimodal comprehension and language generation abilities, offers new possibilities. MLLMs have shown success in various vision-language tasks, indicating their potential to understand MEs comprehensively, including both fine-grained motion patterns and underlying emotional semantics. Nevertheless, challenges remain due to the subtle intensity and short duration of MEs, as existing MLLMs are not designed to capture such delicate frame-level facial dynamics. In this paper, we propose a novel Micro-Expression Large Language Model (MELLM), which incorporates a subtle facial motion perception strategy with the strong inference capabilities of MLLMs, representing the first exploration of MLLMs in the domain of ME analysis. Specifically, to explicitly guide the MLLM toward motion-sensitive regions, we construct an interpretable motion-enhanced color map by fusing onset-apex optical flow dynamics with the corresponding grayscale onset frame as the model input. Additionally, specialized fine-tuning strategies are incorporated to further enhance the model's visual perception of MEs. Furthermore, we construct an instruction-description dataset based on Facial Action Coding System (FACS) annotations and emotion labels to train our MELLM. Comprehensive evaluations across multiple benchmark datasets demonstrate that our model exhibits superior robustness and generalization capabilities in ME understanding (MEU). Code is available at https://github.com/zyzhangUstc/MELLM.
Problem

Research questions and friction points this paper is trying to address.

Enhancing micro-expression recognition via LLM-powered motion analysis
Addressing subtle facial dynamics in emotion classification tasks
Integrating multimodal LLMs for comprehensive micro-expression understanding
Innovation

Methods, ideas, or system contributions that make the work stand out.

MELLM integrates subtle motion perception with MLLMs
Uses motion-enhanced color map for sensitive regions
Specialized fine-tuning enhances ME visual perception
🔎 Similar Papers
No similar papers found.
Z
Zhengye Zhang
School of Computer Science and Technology, University of Science and Technology of China, Hefei, Anhui 230027, China
Sirui Zhao
Sirui Zhao
University of Science and Technology of China
Affective ComputingMLLMHCI
S
Shifeng Liu
School of Artificial Intelligence and Data Science, University of Science and Technology of China, Hefei, Anhui 230027, China
Shukang Yin
Shukang Yin
University of Science and Technology of China
Computer VisionMultimodal Learning
X
Xinglong Mao
School of Artificial Intelligence and Data Science, University of Science and Technology of China, Hefei, Anhui 230027, China
T
Tong Xu
School of Computer Science and Technology, University of Science and Technology of China, Hefei, Anhui 230027, China
Enhong Chen
Enhong Chen
University of Science and Technology of China
data miningrecommender systemmachine learning