AU-TTT: Vision Test-Time Training model for Facial Action Unit Detection

📅 2025-03-30
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
Facial Action Unit (AU) detection suffers from poor cross-domain generalization and overfitting due to high annotation costs and data scarcity. To address this, we propose the first vision-oriented Test-Time Training (TTT) backbone specifically designed for AU detection. Our method introduces three key innovations: (1) a novel bidirectional TTT block enabling self-supervised, iterative optimization during inference; (2) TTT Linear—a lightweight adaptation mechanism that enhances linear attention’s capacity to model sparse, localized AU activations; and (3) an AU-specific Region-of-Interest (RoI) scanning strategy for fine-grained facial region localization and feature extraction. Evaluated on multiple benchmark datasets, our approach achieves state-of-the-art performance in both in-domain and cross-domain AU detection, demonstrating significant improvements in model robustness and generalization without requiring additional labeled training data.

Technology Category

Application Category

📝 Abstract
Facial Action Units (AUs) detection is a cornerstone of objective facial expression analysis and a critical focus in affective computing. Despite its importance, AU detection faces significant challenges, such as the high cost of AU annotation and the limited availability of datasets. These constraints often lead to overfitting in existing methods, resulting in substantial performance degradation when applied across diverse datasets. Addressing these issues is essential for improving the reliability and generalizability of AU detection methods. Moreover, many current approaches leverage Transformers for their effectiveness in long-context modeling, but they are hindered by the quadratic complexity of self-attention. Recently, Test-Time Training (TTT) layers have emerged as a promising solution for long-sequence modeling. Additionally, TTT applies self-supervised learning for iterative updates during both training and inference, offering a potential pathway to mitigate the generalization challenges inherent in AU detection tasks. In this paper, we propose a novel vision backbone tailored for AU detection, incorporating bidirectional TTT blocks, named AU-TTT. Our approach introduces TTT Linear to the AU detection task and optimizes image scanning mechanisms for enhanced performance. Additionally, we design an AU-specific Region of Interest (RoI) scanning mechanism to capture fine-grained facial features critical for AU detection. Experimental results demonstrate that our method achieves competitive performance in both within-domain and cross-domain scenarios.
Problem

Research questions and friction points this paper is trying to address.

High cost and limited datasets hinder AU detection accuracy
Transformers' quadratic complexity limits long-context modeling efficiency
Generalization challenges persist in cross-domain AU detection
Innovation

Methods, ideas, or system contributions that make the work stand out.

Bidirectional TTT blocks for AU detection
TTT Linear for improved performance
AU-specific RoI scanning mechanism
🔎 Similar Papers
No similar papers found.
Bohao Xing
Bohao Xing
Lappeenranta-Lahti University of Technology LUT
Emotion AI
K
Kaishen Yuan
The Hong Kong University of Science and Technology (Guangzhou), China
Zitong Yu
Zitong Yu
U.S. Food and Drug Administration
Medical imagingDeep learningMachine learningImage reconstruction
X
Xin Liu
Lappeenranta-Lahti University of Technology LUT, Finland
H
Heikki Kalviainen
Lappeenranta-Lahti University of Technology LUT, Finland; Rensselaer Polytechnic Institute, USA; Brno University of Technology, Czech Republic