DGME-T: Directional Grid Motion Encoding for Transformer-Based Historical Camera Movement Classification

📅 2025-10-17
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
Historical archival footage suffers from noise, frame loss, and low contrast, severely degrading camera motion classification (CMC) performance. To address this, we introduce the first cross-era unified CMC benchmark and propose DGME-T, a Video Swin Transformer–based model. DGME-T innovatively incorporates optical-flow-driven directional grid motion encoding (DGME) to explicitly model structured motion priors in degraded videos. Additionally, we design a learnable normalization late-fusion layer, the first to empirically validate the complementarity between structured motion priors and Transformer representations for cross-domain video analysis. A two-stage training strategy further enhances domain adaptability. Experiments show that DGME-T achieves 86.14% Top-1 accuracy (+4.36) and 87.81% Macro F1 on modern videos, and 84.62% accuracy and 82.63% F1 on WWII archival footage—demonstrating substantial improvements in cross-domain generalization.

Technology Category

Application Category

📝 Abstract
Camera movement classification (CMC) models trained on contemporary, high-quality footage often degrade when applied to archival film, where noise, missing frames, and low contrast obscure motion cues. We bridge this gap by assembling a unified benchmark that consolidates two modern corpora into four canonical classes and restructures the HISTORIAN collection into five balanced categories. Building on this benchmark, we introduce DGME-T, a lightweight extension to the Video Swin Transformer that injects directional grid motion encoding, derived from optical flow, via a learnable and normalised late-fusion layer. DGME-T raises the backbone's top-1 accuracy from 81.78% to 86.14% and its macro F1 from 82.08% to 87.81% on modern clips, while still improving the demanding World-War-II footage from 83.43% to 84.62% accuracy and from 81.72% to 82.63% macro F1. A cross-domain study further shows that an intermediate fine-tuning stage on modern data increases historical performance by more than five percentage points. These results demonstrate that structured motion priors and transformer representations are complementary and that even a small, carefully calibrated motion head can substantially enhance robustness in degraded film analysis. Related resources are available at https://github.com/linty5/DGME-T.
Problem

Research questions and friction points this paper is trying to address.

Classifying camera movements in noisy archival films with degraded quality
Improving motion analysis robustness for historical footage with missing frames
Enhancing transformer models with directional motion encoding for film analysis
Innovation

Methods, ideas, or system contributions that make the work stand out.

Directional grid motion encoding from optical flow
Lightweight extension to Video Swin Transformer
Learnable normalized late-fusion layer integration
🔎 Similar Papers
No similar papers found.
T
Tingyu Lin
Computer Vision Lab, TU Wien, Vienna, Austria
A
Armin Dadras
Media Computing Group, UAS St. Pölten, St. Pölten, Austria; Computer Vision Lab, TU Wien, Vienna, Austria
Florian Kleber
Florian Kleber
TU Wien
Document AnalysisComputer VisionMachine Learning
Robert Sablatnig
Robert Sablatnig
Prof. for Computer Vision, TU Wien
Document AnalysisDeep LearningMultispectral Image Analysis3D VisionComputer Vision for Cultural Heritage