DGME-T: Directional Grid Motion Encoding for Transformer-Based Historical Camera Movement Classification

📅 2025-10-17

📈 Citations: 0

✨ Influential: 0

🤖 AI Summary

Historical archival footage suffers from noise, frame loss, and low contrast, severely degrading camera motion classification (CMC) performance. To address this, we introduce the first cross-era unified CMC benchmark and propose DGME-T, a Video Swin Transformer–based model. DGME-T innovatively incorporates optical-flow-driven directional grid motion encoding (DGME) to explicitly model structured motion priors in degraded videos. Additionally, we design a learnable normalization late-fusion layer, the first to empirically validate the complementarity between structured motion priors and Transformer representations for cross-domain video analysis. A two-stage training strategy further enhances domain adaptability. Experiments show that DGME-T achieves 86.14% Top-1 accuracy (+4.36) and 87.81% Macro F1 on modern videos, and 84.62% accuracy and 82.63% F1 on WWII archival footage—demonstrating substantial improvements in cross-domain generalization.

Technology Category

Application Category

📝 Abstract

Camera movement classification (CMC) models trained on contemporary, high-quality footage often degrade when applied to archival film, where noise, missing frames, and low contrast obscure motion cues. We bridge this gap by assembling a unified benchmark that consolidates two modern corpora into four canonical classes and restructures the HISTORIAN collection into five balanced categories. Building on this benchmark, we introduce DGME-T, a lightweight extension to the Video Swin Transformer that injects directional grid motion encoding, derived from optical flow, via a learnable and normalised late-fusion layer. DGME-T raises the backbone's top-1 accuracy from 81.78% to 86.14% and its macro F1 from 82.08% to 87.81% on modern clips, while still improving the demanding World-War-II footage from 83.43% to 84.62% accuracy and from 81.72% to 82.63% macro F1. A cross-domain study further shows that an intermediate fine-tuning stage on modern data increases historical performance by more than five percentage points. These results demonstrate that structured motion priors and transformer representations are complementary and that even a small, carefully calibrated motion head can substantially enhance robustness in degraded film analysis. Related resources are available at https://github.com/linty5/DGME-T.

Problem

Research questions and friction points this paper is trying to address.

Classifying camera movements in noisy archival films with degraded quality

Improving motion analysis robustness for historical footage with missing frames

Enhancing transformer models with directional motion encoding for film analysis

Innovation

Methods, ideas, or system contributions that make the work stand out.

Directional grid motion encoding from optical flow

Lightweight extension to Video Swin Transformer

Learnable normalized late-fusion layer integration

🔎 Similar Papers

No similar papers found.

Authors to Follow