Frequency-Adaptive Discrete Cosine-ViT-ResNet Architecture for Sparse-Data Vision

📅 2025-05-28
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
To address the extreme scarcity of labeled samples in rare wildlife image classification, this paper proposes a frequency-domain–spatial-domain collaborative hybrid architecture. The method introduces a novel learnable adaptive DCT band-partitioning mechanism that dynamically optimizes the boundaries between low-, mid-, and high-frequency components. It is the first to jointly model DCT frequency-domain representations with dual spatial backbones—ViT-B16 for global contextual modeling and ResNet-50 for local feature extraction—integrated via cross-level feature fusion and enhanced by a Bayesian linear classifier to improve few-shot generalization. Extensive experiments under extreme few-shot settings (e.g., 1–5 samples per class) on a newly constructed 50-class wildlife dataset demonstrate that the proposed approach significantly outperforms conventional CNNs and fixed-band DCT baselines, achieving state-of-the-art (SOTA) classification accuracy.

Technology Category

Application Category

📝 Abstract
A major challenge in rare animal image classification is the scarcity of data, as many species usually have only a small number of labeled samples. To address this challenge, we designed a hybrid deep-learning framework comprising a novel adaptive DCT preprocessing module, ViT-B16 and ResNet50 backbones, and a Bayesian linear classification head. To our knowledge, we are the first to introduce an adaptive frequency-domain selection mechanism that learns optimal low-, mid-, and high-frequency boundaries suited to the subsequent backbones. Our network first captures image frequency-domain cues via this adaptive DCT partitioning. The adaptively filtered frequency features are then fed into ViT-B16 to model global contextual relationships, while ResNet50 concurrently extracts local, multi-scale spatial representations from the original image. A cross-level fusion strategy seamlessly integrates these frequency- and spatial-domain embeddings, and the fused features are passed through a Bayesian linear classifier to output the final category predictions. On our self-built 50-class wildlife dataset, this approach outperforms conventional CNN and fixed-band DCT pipelines, achieving state-of-the-art accuracy under extreme sample scarcity.
Problem

Research questions and friction points this paper is trying to address.

Addresses rare animal image classification with scarce labeled data
Introduces adaptive frequency-domain selection for optimal feature extraction
Combines ViT and ResNet for global-local feature fusion in sparse-data scenarios
Innovation

Methods, ideas, or system contributions that make the work stand out.

Adaptive DCT preprocessing for frequency selection
Hybrid ViT-B16 and ResNet50 backbones integration
Cross-level fusion of frequency and spatial features
🔎 Similar Papers
No similar papers found.