🤖 AI Summary
This study systematically evaluates the cross-domain transferability (ImageNet → cultural heritage) of six mainstream architectures—DenseNet, ViT, Swin Transformer, PoolFormer, and others—for cultural heritage image analysis. Adopting a unified pretraining-fine-tuning framework with standardized data augmentation and cross-dataset evaluation protocols, it establishes, for the first time, a comparable benchmark in this domain. To holistically assess practical deployment viability, the work introduces the “efficiency–computation ratio” as a core metric, jointly quantifying accuracy, GPU memory footprint, and inference latency. Experimental results demonstrate that DenseNet achieves the best overall performance: it maintains high classification accuracy while reducing GPU memory consumption by 42% and accelerating inference by 1.8× on average compared to ViT-based models. These findings provide principled guidance for model selection and establish a lightweight, transferable paradigm for intelligent cultural heritage analysis.
📝 Abstract
The integration of computer vision and deep learning is an essential part of documenting and preserving cultural heritage, as well as improving visitor experiences. In recent years, two deep learning paradigms have been established in the field of computer vision: convolutional neural networks and transformer architectures. The present study aims to make a comparative analysis of some representatives of these two techniques of their ability to transfer knowledge from generic dataset, such as ImageNet, to cultural heritage specific tasks. The results of testing examples of the architectures VGG, ResNet, DenseNet, Visual Transformer, Swin Transformer, and PoolFormer, showed that DenseNet is the best in terms of efficiency-computability ratio.