🤖 AI Summary
Existing hashing methods rely on expensive, task-specific training, hindering rapid and scalable retrieval. This paper proposes a training-free multimodal hashing strong baseline: it freezes pre-trained vision and audio encoders to extract features, applies principal component analysis (PCA) for dimensionality reduction, enhances discriminability via random orthogonal projection, and generates compact binary hash codes through threshold-based binarization. Crucially, the method entirely avoids fine-tuning, achieving both computational efficiency and broad cross-domain applicability. Evaluated on standard image retrieval benchmarks and a newly constructed audio hashing benchmark, it attains competitive performance—demonstrating that classical unsupervised techniques retain significant potential in the era of large-scale pre-training. Our approach establishes a novel paradigm for lightweight, plug-and-play cross-modal hashing.
📝 Abstract
Information retrieval with compact binary embeddings, also referred to as hashing, is crucial for scalable fast search applications, yet state-of-the-art hashing methods require expensive, scenario-specific training. In this work, we introduce Hashing-Baseline, a strong training-free hashing method leveraging powerful pretrained encoders that produce rich pretrained embeddings. We revisit classical, training-free hashing techniques: principal component analysis, random orthogonal projection, and threshold binarization, to produce a strong baseline for hashing. Our approach combines these techniques with frozen embeddings from state-of-the-art vision and audio encoders to yield competitive retrieval performance without any additional learning or fine-tuning. To demonstrate the generality and effectiveness of this approach, we evaluate it on standard image retrieval benchmarks as well as a newly introduced benchmark for audio hashing.