Image Hashing via Cross-View Code Alignment in the Age of Foundation Models

📅 2025-10-31
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
In the era of foundation models, large-scale image retrieval faces the challenge of learning hash representations that simultaneously achieve compactness and discriminability. To address this, we propose CroVCA—a cross-view code alignment framework—that replaces multi-objective optimization and complex pipelines with a single binary cross-entropy loss jointly regularized by coding-rate maximization, thereby unifying binary code alignment and diversity control. We design a lightweight HashCoder MLP network incorporating batch normalization and LoRA-based fine-tuning, enabling efficient encoder adaptation while freezing backbone features. Evaluated on standard benchmarks, CroVCA achieves state-of-the-art performance within only five training epochs: for 16-bit hashing, it requires less than two minutes for unsupervised hashing on COCO and approximately three minutes for supervised hashing on ImageNet-100—significantly improving both training efficiency and retrieval accuracy.

Technology Category

Application Category

📝 Abstract
Efficient large-scale retrieval requires representations that are both compact and discriminative. Foundation models provide powerful visual and multimodal embeddings, but nearest neighbor search in these high-dimensional spaces is computationally expensive. Hashing offers an efficient alternative by enabling fast Hamming distance search with binary codes, yet existing approaches often rely on complex pipelines, multi-term objectives, designs specialized for a single learning paradigm, and long training times. We introduce CroVCA (Cross-View Code Alignment), a simple and unified principle for learning binary codes that remain consistent across semantically aligned views. A single binary cross-entropy loss enforces alignment, while coding-rate maximization serves as an anti-collapse regularizer to promote balanced and diverse codes. To implement this, we design HashCoder, a lightweight MLP hashing network with a final batch normalization layer to enforce balanced codes. HashCoder can be used as a probing head on frozen embeddings or to adapt encoders efficiently via LoRA fine-tuning. Across benchmarks, CroVCA achieves state-of-the-art results in just 5 training epochs. At 16 bits, it particularly well-for instance, unsupervised hashing on COCO completes in under 2 minutes and supervised hashing on ImageNet100 in about 3 minutes on a single GPU. These results highlight CroVCA's efficiency, adaptability, and broad applicability.
Problem

Research questions and friction points this paper is trying to address.

Learning compact binary codes for efficient image retrieval
Aligning hash codes across semantically consistent image views
Overcoming computational complexity of high-dimensional foundation model embeddings
Innovation

Methods, ideas, or system contributions that make the work stand out.

Cross-View Code Alignment for binary code consistency
Single binary cross-entropy loss with coding-rate regularization
Lightweight HashCoder network enabling rapid frozen or LoRA training
🔎 Similar Papers
No similar papers found.
Ilyass Moummad
Ilyass Moummad
Postdoctoral Researcher, Inria IROKO, Montpellier
Deep LearningComputer VisionMachine Listening
K
Kawtar Zaher
INRIA, LIRMM, Université de Montpellier, France; Institut National de l’Audiovisuel, France
H
Hervé Goëau
CIRAD, UMR AMAP, Montpellier, Occitanie, France
Alexis Joly
Alexis Joly
Research Director, Inria, Montpellier University, LIRMM
machine learningbiodiversityinformation retrievalplant identification