Compressed Concatenation of Small Embedding Models

📅 2025-10-06

📈 Citations: 0

✨ Influential: 0

career value

182K/year

🤖 AI Summary

Deploying embedding models in resource-constrained environments (e.g., browsers, edge devices) faces a fundamental trade-off between deployment feasibility and performance—lightweight models suffer from weak representational capacity, while compact deployment of large models remains challenging. Method: We propose Concat-Encode-Quantize—a novel framework that concatenates raw embeddings from multiple lightweight base models, applies a lightweight shared decoder for unified dimensionality reduction, and trains end-to-end using Matryoshka Representation Learning (MRL) loss—without fine-tuning any base model. Contribution/Results: The framework significantly enhances representation robustness under aggressive compression and quantization. On the MTEB retrieval subset, a four-model ensemble achieves 89% of the original performance after 48× quantization, substantially outperforming a single large model of comparable total parameter count. This yields an efficient, robust, and production-ready embedding solution for dense retrieval and semantic search.

Technology Category

Application Category

📝 Abstract

Embedding models are central to dense retrieval, semantic search, and recommendation systems, but their size often makes them impractical to deploy in resource-constrained environments such as browsers or edge devices. While smaller embedding models offer practical advantages, they typically underperform compared to their larger counterparts. To bridge this gap, we demonstrate that concatenating the raw embedding vectors of multiple small models can outperform a single larger baseline on standard retrieval benchmarks. To overcome the resulting high dimensionality of naive concatenation, we introduce a lightweight unified decoder trained with a Matryoshka Representation Learning (MRL) loss. This decoder maps the high-dimensional joint representation to a low-dimensional space, preserving most of the original performance without fine-tuning the base models. We also show that while concatenating more base models yields diminishing gains, the robustness of the decoder's representation under compression and quantization improves. Our experiments show that, on a subset of MTEB retrieval tasks, our concat-encode-quantize pipeline recovers 89% of the original performance with a 48x compression factor when the pipeline is applied to a concatenation of four small embedding models.

Problem

Research questions and friction points this paper is trying to address.

Compressing concatenated small embedding models for efficiency

Reducing dimensionality while preserving retrieval performance

Enabling deployment in resource-constrained environments like edge devices

Innovation

Methods, ideas, or system contributions that make the work stand out.

Concatenates multiple small embedding models

Uses lightweight decoder with MRL loss

Compresses high-dimensional vectors effectively

🔎 Similar Papers

No similar papers found.