Nomic Embed: Training a Reproducible Long Context Text Embedder

📅 2024-02-02
🏛️ arXiv.org
📈 Citations: 57
Influential: 7
📄 PDF
🤖 AI Summary
Existing proprietary text embedding models (e.g., OpenAI’s Ada-002 and text-embedding-3-small) suffer from poor reproducibility, limited transparency, and inadequate long-context modeling capabilities. Method: We introduce the first fully reproducible, open-source (Apache 2.0), open-weight, and open-data English text embedding model for long contexts. Built upon the Contrastors contrastive learning framework, our approach employs a deterministic training pipeline with publicly released cleaned data, complete training code, and model weights. The model supports up to 8,192-token contexts. Contribution/Results: It achieves state-of-the-art average performance on the MTEB short-context benchmark—surpassing Ada-002—and delivers significant gains on LoCo, a long-context retrieval benchmark. Crucially, this work establishes the first end-to-end open pipeline—from data curation and training to inference—enabling full transparency and reproducibility, thereby laying a foundational basis for trustworthy embedding research.

Technology Category

Application Category

📝 Abstract
This technical report describes the training of nomic-embed-text-v1, the first fully reproducible, open-source, open-weights, open-data, 8192 context length English text embedding model that outperforms both OpenAI Ada-002 and OpenAI text-embedding-3-small on the short-context MTEB benchmark and the long context LoCo benchmark. We release the training code and model weights under an Apache 2.0 license. In contrast with other open-source models, we release the full curated training data and code that allows for full replication of nomic-embed-text-v1. You can find code and data to replicate the model at https://github.com/nomic-ai/contrastors.
Problem

Research questions and friction points this paper is trying to address.

Develops a reproducible long-context text embedder.
Outperforms OpenAI models on benchmarks.
Provides open-source code and training data.
Innovation

Methods, ideas, or system contributions that make the work stand out.

Open-source long context text embedding
Fully reproducible training process
Apache 2.0 licensed model release
🔎 Similar Papers
No similar papers found.