F2LLM Technical Report: Matching SOTA Embedding Performance with 6 Million Open-Source Data

📅 2025-10-02

📈 Citations: 0

✨ Influential: 0

career value

186K/year

🤖 AI Summary

Existing embedding models rely heavily on large-scale contrastive pretraining, complex pipelines, and costly synthetic data generation. Method: This paper introduces the F2LLM family of lightweight, efficient embedding models, directly fine-tuned—without pretraining or synthetic data—on 6 million purely open-source, non-synthetic texts using a query–document–negative triple paradigm atop open foundation LLMs. Contribution/Results: F2LLM achieves state-of-the-art (SOTA) performance with only simple fine-tuning—the first such result. It drastically reduces training cost and deployment complexity while supporting flexible inference at multiple scales (0.6B, 1.7B, and 4B parameters). Empirically, F2LLM-4B ranks second among models of comparable size on the MTEB English leaderboard, and F2LLM-1.7B ranks first in its parameter class. F2LLM establishes a new, reproducible, low-cost, high-performance baseline for embedding tasks.

Technology Category

Application Category

📝 Abstract

We introduce F2LLM - Foundation to Feature Large Language Models, a suite of state-of-the-art embedding models in three sizes: 0.6B, 1.7B, and 4B. Unlike previous top-ranking embedding models that require massive contrastive pretraining, sophisticated training pipelines, and costly synthetic training data, F2LLM is directly finetuned from foundation models on 6 million query-document-negative tuples curated from open-source, non-synthetic datasets, striking a strong balance between training cost, model size, and embedding performance. On the MTEB English leaderboard, F2LLM-4B ranks 2nd among models with approximately 4B parameters and 7th overall, while F2LLM-1.7B ranks 1st among models in the 1B-2B size range. To facilitate future research in the field, we release the models, training dataset, and code, positioning F2LLM as a strong, reproducible, and budget-friendly baseline for future works.

Problem

Research questions and friction points this paper is trying to address.

Developing efficient embedding models with minimal training resources

Achieving top performance without costly synthetic training data

Providing reproducible baseline models for embedding research

Innovation

Methods, ideas, or system contributions that make the work stand out.

Directly finetunes foundation models for embeddings

Uses 6 million open-source non-synthetic training data

Balances training cost, model size, and performance

🔎 Similar Papers

No similar papers found.