🤖 AI Summary
Existing embedding models rely heavily on large-scale contrastive pretraining, complex pipelines, and costly synthetic data generation. Method: This paper introduces the F2LLM family of lightweight, efficient embedding models, directly fine-tuned—without pretraining or synthetic data—on 6 million purely open-source, non-synthetic texts using a query–document–negative triple paradigm atop open foundation LLMs. Contribution/Results: F2LLM achieves state-of-the-art (SOTA) performance with only simple fine-tuning—the first such result. It drastically reduces training cost and deployment complexity while supporting flexible inference at multiple scales (0.6B, 1.7B, and 4B parameters). Empirically, F2LLM-4B ranks second among models of comparable size on the MTEB English leaderboard, and F2LLM-1.7B ranks first in its parameter class. F2LLM establishes a new, reproducible, low-cost, high-performance baseline for embedding tasks.
📝 Abstract
We introduce F2LLM - Foundation to Feature Large Language Models, a suite of state-of-the-art embedding models in three sizes: 0.6B, 1.7B, and 4B. Unlike previous top-ranking embedding models that require massive contrastive pretraining, sophisticated training pipelines, and costly synthetic training data, F2LLM is directly finetuned from foundation models on 6 million query-document-negative tuples curated from open-source, non-synthetic datasets, striking a strong balance between training cost, model size, and embedding performance. On the MTEB English leaderboard, F2LLM-4B ranks 2nd among models with approximately 4B parameters and 7th overall, while F2LLM-1.7B ranks 1st among models in the 1B-2B size range. To facilitate future research in the field, we release the models, training dataset, and code, positioning F2LLM as a strong, reproducible, and budget-friendly baseline for future works.