Bee: A High-Quality Corpus and Full-Stack Suite to Unlock Advanced Fully Open MLLMs

📅 2025-10-15

📈 Citations: 0

✨ Influential: 0

career value

218K/year

🤖 AI Summary

Existing open-source multimodal large language models (MLLMs) significantly underperform proprietary counterparts due to noisy supervised fine-tuning (SFT) data and a severe lack of complex reasoning examples—particularly chain-of-thought (CoT) annotations. Method: We propose a dual-level CoT enhancement strategy, constructing Honey-Data-15M—a 15-million-sample high-quality vision-language QA dataset. We release HoneyPipe and DataStudio: reusable, extensible data engineering frameworks integrating multi-stage cleaning, short/long CoT injection, multimodal alignment modeling, and efficient training protocols. We further introduce a custom evaluation suite and training recipe. Contribution/Results: The resulting 8B-parameter model, Bee-8B, trained on this pipeline, surpasses the semi-open InternVL3.5-8B across multiple benchmarks and achieves state-of-the-art performance among fully open MLLMs. All data, code, models, and tooling are publicly released under permissive open-source licenses.

Technology Category

Application Category

📝 Abstract

Fully open multimodal large language models (MLLMs) currently lag behind proprietary counterparts, primarily due to a significant gap in data quality for supervised fine-tuning (SFT). Existing open-source datasets are often plagued by widespread noise and a critical deficit in complex reasoning data, such as Chain-of-Thought (CoT), which hinders the development of advanced model capabilities. Addressing these challenges, our work makes three primary contributions. First, we introduce Honey-Data-15M, a new SFT dataset comprising approximately 15 million QA pairs, processed through multiple cleaning techniques and enhanced with a novel dual-level (short and long) CoT enrichment strategy. Second, we introduce HoneyPipe, the data curation pipeline, and its underlying framework DataStudio, providing the community with a transparent and adaptable methodology for data curation that moves beyond static dataset releases. Finally, to validate our dataset and pipeline, we train Bee-8B, an 8B model on Honey-Data-15M. Experiments show that Bee-8B establishes a new state-of-the-art (SOTA) for fully open MLLMs, achieving performance that is competitive with, and in some cases surpasses, recent semi-open models such as InternVL3.5-8B. Our work delivers to the community a suite of foundational resources, including: the Honey-Data-15M corpus; the full-stack suite comprising HoneyPipe and DataStudio; training recipes; an evaluation harness; and the model weights. This effort demonstrates that a principled focus on data quality is a key pathway to developing fully open MLLMs that are highly competitive with their semi-open counterparts.

Problem

Research questions and friction points this paper is trying to address.

Addressing data quality gap in open multimodal language models

Providing cleaned dataset with dual-level reasoning enrichment

Developing competitive fully open models through principled data curation

Innovation

Methods, ideas, or system contributions that make the work stand out.

Created Honey-Data-15M dataset with dual-level CoT enrichment

Developed HoneyPipe pipeline and DataStudio framework for curation

Trained Bee-8B model achieving state-of-the-art fully open performance

🔎 Similar Papers

No similar papers found.