🤖 AI Summary
This study challenges the prevailing paradigm that model performance scales with data volume, proposing a “low-data, high-precision fine-tuning” approach. Specifically, it leverages only 2,000 high-quality English–French bilingual mathematical reasoning samples—curated via rigorous filtering—and applies supervised fine-tuning (SFT) coupled with domain-aligned training to jointly enhance both linguistic proficiency in French and mathematical reasoning capability. Evaluated on the Pensez-7B model, the method achieves a 20-percentage-point gain in AIME25 accuracy and a 12-percentage-point improvement in French MATH Level 5 accuracy, substantially outperforming both the baseline and comparable-parameter models. To our knowledge, this is the first empirical demonstration that a small-scale (thousands-level), high-fidelity bilingual domain dataset can effectively co-enhance multilingual competence and specialized reasoning ability. The work establishes a reproducible, resource-efficient fine-tuning paradigm for low-resource multilingual and domain-specific AI applications.
📝 Abstract
Large language models (LLMs) have demonstrated remarkable capabilities in various natural language processing tasks. However, achieving strong performance in specialized domains like mathematical reasoning and non-English languages often requires extensive training on massive datasets. This paper investigates a contrasting approach: strategic fine-tuning on a small, high-quality, bilingual (English-French) dataset to enhance both the reasoning capabilities and French language proficiency of a large language model. Rather than relying on scale, we explore the hypothesis that targeted data curation and optimized training can achieve competitive, or even superior, performance. We demonstrate, through targeted supervised fine-tuning (SFT) on only 2,000 carefully selected samples, significant improvements in mathematical reasoning. Specifically, Pensez 7B exhibits an increase in accuracy of the base model up to 20% on the AIME25 and a 12% increase on a French MATH level 5 benchmark. These results challenge the prevailing assumption that massive datasets are aprerequisite for strong reasoning performance in LLMs, highlighting the potential of strategic data curation and optimized fine-tuning for enhancing both specialized skills and multilingual capabilities. Our findings have implications for the efficient development of high-performing, multilingual LLMs, especially in resource-constrained scenarios.