Fine-tuning Language Models for Recipe Generation: A Comparative Analysis and Benchmark Study

📅 2025-02-04
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
This work addresses open-domain recipe generation by systematically fine-tuning lightweight language models—including T5-small, SmolLM-360M/1.7B, and Phi-2—and introducing the first comprehensive evaluation framework tailored to recipe generation. Methodologically, the framework integrates standard NLG metrics (BLEU, ROUGE) with domain-specific dimensions: ingredient consistency, step executability, and allergen substitution feasibility—a novel safety-oriented criterion. Experimental results demonstrate that SmolLM variants achieve the optimal trade-off between performance and parameter count, significantly outperforming Phi-2; notably, model scale exhibits a non-monotonic relationship with recipe practicality. The proposed framework substantially enhances both the utility and safety of generated recipes, empirically validating the viability of compact models for specialized natural language generation tasks.

Technology Category

Application Category

📝 Abstract
This research presents an exploration and study of the recipe generation task by fine-tuning various very small language models, with a focus on developing robust evaluation metrics and comparing across different language models the open-ended task of recipe generation. This study presents extensive experiments with multiple model architectures, ranging from T5-small (Raffel et al., 2023) and SmolLM-135M (Allal et al., 2024) to Phi-2 (Research, 2023),implementing both traditional NLP metrics and custom domain-specific evaluation metrics. Our novel evaluation framework incorporates recipe-specific metrics for assessing content quality and introduces an approach to allergen substitution. The results indicate that, while larger models generally perform better on standard metrics, the relationship between model size and recipe quality is more nuanced when considering domain-specific metrics. We find that SmolLM-360M and SmolLM-1.7B demonstrate comparable performance despite their size difference, while Phi-2 shows limitations in recipe generation despite its larger parameter count. Our comprehensive evaluation framework and allergen substitution system provide valuable insights for future work in recipe generation and broader NLG tasks that require domain expertise and safety considerations.
Problem

Research questions and friction points this paper is trying to address.

Fine-tuning small language models
Developing robust evaluation metrics
Comparative analysis of recipe generation
Innovation

Methods, ideas, or system contributions that make the work stand out.

Fine-tuning small language models
Developing domain-specific evaluation metrics
Introducing allergen substitution approach
🔎 Similar Papers
No similar papers found.
Anneketh Vij
Anneketh Vij
Arcee AI
Artificial IntelligenceMachine LearningNatural Language ProcessingComputer Vision
C
Changhao Liu
Department of Computer Science, University of Southern California
Rahul Anil Nair
Rahul Anil Nair
University of Southern California
Natural Language ProcessingDeep Learning
T
Theo Ho
Department of Computer Science, University of Southern California
E
Edward Shi
Department of Computer Science, University of Southern California
A
Ayan Bhowmick
Department of Computer Science, University of Southern California