Fine-Tune an SLM or Prompt an LLM? The Case of Generating Low-Code Workflows

📅 2025-05-30

📈 Citations: 0

✨ Influential: 0

career value

180K/year

🤖 AI Summary

This work addresses structured output generation—specifically, synthesizing low-code workflow definitions in JSON format—a task demanding strict syntactic validity, semantic fidelity, and logical consistency. Method: We systematically compare fine-tuning small language models (SLMs) against prompting large language models (LLMs), proposing an SLM optimization framework integrating supervised fine-tuning, structured prompt engineering, and JSON-constrained decoding. We further design a hybrid human-automated evaluation framework for rigorous assessment. Contribution/Results: Experiments demonstrate that fine-tuned SLMs outperform prompted LLMs by 10% on average in structural quality (accuracy, field completeness, logical consistency), achieve 3.2× faster inference, and reduce per-invocation cost by 76%. Fine-grained error analysis identifies persistent bottlenecks in nested structure generation and adherence to semantic constraints. This study establishes a reproducible, cost-effective technical pathway for high-quality structured generation under resource constraints.

Technology Category

Application Category

📝 Abstract

Large Language Models (LLMs) such as GPT-4o can handle a wide range of complex tasks with the right prompt. As per token costs are reduced, the advantages of fine-tuning Small Language Models (SLMs) for real-world applications -- faster inference, lower costs -- may no longer be clear. In this work, we present evidence that, for domain-specific tasks that require structured outputs, SLMs still have a quality advantage. We compare fine-tuning an SLM against prompting LLMs on the task of generating low-code workflows in JSON form. We observe that while a good prompt can yield reasonable results, fine-tuning improves quality by 10% on average. We also perform systematic error analysis to reveal model limitations.

Problem

Research questions and friction points this paper is trying to address.

Compare fine-tuning SLMs vs prompting LLMs for structured outputs

Evaluate quality of generating low-code workflows in JSON

Analyze model limitations through systematic error assessment

Innovation

Methods, ideas, or system contributions that make the work stand out.

Fine-tune SLMs for structured outputs

Compare SLM fine-tuning vs LLM prompting

SLMs improve quality by 10%

🔎 Similar Papers

EPiC: Cost-effective Search-based Prompt Engineering of LLMs for Code Generation