🤖 AI Summary
This work addresses structured output generation—specifically, synthesizing low-code workflow definitions in JSON format—a task demanding strict syntactic validity, semantic fidelity, and logical consistency. Method: We systematically compare fine-tuning small language models (SLMs) against prompting large language models (LLMs), proposing an SLM optimization framework integrating supervised fine-tuning, structured prompt engineering, and JSON-constrained decoding. We further design a hybrid human-automated evaluation framework for rigorous assessment. Contribution/Results: Experiments demonstrate that fine-tuned SLMs outperform prompted LLMs by 10% on average in structural quality (accuracy, field completeness, logical consistency), achieve 3.2× faster inference, and reduce per-invocation cost by 76%. Fine-grained error analysis identifies persistent bottlenecks in nested structure generation and adherence to semantic constraints. This study establishes a reproducible, cost-effective technical pathway for high-quality structured generation under resource constraints.
📝 Abstract
Large Language Models (LLMs) such as GPT-4o can handle a wide range of complex tasks with the right prompt. As per token costs are reduced, the advantages of fine-tuning Small Language Models (SLMs) for real-world applications -- faster inference, lower costs -- may no longer be clear. In this work, we present evidence that, for domain-specific tasks that require structured outputs, SLMs still have a quality advantage. We compare fine-tuning an SLM against prompting LLMs on the task of generating low-code workflows in JSON form. We observe that while a good prompt can yield reasonable results, fine-tuning improves quality by 10% on average. We also perform systematic error analysis to reveal model limitations.