🤖 AI Summary
Existing general-purpose text-to-image models suffer from detail distortion, cultural feature omission, and low fidelity when generating Chinese cuisine imagery. To address these limitations, we propose the first diffusion-based generative model specifically designed for Chinese culinary visualization. Our method introduces four key innovations: (1) construction of the largest publicly available Chinese dish dataset, meticulously annotated by regional cuisine categories and refined via human re-captioning; (2) a coarse-to-fine two-stage training paradigm; (3) an LLM-driven high-quality prompt enhancement mechanism; and (4) a concept-augmented Prompt-to-Prompt local editing framework. Extensive experiments demonstrate that our model consistently outperforms state-of-the-art methods across fidelity, cultural accuracy, and editing controllability. Notably, it achieves significant improvements in multi-cuisine representation, complex plating composition, and fine-grained ingredient texture synthesis—enabling both photorealistic generation and semantically coherent local edits.
📝 Abstract
Dish images play a crucial role in the digital era, with the demand for culturally distinctive dish images continuously increasing due to the digitization of the food industry and e-commerce. In general cases, existing text-to-image generation models excel in producing high-quality images; however, they struggle to capture diverse characteristics and faithful details of specific domains, particularly Chinese dishes. To address this limitation, we propose Omni-Dish, the first text-to-image generation model specifically tailored for Chinese dishes. We develop a comprehensive dish curation pipeline, building the largest dish dataset to date. Additionally, we introduce a recaption strategy and employ a coarse-to-fine training scheme to help the model better learn fine-grained culinary nuances. During inference, we enhance the user's textual input using a pre-constructed high-quality caption library and a large language model, enabling more photorealistic and faithful image generation. Furthermore, to extend our model's capability for dish editing tasks, we propose Concept-Enhanced P2P. Based on this approach, we build a dish editing dataset and train a specialized editing model. Extensive experiments demonstrate the superiority of our methods.