🤖 AI Summary
Current CPS software engineering approaches inadequately address multi-source heterogeneous data processing, relying excessively on unimodal large language models while neglecting cross-modal synergy and physical-world interpretability. Method: This work proposes the first systematic roadmap for integrating multimodal foundation models (FMs)—including vision-language, audio-text, and cross-modal fusion models—into CPS development. It establishes a unified research framework explicitly targeting three core challenges across the CPS lifecycle (requirements analysis, modeling, verification): cross-modal alignment, physics-grounded interpretability, and real-time constraints. Domain knowledge injection and model–physical-system co-verification techniques are employed to analyze technical gaps. Contribution/Results: The study identifies six key research directions and five categories of cross-cutting challenges, providing both theoretical foundations and practical paradigms for developing trustworthy, intelligent, industrial-grade CPS.
📝 Abstract
Foundation Models (FMs), particularly Large Language Models (LLMs), are increasingly used to support various software engineering activities (e.g., coding and testing). Their applications in the software engineering of Cyber-Physical Systems (CPSs) are also growing. However, research in this area remains limited. Moreover, existing studies have primarily focused on LLMs-only one type of FM-leaving ample opportunities to explore others, such as vision-language models. We argue that, in addition to LLMs, other FMs utilizing different data modalities (e.g., images, audio) and multimodal models (which integrate multiple modalities) hold great potential for supporting CPS software engineering, given that these systems process diverse data types. To address this, we present a research roadmap for integrating FMs into various phases of CPS software engineering, highlighting key research opportunities and challenges for the software engineering community.