🤖 AI Summary
This work addresses key challenges—atomic proposition errors, semantic distortion, and formula redundancy—in automated natural language-to-Signal Temporal Logic (STL) translation. We propose a multi-dimensional reward-guided reinforcement learning framework built upon a large language model (LLM) backbone and optimized end-to-end via the Proximal Policy Optimization (PPO) algorithm. Four specialized reward models are introduced to quantitatively assess atomic proposition consistency, semantic alignment, formula conciseness, and symbolic matching accuracy; a curriculum learning strategy further enhances reward signal quality. Experimental results demonstrate that our method significantly outperforms existing state-of-the-art approaches on both automatic metrics and human evaluation, yielding substantial improvements in STL formula accuracy, semantic fidelity, and readability.
📝 Abstract
Signal Temporal Logic (STL) is a powerful formal language for specifying real-time specifications of Cyber-Physical Systems (CPS). Transforming specifications written in natural language into STL formulas automatically has attracted increasing attention. Existing rule-based methods depend heavily on rigid pattern matching and domain-specific knowledge, limiting their generalizability and scalability. Recently, Supervised Fine-Tuning (SFT) of large language models (LLMs) has been successfully applied to transform natural language into STL. However, the lack of fine-grained supervision on atomic proposition correctness, semantic fidelity, and formula readability often leads SFT-based methods to produce formulas misaligned with the intended meaning. To address these issues, we propose RESTL, a reinforcement learning (RL)-based framework for the transformation from natural language to STL. RESTL introduces multiple independently trained reward models that provide fine-grained, multi-faceted feedback from four perspectives, i.e., atomic proposition consistency, semantic alignment, formula succinctness, and symbol matching. These reward models are trained with a curriculum learning strategy to improve their feedback accuracy, and their outputs are aggregated into a unified signal that guides the optimization of the STL generator via Proximal Policy Optimization (PPO). Experimental results demonstrate that RESTL significantly outperforms state-of-the-art methods in both automatic metrics and human evaluations.