🤖 AI Summary
This work addresses the safety risks posed by the probabilistic nature of large language models (LLMs) in safety-critical applications such as running plan generation, where rule violations can lead to hazardous outcomes. To mitigate this issue, the authors propose SafeRun, a novel framework that decouples soft semantic understanding from hard safety constraints for the first time. SafeRun leverages an LLM to interpret user instructions while delegating plan execution to a deterministic solver that enforces physiological and safety constraints. Evaluated across five mainstream LLMs on a newly curated running planning benchmark, SafeRun achieves 100% safety compliance—substantially outperforming Prompt Engineering (79.1%) and CodeAct (97.6%)—while preserving strong instruction-following capabilities.
📝 Abstract
Large Language Models enable flexible natural-language planning but remain unreliable in determinism-critical domains due to their probabilistic nature. This limitation is especially problematic in running planning, where violating safety rules can lead to safety risks. We propose SafeRun, a framework for deterministic LLM-based planning via a decoupled architecture. SafeRun separates soft interpretation by an LLM from hard constraint enforcement by a deterministic solver, ensuring strict safety constraints while preserving natural-language flexibility. To validate SafeRun, we build a comprehensive benchmark for running planning under realistic physiological and safety constraints. Experiments across five LLMs show that SafeRun achieves 100\% safety score (vs.\ 79.1\% PE average and 97.6\% CodeAct average) while maintaining competitive instruction-following scores. The SafeRun benchmark is publicly available at \href{https://huggingface.co/datasets/zzp-seeker/SafeRun-RunPlanning-Benchmark}{huggingface}.