🤖 AI Summary
Language-conditioned reinforcement learning (LC-RL) struggles to interpret and execute high-level abstract natural language instructions (e.g., “press the opponent”, “counter-attack quickly”) in complex multi-agent environments such as 5v5 football.
Method: This paper proposes LCDSP, a novel paradigm that introduces interpretable style parameters to bridge natural language commands and tactical behavior styles. It jointly trains a single policy network via diverse style training (DST) and employs a dedicated style interpreter (SI) to explicitly model the mapping from linguistic inputs to stylistic behavioral representations.
Contribution/Results: LCDSP enables controllable, diverse, and semantically grounded responses to abstract tactical directives within a unified policy architecture. Experiments in the 5v5 football domain demonstrate significant improvements in instruction-following accuracy and behavioral diversity. To our knowledge, LCDSP is the first framework achieving end-to-end, interpretable, and high-fidelity mapping from natural language to coordinated multi-agent tactical styles.
📝 Abstract
Despite advancements in language-controlled reinforcement learning (LC-RL) for basic domains and straightforward commands (e.g., object manipulation and navigation), effectively extending LC-RL to comprehend and execute high-level or abstract instructions in complex, multi-agent environments, such as football games, remains a significant challenge. To address this gap, we introduce Language-Controlled Diverse Style Policies (LCDSP), a novel LC-RL paradigm specifically designed for complex scenarios. LCDSP comprises two key components: a Diverse Style Training (DST) method and a Style Interpreter (SI). The DST method efficiently trains a single policy capable of exhibiting a wide range of diverse behaviors by modulating agent actions through style parameters (SP). The SI is designed to accurately and rapidly translate high-level language instructions into these corresponding SP. Through extensive experiments in a complex 5v5 football environment, we demonstrate that LCDSP effectively comprehends abstract tactical instructions and accurately executes the desired diverse behavioral styles, showcasing its potential for complex, real-world applications.