🤖 AI Summary
This paper addresses fundamental challenges in reinforcement learning (RL)—including sparse prior knowledge, difficulty in long-horizon planning, and complex reward function design—by systematically surveying recent advances in integrating large language models (LLMs) and vision-language models (VLMs) into RL. We propose a novel tripartite functional taxonomy wherein LLMs/VLMs serve as agents, planners, and reward generators, and introduce a unified language-vision-action modeling framework that integrates multimodal alignment, prompt engineering, and interpretability analysis. Our contributions include: (i) the first structured survey framework for LLM/VLM-augmented RL; (ii) identification of four critical open problems—grounding fidelity, bias mitigation, representation optimization, and action grounding; and (iii) theoretical foundations and evolutionary pathways for multimodal intelligent decision-making. The work bridges symbolic reasoning and embodied control, advancing principled integration of foundation models into RL systems. (149 words)
📝 Abstract
Reinforcement learning (RL) has shown impressive results in sequential decision-making tasks. Meanwhile, Large Language Models (LLMs) and Vision-Language Models (VLMs) have emerged, exhibiting impressive capabilities in multimodal understanding and reasoning. These advances have led to a surge of research integrating LLMs and VLMs into RL. In this survey, we review representative works in which LLMs and VLMs are used to overcome key challenges in RL, such as lack of prior knowledge, long-horizon planning, and reward design. We present a taxonomy that categorizes these LLM/VLM-assisted RL approaches into three roles: agent, planner, and reward. We conclude by exploring open problems, including grounding, bias mitigation, improved representations, and action advice. By consolidating existing research and identifying future directions, this survey establishes a framework for integrating LLMs and VLMs into RL, advancing approaches that unify natural language and visual understanding with sequential decision-making.