Zero-shot cross-modal transfer of Reinforcement Learning policies through a Global Workspace

📅 2024-03-07

🏛️ RLJ

📈 Citations: 1

✨ Influential: 1

🤖 AI Summary

Addressing the challenge of zero-shot cross-modal policy transfer in reinforcement learning—particularly between vision and attribute-vector modalities—this paper introduces a brain-inspired multimodal representation architecture, the first to integrate cognitive science’s Global Workspace Theory into deep RL. Our method enables bidirectional zero-shot policy transfer without fine-tuning or retraining, leveraging a cross-modal broadcasting mechanism and frozen representation transfer. Evaluated on two representative task environments, it achieves over 92% of the source-modality training performance when transferring policies between image and attribute-vector modalities—substantially outperforming contrastive learning baselines such as CLIP. Key contributions are: (1) the first deep RL architecture incorporating the Global Workspace mechanism; and (2) robust, generalizable zero-shot cross-modal policy reuse. This work bridges cognitive neuroscience and RL, advancing multimodal representation learning for adaptive decision-making under modality shift.

Technology Category

Application Category

📝 Abstract

Humans perceive the world through multiple senses, enabling them to create a comprehensive representation of their surroundings and to generalize information across domains. For instance, when a textual description of a scene is given, humans can mentally visualize it. In fields like robotics and Reinforcement Learning (RL), agents can also access information about the environment through multiple sensors; yet redundancy and complementarity between sensors is difficult to exploit as a source of robustness (e.g. against sensor failure) or generalization (e.g. transfer across domains). Prior research demonstrated that a robust and flexible multimodal representation can be efficiently constructed based on the cognitive science notion of a 'Global Workspace': a unique representation trained to combine information across modalities, and to broadcast its signal back to each modality. Here, we explore whether such a brain-inspired multimodal representation could be advantageous for RL agents. First, we train a 'Global Workspace' to exploit information collected about the environment via two input modalities (a visual input, or an attribute vector representing the state of the agent and/or its environment). Then, we train a RL agent policy using this frozen Global Workspace. In two distinct environments and tasks, our results reveal the model's ability to perform zero-shot cross-modal transfer between input modalities, i.e. to apply to image inputs a policy previously trained on attribute vectors (and vice-versa), without additional training or fine-tuning. Variants and ablations of the full Global Workspace (including a CLIP-like multimodal representation trained via contrastive learning) did not display the same generalization abilities.

Problem

Research questions and friction points this paper is trying to address.

Enabling RL agents to transfer policies across sensory modalities without retraining

Exploring brain-inspired multimodal representations for robust RL performance

Assessing zero-shot cross-modal transfer in visual and attribute-based inputs

Innovation

Methods, ideas, or system contributions that make the work stand out.

Global Workspace combines multimodal information

Zero-shot cross-modal transfer in RL

Frozen Global Workspace trains RL policy

🔎 Similar Papers

No similar papers found.

Authors to Follow