OmniActor: A Generalist GUI and Embodied Agent for 2D&3D Worlds

📅 2025-09-02
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
To address performance degradation in multimodal large language models (MLLMs) caused by inter-modal data conflicts when jointly processing 2D GUI and 3D embodied tasks, this paper proposes a layer-heterogeneous Mixture-of-Experts (MoE) architecture: shallow layers share parameters to model cross-modal synergy, while deep layers employ modality-specific parameters to suppress interference. Additionally, we introduce a unified action space and jointly train the model on large-scale GUI and embodied interaction datasets. Inspired by functional brain parcellation, this design achieves both modality compatibility and task decoupling. Experiments demonstrate that the proposed agent outperforms unimodal specialized models on both GUI and embodied benchmarks—particularly excelling in 2D interface manipulation tasks—and exhibits strong cross-task generalization capability.

Technology Category

Application Category

📝 Abstract
Multimodal large language models are evolving toward multimodal agents capable of proactively executing tasks. Most agent research focuses on GUI or embodied scenarios, which correspond to agents interacting with 2D virtual worlds or 3D real worlds, respectively. However, many complex tasks typically require agents to interleavely interact with these two types of environment. We initially mix GUI and embodied data to train, but find the performance degeneration brought by the data conflict. Further analysis reveals that GUI and embodied data exhibit synergy and conflict at the shallow and deep layers, respectively, which resembles the cerebrum-cerebellum mechanism in the human brain. To this end, we propose a high-performance generalist agent OmniActor, designed from both structural and data perspectives. First, we propose Layer-heterogeneity MoE to eliminate the conflict between GUI and embodied data by separating deep-layer parameters, while leverage their synergy by sharing shallow-layer parameters. By successfully leveraging the synergy and eliminating the conflict, OmniActor outperforms agents only trained by GUI or embodied data in GUI or embodied tasks. Furthermore, we unify the action spaces of GUI and embodied tasks, and collect large-scale GUI and embodied data from various sources for training. This significantly improves OmniActor under different scenarios, especially in GUI tasks. The code will be publicly available.
Problem

Research questions and friction points this paper is trying to address.

Developing a generalist agent for both 2D GUI and 3D embodied environments
Resolving data conflicts between GUI and embodied task training
Unifying action spaces across different multimodal interaction scenarios
Innovation

Methods, ideas, or system contributions that make the work stand out.

Layer-heterogeneity MoE architecture for parameter separation
Unified action space for GUI and embodied tasks
Large-scale multimodal data integration from diverse sources
🔎 Similar Papers
No similar papers found.
Longrong Yang
Longrong Yang
Zhejiang University
Computer Vision and Pattern Recognition
Z
Zhixiong Zeng
Meituan,Zhejiang University
Yufeng Zhong
Yufeng Zhong
Meituan
Multimodal LLMComputer Vision
J
Jing Huang
Meituan,Zhejiang University
L
Liming Zheng
Meituan,Zhejiang University
L
Lei Chen
Meituan,Zhejiang University
Haibo Qiu
Haibo Qiu
University of Sydney
Multimodal LLMVision and LanguageComputer Vision
Zequn Qin
Zequn Qin
Zhejiang University
computer visiondeep learningmachine learning
L
Lin Ma
Meituan,Zhejiang University
X
Xi Li
Zhejiang University