Universal Retrieval for Multimodal Trajectory Modeling

📅 2025-06-27

📈 Citations: 0

✨ Influential: 0

career value

187K/year

🤖 AI Summary

To address the explosive growth of multimodal trajectory data in GUI environments, this paper introduces a novel paradigm for trajectory modeling and retrieval tailored to AI agents. Methodologically, we (1) formally define the multimodal trajectory retrieval task; (2) construct UATD—the first unified agent trajectory dataset—and GAE-Bench, a comprehensive evaluation benchmark; and (3) propose GAE-Retriever, a model integrating vision-language pretraining, token-level feature selection, and GradCache—a gradient caching mechanism—to enable efficient contrastive learning. Extensive experiments across multiple benchmarks demonstrate that our approach significantly improves cross-modal trajectory retrieval recall, consistently outperforming strong baselines. Ablation studies confirm the effectiveness of each design component, while robustness and generalization analyses validate its reliability across diverse GUI tasks and trajectory lengths. The proposed framework establishes a new foundation for scalable, interpretable, and semantically grounded agent trajectory understanding.

Technology Category

Application Category

📝 Abstract

Trajectory data, capturing human actions and environmental states across various modalities, holds significant potential for enhancing AI agent capabilities, particularly in GUI environments. However, how to model the representation of trajectory-level data presents a significant challenge that has not been systematically addressed amid explosive trajectory data growth. In this work, we introduce Multimodal Trajectory Retrieval, bridging the gap between universal retrieval and agent-centric trajectory modeling. We construct the Unified Agent Trajectory Dataset (UATD) from annotated demonstrations and states across diverse real-world scenarios. Based on this, we present GAE-Bench, a benchmark containing a large number of trajectory-based retrieval pairs. In addition, we propose GAE-Retriever, a multimodal retrieval framework that adopts vision-language models and incorporates optimized contrastive learning through a token selection and the GradCache mechanism. Comprehensive evaluations across multiple datasets show that GAE-Retriever consistently outperforms strong baselines in retrieval recall, highlighting its effectiveness in advancing multimodal trajectory retrieval.

Problem

Research questions and friction points this paper is trying to address.

Modeling trajectory-level data representation challenges

Bridging universal retrieval and agent-centric trajectory modeling

Advancing multimodal trajectory retrieval performance

Innovation

Methods, ideas, or system contributions that make the work stand out.

Multimodal Trajectory Retrieval bridges retrieval and agent modeling

GAE-Retriever uses vision-language models and contrastive learning

Token selection and GradCache optimize retrieval performance

🔎 Similar Papers

MLLM as Retriever: Interactively Learning Multimodal Retrieval for Embodied Agents