MetaForge: A Self-Evolving Multimodal Agent that Retrieves, Adapts, and Forges Tools On Demand

📅 2026-06-01

📈 Citations: 0

✨ Influential: 0

career value

192K/year

🤖 AI Summary

Existing multimodal agents are constrained by static toolkits, limiting their generalization to novel scenarios and often introducing redundancy and errors through indiscriminate tool invocation. This work proposes the first self-evolving multimodal agent framework capable of online skill forging and reuse, integrating four tightly coupled stages—decision, retrieval, adaptation, and forging—to dynamically determine whether to use tools, select existing ones, or synthesize new skills on demand, thereby establishing a closed-loop evolutionary mechanism. The framework employs a unified policy to dynamically choose among direct answering, tool reuse, or skill forging, and leverages reinforcement learning to jointly optimize the necessity of invocation, retrieval accuracy, execution effectiveness, and skill reusability, augmented with explicit cost penalties to suppress redundant calls. Evaluated across twelve benchmarks, the approach significantly outperforms sixteen baselines, achieving notable advances in accuracy, efficiency, and generalization.

📝 Abstract

Multimodal agents have achieved notable progress on complex reasoning tasks through tool use, yet remain limited by two issues: statically predefined tool inventories fail to generalize to unseen scenarios, and indiscriminate tool invocation incurs redundant cost and noise-induced errors. We propose MetaForge, a multimodal agent framework that learns when to invoke tools and how to evolve its toolset on demand. MetaForge factorizes agentic behavior into four coupled stages: Decide (judging whether tool use is warranted), Retrieve (selecting suitable tools), Adapt (grounding tool parameters in task context), and Forge (synthesizing new skills online and recycling them into the tool library for reuse), forming a closed judge-retrieve-adapt-forge-recycle loop. A unified orchestration policy enables the agent to choose among answering directly, reusing existing tools, or forging new ones. We jointly optimize invocation necessity, retrieval accuracy, execution effectiveness, and forged-skill reusability via reinforcement learning, with an explicit invocation-cost penalty discouraging redundant calls. Across 12 benchmarks, MetaForge consistently surpasses 16 baselines in accuracy, efficiency, and generalization, validating a paradigm shift from static tool inventories to on-demand self-evolution.

Problem

Research questions and friction points this paper is trying to address.

multimodal agents

tool use

static tool inventories

indiscriminate tool invocation

generalization

Innovation

Methods, ideas, or system contributions that make the work stand out.

self-evolving agent

on-demand tool forging

multimodal reasoning