From An LLM Swarm To A PDDL-Empowered HIVE: Planning Self-Executed Instructions In A Multi-Modal Jungle

📅 2024-12-17

🏛️ arXiv.org

📈 Citations: 2

✨ Influential: 0

career value

208K/year

🤖 AI Summary

To address the challenge of executing multimodal user instructions that require coordinated invocation of heterogeneous models under diverse constraints, this paper proposes HIVE: a framework grounded in LLM-driven PDDL-based formal planning. HIVE decomposes complex queries into verifiable atomic action sequences and dynamically orchestrates specialized multimodal models accordingly. Its methodology integrates knowledge-aware task decomposition, constraint-guided action planning, and model orchestration to ensure end-to-end interpretability and traceability. Evaluated on the MuSE benchmark, HIVE establishes new state-of-the-art performance in both task selection and multimodal planning, significantly outperforming existing collaborative systems. Crucially, HIVE introduces the first LLM+PDDL joint planning paradigm for multimodal, cross-model tasks—enabling strong constraint satisfaction and formal verification of execution plans.

Technology Category

Application Category

📝 Abstract

In response to the call for agent-based solutions that leverage the ever-increasing capabilities of the deep models' ecosystem, we introduce Hive -- a comprehensive solution for selecting appropriate models and subsequently planning a set of atomic actions to satisfy the end-users' instructions. Hive operates over sets of models and, upon receiving natural language instructions (i.e. user queries), schedules and executes explainable plans of atomic actions. These actions can involve one or more of the available models to achieve the overall task, while respecting end-users specific constraints. Notably, Hive handles tasks that involve multi-modal inputs and outputs, enabling it to handle complex, real-world queries. Our system is capable of planning complex chains of actions while guaranteeing explainability, using an LLM-based formal logic backbone empowered by PDDL operations. We introduce the MuSE benchmark in order to offer a comprehensive evaluation of the multi-modal capabilities of agent systems. Our findings show that our framework redefines the state-of-the-art for task selection, outperforming other competing systems that plan operations across multiple models while offering transparency guarantees while fully adhering to user constraints.

Problem

Research questions and friction points this paper is trying to address.

Planning explainable atomic action sequences for user queries

Selecting appropriate multi-modal models under user constraints

Handling complex tasks with multi-modal inputs and outputs

Innovation

Methods, ideas, or system contributions that make the work stand out.

Uses PDDL operations for explainable action planning

Schedules multi-modal models to handle complex queries

Guarantees transparency while respecting user constraints

🔎 Similar Papers

NL2Plan: Robust LLM-Driven Planning from Minimal Text Descriptions