Insider Attacks in Multi-Agent LLM Consensus Systems

πŸ“… 2026-05-07
πŸ“ˆ Citations: 0
✨ Influential: 0
πŸ“„ PDF

career value

260K/year
πŸ€– AI Summary
This work addresses the vulnerability of multi-agent large language model (LLM) consensus systems to malicious internal agents, which can impede benign agents from reaching agreement. The study introduces a novel approach that, for the first time, integrates latent world models with reinforcement learning to formulate internal attacks as a sequential decision-making task. By learning state representations of benign agents’ behaviors in natural language communication, the method trains an adaptive attack policy capable of dynamically disrupting consensus formation. Experimental results demonstrate that this strategy significantly reduces consensus success rates and prolongs disagreement duration, substantially outperforming baseline attacks based on direct prompting.
πŸ“ Abstract
Large language models (LLMs) are increasingly deployed in multi-agent systems where agents communicate in natural language to solve tasks jointly. A key capability in such systems is consensus formation, where agents iteratively exchange messages and update decisions to reach a shared outcome. However, most existing multi-agent LLM frameworks assume that all participating agents are aligned with the system objective. In practice, a malicious insider may participate as a legitimate member of the group while pursuing a hidden adversarial goal. In this work, we study insider manipulation in multi-agent LLM consensus systems. We formalize the problem as a sequential decision-making task in which a malicious agent seeks to delay or prevent agreement among benign agents. To make attack optimization tractable, we propose a world-model-based framework that learns surrogate dynamics over the latent behavioral states of benign agents and then trains an attacker using reinforcement learning based on this learned model. Preliminary results show that the trained attacker reduces the benign consensus rate and prolongs disagreement more effectively than the direct malicious-prompt baseline. These results suggest that combining latent world models with reinforcement learning is a promising direction for adaptive insider attacks in language-based multi-agent systems.
Problem

Research questions and friction points this paper is trying to address.

insider attacks
multi-agent systems
LLM consensus
adversarial manipulation
malicious agents
Innovation

Methods, ideas, or system contributions that make the work stand out.

insider attack
multi-agent LLM
consensus formation
world model
reinforcement learning
πŸ”Ž Similar Papers