MM-tau-p$^2$: Persona-Adaptive Prompting for Robust Multi-Modal Agent Evaluation in Dual-Control Settings

šŸ“… 2026-03-10
šŸ“ˆ Citations: 0
✨ Influential: 0
šŸ“„ PDF
šŸ¤– AI Summary
This work addresses the limitations of existing evaluation frameworks, which overlook the influence of user personality on multimodal agent behavior and lack systematic robustness assessment under dual-control settings. To bridge this gap, we propose the MM-tau-p² benchmark, which integrates user-personality-adaptive prompting and user-input-informed planning to enable automated evaluation of multimodal agents in dual-control scenarios. For the first time, we unify personality adaptability, multimodal robustness, and interaction-round overhead into a cohesive evaluation framework, introducing twelve novel metrics and extending the FOCAL framework to multimodal contexts. Leveraging LLM-as-judge with carefully designed scoring rules, our approach enables efficient automatic evaluation in telecommunications and retail domains. Experiments on models including GPT-5 and GPT-4.1 quantify the robustness challenges and additional overhead introduced by multimodality, offering actionable insights for agent optimization.

Technology Category

Application Category

šŸ“ Abstract
Current evaluation frameworks and benchmarks for LLM powered agents focus on text chat driven agents, these frameworks do not expose the persona of user to the agent, thus operating in a user agnostic environment. Importantly, in customer experience management domain, the agent's behaviour evolves as the agent learns about user personality. With proliferation of real time TTS and multi-modal language models, LLM based agents are gradually going to become multi-modal. Towards this, we propose the MM-tau-p$^2$ benchmark with metrics for evaluating the robustness of multi-modal agents in dual control setting with and without persona adaption of user, while also taking user inputs in the planning process to resolve a user query. In particular, our work shows that even with state of-the-art frontier LLMs like GPT-5, GPT 4.1, there are additional considerations measured using metrics viz. multi-modal robustness, turn overhead while introducing multi-modality into LLM based agents. Overall, MM-tau-p$^2$ builds on our prior work FOCAL and provides a holistic way of evaluating multi-modal agents in an automated way by introducing 12 novel metrics. We also provide estimates of these metrics on the telecom and retail domains by using the LLM-as-judge approach using carefully crafted prompts with well defined rubrics for evaluating each conversation.
Problem

Research questions and friction points this paper is trying to address.

multi-modal agent evaluation
persona adaptation
dual-control setting
user personality
robustness
Innovation

Methods, ideas, or system contributions that make the work stand out.

persona-adaptive prompting
multi-modal agent evaluation
dual-control setting
multi-modal robustness
LLM-as-judge
šŸ”Ž Similar Papers
No similar papers found.