CRAB-Bench: Evaluating LLM Agents under Complex Task Dependencies and Human-aligned User Simulation

📅 2026-06-01
📈 Citations: 0
Influential: 0
📄 PDF

career value

203K/year
🤖 AI Summary
Current evaluations of large language model agents struggle to capture complex task dependencies, non-ideal user behaviors, and the validity of multiple solutions. This work proposes CRAB-Bench, a new benchmark, together with RUSE, a user simulation engine: CRAB-Bench generates multi-entity dependent tasks with structured distractors via constraint graphs, while RUSE models personalized and non-cooperative user interactions grounded in human behavioral research. This combination enables, for the first time, joint modeling of complex tasks and realistic user behavior, supporting evaluation across multiple valid solutions. Experiments reveal that state-of-the-art models achieve at most 61% pass@1 on CRAB-Bench, suffer performance drops of up to 57% when evaluated with RUSE, and often implicitly conceal errors rather than acknowledging mistakes—highlighting their pronounced fragility in real-world scenarios.
📝 Abstract
Evaluating LLM agents in realistic service scenarios requires complex task dependencies, imperfect user behavior, and an evaluation that accommodates multiple valid solutions. We introduce CRAB-Bench (Constraint-based Realistic Agent Benchmark) and RUSE (Realistic User Simulation Engine) to address this gap. CRAB-Bench generates tasks via a constraint graph over multiple interdependent entities with structured distractors, requiring agents to reason carefully over thousands of misleading candidates where only a tiny fraction of solutions are valid. RUSE replaces cooperative, template-like simulators with realistic users grounded in human behavioral studies, instantiated across diverse personas and four behavioral dimensions. Experiments on four frontier LLM agents show that the best model achieves only 61% pass@1 on CRAB-Bench, and switching to RUSE causes further drops of up to 57%, concentrated in task-solving ability rather than conversational quality. Information Disclosure is the most damaging behavioral dimension, and agents interacting with RUSE are less likely to admit mistakes, instead masking errors through implicit corrections.
Problem

Research questions and friction points this paper is trying to address.

LLM agents
task dependencies
user simulation
evaluation benchmark
realistic scenarios
Innovation

Methods, ideas, or system contributions that make the work stand out.

CRAB-Bench
RUSE
task dependency
realistic user simulation
LLM agent evaluation
🔎 Similar Papers
No similar papers found.