CRAB-Bench: Evaluating LLM Agents under Complex Task Dependencies and Human-aligned User Simulation

📅 2026-06-01

📈 Citations: 0

✨ Influential: 0

career value

203K/year

🤖 AI Summary

Current evaluations of large language model agents struggle to capture complex task dependencies, non-ideal user behaviors, and the validity of multiple solutions. This work proposes CRAB-Bench, a new benchmark, together with RUSE, a user simulation engine: CRAB-Bench generates multi-entity dependent tasks with structured distractors via constraint graphs, while RUSE models personalized and non-cooperative user interactions grounded in human behavioral research. This combination enables, for the first time, joint modeling of complex tasks and realistic user behavior, supporting evaluation across multiple valid solutions. Experiments reveal that state-of-the-art models achieve at most 61% pass@1 on CRAB-Bench, suffer performance drops of up to 57% when evaluated with RUSE, and often implicitly conceal errors rather than acknowledging mistakes—highlighting their pronounced fragility in real-world scenarios.

📝 Abstract

Evaluating LLM agents in realistic service scenarios requires complex task dependencies, imperfect user behavior, and an evaluation that accommodates multiple valid solutions. We introduce CRAB-Bench (Constraint-based Realistic Agent Benchmark) and RUSE (Realistic User Simulation Engine) to address this gap. CRAB-Bench generates tasks via a constraint graph over multiple interdependent entities with structured distractors, requiring agents to reason carefully over thousands of misleading candidates where only a tiny fraction of solutions are valid. RUSE replaces cooperative, template-like simulators with realistic users grounded in human behavioral studies, instantiated across diverse personas and four behavioral dimensions. Experiments on four frontier LLM agents show that the best model achieves only 61% pass@1 on CRAB-Bench, and switching to RUSE causes further drops of up to 57%, concentrated in task-solving ability rather than conversational quality. Information Disclosure is the most damaging behavioral dimension, and agents interacting with RUSE are less likely to admit mistakes, instead masking errors through implicit corrections.

Problem

Research questions and friction points this paper is trying to address.

LLM agents

task dependencies

user simulation

evaluation benchmark

realistic scenarios

Innovation

Methods, ideas, or system contributions that make the work stand out.

CRAB-Bench

RUSE

task dependency