Beyond Ideal Instruction: A Comprehensive Framework for Evaluating LLMs in Realistic Interactions

📅 2026-06-02

📈 Citations: 0

✨ Influential: 0

career value

187K/year

🤖 AI Summary

This work addresses the limitations of existing large language model (LLM) evaluation benchmarks, which often rely on idealized user assumptions and fail to capture the ambiguity, non-cooperative behaviors, and dynamic intentions characteristic of real-world interactions. To bridge this gap, we introduce RUT-Bench, the first tool-augmented evaluation benchmark that systematically models the diversity of real users, encompassing both ideal and non-ideal behaviors in single- and multi-turn dialogues. RUT-Bench incorporates high-fidelity dialogue simulation and user experience–oriented evaluation metrics for non-ideal interactions. Comprehensive evaluation of 19 state-of-the-art LLMs reveals a significant performance drop under non-ideal inputs, with all models achieving task success rates below 40%, underscoring their insufficient robustness in realistic scenarios.

📝 Abstract

Despite great advances in tool-use capabilities of large language models (LLMs), existing evaluation benchmarks struggle to fully align with real-world scenarios. Such benchmarks mostly rely on simulated idealized user assumptions and lacks experience-oriented evaluation. These limitations fail to account for the ambiguity, uncooperative behaviors, and shifting intentions characteristic of real-world users. To fill this gap, we propose RUT-Bench, a dedicated benchmark designed to assess LLMs under diverse Real-world User Tool calling scenarios. RUT-Bench supports high-fidelity simulations covering both ideal rational patterns and heterogeneous non-ideal behaviors across single-turn and multi-turn dialogues. We conduct comprehensive evaluations on 19 widely adopted open-source and proprietary LLMs using our benchmark. Experimental results reveal that no tested LLMs achieve an overall success rate above 40%, and nearly all of them experience noticeable performance drops when facing more complicated non-ideal user inputs. Our code and data is available at https://github.com/TorresYangX/RUT-Bench.

Problem

Research questions and friction points this paper is trying to address.

large language models

tool-use evaluation

real-world user interactions

non-ideal user behavior

benchmarking

Innovation

Methods, ideas, or system contributions that make the work stand out.

real-world user simulation

tool-use evaluation

non-ideal user behavior