๐ค AI Summary
Current evaluations of large language models predominantly rely on static chat interfaces, which fail to capture model performance under interactive interface features that support user-driven adjustmentsโsuch as real-time control over linguistic complexity. This work introduces, for the first time, the human-centered design principle of direct manipulation into large model evaluation by proposing a multi-response generation benchmark framework with controllable linguistic complexity. Specifically, GPT-5.1, GPT-5 mini, Claude Sonnet 4.5 + Thinking, and DeepSeek-V3.1 were prompted to generate responses at five distinct complexity levels across 98 scientific questions, with their adaptability assessed via quantifiable metrics. Results indicate that while models exhibit some capacity to modulate linguistic complexity, their consistency remains limited; even the best-performing model, Claude Sonnet 4.5, reliably adjusted complexity in the intended direction in only 46% of cases.
๐ Abstract
Evaluations of large language models (LLMs) in scientific information seeking tasks have become increasingly use-centric, such as conducting live or multi-turn evaluations with real users. These evaluations still assume a single, static chat interface, but as models are integrated into new interfaces, evaluations must shift to incorporate interface-specific criteria. We propose a new evaluation framework based on a formative study with $16$ participants that tests models' ability to generate multiple responses to one query that differ along an interpretable axis of language (language complexity), inspired by direct manipulation interfaces from human-centered design literature. We evaluate GPT-5.1, GPT-5 mini, Claude Sonnet 4.5 + Thinking, and DeepSeek-V3.1 by generating 5 responses at different levels of language complexity for $98$ scientific queries. While models vary complexity across responses, most changes remain inconsistent, with the best performing model (Claude Sonnet 4.5) only shifting reliable complexity measures in the correct direction $46\%$ of the time. Our findings hold with increased sample size and alternative complexity levels.