Explain Like I'm 5 or Whatever I Choose: Evaluating the Interactive Potential of Language Model Responses

📅 2026-06-04

📈 Citations: 0

✨ Influential: 0

🤖 AI Summary

Current evaluations of large language models predominantly rely on static chat interfaces, which fail to capture model performance under interactive interface features that support user-driven adjustments—such as real-time control over linguistic complexity. This work introduces, for the first time, the human-centered design principle of direct manipulation into large model evaluation by proposing a multi-response generation benchmark framework with controllable linguistic complexity. Specifically, GPT-5.1, GPT-5 mini, Claude Sonnet 4.5 + Thinking, and DeepSeek-V3.1 were prompted to generate responses at five distinct complexity levels across 98 scientific questions, with their adaptability assessed via quantifiable metrics. Results indicate that while models exhibit some capacity to modulate linguistic complexity, their consistency remains limited; even the best-performing model, Claude Sonnet 4.5, reliably adjusted complexity in the intended direction in only 46% of cases.

📝 Abstract

Evaluations of large language models (LLMs) in scientific information seeking tasks have become increasingly use-centric, such as conducting live or multi-turn evaluations with real users. These evaluations still assume a single, static chat interface, but as models are integrated into new interfaces, evaluations must shift to incorporate interface-specific criteria. We propose a new evaluation framework based on a formative study with $16$ participants that tests models' ability to generate multiple responses to one query that differ along an interpretable axis of language (language complexity), inspired by direct manipulation interfaces from human-centered design literature. We evaluate GPT-5.1, GPT-5 mini, Claude Sonnet 4.5 + Thinking, and DeepSeek-V3.1 by generating 5 responses at different levels of language complexity for $98$ scientific queries. While models vary complexity across responses, most changes remain inconsistent, with the best performing model (Claude Sonnet 4.5) only shifting reliable complexity measures in the correct direction $46\%$ of the time. Our findings hold with increased sample size and alternative complexity levels.

Problem

Research questions and friction points this paper is trying to address.

language model evaluation

interactive interfaces

language complexity

scientific information seeking

response variability

Innovation

Methods, ideas, or system contributions that make the work stand out.

interactive evaluation

language complexity

large language models