🤖 AI Summary
Existing LLM evaluation benchmarks suffer from annotation bias, high human evaluation costs, and susceptibility to cheating. Method: We propose StructTest, a novel benchmark targeting models’ comprehension of compositional instructions and their ability to generate structured outputs (e.g., HTML, code, mathematical expressions). It introduces the first rule-based, deterministic evaluation framework: outputs are automatically validated via syntactic and semantic parsing—eliminating reliance on human annotation, model-based scorers, or answer leakage—and employs domain-diverse instruction templates with compositional instruction engineering for scalable, cross-task assessment. Results: Evaluated on 17 state-of-the-art LLMs, even top-tier models such as GPT-4o and DeepSeek-V3/R1 achieve only moderate accuracy. StructTest thus demonstrates strong discriminative power, high difficulty, and robustness against contamination—establishing it as a rigorous proxy metric for structured reasoning capability.
📝 Abstract
The rapid advancement of large language models (LLMs) demands robust, unbiased, and scalable evaluation methods. However, human annotations are costly to scale, model-based evaluations are susceptible to stylistic biases, and target-answer-based benchmarks are vulnerable to data contamination and cheating. To address these limitations, we propose StructTest, a novel benchmark that evaluates LLMs on their ability to follow compositional instructions and generate structured outputs, providing an unbiased, cost-effective, and difficult-to-cheat evaluation framework. Assessments are conducted deterministically using a rule-based evaluator, which can be easily extended to new tasks and datasets. By testing structured outputs across diverse domains including Summarization, Code, HTML, and Math, and evaluating 17 popular LLMs, we demonstrate that StructTest remains challenging even for top-performing models like Deepseek-V3/R1 and GPT-4o, establishing it as a robust proxy for measuring reasoning capabilities. We believe StructTest offers a critical and complementary approach to achieving objective and comprehensive model evaluation.