StructTest: Benchmarking LLMs' Reasoning through Compositional Structured Outputs

📅 2024-12-23

🏛️ arXiv.org

📈 Citations: 0

✨ Influential: 0

career value

176K/year

🤖 AI Summary

Existing LLM evaluation benchmarks suffer from annotation bias, high human evaluation costs, and susceptibility to cheating. Method: We propose StructTest, a novel benchmark targeting models’ comprehension of compositional instructions and their ability to generate structured outputs (e.g., HTML, code, mathematical expressions). It introduces the first rule-based, deterministic evaluation framework: outputs are automatically validated via syntactic and semantic parsing—eliminating reliance on human annotation, model-based scorers, or answer leakage—and employs domain-diverse instruction templates with compositional instruction engineering for scalable, cross-task assessment. Results: Evaluated on 17 state-of-the-art LLMs, even top-tier models such as GPT-4o and DeepSeek-V3/R1 achieve only moderate accuracy. StructTest thus demonstrates strong discriminative power, high difficulty, and robustness against contamination—establishing it as a rigorous proxy metric for structured reasoning capability.

Technology Category

Application Category

📝 Abstract

The rapid advancement of large language models (LLMs) demands robust, unbiased, and scalable evaluation methods. However, human annotations are costly to scale, model-based evaluations are susceptible to stylistic biases, and target-answer-based benchmarks are vulnerable to data contamination and cheating. To address these limitations, we propose StructTest, a novel benchmark that evaluates LLMs on their ability to follow compositional instructions and generate structured outputs, providing an unbiased, cost-effective, and difficult-to-cheat evaluation framework. Assessments are conducted deterministically using a rule-based evaluator, which can be easily extended to new tasks and datasets. By testing structured outputs across diverse domains including Summarization, Code, HTML, and Math, and evaluating 17 popular LLMs, we demonstrate that StructTest remains challenging even for top-performing models like Deepseek-V3/R1 and GPT-4o, establishing it as a robust proxy for measuring reasoning capabilities. We believe StructTest offers a critical and complementary approach to achieving objective and comprehensive model evaluation.

Problem

Research questions and friction points this paper is trying to address.

Evaluates LLMs' ability to follow compositional instructions

Provides unbiased, cost-effective, and difficult-to-cheat evaluation

Measures reasoning capabilities through structured outputs across domains

Innovation

Methods, ideas, or system contributions that make the work stand out.

Evaluates LLMs via structured, compositional outputs.

Uses rule-based evaluator for deterministic assessments.

Tests diverse domains: Summarization, Code, HTML, Math.

🔎 Similar Papers

No similar papers found.