🤖 AI Summary
AI benchmark documentation frequently suffers from incompleteness and inconsistency, severely undermining interpretability and cross-task/cross-domain comparability. To address this, we propose the first multi-agent framework for automated benchmark documentation construction. Our method comprises three tightly integrated stages: (1) heterogeneous data crawling from diverse sources—including Hugging Face, UniTxT, and academic papers; (2) LLM-driven abstractive summarization; and (3) atomic fact verification powered by FactReasoner. It introduces a novel closed-loop paradigm—“collaborative extraction → semantic synthesis → entailment verification”—that ensures rigorous alignment between documentation and underlying benchmark specifications. Evaluation demonstrates substantial improvements: +42.3% in documentation completeness and +38.7% in factual accuracy. The framework enhances transparency in AI evaluation, strengthens cross-domain comparability, and significantly improves community reproducibility and reuse efficiency.
📝 Abstract
We present Auto-BenchmarkCard, a workflow for generating validated descriptions of AI benchmarks. Benchmark documentation is often incomplete or inconsistent, making it difficult to interpret and compare benchmarks across tasks or domains. Auto-BenchmarkCard addresses this gap by combining multi-agent data extraction from heterogeneous sources (e.g., Hugging Face, Unitxt, academic papers) with LLM-driven synthesis. A validation phase evaluates factual accuracy through atomic entailment scoring using the FactReasoner tool. This workflow has the potential to promote transparency, comparability, and reusability in AI benchmark reporting, enabling researchers and practitioners to better navigate and evaluate benchmark choices.