Fixing FOLIO and MALLS: Verified Annotations and an LLM-assisted Framework to Focus Human Relabeling

📅 2026-06-01

📈 Citations: 0

✨ Influential: 0

career value

159K/year

🤖 AI Summary

This study addresses critical reliability issues in existing natural language to first-order logic (NL-to-FOL) benchmark datasets—specifically FOLIO and MALLS—where pervasive annotation errors, linguistic ambiguities, and biased inference labels severely compromise model evaluation. We present the first systematic manual audit of these widely used datasets, revealing that 39% of FOLIO and 36% of MALLS annotations are incorrect, and release corrected versions. To efficiently achieve high annotation fidelity, we propose an LLM-guided human review framework that integrates large language models (Gemma-2 27B-it, Qwen2-72B-Instruct, and GPT-4o-mini), active learning for sample selection, and formal logical validation. This approach attains 90% data accuracy by reviewing fewer than 24% of samples. Evaluations with the revised annotations demonstrate performance gains of 9–22 percentage points for state-of-the-art LLMs, substantially improving assessment reliability.

📝 Abstract

Accurate translation from Natural Language to First-Order Logic (NL-to-FOL) underpins neurosymbolic AI systems and Natural Language Inference (NLI), making the quality of NL-to-FOL benchmarks essential -- yet these datasets have never been rigorously audited. Our first contribution is to present a systematic human inspection of the validation split of \textsf{FOLIO} and a subset of \textsf{MALLS} test instances, finding that approximately 39% and 36% of entries, respectively, contain incorrect FOL formalizations (i.e., ground truth labels), with additional rates of ambiguous NL sentences (16.4% and 48%) and incorrect NLI labels in \textsf{FOLIO} (8.4%). Our second contribution is to develop and release corrected ground truths for such datasets, showing that annotation errors distort model evaluation on a reference benchmark task: testing three state-of-the-art LLMs (Gemma~4 31B-it, Qwen3-30B-A3B, and GPT-4o-mini) with the corrected ground truths yields accuracy gains from +9 to +22 percentage points. Motivated by these findings, we propose an LLM-based framework to support humans in manual reviewing NL-to-FOL datasets. By directing reviewers toward the most error-prone instances, we empirically show that it is possible to achieve 90% dataset accuracy after reviewing fewer than 24% of instances, compared to over 70% required by unguided review. We release all human-verified annotations and the code for our framework.

Problem

Research questions and friction points this paper is trying to address.

NL-to-FOL

annotation errors

benchmark validation

natural language inference

dataset auditing

Innovation

Methods, ideas, or system contributions that make the work stand out.

NL-to-FOL

verified annotations

LLM-assisted relabeling