๐ค AI Summary
This study addresses the lack of auditable, evidence-based approaches for large-scale structured extraction of biological traits, which has hindered the accuracy and reliability of cross-species data. The authors propose a registry-constrained large language model pipeline that integrates four key mechanisms: a closed-vocabulary registry, line-by-line evidence citation, confidence-based filtering, and multi-version data persistence. This framework enables, for the first time, fully auditable automated trait extraction across an unprecedented scaleโgenerating 5,489,881 trait records for 409,880 species (99.985% coverage), with 81.57% classified as high-confidence. Rigorous three-tier validation demonstrates that evidence citations achieve over 90% accuracy, substantially enhancing the traceability and verifiability of large-scale trait data.
๐ Abstract
We describe a registry-bound large-language-model extraction pipeline producing evidence-grounded structured trait records at scale, on cultivated tropical plant, aquatic, and pet species. Four mechanisms render LLM-derived rows auditable: a versioned 39-key closed-vocabulary trait registry constraining every admitted value to a typed schema; a per-row verbatim evidence quote tying each value to source text; a per-row confidence label (high or medium; low dropped pre-persist); and multi-version preservation. Applied to 409,880 publishable species from the Tropical Species Encyclopedia, the pipeline executed 706,220 runs and persisted 5,489,881 trait records across 409,820 species (99.985%), 81.57% at high confidence. We report three validation layers in descending evidentiary strength: at full population, 90.12% of 5,427,588 evidence-bearing rows have their quote as a verbatim source substring (93.49% excluding one compliance meta-trait); a quote-supports-value audit on n=100 stratified non-red-zone rows yielded 100/100 (lower bound 96.30%); face-validity on n=50 red-zone rows yielded 50/50 Accept (lower bound 92.86%). Per-record correctness is not claimed; 100% pending human curation. The contribution is the four-mechanism framework.