Beyond Agreement: Scoring Panel-Surfaced Biomedical Entity Candidates for Curator Triage

📅 2026-05-29

📈 Citations: 0

✨ Influential: 0

career value

144K/year

🤖 AI Summary

This work addresses the challenge that large language models often generate superficially plausible yet annotation-inconsistent candidate entities in biomedical named entity recognition, failing to meet expert curation standards. To tackle this, the authors propose BioConCal, a framework that first aggregates predictions from multiple large models into a unified candidate list and then employs a supervised scorer to rank these candidates by integrating gold-label-free signals—including inter-model consistency, mention-level features, surface form properties, and document-level context. By shifting entity validation from single-model outputs to collaborative multi-model scoring, BioConCal substantially improves performance, achieving an AUROC of 0.910 across five biomedical NER datasets—up from 0.753—and efficiently retrieves 1,340 high-quality candidates at a target precision of 0.95 (empirically attaining 0.939), significantly outperforming baseline approaches.

📝 Abstract

Biomedical NER is deceptively simple for modern LLMs: plausible biomedical mentions are easy to surface, but corpus-convention correctness depends on annotation conventions, span boundaries, entity granularity, and type schemas. Multi-LLM agreement is a salience signal, not corpus-convention correctness. We introduce a candidate-level panel-output benchmark for panel-surfaced candidate verification, where the unit is an aligned candidate surfaced by an explicitly defined multi-model panel rather than a standalone extractor output. The benchmark aligns eight LLMs' predictions over five public biomedical NER datasets into a candidate master table. BioConCal is an in-domain supervised scorer that instantiates this layer with inference-time gold-free agreement, mention, surface-availability, and document features for a fixed candidate stream. In domain, BioConCal improves AUROC from 0.753 for raw agreement to 0.910. At a validation-selected 0.95 precision target it selects 1,340 candidates at empirical test precision 0.939, compared with 293 for raw agreement. This corresponds to candidate-level recall 0.592 and corpus-level recall 0.523 against a within-panel row-label ceiling of 0.883. The main benefit is not recovering entities missed by every panel member, but reshaping a noisy panel stream into a higher-yield review queue. Under entity-type shift, thresholds require target-domain validation, and exact character localization remains a separate deterministic post-processing step.

Problem

Research questions and friction points this paper is trying to address.

biomedical NER

annotation conventions

multi-LLM agreement

candidate verification

curator triage

Innovation

Methods, ideas, or system contributions that make the work stand out.

panel-surfaced candidates

multi-LLM agreement

BioConCal