🤖 AI Summary
Existing scientific relation extraction benchmarks predominantly focus on computer science and are ill-suited for empirical, variable-centered disciplines such as psychology. This work introduces the first variable-level empirical graph extraction task, transforming psychological abstracts into typed graphs where standardized variables serve as nodes and empirical or hierarchical relations as edges. We present a new benchmark comprising 210 annotated abstracts, accompanied by a fine-grained annotation scheme covering variable standardization, conceptual hierarchies, relation types, and validation status. A staged graph construction pipeline is proposed to address this task, which significantly outperforms end-to-end approaches; the best configuration achieves a macro F1 score of 0.74. Our experiments further highlight that identifying moderating relationships and modeling conceptual hierarchies remain key challenges.
📝 Abstract
Existing scientific relation extraction benchmarks mainly target domains such as computer science, where entities are tasks, methods, datasets, materials, or metrics. This leaves a gap in variable-oriented empirical fields such as psychology, where findings are expressed as relations among constructs, measurements, interventions, and outcomes. We introduce variable-centered empirical graph extraction, the task of mapping scientific abstracts to typed graphs whose nodes are normalized variables and whose edges represent empirical and hierarchical relations. To support this task, we construct EmpiriGraph-Psy, a benchmark of 210 psychology abstracts annotated by domain-trained annotators with normalized variables, concept hierarchies, empirical relation types, and validation states. We evaluate frontier and open-weight LLMs using both direct extraction and a staged graph-construction pipeline that separates variable extraction, normalization, hierarchy construction, evidence selection, relation extraction, and edge validation. The staged pipeline substantially outperforms direct extraction, with the best configuration achieving a macro-F1 of 0.74. Error analysis shows that moderation relations and concept hierarchies remain the most challenging cases, highlighting the difficulty of extracting higher-order empirical claims and implicit abstraction structure from scientific abstracts.