🤖 AI Summary
In graph learning, the relative contributions of molecular structure (topology) and node features (attributes) to model performance have long lacked quantifiable assessment, hindering task-driven modeling decisions. To address this, we propose NNRD (Node-Neighbor Redundancy Degree), the first computable metric quantifying the balance between structural and feature information. NNRD independently perturbs graph topology and node features, then measures the resulting performance degradation to reveal the intrinsic information preference of the data. Evaluated on multiple molecular property prediction tasks, NNRD exhibits strong correlation with information loss, yielding intuitive, interpretable results. It significantly outperforms conventional average-performance metrics in characterizing dataset-specific properties. By providing a theoretically grounded, empirically validated measure, NNRD enables task-adaptive selection and optimization of GNN architectures—offering both theoretical insight and practical utility for structure–feature-aware graph representation learning.
📝 Abstract
Graph learning on molecules makes use of information from both the molecular structure and the features attached to that structure. Much work has been conducted on biasing either towards structure or features, with the aim that bias bolsters performance. Identifying which information source a dataset favours, and therefore how to approach learning that dataset, is an open issue. Here we propose Noise-Noise Ratio Difference (NNRD), a quantitative metric for whether there is more useful information in structure or features. By employing iterative noising on features and structure independently, leaving the other intact, NNRD measures the degradation of information in each. We employ NNRD over a range of molecular tasks, and show that it corresponds well to a loss of information, with intuitive results that are more expressive than simple performance aggregates. Our future work will focus on expanding data domains, tasks and types, as well as refining our choice of baseline model.