🤖 AI Summary
This paper addresses the consensus inference problem for multi-source phylogenetic trees—characterized by heterogeneous leaf sets and non-binary topologies. We propose a poset-based structured feature selection framework that models edge (clade) stability as a true/false discovery control problem, enabling the first finite-sample, model-free construction of consensus trees with guaranteed false discovery rate (FDR) control. Our method avoids forced resolution of low-support branches, leverages poset structure to quantify edge stability, and provides statistical inference guarantees under nonparametric generative models. Theoretically, we establish the first FDR-control framework for consensus estimation from heterogeneous tree collections. Empirically, our approach resolves the long-standing controversy regarding the archaeal origin of eukaryotic cells, robustly characterizes deep-branching uncertainty, and outputs consensus trees with rigorous statistical validity.
📝 Abstract
Connected acyclic graphs (trees) are data objects that hierarchically organize categories. Collections of trees arise in a diverse variety of fields, including evolutionary biology, public health, machine learning, social sciences and anatomy. Summarizing a collection of trees by a single representative is challenging, in part due to the dimension of both the sample and parameter space. We frame consensus tree estimation as a structured feature-selection problem, where leaves and edges are the features. We introduce a partial order on leaf-labeled trees, use it to define true and false discoveries for a candidate summary tree, and develop an estimation algorithm that controls the false discovery rate at a nominal level for a broad class of non-parametric generative models. Furthermore, using the partial order structure, we assess the stability of each feature in a selected tree. Importantly, our method accommodates unequal leaf sets and non-binary trees, allowing the estimator to reflect uncertainty by collapsing poorly supported structure instead of forcing full resolution. We apply the method to study the archaeal origin of eukaryotic cells and to quantify uncertainty in deep branching orders. While consensus tree construction has historically been viewed as an estimation task, reframing it as feature selection over a partially ordered set allows us to obtain the first estimator with finite-sample and model-free guarantees. More generally, our approach provides a foundation for integrating tools from multiple testing into tree estimation.