Phylo2Vec: a vector representation for binary trees

📅 2023-04-25
🏛️ Systematic Biology
📈 Citations: 3
Influential: 1
📄 PDF
🤖 AI Summary
Binary phylogenetic trees suffer from redundant representation and inefficient manipulation. Method: We propose Phylo2Vec—the first encoding framework that uniquely and reversibly maps any n-leaf rooted binary phylogenetic tree to an integer vector of length n−1. Its deterministic algorithm leverages tree traversal orders and combinatorial mathematics to ensure compactness, lossless compression, instantaneous topological equivalence checking, and controllable traversal of tree space. Integrated with maximum-likelihood evaluation and lightweight hill-climbing optimization, Phylo2Vec requires only a random initial tree and few iterations to converge efficiently to high-likelihood topologies. Results: Evaluated on five real-world datasets, Phylo2Vec achieves substantial compression over Newick format while maintaining computational efficiency and interpretability—demonstrating superior scalability, accuracy, and practical utility for large-scale phylogenetic inference.
📝 Abstract
Binary phylogenetic trees inferred from biological data are central to understanding the shared history among evolutionary units. However, inferring the placement of latent nodes in a tree is computationally expensive. State-of-the-art methods rely on carefully designed heuristics for tree search, using different data structures for easy manipulation (e.g., classes in object-oriented programming languages) and readable representation of trees (e.g., Newick-format strings). Here, we present Phylo2Vec, a parsimonious encoding for phylogenetic trees that serves as a unified approach for both manipulating and representing phylogenetic trees. Phylo2Vec maps any binary tree with n leaves to a unique integer vector of length n - 1. The advantages of Phylo2Vec are fourfold: i) fast tree sampling, (ii) compressed tree representation compared to a Newick string, iii) quick and unambiguous verification if two binary trees are identical topologically, and iv) systematic ability to traverse tree space in very large or small jumps. As a proof of concept, we use Phylo2Vec for maximum likelihood inference on five real-world datasets and show that a simple hill-climbing-based optimisation scheme can efficiently traverse the vastness of tree space from a random to an optimal tree.
Problem

Research questions and friction points this paper is trying to address.

Efficient representation of binary phylogenetic trees
Fast computation of latent node placement
Unified manipulation and representation of tree structures
Innovation

Methods, ideas, or system contributions that make the work stand out.

Phylo2Vec encodes trees as integer vectors
Enables fast tree sampling and manipulation
Facilitates efficient tree space traversal
M
Matthew J. Penn
Department of Statistics, University of Oxford, Oxford, United Kingdom
Neil Scheidwasser
Neil Scheidwasser
University of Copenhagen
Deep learningspeech processingphylogeneticsanimal behaviorpublic health
M
M. Khurana
Section of Epidemiology, University of Copenhagen, Copenhagen, Denmark
D
D. Duchêne
Center for Evolutionary Hologenomics, University of Copenhagen, Copenhagen, Denmark
C
C. Donnelly
Pandemic Sciences Institute, University of Oxford, Oxford, United Kingdom
S
S. Bhatt
MRC Centre for Global Infectious Disease Analysis, Imperial College London, London, United Kingdom