$p$-adic Bi-Filtrations for Topological Machine Learning on Genomic Sequences

📅 2026-06-04

📈 Citations: 0

✨ Influential: 0

career value

196K/year

🤖 AI Summary

This study addresses the challenge of integrating hierarchical sequence structure with local compositional information for alignment-free genomic sequence classification. The authors propose the pVR framework, which leverages p-adic distance to capture the hierarchical positional structure of k-mer prefixes and L1 distance to encode local k-mer frequency composition, thereby constructing a two-parameter Vietoris–Rips complex to extract topological features. This work pioneers the combination of p-adic metrics with multiparameter topological data analysis, providing theoretical guarantees of construction stability and invariance to prime choice, while demonstrating that single-parameter p-adic filtrations yield trivial topology, whereas the proposed bivariate filtration recovers nontrivial homology. Evaluated on twelve genomic benchmarks, pVR significantly outperforms four state-of-the-art methods in low-sample regimes—by up to 21 percentage points—and substantially surpasses the zero-shot embedding performance of Nucleotide Transformer v2.

📝 Abstract

We introduce pVR, a topological machine learning framework for alignment-free genomic sequence classification that combines $p$-adic numbers with topological data analysis. Each DNA sequence is encoded along two complementary axes: a $p$-adic distance on $k$-mer prefixes, which captures hierarchical positional structure, and a compositional $L_1$ distance on $k$-mer frequencies, which captures local sequence content. The two distances jointly parameterise a bi-filtered Vietoris--Rips complex, and per-sequence topological summaries from this bi-filtration serve as features for standard machine learning classifiers. We establish theoretical guarantees for the construction: stability under metric perturbations and invariance to the choice of prime, alongside a result that explains why a single $p$-adic axis is topologically uninformative and why the bi-filtration recovers nontrivial homology. On twelve genomic benchmarks ($28$ to $500$ sequences, $3$ to $7$ classes), pVR outperforms four established alignment-free baselines on three of six low-sample datasets, with gains of up to $21$ percentage points; it underperforms only on a SARS-CoV-2 variant benchmark whose point-mutation divergence violates the hierarchical assumption, and all methods saturate in the large-sample regime. pVR also outperforms zero-shot frozen embeddings from the 500M-parameter Nucleotide Transformer v2 by $6.7$ to $11.4$ percentage points on three low-sample benchmarks. The pVR codebase is publicly available at https://github.com/MAHI-Group/pVR.

Problem

Research questions and friction points this paper is trying to address.

genomic sequence classification

alignment-free

topological machine learning

p-adic distance

k-mer

Innovation

Methods, ideas, or system contributions that make the work stand out.

p-adic numbers

bi-filtration

topological data analysis