🤖 AI Summary
Sparse flow decomposition (SFD) for inferring transcript composition and abundance from directed acyclic graph (DAG)-structured RNA-seq data remains computationally challenging.
Method: We propose the first continuous optimization framework for SFD, grounded in geometric modeling of the flow polyhedral cone. By formulating SFD as a conic fitting problem, we design a customized Frank–Wolfe algorithm variant that ensures rapid convergence and intrinsically sparse solutions—without requiring prespecified transcript count or explicit cardinality constraints. The method accommodates both exact and noisy flow inputs and supports diverse error priors.
Results: Experiments demonstrate that our approach achieves state-of-the-art transcript reconstruction accuracy while accelerating computation by one to two orders of magnitude over integer programming and other combinatorial methods. It further offers interpretability, scalability to large-scale graphs, and robustness to input noise—bridging statistical fidelity with practical tractability.
📝 Abstract
Decomposing a flow on a Directed Acyclic Graph (DAG) into a weighted sum of a small number of paths is an essential task in operations research and bioinformatics. This problem, referred to as Sparse Flow Decomposition (SFD), has gained significant interest, in particular for its application in RNA transcript multi-assembly, the identification of the multiple transcripts corresponding to a given gene and their relative abundance. Several recent approaches cast SFD variants as integer optimization problems, motivated by the NP-hardness of the formulations they consider. We propose an alternative formulation of SFD as a fitting problem on the conic hull of the flow polytope. By reformulating the problem on the flow polytope for compactness and solving it using specific variants of the Frank-Wolfe algorithm, we obtain a method converging rapidly to the minimizer of the chosen loss function while producing a parsimonious decomposition. Our approach subsumes previous formulations of SFD with exact and inexact flows and can model different priors on the error distributions. Computational experiments show that our method outperforms recent integer optimization approaches in runtime, but is also highly competitive in terms of reconstruction of the underlying transcripts, despite not explicitly minimizing the solution cardinality.