🤖 AI Summary
Efficiently generating functional, variable-length biological sequences under stringent evolutionary and biophysical constraints remains a significant challenge. This work proposes a novel approach based on the Discrete Flow Matching (DFM) framework, incorporating a structured source distribution to encode domain-specific preferences. To flexibly model sequence length variation, the method introduces a latent-variable editing mechanism. Furthermore, it integrates classifier-free guidance with temperature-scaled Dirichlet priors to enable controllable generation. The proposed model achieves state-of-the-art performance across both unconditional and conditional generation tasks, as well as density estimation, on DNA and peptide sequences. Notably, it substantially enhances the functionality and diversity of the generated sequences, demonstrating its effectiveness in capturing complex biological design principles.
📝 Abstract
Designing functional biological sequences requires navigating vast discrete spaces under strict evolutionary and biophysical constraints. Discrete Flow Matching (DFM) offers a generative framework over such spaces, but existing approaches rely on biologically uninformative couplings and offer limited flexibility for variable-length sequence generation and fine-grained control. We propose a structured coupling that encodes domain-specific preferences among sequence elements, biasing the source distribution toward plausible regions without modifying the flow objective or training procedure. Building on this, we introduce a latent edit-based rate parameterization that models variable-length generation via edit operations conditioned on a shared global latent, akin to a latent variable model, while remaining tractable. We further introduce a latent classifier-free guidance mechanism that steers generation coherently in continuous latent space, along with Dirichlet-prior temperature scaling for test-time control over edit operations. Our method achieves state-of-the-art performance across diverse biological sequence tasks, including density estimation, unconditional and conditional DNA sequence generation, and peptide sequence generation.