Active Flow Expansion for Out-of-Distribution Discovery: from Theory to Molecules

📅 2026-06-07
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
This work addresses the limitation of standard flow models, which are constrained by their training data distribution and struggle to generate valid yet out-of-distribution novel designs—such as new molecules. To overcome this, the authors propose Active Flow Expansion (ActFlow), a method that leverages verifier feedback within an active learning framework to iteratively generate and learn from synthetic data in the representation space of a pretrained flow model, thereby expanding its generative support. The study pioneers a statistical learning theory for out-of-distribution generation with flow models by conceptualizing the generative model itself as a learnable set and formalizing set expansion as a local-to-global reachability process in representation space. Evaluated on small molecules, drug-like compounds, therapeutic peptides, and protein sequence design, ActFlow significantly outperforms existing approaches and effectively broadens the coverage of high-quality, novel samples.
📝 Abstract
Standard flow and diffusion pre-training matches the distribution of available data (e.g., molecules), which often covers only a small fraction of the valid design space. In generative discovery, however, one aims to sample valid new-to-nature designs, assigned negligible probability under, and thus inaccessible to, standard models fitted to the observed data. To overcome this limitation, we depart from data distribution matching and view a generative model through its generable set: the region it covers with non-negligible probability. This allows to introduce a new learning principle for out-of-distribution flow modeling: enlarging a model's generable set to increase coverage of the valid design space. We propose Active Flow Expansion (ActFlow), a continued pre-training method that employs verifier feedback to expand a pre-trained model over new valid regions by iteratively adapting to synthetic data generated through active exploration in the learned flow representation. Theoretically, we establish to our knowledge first-of-their-kind statistical learning guarantees for out-of-distribution flow modeling, analyzing generable set expansion as a local-to-global reachability process over a learned representation. Empirically, we assess ActFlow with suitable out-of-distribution generative modeling metrics across small organic molecules, mid-sized drug-like molecules, therapeutic peptides, and protein sequence design tasks. Results show that ActFlow expands valid coverage far beyond the region modeled by the initial pre-trained model, significantly outperforming widely adopted synthetic flow pre-training methods.
Problem

Research questions and friction points this paper is trying to address.

out-of-distribution discovery
generative modeling
flow models
design space coverage
molecular generation
Innovation

Methods, ideas, or system contributions that make the work stand out.

Active Flow Expansion
Out-of-Distribution Discovery
Generable Set
Flow Modeling
Verifier-Guided Learning
🔎 Similar Papers
No similar papers found.