Ctrl-DNA: Controllable Cell-Type-Specific Regulatory DNA Design via Constrained RL

📅 2025-05-26
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
Cell-type-specific gene regulation is critical for synthetic biology and gene therapy, yet existing sequence design methods struggle to simultaneously maximize on-target activity and minimize off-target leakage. Method: We propose the first constrained reinforcement learning (CRL) framework for controllable regulatory DNA sequence design. It integrates an autoregressive genomic language model with biological priors to construct an interpretable, transcription-factor-binding-site (TFBS)-enrichment-guided reward function, and incorporates both hard and soft constraints optimized via proximal policy optimization (PPO). Contribution/Results: Our method achieves state-of-the-art cell-type specificity on human promoter and enhancer design tasks, outperforming existing generative models and RL baselines. Computationally validated sequences exhibit significant enrichment of authentic TFBSs, high functional compatibility, and strong biological plausibility—demonstrating both superior performance and mechanistic interpretability.

Technology Category

Application Category

📝 Abstract
Designing regulatory DNA sequences that achieve precise cell-type-specific gene expression is crucial for advancements in synthetic biology, gene therapy and precision medicine. Although transformer-based language models (LMs) can effectively capture patterns in regulatory DNA, their generative approaches often struggle to produce novel sequences with reliable cell-specific activity. Here, we introduce Ctrl-DNA, a novel constrained reinforcement learning (RL) framework tailored for designing regulatory DNA sequences with controllable cell-type specificity. By formulating regulatory sequence design as a biologically informed constrained optimization problem, we apply RL to autoregressive genomic LMs, enabling the models to iteratively refine sequences that maximize regulatory activity in targeted cell types while constraining off-target effects. Our evaluation on human promoters and enhancers demonstrates that Ctrl-DNA consistently outperforms existing generative and RL-based approaches, generating high-fitness regulatory sequences and achieving state-of-the-art cell-type specificity. Moreover, Ctrl-DNA-generated sequences capture key cell-type-specific transcription factor binding sites (TFBS), short DNA motifs recognized by regulatory proteins that control gene expression, demonstrating the biological plausibility of the generated sequences.
Problem

Research questions and friction points this paper is trying to address.

Design regulatory DNA for precise cell-type-specific gene expression
Overcome limitations of transformer models in generating reliable sequences
Optimize sequences to maximize target activity and minimize off-target effects
Innovation

Methods, ideas, or system contributions that make the work stand out.

Constrained RL for regulatory DNA design
Autoregressive genomic LMs refinement
Cell-type-specific TFBS capture
X
Xingyu Chen
University of Toronto, Vector Institute for Artificial Intelligence, University Health Network
Shihao Ma
Shihao Ma
University of Toronto, Vector Institute
Machine LearningComputation BiologyAI in healthcare
R
Runsheng Lin
University of Toronto
Jiecong Lin
Jiecong Lin
Postdoctoral Research Fellow, Harvard Medical School/MGH/BCH and HKU
Computational biologyDeep learningRegulatory genomicsGenome editingAI4Science
B
Bo Wang
University of Toronto, Vector Institute for Artificial Intelligence, University Health Network