Context-Aware Regularization with Markovian Integration for Attention-Based Nucleotide Analysis

πŸ“… 2025-07-12
πŸ“ˆ Citations: 0
✨ Influential: 0
πŸ“„ PDF
πŸ€– AI Summary
To address the challenges of capturing long-range dependencies and enforcing global transition consistency in modeling long nucleic acid sequences with Transformers, this paper proposes CARMANIA. The framework introduces a context-aware Markov regularization mechanism, explicitly constraining global consistency of sequence state transitions via an n-gram statistics-guided transition matrix (TM) lossβ€”thereby jointly preserving local contextual information and evolutionary/functional structural patterns. CARMANIA employs self-supervised pretraining within a fixed window, jointly optimizing standard self-attention and the TM loss. Evaluated on 40 genomic tasks, it achieves statistically significant accuracy improvements on 33 tasks, outperforming the best long-context baseline by β‰₯7 percentage points on average. Notably, enhancer prediction sees a maximum MCC improvement of 34 percentage points. Moreover, inference is accelerated by 2.5Γ— compared to the baseline.

Technology Category

Application Category

πŸ“ Abstract
Transformers have revolutionized nucleotide sequence analysis, yet capturing long-range dependencies remains challenging. Recent studies show that autoregressive transformers often exhibit Markovian behavior by relying on fixed-length context windows for next-token prediction. However, standard self-attention mechanisms are computationally inefficient for long sequences due to their quadratic complexity and do not explicitly enforce global transition consistency. We introduce CARMANIA (Context-Aware Regularization with Markovian Integration for Attention-Based Nucleotide Analysis), a self-supervised pretraining framework that augments next-token (NT) prediction with a transition-matrix (TM) loss. The TM loss aligns predicted token transitions with empirically derived n-gram statistics from each input sequence, encouraging the model to capture higher-order dependencies beyond local context. This integration enables CARMANIA to learn organism-specific sequence structures that reflect both evolutionary constraints and functional organization. We evaluate CARMANIA across diverse genomic tasks, including regulatory element prediction, functional gene classification, taxonomic inference, antimicrobial resistance detection, and biosynthetic gene cluster classification. CARMANIA outperforms the previous best long-context model by at least 7 percent, matches state-of-the-art on shorter sequences (exceeding prior results on 20 out of 40 tasks while running approximately 2.5 times faster), and shows particularly strong improvements on enhancer and housekeeping gene classification tasks, including up to a 34 percent absolute gain in Matthews correlation coefficient (MCC) for enhancer prediction. The TM loss boosts accuracy in 33 of 40 tasks, especially where local motifs or regulatory patterns drive prediction.
Problem

Research questions and friction points this paper is trying to address.

Improving long-range dependency capture in nucleotide sequence analysis
Enhancing computational efficiency of self-attention for long sequences
Integrating transition-matrix loss for higher-order dependency learning
Innovation

Methods, ideas, or system contributions that make the work stand out.

Integrates transition-matrix loss with self-attention
Aligns token transitions with n-gram statistics
Improves accuracy and speed in genomic tasks
πŸ”Ž Similar Papers
No similar papers found.
M
Mohammadsaleh Refahi
Drexel University, Philadelphia, PA
Mahdi Abavisani
Mahdi Abavisani
Dataminr
AIMachine LearningComputer VisionNatural Language ProcessingDeep Learning
B
Bahrad A. Sokhansanj
Drexel University, Philadelphia, PA
J
James R. Brown
Drexel University, Philadelphia, PA
Gail Rosen
Gail Rosen
Professor of ECE, Drexel University
BioinformaticsMetagenomicsGenomic Signal Processing