🤖 AI Summary
To address the poor adaptability of numerical encodings and information loss due to sparsity in Chaos Game Representation (CGR) images for deep learning–based representation of biological sequences (proteins, DNA, SMILES), this paper proposes the first sequence-to-image transformation paradigm grounded in Bézier curves. Our method maps discrete sequence elements onto continuous, smooth parametric curves in 2D space, thereby efficiently encoding sequential order, local motifs, and long-range dependencies. This yields images with significantly enhanced spatial structural density and higher sequence fidelity compared to CGR. Integrated with convolutional neural networks (CNNs) for end-to-end learning on multi-source biological sequence classification tasks, our approach consistently outperforms both standard CGR and conventional numerical encoding baselines across multiple benchmark datasets. Empirical results demonstrate its effectiveness and generalizability in downstream applications, including disease diagnosis and drug discovery.
📝 Abstract
The analysis of sequences (e.g., protein, DNA, and SMILES string) is essential for disease diagnosis, biomaterial engineering, genetic engineering, and drug discovery domains. Conventional analytical methods focus on transforming sequences into numerical representations for applying machine learning/deep learning-based sequence characterization. However, their efficacy is constrained by the intrinsic nature of deep learning (DL) models, which tend to exhibit suboptimal performance when applied to tabular data. An alternative group of methodologies endeavors to convert biological sequences into image forms by applying the concept of Chaos Game Representation (CGR). However, a noteworthy drawback of these methods lies in their tendency to map individual elements of the sequence onto a relatively small subset of designated pixels within the generated image. The resulting sparse image representation may not adequately encapsulate the comprehensive sequence information, potentially resulting in suboptimal predictions. In this study, we introduce a novel approach to transform sequences into images using the B'ezier curve concept for element mapping. Mapping the elements onto a curve enhances the sequence information representation in the respective images, hence yielding better DL-based classification performance. We employed different sequence datasets to validate our system by using different classification tasks, and the results illustrate that our B'ezier curve method is able to achieve good performance for all the tasks.