A hybrid multi-scale CNN-LSTM deep learning model for the identification of protein-coding regions in DNA sequences

نوع مقاله : علمی-پژوهشی

نویسندگان

گروه بیوالکتریک، دانشکده مهندسی پزشکی، دانشگاه صنعتی سهند، تبریز، ایران

چکیده

Identification of the exact location of an exon in a DNA sequence is an important research area of bioinformatics. The main issues of the previous signal processing techniques are accuracy and robustness for the exact locating of exons. To address the mentioned issues, in this study, a method has been proposed based on deep learning. The proposed method includes a new preprocessing, a new mapping method, and a multi-scale modified and hybrid deep neural network. The proposed preprocessing method enriches the network to accept and encode genes at any length in a new mapping method. The proposed multi-scale deep neural network uses a combination of an embedding layer, a modified CNN, and an LSTM network. In this study, HMR195, BG570, and F56F11.4 datasets have been used to compare this work with previous studies. The accuracies of the proposed method have been 0.982, 0.966, and 0.965 on HMR195, BG570, and F56F11.4 databases, respectively. The results reveal the superiority and effectiveness of the proposed hybrid multi-scale CNN-LSTM network.

کلیدواژه‌ها


عنوان مقاله [English]

A hybrid multi-scale CNN-LSTM deep learning model for the identification of protein-coding regions in DNA sequences

نویسندگان [English]

  • A. Darvish
  • S. Shamekhi
Bioelectric Department, Faculty of BiomFaculty of Biomedical Engineering, Sahand University of Technology, Tabriz, Iran edical Engineering, Sahand University of Technology, Tabriz, Iran
چکیده [English]

Identification of the exact location of an exon in a DNA sequence is an important research area of bioinformatics. The main issues of the previous signal processing techniques are accuracy and robustness for the exact locating of exons. To address the mentioned issues, in this study, a method has been proposed based on deep learning. The proposed method includes a new preprocessing, a new mapping method, and a multi-scale modified and hybrid deep neural network. The proposed preprocessing method enriches the network to accept and encode genes at any length in a new mapping method. The proposed multi-scale deep neural network uses a combination of an embedding layer, a modified CNN, and an LSTM network. In this study, HMR195, BG570, and F56F11.4 datasets have been used to compare this work with previous studies. The accuracies of the proposed method have been 0.982, 0.966, and 0.965 on HMR195, BG570, and F56F11.4 databases, respectively. The results reveal the superiority and effectiveness of the proposed hybrid multi-scale CNN-LSTM network.

کلیدواژه‌ها [English]

  • Deep learning
  • DNA sequences
  • CNN
  • LSTM
  • Multi-scale
  • Protein coding region
[1] D. P. Snustad and M. J. Simmons, Principles of genetics. John Wiley & Sons, 2015.
[2] E. R. Dougherty and I. Shmulevich, Genomic signal processing and statistics. Hindawi Publishing Corporation, 2005.
[3] H. JE, "Guyton and Hall textbook of medical physiology," Philadelphia, PA: Saunders Elsevier, vol. 107, p. 1146, 2011.
[4] A. M. Oudelaar and D. R. Higgs, "The relationship between genome structure and function," Nature Reviews Genetics, vol. 22, no. 3, pp. 154-168, 2021.
[5] P. Vaidyanathan, "Genomics and proteomics: A signal processor's tour," IEEE Circuits and Systems Magazine, vol. 4, no. 4, pp. 6-29, 2004.
[6] F. B. Nasr and A. E. Oueslati, "CNN for human exons and introns classification," in 2021 18th International Multi-Conference on Systems, Signals & Devices (SSD), 2021, pp. 249-254: IEEE.
[7] Q. Zheng, T. Chen, W. Zhou, S. A. Marhon, L. Xie, and H. Su, "SAVMD: An adaptive signal processing method for identifying protein coding regions," Biomedical Signal Processing and Control, vol. 70, p. 102998, 2021.
[8] R. F. Voss, "Evolution of long-range fractal correlations and 1/f noise in DNA base sequences," Physical review letters, vol. 68, no. 25, p. 3805, 1992.
[9] P. D. A. Cristea, "Genomic signals of chromosomes and of concatenated reoriented coding regions," in Imaging, Manipulation, and Analysis of Biomolecules, Cells, and Tissues II, 2004, vol. 5322, pp. 29-41: International Society for Optics and Photonics.
[10] G. L. Rosen, Signal processing for biologically-inspired gradient source localization and DNA sequence analysis. Georgia Institute of Technology, 2006.
[11] A. S. Nair and S. P. Sreenadhan, "A coding measure scheme employing electron-ion interaction pseudopotential (EIIP)," Bioinformation, vol. 1, no. 6, p. 197, 2006.
[12] D. Anastassiou, "Genomic signal processing," IEEE signal processing magazine, vol. 18, no. 4, pp. 8-20, 2001.
[13] P. D. Cristea, "Genetic signal representation and analysis," in Functional Monitoring and Drug-Tissue Interaction, 2002, vol. 4623, pp. 77-84: International Society for Optics and Photonics.
[14] M. Ahmad, L. T. Jung, and M. A.-A. Bhuiyan, "On fuzzy semantic similarity measure for DNA coding," Computers in biology and medicine, vol. 69, pp. 144-151, 2016.
[15] Q. Zheng, T. Chen, W. Zhou, L. Xie, and H. Su, "Gene prediction by the noise-assisted MEMD and wavelet transform for identifying the protein coding regions," Biocybernetics and Biomedical Engineering, vol. 41, no. 1, pp. 196-210, 2021.
[16] H. Herzel, E. N. Trifonov, O. Weiss, and I. Grosse, "Interpreting correlations in biosequences," Physica A: Statistical Mechanics and Its Applications, vol. 249, no. 1-4, pp. 449-459, 1998.
[17] H. Herzel, O. Weiss, and E. N. Trifonov, "10-11 bp periodicities in complete genomes reflect protein structure and DNA folding," Bioinformatics (Oxford, England), vol. 15, no. 3, pp. 187-193, 1999.
[18] H. Saberkari, M. Shamsi, and M. H. Sedaaghi, "A punctual algorithm for small gene prediction in DNA sequences using a time-frequency approach based on the z-curve," GSTF Journal of Engineering Technology (JET), vol. 2, no. 1, p. 1, 2013.
[19] M. Ahmad, L. T. Jung, and A.-A. Bhuiyan, "A biological inspired fuzzy adaptive window median filter (FAWMF) for enhancing DNA signal processing," Computer methods and programs in biomedicine, vol. 149, pp. 11-17, 2017.
[20] A. K. Singh and V. K. Srivastava, "Improved filtering approach for identification of protein-coding regions in eukaryotes by background noise reduction using S–G filter," Network Modeling Analysis in Health Informatics and Bioinformatics, vol. 10, no. 1, pp. 1-16, 2021.
[21] F. Chollet, Deep learning with Python. Simon and Schuster, 2021.
[22] S. Min, B. Lee, and S. Yoon, "Deep learning in bioinformatics," Briefings in bioinformatics, vol. 18, no. 5, pp. 851-869, 2017.
[23] N. K. Vaegae, "Walsh code based numerical mapping method for the identification of protein coding regions in eukaryotes," Biomedical Signal Processing and Control, vol. 58, p. 101859, 2020.
[24] A. K. Singh and V. K. Srivastava, "The three base periodicity of protein coding sequences and its application in exon prediction," in 2020 7th international conference on signal processing and integrated networks (spin), 2020, pp. 1089-1094: IEEE.
[25] N. Naderi and B. Nasersharif, "Robust sub-band speech feature extraction using multiresolution convolutional neural networks," TABRIZ JOURNAL OF ELECTRICAL ENGINEERING, vol. 49, no. 3, pp. 1393-1404, 2019.
[26] M. Afrasiabi, H. Khotanlou, and M. Mansoorizadeh, "Deep neural network for interaction prediction in video using fuzzy relationship and optical flow," TABRIZ JOURNAL OF ELECTRICAL ENGINEERING, vol. 50, no. 3, pp. 1035-1046, 2020.
[27] A. Saeedi, M. Saeedi, A. Maghsoudi, and A. Shalbaf, "Major depressive disorder diagnosis based on effective connectivity in EEG signals: A convolutional neural network and long short-term memory approach," Cognitive Neurodynamics, vol. 15, no. 2, pp. 239-252, 2021.
[28] S. Rogic. [Online]. Available: http://srogic.wordpress.com/datasets/hmr195-dataset/
[29] S. Rogic, A. K. Mackworth, and F. B. Ouellette, "Evaluation of gene-finding programs on mammalian sequences," Genome research, vol. 11, no. 5, pp. 817-832, 2001.
[30] A. Saito, A. Tomita, R. Ando, K. Watanabe, and H. Akima, "Similarity of muscle synergies extracted from the lower limb including the deep muscles between level and uphill treadmill walking," Gait & posture, vol. 59, pp. 134-139, 2018.
[31] M. Burset and R. Guigo, "Evaluation of gene structure prediction programs," genomics, vol. 34, no. 3, pp. 353-367, 1996.
[32] I. Goodfellow, Y. Bengio, and A. Courville, Deep learning. MIT press, 2016.
[33] Y. Goldberg and O. Levy, "word2vec Explained: deriving Mikolov et al.'s negative-sampling word-embedding method," arXiv preprint arXiv:1402.3722, 2014.
[34] M. D. Zeiler, "Adadelta: an adaptive learning rate method," arXiv preprint arXiv:1212.5701, 2012.
[35] B. W. Matthews, "Comparison of the predicted and observed secondary structure of T4 phage lysozyme," Biochimica et Biophysica Acta (BBA)-Protein Structure, vol. 405, no. 2, pp. 442-451, 1975.
[36] R. Kumar and A. Indrayan, "Receiver operating characteristic (ROC) curve for medical researchers," Indian pediatrics, vol. 48, no. 4, pp. 277-287, 2011.
[37] S. Shamekhi, M. H. M. Baygi, B. Azarian, and A. Gooya, "A novel multi-scale Hessian based spot enhancement filter for two-dimensional gel electrophoresis images," Computers in biology and medicine, vol. 66, pp. 154-169, 2015.
[38] C. Yin and S. S.-T. Yau, "Prediction of protein coding regions by the 3-base periodicity analysis of a DNA sequence," Journal of theoretical biology, vol. 247, no. 4, pp. 687-694, 2007.
[39] M. Ahmad, L. T. Jung, and A.-A. Bhuiyan, "From DNA to protein: Why genetic code context of nucleotides for DNA signal processing? A review," Biomedical Signal Processing and Control, vol. 34, pp. 44-63, 2017.
[40] J. Mena-Chalco, H. Carrer, Y. Zana, and R. M. Cesar Jr, "Identification of protein coding regions using the modified Gabor-wavelet transform," IEEE/ACM Transactions on Computational Biology and Bioinformatics, vol. 5, no. 2, pp. 198-207, 2008.
[41] X. Zhang et al., "Short exon detection via wavelet transform modulus maxima," PloS one, vol. 11, no. 9, p. e0163088, 2016.
[42] S. Kar, M. Ganguly, and S. Das, "Using DIT-FFT algorithm for identification of protein coding region in eukaryotic gene," Biomedical Engineering: Applications, Basis, and Communications, vol. 31, no. 01, p. 1950002, 2019.
[43] M. Akhtar, E. Ambikairajah, and J. Epps, "Detection of period-3 behavior in genomic sequences using singular value decomposition," in Proceedings of the IEEE Symposium on Emerging Technologies, 2005., 2005, pp. 13-17: IEEE.
[44] H. Saberkari, M. Shamsi, H. Heravi, and M. H. Sedaaghi, "A novel fast algorithm for exon prediction in eukaryotic genes using linear predictive coding model and goertzel algorithm based on the Z-curve," International Journal of Computer Applications, vol. 67, no. 17, 2013.
[45] S. A. Marhon and S. C. Kremer, "Prediction of protein coding regions using a wide-range wavelet window method," IEEE/ACM transactions on computational biology and bioinformatics, vol. 13, no. 4, pp. 742-753, 2015.
[46] L. Das, J. Das, and S. Nanda, "Advanced protein coding region prediction applying robust SVD algorithm," in 2017 2nd International Conference on Man and Machine Interfacing (MAMI), 2017, pp. 1-6: IEEE.
[47] L. Das, S. Nanda, and J. Das, "An integrated approach for identification of exon locations using recursive Gauss Newton tuned adaptive Kaiser window," Genomics, vol. 111, no. 3, pp. 284-296, 2019.