A Hybrid Model for Human DNA Sequence Classification Using Convolutional Neural Networks and Random Forests

Authors

  • Gregorius Airlangga Universitas Katolik Indonesia Atma Jaya

DOI:

https://doi.org/10.32493/informatika.v9i2.39355

Keywords:

DNA classification, CNN, Random Forests, Hybrid models, Genomic data analysis

Abstract

Human DNA sequence classification is a fundamental task in genomics, essential for understanding genetic variations and its implications in disease susceptibility, personalized medicine, and evolutionary biology. This study proposes a novel hybrid model combining Convolutional Neural Networks (CNN) for feature extraction and Random Forest classifiers for final classification. The model was evaluated on a dataset of human DNA sequences, with achieving an accuracy of 75.34%. The results showed that performance metrics, including precision, recall, and F1-scores across multiple classes, showed significant improvements over traditional models. The CNN component effectively captures local dependencies and patterns within the sequences, while the Random Forest classifier handles complex decision boundaries, resulting in enhanced classification accuracy. Comparative analysis demonstrated the superiority of our hybrid approach, with the CNN-LSTM model achieving only 59.47% accuracy, and other RNN-based models like CNN-GRU and CNN-BiLSTM performing similarly lower. These results suggest that hybrid models can leverage the strengths of both deep learning and traditional machine learning techniques an offering a more effective tool for DNA sequence classification. The future work will optimize model architecture and explore larger, thus more diverse datasets to validate our approach's generalizability and robustness.

References

Alamro, H., Gojobori, T., Essack, M. & Gao, X. (2024). BioBBC: a multi-feature model that enhances the detection of biomedical entities. Scientific Reports, 14(1), 7697.

Avanzo, M., Wei, L., Stancanello, J., Vallieres, M., Rao, A., Morin, O., Mattonen, S. A. & El Naqa, I. (2020). Machine and deep learning methods for radiomics. Medical Physics, 47(5), e185--e202.

Balamurugan, T. & Gnanamanoharan, E. (2023). Brain tumor segmentation and classification using hybrid deep CNN with LuNetClassifier. Neural Computing and Applications, 35(6), 4739–4753.

Bian, K. & Priyadarshi, R. (2024). Machine learning optimization techniques: a Survey, classification, challenges, and Future Research Issues. Archives of Computational Methods in Engineering, 1–25.

Cheng, K., Guo, Q., He, Y., Lu, Y., Gu, S. & Wu, H. (2023). Exploring the potential of GPT-4 in biomedical engineering: the dawn of a new era. Annals of Biomedical Engineering, 51(8), 1645–1653.

Cortés-Ciriano, I., Gulhan, D. C., Lee, J. J.-K., Melloni, G. E. M. & Park, P. J. (2022). Computational analysis of cancer genome sequencing data. Nature Reviews Genetics, 23(5), 298–314.

Goshisht, M. K. (2024). Machine Learning and Deep Learning in Synthetic Biology: Key Architectures, Applications, and Challenges. ACS Omega, 9(9), 9921–9945.

Khan, S., Sajjad, M., Hussain, T., Ullah, A. & Imran, A. S. (2020). A review on traditional machine learning and deep learning models for WBCs classification in blood smear images. Ieee Access, 9, 10657–10673.

Landolsi, M. Y., Hlaoua, L. & Romdhane, L. Ben. (2024). Extracting and structuring information from the electronic medical text: state of the art and trendy directions. Multimedia Tools and Applications, 83(7), 21229–21280.

Laskar, P., Bhattacharya, S., Chaudhuri, A. & Kundu, A. (2021). Exploring the GRAS gene family in common bean (Phaseolus vulgaris L.): characterization, evolutionary relationships, and expression analyses in response to abiotic stresses. Planta, 254, 1–21.

Li, R., Li, L., Xu, Y. & Yang, J. (2022). Machine learning meets omics: applications and perspectives. Briefings in Bioinformatics, 23(1), bbab460.

Liu, C., Ma, Y., Zhao, J., Nussinov, R., Zhang, Y.-C., Cheng, F. & Zhang, Z.-K. (2020). Computational network biology: data, models, and applications. Physics Reports, 846, 1–66.

Luo, D., Cheng, W., Yu, W., Zong, B., Ni, J., Chen, H. & Zhang, X. (2021). Learning to drop: Robust graph neural network via topological denoising. Proceedings of the 14th ACM International Conference on Web Search and Data Mining, 779–787.

Maharachchikumbura, S. S. N., Chen, Y., Ariyawansa, H. A., Hyde, K. D., Haelewaters, D., Perera, R. H., Samarakoon, M. C., Wanasinghe, D. N., Bustamante, D. E., Liu, J.-K. & others. (2021). Integrative approaches for species delimitation in Ascomycota. Fungal Diversity, 109(1), 155–179.

Mahmud, M., Kaiser, M. S., McGinnity, T. M. & Hussain, A. (2021). Deep learning in mining biological data. Cognitive Computation, 13(1), 1–33.

Meharunnisa, M., Sornam, M. & Ramesh, B. (2024). An Optimized Hybrid Model for Classifying Bacterial Genus using an Integrated CNN-RF Approach on 16S rDNA Sequences: OPTIMIZED CNN-RF MODEL FOR BACTERIAL GENUS CLASSIFICATION. Journal of Scientific & Industrial Research (JSIR), 83(4), 392–404.

Narayanan, D., Shoeybi, M., Casper, J., LeGresley, P., Patwary, M., Korthikanti, V., Vainbrand, D., Kashinkunti, P., Bernauer, J., Catanzaro, B. & others. (2021). Efficient large-scale language model training on gpu clusters using megatron-lm. Proceedings of the International Conference for High Performance Computing, Networking, Storage and Analysis, 1–15.

Nisa, I., Pandey, P., Ellis, M., Oliker, L., Buluç, A. & Yelick, K. (2021). Distributed-memory k-mer counting on GPUs. 2021 IEEE International Parallel and Distributed Processing Symposium (IPDPS), 527–536.

Papoutsoglou, G., Tarazona, S., Lopes, M. B., Klammsteiner, T., Ibrahimi, E., Eckenberger, J., Novielli, P., Tonda, A., Simeon, A., Shigdel, R. & others. (2023). Machine learning approaches in microbiome research: challenges and best practices. Frontiers in Microbiology, 14, 1261889.

Rashed, A. E. E.-D., Amer, H. M., El-Seddek, M. & Moustafa, H. E.-D. (2021). Sequence alignment using machine learning-based needleman--wunsch algorithm. IEEE Access, 9, 109522–109535.

Satam, H., Joshi, K., Mangrolia, U., Waghoo, S., Zaidi, G., Rawool, S., Thakare, R. P., Banday, S., Mishra, A. K., Das, G. & others. (2023). Next-generation sequencing technology: current trends and advancements. Biology, 12(7), 997.

Sindelar, R. D. (2024). Genomics, other “OMIC” technologies, precision medicine, and additional biotechnology-related techniques. In Pharmaceutical Biotechnology: Fundamentals and Applications (pp. 209–254). Springer.

Tan, X., Su, A. T., Hajiabadi, H., Tran, M. & Nguyen, Q. (2021). Applying machine learning for integration of multi-modal genomics data and imaging data to quantify heterogeneity in tumour tissues. Artificial Neural Networks, 209–228.

Tao, J., Bauer, D. E. & Chiarle, R. (2023). Assessing and advancing the safety of CRISPR-Cas tools: from DNA to RNA editing. Nature Communications, 14(1), 212.

Theodoridis, S., Fordham, D. A., Brown, S. C., Li, S., Rahbek, C. & Nogues-Bravo, D. (2020). Evolutionary history and past climate change shape the distribution of genetic diversity in terrestrial mammals. Nature Communications, 11(1), 2557.

Vasani, N. (2022). Human DNA Data. https://www.kaggle.com/datasets/neelvasani/humandnadata

Walkowiak, S., Gao, L., Monat, C., Haberer, G., Kassa, M. T., Brinton, J., Ramirez-Gonzalez, R. H., Kolodziej, M. C., Delorean, E., Thambugala, D. & others. (2020). Multiple wheat genomes reveal global variation in modern breeding. Nature, 588(7837), 277–283.

Wang, Z., Jiang, Y., Liu, Z., Tang, X. & Li, H. (2022). Machine learning and ensemble learning for transcriptome data: principles and advances. 2022 5th International Conference on Advanced Electronic Materials, Computers and Software Engineering (AEMCSE), 676–683.

Waring, J., Lindvall, C. & Umeton, R. (2020). Automated machine learning: Review of the state-of-the-art and opportunities for healthcare. Artificial Intelligence in Medicine, 104, 101822.

Wilson, S., Steele, S. & Adeli, K. (2022). Innovative technological advancements in laboratory medicine: Predicting the lab of the future. Biotechnology & Biotechnological Equipment, 36(sup1), S9--S21.

Downloads

Published

2024-07-30