A Hybrid Model for Human DNA Sequence Classification Using Convolutional Neural Networks and Random Forests
DOI:
https://doi.org/10.32493/informatika.v9i2.39355Keywords:
DNA classification, CNN, Random Forests, Hybrid models, Genomic data analysisAbstract
Human DNA sequence classification is a fundamental task in genomics, essential for understanding genetic variations and its implications in disease susceptibility, personalized medicine, and evolutionary biology. This study proposes a novel hybrid model combining Convolutional Neural Networks (CNN) for feature extraction and Random Forest classifiers for final classification. The model was evaluated on a dataset of human DNA sequences, with achieving an accuracy of 75.34%. The results showed that performance metrics, including precision, recall, and F1-scores across multiple classes, showed significant improvements over traditional models. The CNN component effectively captures local dependencies and patterns within the sequences, while the Random Forest classifier handles complex decision boundaries, resulting in enhanced classification accuracy. Comparative analysis demonstrated the superiority of our hybrid approach, with the CNN-LSTM model achieving only 59.47% accuracy, and other RNN-based models like CNN-GRU and CNN-BiLSTM performing similarly lower. These results suggest that hybrid models can leverage the strengths of both deep learning and traditional machine learning techniques an offering a more effective tool for DNA sequence classification. The future work will optimize model architecture and explore larger, thus more diverse datasets to validate our approach's generalizability and robustness.
References
Alamro, H., Gojobori, T., Essack, M. & Gao, X. (2024). BioBBC: a multi-feature model that enhances the detection of biomedical entities. Scientific Reports, 14(1), 7697.
Avanzo, M., Wei, L., Stancanello, J., Vallieres, M., Rao, A., Morin, O., Mattonen, S. A. & El Naqa, I. (2020). Machine and deep learning methods for radiomics. Medical Physics, 47(5), e185--e202.
Balamurugan, T. & Gnanamanoharan, E. (2023). Brain tumor segmentation and classification using hybrid deep CNN with LuNetClassifier. Neural Computing and Applications, 35(6), 4739–4753.
Bian, K. & Priyadarshi, R. (2024). Machine learning optimization techniques: a Survey, classification, challenges, and Future Research Issues. Archives of Computational Methods in Engineering, 1–25.
Cheng, K., Guo, Q., He, Y., Lu, Y., Gu, S. & Wu, H. (2023). Exploring the potential of GPT-4 in biomedical engineering: the dawn of a new era. Annals of Biomedical Engineering, 51(8), 1645–1653.
Cortés-Ciriano, I., Gulhan, D. C., Lee, J. J.-K., Melloni, G. E. M. & Park, P. J. (2022). Computational analysis of cancer genome sequencing data. Nature Reviews Genetics, 23(5), 298–314.
Goshisht, M. K. (2024). Machine Learning and Deep Learning in Synthetic Biology: Key Architectures, Applications, and Challenges. ACS Omega, 9(9), 9921–9945.
Khan, S., Sajjad, M., Hussain, T., Ullah, A. & Imran, A. S. (2020). A review on traditional machine learning and deep learning models for WBCs classification in blood smear images. Ieee Access, 9, 10657–10673.
Landolsi, M. Y., Hlaoua, L. & Romdhane, L. Ben. (2024). Extracting and structuring information from the electronic medical text: state of the art and trendy directions. Multimedia Tools and Applications, 83(7), 21229–21280.
Laskar, P., Bhattacharya, S., Chaudhuri, A. & Kundu, A. (2021). Exploring the GRAS gene family in common bean (Phaseolus vulgaris L.): characterization, evolutionary relationships, and expression analyses in response to abiotic stresses. Planta, 254, 1–21.
Li, R., Li, L., Xu, Y. & Yang, J. (2022). Machine learning meets omics: applications and perspectives. Briefings in Bioinformatics, 23(1), bbab460.
Liu, C., Ma, Y., Zhao, J., Nussinov, R., Zhang, Y.-C., Cheng, F. & Zhang, Z.-K. (2020). Computational network biology: data, models, and applications. Physics Reports, 846, 1–66.
Luo, D., Cheng, W., Yu, W., Zong, B., Ni, J., Chen, H. & Zhang, X. (2021). Learning to drop: Robust graph neural network via topological denoising. Proceedings of the 14th ACM International Conference on Web Search and Data Mining, 779–787.
Maharachchikumbura, S. S. N., Chen, Y., Ariyawansa, H. A., Hyde, K. D., Haelewaters, D., Perera, R. H., Samarakoon, M. C., Wanasinghe, D. N., Bustamante, D. E., Liu, J.-K. & others. (2021). Integrative approaches for species delimitation in Ascomycota. Fungal Diversity, 109(1), 155–179.
Mahmud, M., Kaiser, M. S., McGinnity, T. M. & Hussain, A. (2021). Deep learning in mining biological data. Cognitive Computation, 13(1), 1–33.
Meharunnisa, M., Sornam, M. & Ramesh, B. (2024). An Optimized Hybrid Model for Classifying Bacterial Genus using an Integrated CNN-RF Approach on 16S rDNA Sequences: OPTIMIZED CNN-RF MODEL FOR BACTERIAL GENUS CLASSIFICATION. Journal of Scientific & Industrial Research (JSIR), 83(4), 392–404.
Narayanan, D., Shoeybi, M., Casper, J., LeGresley, P., Patwary, M., Korthikanti, V., Vainbrand, D., Kashinkunti, P., Bernauer, J., Catanzaro, B. & others. (2021). Efficient large-scale language model training on gpu clusters using megatron-lm. Proceedings of the International Conference for High Performance Computing, Networking, Storage and Analysis, 1–15.
Nisa, I., Pandey, P., Ellis, M., Oliker, L., Buluç, A. & Yelick, K. (2021). Distributed-memory k-mer counting on GPUs. 2021 IEEE International Parallel and Distributed Processing Symposium (IPDPS), 527–536.
Papoutsoglou, G., Tarazona, S., Lopes, M. B., Klammsteiner, T., Ibrahimi, E., Eckenberger, J., Novielli, P., Tonda, A., Simeon, A., Shigdel, R. & others. (2023). Machine learning approaches in microbiome research: challenges and best practices. Frontiers in Microbiology, 14, 1261889.
Rashed, A. E. E.-D., Amer, H. M., El-Seddek, M. & Moustafa, H. E.-D. (2021). Sequence alignment using machine learning-based needleman--wunsch algorithm. IEEE Access, 9, 109522–109535.
Satam, H., Joshi, K., Mangrolia, U., Waghoo, S., Zaidi, G., Rawool, S., Thakare, R. P., Banday, S., Mishra, A. K., Das, G. & others. (2023). Next-generation sequencing technology: current trends and advancements. Biology, 12(7), 997.
Sindelar, R. D. (2024). Genomics, other “OMIC” technologies, precision medicine, and additional biotechnology-related techniques. In Pharmaceutical Biotechnology: Fundamentals and Applications (pp. 209–254). Springer.
Tan, X., Su, A. T., Hajiabadi, H., Tran, M. & Nguyen, Q. (2021). Applying machine learning for integration of multi-modal genomics data and imaging data to quantify heterogeneity in tumour tissues. Artificial Neural Networks, 209–228.
Tao, J., Bauer, D. E. & Chiarle, R. (2023). Assessing and advancing the safety of CRISPR-Cas tools: from DNA to RNA editing. Nature Communications, 14(1), 212.
Theodoridis, S., Fordham, D. A., Brown, S. C., Li, S., Rahbek, C. & Nogues-Bravo, D. (2020). Evolutionary history and past climate change shape the distribution of genetic diversity in terrestrial mammals. Nature Communications, 11(1), 2557.
Vasani, N. (2022). Human DNA Data. https://www.kaggle.com/datasets/neelvasani/humandnadata
Walkowiak, S., Gao, L., Monat, C., Haberer, G., Kassa, M. T., Brinton, J., Ramirez-Gonzalez, R. H., Kolodziej, M. C., Delorean, E., Thambugala, D. & others. (2020). Multiple wheat genomes reveal global variation in modern breeding. Nature, 588(7837), 277–283.
Wang, Z., Jiang, Y., Liu, Z., Tang, X. & Li, H. (2022). Machine learning and ensemble learning for transcriptome data: principles and advances. 2022 5th International Conference on Advanced Electronic Materials, Computers and Software Engineering (AEMCSE), 676–683.
Waring, J., Lindvall, C. & Umeton, R. (2020). Automated machine learning: Review of the state-of-the-art and opportunities for healthcare. Artificial Intelligence in Medicine, 104, 101822.
Wilson, S., Steele, S. & Adeli, K. (2022). Innovative technological advancements in laboratory medicine: Predicting the lab of the future. Biotechnology & Biotechnological Equipment, 36(sup1), S9--S21.
Downloads
Published
Issue
Section
License
Copyright (c) 2024 Gregorius Airlangga
This work is licensed under a Creative Commons Attribution-NonCommercial 4.0 International License.
Authors who publish with this journal agree to the following terms:
- Authors retain copyright and grant the journal right of first publication with the work simultaneously licensed under a Creative Commons Attribution-NonCommercial 4.0 International (CC BY-NC 4.0) that allows others to share the work with an acknowledgement of the work's authorship and initial publication in this journal.
- Authors are able to enter into separate, additional contractual arrangements for the non-exclusive distribution of the journal's published version of the work (e.g., post it to an institutional repository or publish it in a book), with an acknowledgement of its initial publication in this journal.
- Authors are permitted and encouraged to post their work online (e.g., in institutional repositories or on their website) prior to and during the submission process, as it can lead to productive exchanges, as well as earlier and greater citation of published work (See The Effect of Open Access).
Jurnal Informatika Universitas Pamulang have CC-BY-NC or an equivalent license as the optimal license for the publication, distribution, use, and reuse of scholarly work.
In developing strategy and setting priorities, Jurnal Informatika Universitas Pamulang recognize that free access is better than priced access, libre access is better than free access, and libre under CC-BY-NC or the equivalent is better than libre under more restrictive open licenses. We should achieve what we can when we can. We should not delay achieving free in order to achieve libre, and we should not stop with free when we can achieve libre.
Jurnal Informatika Universitas Pamulang is licensed under a Creative Commons Attribution-NonCommercial 4.0 International (CC BY-NC 4.0)
YOU ARE FREE TO:
- Share : copy and redistribute the material in any medium or format
- Adapt : remix, transform, and build upon the material for any purpose, even commercially.
- The licensor cannot revoke these freedoms as long as you follow the license terms