Application of Traditional Machine Learning Techniques for the Classification of Human DNA Sequences: A Comparative Study of Random Forest and XGBoost
DOI:
https://doi.org/10.32493/informatika.v9i1.39353Keywords:
Machine Learning, DNA Sequence Classification, Random Forest, XGBoost, Genomic Data AnalysisAbstract
This study evaluates the performance of hybrid machine learning models, specifically Random Forest and XGBoost, in classifying human DNA sequences into seven functional classes. Utilizing advanced feature vectorization techniques, this research addresses the challenges of analyzing high-dimensional genomic data. Both models were trained and tested on a dataset of annotated human DNA sequences, with an emphasis on generalizability to new, unseen data. Our results indicate that the Random Forest model achieved an accuracy of 87.98%, slightly outperforming the XGBoost model, which recorded an accuracy of 87.06%. These findings underscore the effectiveness of employing traditional machine learning techniques coupled with innovative data preprocessing for predictive modeling in genomics. The study not only enhances our understanding of genomic functionalities but also suggests robust methodologies for future genetic research and potential applications in personalized medicine. The implications of these results for improving classification accuracy and the recommendations for integrating more complex algorithms are also discussed
References
Ahmed, N. Y., Alsanousi, W. A., Hamid, E. M., Elbashir, M. K., Al-Aidarous, K. M., Mohammed, M. & Musa, M. E. M. (2024). An Efficient Deep Learning Approach for DNA-Binding Proteins Classification from Primary Sequences. International Journal of Computational Intelligence Systems, 17(1), 1–14.
Ahmed, S. F., Alam, M. S. Bin, Hassan, M., Rozbu, M. R., Ishtiak, T., Rafa, N., Mofijur, M., Shawkat Ali, A. B. M. & Gandomi, A. H. (2023). Deep learning modelling techniques: current progress, applications, advantages, and challenges. Artificial Intelligence Review, 56(11), 13521–13617.
Akbari Rokn Abadi, S., Mohammadi, A. & Koohi, S. (2023). A new profiling approach for DNA sequences based on the nucleotides’ physicochemical features for accurate analysis of SARS-CoV-2 genomes. BMC Genomics, 24(1), 266.
Amiri, Z., Heidari, A., Navimipour, N. J., Unal, M. & Mousavi, A. (2023). Adventures in data analysis: A systematic review of Deep Learning techniques for pattern recognition in cyber-physical-social systems. Multimedia Tools and Applications, 1–65.
Basso, M. F., Arraes, F. B. M., Grossi-de-Sa, M., Moreira, V. J. V., Alves-Ferreira, M. & Grossi-de-Sa, M. F. (2020). Insights into genetic and molecular elements for transgenic crop development. Frontiers in Plant Science, 11, 509.
Caudai, C., Galizia, A., Geraci, F., Le Pera, L., Morea, V., Salerno, E., Via, A. & Colombo, T. (2021). AI applications in functional genomics. Computational and Structural Biotechnology Journal, 19, 5762–5790.
Dral, P. O. & Barbatti, M. (2021). Molecular excited states through a machine learning lens. Nature Reviews Chemistry, 5(6), 388–405.
Fu, J. M., Satterstrom, F. K., Peng, M., Brand, H., Collins, R. L., Dong, S., Wamsley, B., Klei, L., Wang, L., Hao, S. P. & others. (2022). Rare coding variation provides insight into the genetic architecture and phenotypic context of autism. Nature Genetics, 54(9), 1320–1331.
Greener, J. G., Kandathil, S. M., Moffat, L. & Jones, D. T. (2022). A guide to machine learning for biologists. Nature Reviews Molecular Cell Biology, 23(1), 40–55.
He, W., Liu, T., Han, Y., Ming, W., Du, J., Liu, Y., Yang, Y., Wang, L., Jiang, Z., Wang, Y. & others. (2022). A review: The detection of cancer cells in histopathology based on machine vision. Computers in Biology and Medicine, 146, 105636.
Jovic, D., Liang, X., Zeng, H., Lin, L., Xu, F. & Luo, Y. (2022). Single-cell RNA sequencing technologies and applications: A brief overview. Clinical and Translational Medicine, 12(3), e694.
Kunduru, A. R. (2023). Machine Learning in Drug Discovery: A Comprehensive Analysis of Applications, Challenges, and Future Directions. International Journal on Orange Technologies, 5(8), 29–37.
Mobarak, M. H., Mimona, M. A., Islam, M. A., Hossain, N., Zohura, F. T., Imtiaz, I. & Rimon, M. I. H. (2023). Scope of machine learning in materials research—A review. Applied Surface Science Advances, 18, 100523.
Pan, X., Lin, X., Cao, D., Zeng, X., Yu, P. S., He, L., Nussinov, R. & Cheng, F. (2022). Deep learning for drug repurposing: Methods, databases, and applications. Wiley Interdisciplinary Reviews: Computational Molecular Science, 12(4), e1597.
Patra, P., Disha, B. R., Kundu, P., Das, M. & Ghosh, A. (2023). Recent advances in machine learning applications in metabolic engineering. Biotechnology Advances, 62, 108069.
Raslan, M. A., Raslan, S. A., Shehata, E. M., Mahmoud, A. S. & Sabri, N. A. (2023). Advances in the Applications of Bioinformatics and Chemoinformatics. Pharmaceuticals, 16(7), 1050.
Rausch, T., Rashed, A. & Dustdar, S. (2021). Optimized container scheduling for data-intensive serverless edge computing. Future Generation Computer Systems, 114, 259–271.
Rhodes, C. J., Sweatt, A. J. & Maron, B. A. (2022). Harnessing big data to advance treatment and understanding of pulmonary hypertension. Circulation Research, 130(9), 1423–1444.
Satam, H., Joshi, K., Mangrolia, U., Waghoo, S., Zaidi, G., Rawool, S., Thakare, R. P., Banday, S., Mishra, A. K., Das, G. & others. (2023). Next-generation sequencing technology: current trends and advancements. Biology, 12(7), 997.
Shiri, F. M., Perumal, T., Mustapha, N. & Mohamed, R. (2023). A comprehensive overview and comparative analysis on deep learning models: CNN, RNN, LSTM, GRU. ArXiv Preprint ArXiv:2305.17473.
Thudumu, S., Branch, P., Jin, J. & Singh, J. (2020). A comprehensive survey of anomaly detection techniques for high dimensional big data. Journal of Big Data, 7, 1–30.
Tian, T., Yang, Z. & Li, X. (2021). Tissue clearing technique: Recent progress and biomedical applications. Journal of Anatomy, 238(2), 489–507.
Vasani, N. (2022). Human DNA Data. https://www.kaggle.com/datasets/neelvasani/humandnadata
Wang, R. C. & Wang, Z. (2023). Precision medicine: Disease subtyping and tailored treatment. Cancers, 15(15), 3837.
Xie, W., He, M., Yu, D., Wu, Y., Wang, X., Lv, S., Xiao, W. & Li, Y. (2021). Mouse models of sarcopenia: classification and evaluation. Journal of Cachexia, Sarcopenia and Muscle, 12(3), 538–554.
Zhang, X., Chen, S., Shi, L., Gong, D., Zhang, S., Zhao, Q., Zhan, D., Vasseur, L., Wang, Y., Yu, J. & others. (2021). Haplotype-resolved genome assembly provides insights into evolutionary history of the tea plant Camellia sinensis. Nature Genetics, 53(8), 1250–1259.
Downloads
Published
Issue
Section
License
Copyright (c) 2024 Gregorius Airlangga
This work is licensed under a Creative Commons Attribution-NonCommercial 4.0 International License.
Authors who publish with this journal agree to the following terms:
- Authors retain copyright and grant the journal right of first publication with the work simultaneously licensed under a Creative Commons Attribution-NonCommercial 4.0 International (CC BY-NC 4.0) that allows others to share the work with an acknowledgement of the work's authorship and initial publication in this journal.
- Authors are able to enter into separate, additional contractual arrangements for the non-exclusive distribution of the journal's published version of the work (e.g., post it to an institutional repository or publish it in a book), with an acknowledgement of its initial publication in this journal.
- Authors are permitted and encouraged to post their work online (e.g., in institutional repositories or on their website) prior to and during the submission process, as it can lead to productive exchanges, as well as earlier and greater citation of published work (See The Effect of Open Access).
Jurnal Informatika Universitas Pamulang have CC-BY-NC or an equivalent license as the optimal license for the publication, distribution, use, and reuse of scholarly work.
In developing strategy and setting priorities, Jurnal Informatika Universitas Pamulang recognize that free access is better than priced access, libre access is better than free access, and libre under CC-BY-NC or the equivalent is better than libre under more restrictive open licenses. We should achieve what we can when we can. We should not delay achieving free in order to achieve libre, and we should not stop with free when we can achieve libre.
Jurnal Informatika Universitas Pamulang is licensed under a Creative Commons Attribution-NonCommercial 4.0 International (CC BY-NC 4.0)
YOU ARE FREE TO:
- Share : copy and redistribute the material in any medium or format
- Adapt : remix, transform, and build upon the material for any purpose, even commercially.
- The licensor cannot revoke these freedoms as long as you follow the license terms