Application of Traditional Machine Learning Techniques for the Classification of Human DNA Sequences: A Comparative Study of Random Forest and XGBoost

Authors

  • Gregorius Airlangga Atma Jaya Catholic University of Indonesia

DOI:

https://doi.org/10.32493/informatika.v9i1.39353

Keywords:

Machine Learning, DNA Sequence Classification, Random Forest, XGBoost, Genomic Data Analysis

Abstract

This study evaluates the performance of hybrid machine learning models, specifically Random Forest and XGBoost, in classifying human DNA sequences into seven functional classes. Utilizing advanced feature vectorization techniques, this research addresses the challenges of analyzing high-dimensional genomic data. Both models were trained and tested on a dataset of annotated human DNA sequences, with an emphasis on generalizability to new, unseen data. Our results indicate that the Random Forest model achieved an accuracy of 87.98%, slightly outperforming the XGBoost model, which recorded an accuracy of 87.06%. These findings underscore the effectiveness of employing traditional machine learning techniques coupled with innovative data preprocessing for predictive modeling in genomics. The study not only enhances our understanding of genomic functionalities but also suggests robust methodologies for future genetic research and potential applications in personalized medicine. The implications of these results for improving classification accuracy and the recommendations for integrating more complex algorithms are also discussed

References

Ahmed, N. Y., Alsanousi, W. A., Hamid, E. M., Elbashir, M. K., Al-Aidarous, K. M., Mohammed, M. & Musa, M. E. M. (2024). An Efficient Deep Learning Approach for DNA-Binding Proteins Classification from Primary Sequences. International Journal of Computational Intelligence Systems, 17(1), 1–14.

Ahmed, S. F., Alam, M. S. Bin, Hassan, M., Rozbu, M. R., Ishtiak, T., Rafa, N., Mofijur, M., Shawkat Ali, A. B. M. & Gandomi, A. H. (2023). Deep learning modelling techniques: current progress, applications, advantages, and challenges. Artificial Intelligence Review, 56(11), 13521–13617.

Akbari Rokn Abadi, S., Mohammadi, A. & Koohi, S. (2023). A new profiling approach for DNA sequences based on the nucleotides’ physicochemical features for accurate analysis of SARS-CoV-2 genomes. BMC Genomics, 24(1), 266.

Amiri, Z., Heidari, A., Navimipour, N. J., Unal, M. & Mousavi, A. (2023). Adventures in data analysis: A systematic review of Deep Learning techniques for pattern recognition in cyber-physical-social systems. Multimedia Tools and Applications, 1–65.

Basso, M. F., Arraes, F. B. M., Grossi-de-Sa, M., Moreira, V. J. V., Alves-Ferreira, M. & Grossi-de-Sa, M. F. (2020). Insights into genetic and molecular elements for transgenic crop development. Frontiers in Plant Science, 11, 509.

Caudai, C., Galizia, A., Geraci, F., Le Pera, L., Morea, V., Salerno, E., Via, A. & Colombo, T. (2021). AI applications in functional genomics. Computational and Structural Biotechnology Journal, 19, 5762–5790.

Dral, P. O. & Barbatti, M. (2021). Molecular excited states through a machine learning lens. Nature Reviews Chemistry, 5(6), 388–405.

Fu, J. M., Satterstrom, F. K., Peng, M., Brand, H., Collins, R. L., Dong, S., Wamsley, B., Klei, L., Wang, L., Hao, S. P. & others. (2022). Rare coding variation provides insight into the genetic architecture and phenotypic context of autism. Nature Genetics, 54(9), 1320–1331.

Greener, J. G., Kandathil, S. M., Moffat, L. & Jones, D. T. (2022). A guide to machine learning for biologists. Nature Reviews Molecular Cell Biology, 23(1), 40–55.

He, W., Liu, T., Han, Y., Ming, W., Du, J., Liu, Y., Yang, Y., Wang, L., Jiang, Z., Wang, Y. & others. (2022). A review: The detection of cancer cells in histopathology based on machine vision. Computers in Biology and Medicine, 146, 105636.

Jovic, D., Liang, X., Zeng, H., Lin, L., Xu, F. & Luo, Y. (2022). Single-cell RNA sequencing technologies and applications: A brief overview. Clinical and Translational Medicine, 12(3), e694.

Kunduru, A. R. (2023). Machine Learning in Drug Discovery: A Comprehensive Analysis of Applications, Challenges, and Future Directions. International Journal on Orange Technologies, 5(8), 29–37.

Mobarak, M. H., Mimona, M. A., Islam, M. A., Hossain, N., Zohura, F. T., Imtiaz, I. & Rimon, M. I. H. (2023). Scope of machine learning in materials research—A review. Applied Surface Science Advances, 18, 100523.

Pan, X., Lin, X., Cao, D., Zeng, X., Yu, P. S., He, L., Nussinov, R. & Cheng, F. (2022). Deep learning for drug repurposing: Methods, databases, and applications. Wiley Interdisciplinary Reviews: Computational Molecular Science, 12(4), e1597.

Patra, P., Disha, B. R., Kundu, P., Das, M. & Ghosh, A. (2023). Recent advances in machine learning applications in metabolic engineering. Biotechnology Advances, 62, 108069.

Raslan, M. A., Raslan, S. A., Shehata, E. M., Mahmoud, A. S. & Sabri, N. A. (2023). Advances in the Applications of Bioinformatics and Chemoinformatics. Pharmaceuticals, 16(7), 1050.

Rausch, T., Rashed, A. & Dustdar, S. (2021). Optimized container scheduling for data-intensive serverless edge computing. Future Generation Computer Systems, 114, 259–271.

Rhodes, C. J., Sweatt, A. J. & Maron, B. A. (2022). Harnessing big data to advance treatment and understanding of pulmonary hypertension. Circulation Research, 130(9), 1423–1444.

Satam, H., Joshi, K., Mangrolia, U., Waghoo, S., Zaidi, G., Rawool, S., Thakare, R. P., Banday, S., Mishra, A. K., Das, G. & others. (2023). Next-generation sequencing technology: current trends and advancements. Biology, 12(7), 997.

Shiri, F. M., Perumal, T., Mustapha, N. & Mohamed, R. (2023). A comprehensive overview and comparative analysis on deep learning models: CNN, RNN, LSTM, GRU. ArXiv Preprint ArXiv:2305.17473.

Thudumu, S., Branch, P., Jin, J. & Singh, J. (2020). A comprehensive survey of anomaly detection techniques for high dimensional big data. Journal of Big Data, 7, 1–30.

Tian, T., Yang, Z. & Li, X. (2021). Tissue clearing technique: Recent progress and biomedical applications. Journal of Anatomy, 238(2), 489–507.

Vasani, N. (2022). Human DNA Data. https://www.kaggle.com/datasets/neelvasani/humandnadata

Wang, R. C. & Wang, Z. (2023). Precision medicine: Disease subtyping and tailored treatment. Cancers, 15(15), 3837.

Xie, W., He, M., Yu, D., Wu, Y., Wang, X., Lv, S., Xiao, W. & Li, Y. (2021). Mouse models of sarcopenia: classification and evaluation. Journal of Cachexia, Sarcopenia and Muscle, 12(3), 538–554.

Zhang, X., Chen, S., Shi, L., Gong, D., Zhang, S., Zhao, Q., Zhan, D., Vasseur, L., Wang, Y., Yu, J. & others. (2021). Haplotype-resolved genome assembly provides insights into evolutionary history of the tea plant Camellia sinensis. Nature Genetics, 53(8), 1250–1259.

Downloads

Published

2024-03-30