Determine the Classification of COVID-19 by Combining the Encoding of Amino Acids with Machine-Learning Models

Main Article Content

Anurag Golwalkar, Abhay Kothari

Abstract

In the ongoing battle against COVID-19, a novel approach integrating the encoding of amino acids with advanced machine-learning models offers a promising avenue for enhancing the classification accuracy of the virus strains. The relentless evolution of the virus necessitates robust and adaptable diagnostic tools capable of capturing the genetic intricacies that underpin the disease's transmission and virulence. This study addresses the critical need for refined classification techniques, pinpointing a significant gap in existing methodologies that often overlook the potential of amino acid sequences as predictive biomarkers. Employing a sophisticated feature selection mechanism, this research harnesses the power of Information Gain (IG) and Analysis of Variance (ANOVA) to distill essential features from the amino acid sequences. This process not only illuminates the sequences' predictive capacity but also reduces computational complexity, paving the way for more efficient model training and validation. The dataset, derived from the National Genomics Data Center (NGDC), encompasses a comprehensive array of amino acid sequences associated with various COVID-19 strains, providing a fertile ground for model evaluation through 10-fold cross-validation. The study meticulously evaluates the performance of two machine-learning classifiers: Decision Trees (DT) and Random Forest (RF). Utilizing IG, the RF classifier demonstrated exceptional proficiency, achieving an accuracy of 98.69%, with similarly high metrics across sensitivity, specificity, and precision. This starkly contrasts with the DT classifier, which, while respectable, lagged behind with an overall accuracy of 89.23%. A parallel examination using ANOVA echoed these findings, with RF maintaining superior performance, albeit with a narrower margin of distinction between the two classifiers. This comparative analysis underscores the RF classifier's robustness, attributable to its ensemble nature, which aggregates insights from multiple decision trees to mitigate overfitting and enhance predictive accuracy. The integration of amino acid encoding with RF, informed by targeted feature selection through IG and ANOVA, presents a potent methodology for COVID-19 strain classification.

Article Details

How to Cite
Abhay Kothari, A. G. (2024). Determine the Classification of COVID-19 by Combining the Encoding of Amino Acids with Machine-Learning Models . International Journal on Recent and Innovation Trends in Computing and Communication, 11(8), 613–625. Retrieved from https://www.ijritcc.org/index.php/ijritcc/article/view/10488
Section
Articles