TY - JOUR
T1 - Applications of Machine Learning for the Classification of Porcine Reproductive and Respiratory Syndrome Virus Sublineages Using Amino Acid Scores of ORF5 Gene
AU - Kim, Jeonghoon
AU - Lee, Kyuyoung
AU - Rupasinghe, Ruwini
AU - Rezaei, Shahbaz
AU - Martínez-López, Beatriz
AU - Liu, Xin
N1 - Funding Information:
This work has been possible thanks to the NSF funded project #1838207 BIGDATA: IA: A multi-level approach for global optimization of the surveillance and control of infectious disease in the swine industry. The work was partially supported by
Funding Information:
Authors would like to thank swine industry collaborators and producers for the provision of data. Funding. This work has been possible thanks to the NSF funded project #1838207 BIGDATA: IA: A multi-level approach for global optimization of the surveillance and control of infectious disease in the swine industry. The work was partially supported by NSF through Grants IIS-1838207 and OIA-2040680 and USDA through Award 2020-67021-32855.
Publisher Copyright:
© Copyright © 2021 Kim, Lee, Rupasinghe, Rezaei, Martínez-López and Liu.
PY - 2021/7/23
Y1 - 2021/7/23
N2 - Porcine reproductive and respiratory syndrome is an infectious disease of pigs caused by PRRS virus (PRRSV). A modified live-attenuated vaccine has been widely used to control the spread of PRRSV and the classification of field strains is a key for a successful control and prevention. Restriction fragment length polymorphism targeting the Open reading frame 5 (ORF5) genes is widely used to classify PRRSV strains but showed unstable accuracy. Phylogenetic analysis is a powerful tool for PRRSV classification with consistent accuracy but it demands large computational power as the number of sequences gets increased. Our study aimed to apply four machine learning (ML) algorithms, random forest, k-nearest neighbor, support vector machine and multilayer perceptron, to classify field PRRSV strains into four clades using amino acid scores based on ORF5 gene sequence. Our study used amino acid sequences of ORF5 gene in 1931 field PRRSV strains collected in the US from 2012 to 2020. Phylogenetic analysis was used to labels field PRRSV strains into one of four clades: Lineage 5 or three clades in Linage 1. We measured accuracy and time consumption of classification using four ML approaches by different size of gene sequences. We found that all four ML algorithms classify a large number of field strains in a very short time (<2.5 s) with very high accuracy (>0.99 Area under curve of the Receiver of operating characteristics curve). Furthermore, the random forest approach detects a total of 4 key amino acid positions for the classification of field PRRSV strains into four clades. Our finding will provide an insightful idea to develop a rapid and accurate classification model using genetic information, which also enables us to handle large genome datasets in real time or semi-real time for data-driven decision-making and more timely surveillance.
AB - Porcine reproductive and respiratory syndrome is an infectious disease of pigs caused by PRRS virus (PRRSV). A modified live-attenuated vaccine has been widely used to control the spread of PRRSV and the classification of field strains is a key for a successful control and prevention. Restriction fragment length polymorphism targeting the Open reading frame 5 (ORF5) genes is widely used to classify PRRSV strains but showed unstable accuracy. Phylogenetic analysis is a powerful tool for PRRSV classification with consistent accuracy but it demands large computational power as the number of sequences gets increased. Our study aimed to apply four machine learning (ML) algorithms, random forest, k-nearest neighbor, support vector machine and multilayer perceptron, to classify field PRRSV strains into four clades using amino acid scores based on ORF5 gene sequence. Our study used amino acid sequences of ORF5 gene in 1931 field PRRSV strains collected in the US from 2012 to 2020. Phylogenetic analysis was used to labels field PRRSV strains into one of four clades: Lineage 5 or three clades in Linage 1. We measured accuracy and time consumption of classification using four ML approaches by different size of gene sequences. We found that all four ML algorithms classify a large number of field strains in a very short time (<2.5 s) with very high accuracy (>0.99 Area under curve of the Receiver of operating characteristics curve). Furthermore, the random forest approach detects a total of 4 key amino acid positions for the classification of field PRRSV strains into four clades. Our finding will provide an insightful idea to develop a rapid and accurate classification model using genetic information, which also enables us to handle large genome datasets in real time or semi-real time for data-driven decision-making and more timely surveillance.
KW - artificial intelligence
KW - classification
KW - k-nearest neighbor
KW - multilayer perceptron
KW - phylogenetic tree
KW - random forest
KW - support vector machine
KW - swine health
UR - http://www.scopus.com/inward/record.url?scp=85112158136&partnerID=8YFLogxK
UR - http://www.scopus.com/inward/citedby.url?scp=85112158136&partnerID=8YFLogxK
U2 - 10.3389/fvets.2021.683134
DO - 10.3389/fvets.2021.683134
M3 - Article
AN - SCOPUS:85112158136
VL - 8
JO - Frontiers in Veterinary Science
JF - Frontiers in Veterinary Science
SN - 2297-1769
M1 - 683134
ER -