TY - JOUR
T1 - Prediction of tuberculosis using an automated machine learning platform for models trained on synthetic data
AU - Rashidi, Hooman
AU - Khan, Imran
AU - Dang, Luke
AU - Albahra, Samer
AU - Ratan, Ujjwal
AU - Chadderwala, Nihir
AU - To, Wilson
AU - Srinivas, Prathima
AU - Wajda, Jeffery
AU - Tran, Nam
N1 - Publisher Copyright:
© 2022 Wolters Kluwer Medknow Publications. All rights reserved.
PY - 2022
Y1 - 2022
N2 - High-quality medical data is critical to the development and implementation of machine learning (ML) algorithms in healthcare; however, security, and privacy concerns continue to limit access. We sought to determine the utility of 'synthetic data' in training ML algorithms for the detection of tuberculosis (TB) from inflammatory biomarker profiles. A retrospective dataset (A) comprised of 278 patients was used to generate synthetic datasets (B, C, and D) for training models prior to secondary validation on a generalization dataset. ML models trained and validated on the Dataset A (real) demonstrated an accuracy of 90%, a sensitivity of 89% (95% CI, 83-94%), and a specificity of 100% (95% CI, 81-100%). Models trained using the optimal synthetic dataset B showed an accuracy of 91%, a sensitivity of 93% (95% CI, 87-96%), and a specificity of 77% (95% CI, 50-93%). Synthetic datasets C and D displayed diminished performance measures (respective accuracies of 71% and 54%). This pilot study highlights the promise of synthetic data as an expedited means for ML algorithm development.
AB - High-quality medical data is critical to the development and implementation of machine learning (ML) algorithms in healthcare; however, security, and privacy concerns continue to limit access. We sought to determine the utility of 'synthetic data' in training ML algorithms for the detection of tuberculosis (TB) from inflammatory biomarker profiles. A retrospective dataset (A) comprised of 278 patients was used to generate synthetic datasets (B, C, and D) for training models prior to secondary validation on a generalization dataset. ML models trained and validated on the Dataset A (real) demonstrated an accuracy of 90%, a sensitivity of 89% (95% CI, 83-94%), and a specificity of 100% (95% CI, 81-100%). Models trained using the optimal synthetic dataset B showed an accuracy of 91%, a sensitivity of 93% (95% CI, 87-96%), and a specificity of 77% (95% CI, 50-93%). Synthetic datasets C and D displayed diminished performance measures (respective accuracies of 71% and 54%). This pilot study highlights the promise of synthetic data as an expedited means for ML algorithm development.
KW - Artificial intelligence
KW - biomarkers
KW - data accessibility
KW - electronic medical record
KW - privacy
KW - simulation
UR - http://www.scopus.com/inward/record.url?scp=85124653923&partnerID=8YFLogxK
UR - http://www.scopus.com/inward/citedby.url?scp=85124653923&partnerID=8YFLogxK
U2 - 10.4103/jpi.jpi_75_21
DO - 10.4103/jpi.jpi_75_21
M3 - Article
AN - SCOPUS:85124653923
VL - 13
SP - 10
JO - Journal of Pathology Informatics
JF - Journal of Pathology Informatics
SN - 2229-5089
IS - 1
ER -