A text-mining approach to obtain detailed treatment information from free-text fields in population-based cancer registries: A study of non-small cell lung cancer in California

Frances B. Maguire, Cyllene R. Morris, Arti Parikh-Patel, Rosemary D Cress, Theresa H Keegan, Chin-Shang Li, Patrick S. Lin, Kenneth W Kizer

Research output: Contribution to journalArticle

Abstract

Background Population-based cancer registries have treatment information for all patients making them an excellent resource for population-level monitoring. However, specific treatment details, such as drug names, are contained in a free-text format that is difficult to process and summarize. We assessed the accuracy and efficiency of a text-mining algorithm to identify systemic treatments for lung cancer from free-text fields in the California Cancer Registry. Methods The algorithm used Perl regular expressions in SAS 9.4 to search for treatments in 24,845 free-text records associated with 17,310 patients in California diagnosed with stage IV non-small cell lung cancer between 2012 and 2014. Our algorithm categorized treatments into six groups that align with National Comprehensive Cancer Network guidelines. We compared results to a manual review (gold standard) of the same records. Results Percent agreement ranged from 91.1% to 99.4%. Ranges for other measures were 0.71–0.92 (Kappa), 74.3%-97.3% (sensitivity), 92.4%-99.8% (specificity), 60.4%-96.4% (positive predictive value), and 92.9%-99.9% (negative predictive value). The text-mining algorithm used one-sixth of the time required for manual review. Conclusion SAS-based text mining of free-text data can accurately detect systemic treatments administered to patients and save considerable time compared to manual review, maximizing the utility of the extant information in population-based cancer registries for comparative effectiveness research.

Original languageEnglish (US)
Article numbere0212454
JournalPloS one
Volume14
Issue number2
DOIs
StatePublished - Feb 1 2019

Fingerprint

Data Mining
lung neoplasms
Non-Small Cell Lung Carcinoma
Registries
Cells
neoplasms
Population
Neoplasms
cells
Therapeutics
Comparative Effectiveness Research
gold
drugs
Names
Monitoring
monitoring
Lung Neoplasms
Pharmaceutical Preparations
Guidelines
methodology

ASJC Scopus subject areas

  • Biochemistry, Genetics and Molecular Biology(all)
  • Agricultural and Biological Sciences(all)

Cite this

A text-mining approach to obtain detailed treatment information from free-text fields in population-based cancer registries : A study of non-small cell lung cancer in California. / Maguire, Frances B.; Morris, Cyllene R.; Parikh-Patel, Arti; Cress, Rosemary D; Keegan, Theresa H; Li, Chin-Shang; Lin, Patrick S.; Kizer, Kenneth W.

In: PloS one, Vol. 14, No. 2, e0212454, 01.02.2019.

Research output: Contribution to journalArticle

@article{3d39037f8d3a44c0a427a0010c92e6e0,
title = "A text-mining approach to obtain detailed treatment information from free-text fields in population-based cancer registries: A study of non-small cell lung cancer in California",
abstract = "Background Population-based cancer registries have treatment information for all patients making them an excellent resource for population-level monitoring. However, specific treatment details, such as drug names, are contained in a free-text format that is difficult to process and summarize. We assessed the accuracy and efficiency of a text-mining algorithm to identify systemic treatments for lung cancer from free-text fields in the California Cancer Registry. Methods The algorithm used Perl regular expressions in SAS 9.4 to search for treatments in 24,845 free-text records associated with 17,310 patients in California diagnosed with stage IV non-small cell lung cancer between 2012 and 2014. Our algorithm categorized treatments into six groups that align with National Comprehensive Cancer Network guidelines. We compared results to a manual review (gold standard) of the same records. Results Percent agreement ranged from 91.1{\%} to 99.4{\%}. Ranges for other measures were 0.71–0.92 (Kappa), 74.3{\%}-97.3{\%} (sensitivity), 92.4{\%}-99.8{\%} (specificity), 60.4{\%}-96.4{\%} (positive predictive value), and 92.9{\%}-99.9{\%} (negative predictive value). The text-mining algorithm used one-sixth of the time required for manual review. Conclusion SAS-based text mining of free-text data can accurately detect systemic treatments administered to patients and save considerable time compared to manual review, maximizing the utility of the extant information in population-based cancer registries for comparative effectiveness research.",
author = "Maguire, {Frances B.} and Morris, {Cyllene R.} and Arti Parikh-Patel and Cress, {Rosemary D} and Keegan, {Theresa H} and Chin-Shang Li and Lin, {Patrick S.} and Kizer, {Kenneth W}",
year = "2019",
month = "2",
day = "1",
doi = "10.1371/journal.pone.0212454",
language = "English (US)",
volume = "14",
journal = "PLoS One",
issn = "1932-6203",
publisher = "Public Library of Science",
number = "2",

}

TY - JOUR

T1 - A text-mining approach to obtain detailed treatment information from free-text fields in population-based cancer registries

T2 - A study of non-small cell lung cancer in California

AU - Maguire, Frances B.

AU - Morris, Cyllene R.

AU - Parikh-Patel, Arti

AU - Cress, Rosemary D

AU - Keegan, Theresa H

AU - Li, Chin-Shang

AU - Lin, Patrick S.

AU - Kizer, Kenneth W

PY - 2019/2/1

Y1 - 2019/2/1

N2 - Background Population-based cancer registries have treatment information for all patients making them an excellent resource for population-level monitoring. However, specific treatment details, such as drug names, are contained in a free-text format that is difficult to process and summarize. We assessed the accuracy and efficiency of a text-mining algorithm to identify systemic treatments for lung cancer from free-text fields in the California Cancer Registry. Methods The algorithm used Perl regular expressions in SAS 9.4 to search for treatments in 24,845 free-text records associated with 17,310 patients in California diagnosed with stage IV non-small cell lung cancer between 2012 and 2014. Our algorithm categorized treatments into six groups that align with National Comprehensive Cancer Network guidelines. We compared results to a manual review (gold standard) of the same records. Results Percent agreement ranged from 91.1% to 99.4%. Ranges for other measures were 0.71–0.92 (Kappa), 74.3%-97.3% (sensitivity), 92.4%-99.8% (specificity), 60.4%-96.4% (positive predictive value), and 92.9%-99.9% (negative predictive value). The text-mining algorithm used one-sixth of the time required for manual review. Conclusion SAS-based text mining of free-text data can accurately detect systemic treatments administered to patients and save considerable time compared to manual review, maximizing the utility of the extant information in population-based cancer registries for comparative effectiveness research.

AB - Background Population-based cancer registries have treatment information for all patients making them an excellent resource for population-level monitoring. However, specific treatment details, such as drug names, are contained in a free-text format that is difficult to process and summarize. We assessed the accuracy and efficiency of a text-mining algorithm to identify systemic treatments for lung cancer from free-text fields in the California Cancer Registry. Methods The algorithm used Perl regular expressions in SAS 9.4 to search for treatments in 24,845 free-text records associated with 17,310 patients in California diagnosed with stage IV non-small cell lung cancer between 2012 and 2014. Our algorithm categorized treatments into six groups that align with National Comprehensive Cancer Network guidelines. We compared results to a manual review (gold standard) of the same records. Results Percent agreement ranged from 91.1% to 99.4%. Ranges for other measures were 0.71–0.92 (Kappa), 74.3%-97.3% (sensitivity), 92.4%-99.8% (specificity), 60.4%-96.4% (positive predictive value), and 92.9%-99.9% (negative predictive value). The text-mining algorithm used one-sixth of the time required for manual review. Conclusion SAS-based text mining of free-text data can accurately detect systemic treatments administered to patients and save considerable time compared to manual review, maximizing the utility of the extant information in population-based cancer registries for comparative effectiveness research.

UR - http://www.scopus.com/inward/record.url?scp=85061978083&partnerID=8YFLogxK

UR - http://www.scopus.com/inward/citedby.url?scp=85061978083&partnerID=8YFLogxK

U2 - 10.1371/journal.pone.0212454

DO - 10.1371/journal.pone.0212454

M3 - Article

C2 - 30794610

AN - SCOPUS:85061978083

VL - 14

JO - PLoS One

JF - PLoS One

SN - 1932-6203

IS - 2

M1 - e0212454

ER -