Cross-Modal Data Programming Enables Rapid Medical Machine Learning

Jared A. Dunnmon, Alexander J. Ratner, Khaled Saab, Nishith Khandwala, Matthew Markert, Hersh Sagreiya, Roger Goldman, Christopher Lee-Messer, Matthew P. Lungren, Daniel L. Rubin, Christopher Ré

Research output: Contribution to journalArticlepeer-review

8 Scopus citations


A major bottleneck in developing clinically impactful machine learning models is a lack of labeled training data for model supervision. Thus, medical researchers increasingly turn to weaker, noisier sources of supervision, such as leveraging extractions from unstructured text reports to supervise image classification. A key challenge in weak supervision is combining sources of information that may differ in quality and have correlated errors. Recently, a statistical theory of weak supervision called data programming has shown promise in addressing this challenge. Data programming now underpins many deployed machine-learning systems in the technology industry, even for critical applications. We propose a new technique for applying data programming to the problem of cross-modal weak supervision in medicine, wherein weak labels derived from an auxiliary modality (e.g., text) are used to train models over a different target modality (e.g., images). We evaluate our approach on diverse clinical tasks via direct comparison to institution-scale, hand-labeled datasets. We find that our supervision technique increases model performance by up to 6 points area under the receiver operating characteristic curve (ROC-AUC) over baseline methods by improving both coverage and quality of the weak labels. Our approach yields models that on average perform within 1.75 points ROC-AUC of those supervised with physician-years of hand labeling and outperform those supervised with physician-months of hand labeling by 10.25 points ROC-AUC, while using only person-days of developer time and clinician work—a time saving of 96%. Our results suggest that modern weak supervision techniques such as data programming may enable more rapid development and deployment of clinically useful machine-learning models. Machine learning can achieve record-breaking performance on many tasks, but machine learning development is often hindered by insufficient hand-labeled data for model training. This issue is particularly prohibitive in areas such as medical diagnostic analysis, where data are private and require expensive labeling by clinicians. A promising approach to handle this bottleneck is weak supervision, where machine learning models are trained using cheaper, noisier labels. We extend a recent, theoretically grounded weak supervision paradigm—data programming—wherein subject matter expert users write labeling functions to label training data imprecisely rather than hand-labeling data points. We show that our approach allows us to train machine learning models using person-days of effort that previously required person-years of hand labeling. Our methods could enable researchers and practitioners to leverage machine learning models over high-dimensional data (e.g., images, time series) even when labeled training sets are unavailable. Machine learning (ML) models have achieved record-breaking performance on many tasks, but development is often blocked by a lack of large, hand-labeled training datasets for model supervision. We extend data programming—a theoretically grounded technique for supervision using cheaper, noisier labels—to train medical ML models using person-days of effort that previously required person-years of hand labeling. We find that our weakly supervised models perform similarly to their hand-labeled counterparts and that their performance improves as additional unlabeled data becomes available.

Original languageEnglish (US)
Article number100019
Issue number2
StatePublished - May 8 2020
Externally publishedYes


  • computed tomography
  • DSML 3: Development/Pre-production: Data science output has been rolled out/validated across multiple domains/problems
  • electroencephalography
  • machine learning
  • medical imaging
  • weak supervision
  • X-ray

ASJC Scopus subject areas

  • Decision Sciences(all)


Dive into the research topics of 'Cross-Modal Data Programming Enables Rapid Medical Machine Learning'. Together they form a unique fingerprint.

Cite this