Effects of imputation on correlation

Implications for analysis of mass spectrometry data from multiple biological matrices

Sandra L. Taylor, L. Renee Ruhaak, Karen Kelly, Robert H Weiss, Kyoungmi Kim

Research output: Contribution to journalArticle

10 Citations (Scopus)

Abstract

With expanded access to, and decreased costs of, mass spectrometry, investigators are collecting and analyzing multiple biological matrices from the same subject such as serum, plasma, tissue and urine to enhance biomarker discoveries, understanding of disease processes and identification of therapeutic targets. Commonly, each biological matrix is analyzed separately, but multivariate methods such as MANOVAs that combine information from multiple biological matrices are potentially more powerful. However, mass spectrometric data typically contain large amounts of missing values, and imputation is often used to create complete data sets for analysis. The effects of imputation on multiple biological matrix analyses have not been studied. We investigated the effects of seven imputation methods (half minimum substitution, mean substitution, k-nearest neighbors, local least squares regression, Bayesian principal components analysis, singular value decomposition and random forest), on the within-subject correlation of compounds between biological matrices and its consequences on MANOVA results. Through analysis of three real omics data sets and simulation studies, we found the amount of missing data and imputation method to substantially change the between-matrix correlation structure. The magnitude of the correlations was generally reduced in imputed data sets, and this effect increased with the amount of missing data. Significant results from MANOVA testing also were substantially affected. In particular, the number of false positives increased with the level of missing data for all imputation methods. No one imputation method was universally the best, but the simple substitution methods (Half Minimum and Mean) consistently performed poorly.

Original languageEnglish (US)
Pages (from-to)312-320
Number of pages9
JournalBriefings in Bioinformatics
Volume18
Issue number2
DOIs
StatePublished - 2017

Fingerprint

Mass spectrometry
Mass Spectrometry
Substitution reactions
Biomarkers
Singular value decomposition
Principal Component Analysis
Least-Squares Analysis
Principal component analysis
Research Personnel
Urine
Tissue
Plasmas
Costs and Cost Analysis
Testing
Serum
Datasets
Costs

Keywords

  • Imputation
  • Mass spectrometry
  • Metabolomics
  • Missing data
  • Multivariate analysis
  • Within-subject correlation

ASJC Scopus subject areas

  • Information Systems
  • Molecular Biology

Cite this

@article{51a4e7da3b10426a826eb89ad140a3ee,
title = "Effects of imputation on correlation: Implications for analysis of mass spectrometry data from multiple biological matrices",
abstract = "With expanded access to, and decreased costs of, mass spectrometry, investigators are collecting and analyzing multiple biological matrices from the same subject such as serum, plasma, tissue and urine to enhance biomarker discoveries, understanding of disease processes and identification of therapeutic targets. Commonly, each biological matrix is analyzed separately, but multivariate methods such as MANOVAs that combine information from multiple biological matrices are potentially more powerful. However, mass spectrometric data typically contain large amounts of missing values, and imputation is often used to create complete data sets for analysis. The effects of imputation on multiple biological matrix analyses have not been studied. We investigated the effects of seven imputation methods (half minimum substitution, mean substitution, k-nearest neighbors, local least squares regression, Bayesian principal components analysis, singular value decomposition and random forest), on the within-subject correlation of compounds between biological matrices and its consequences on MANOVA results. Through analysis of three real omics data sets and simulation studies, we found the amount of missing data and imputation method to substantially change the between-matrix correlation structure. The magnitude of the correlations was generally reduced in imputed data sets, and this effect increased with the amount of missing data. Significant results from MANOVA testing also were substantially affected. In particular, the number of false positives increased with the level of missing data for all imputation methods. No one imputation method was universally the best, but the simple substitution methods (Half Minimum and Mean) consistently performed poorly.",
keywords = "Imputation, Mass spectrometry, Metabolomics, Missing data, Multivariate analysis, Within-subject correlation",
author = "Taylor, {Sandra L.} and {Renee Ruhaak}, L. and Karen Kelly and Weiss, {Robert H} and Kyoungmi Kim",
year = "2017",
doi = "10.1093/bib/bbw010",
language = "English (US)",
volume = "18",
pages = "312--320",
journal = "Briefings in Bioinformatics",
issn = "1467-5463",
publisher = "Oxford University Press",
number = "2",

}

TY - JOUR

T1 - Effects of imputation on correlation

T2 - Implications for analysis of mass spectrometry data from multiple biological matrices

AU - Taylor, Sandra L.

AU - Renee Ruhaak, L.

AU - Kelly, Karen

AU - Weiss, Robert H

AU - Kim, Kyoungmi

PY - 2017

Y1 - 2017

N2 - With expanded access to, and decreased costs of, mass spectrometry, investigators are collecting and analyzing multiple biological matrices from the same subject such as serum, plasma, tissue and urine to enhance biomarker discoveries, understanding of disease processes and identification of therapeutic targets. Commonly, each biological matrix is analyzed separately, but multivariate methods such as MANOVAs that combine information from multiple biological matrices are potentially more powerful. However, mass spectrometric data typically contain large amounts of missing values, and imputation is often used to create complete data sets for analysis. The effects of imputation on multiple biological matrix analyses have not been studied. We investigated the effects of seven imputation methods (half minimum substitution, mean substitution, k-nearest neighbors, local least squares regression, Bayesian principal components analysis, singular value decomposition and random forest), on the within-subject correlation of compounds between biological matrices and its consequences on MANOVA results. Through analysis of three real omics data sets and simulation studies, we found the amount of missing data and imputation method to substantially change the between-matrix correlation structure. The magnitude of the correlations was generally reduced in imputed data sets, and this effect increased with the amount of missing data. Significant results from MANOVA testing also were substantially affected. In particular, the number of false positives increased with the level of missing data for all imputation methods. No one imputation method was universally the best, but the simple substitution methods (Half Minimum and Mean) consistently performed poorly.

AB - With expanded access to, and decreased costs of, mass spectrometry, investigators are collecting and analyzing multiple biological matrices from the same subject such as serum, plasma, tissue and urine to enhance biomarker discoveries, understanding of disease processes and identification of therapeutic targets. Commonly, each biological matrix is analyzed separately, but multivariate methods such as MANOVAs that combine information from multiple biological matrices are potentially more powerful. However, mass spectrometric data typically contain large amounts of missing values, and imputation is often used to create complete data sets for analysis. The effects of imputation on multiple biological matrix analyses have not been studied. We investigated the effects of seven imputation methods (half minimum substitution, mean substitution, k-nearest neighbors, local least squares regression, Bayesian principal components analysis, singular value decomposition and random forest), on the within-subject correlation of compounds between biological matrices and its consequences on MANOVA results. Through analysis of three real omics data sets and simulation studies, we found the amount of missing data and imputation method to substantially change the between-matrix correlation structure. The magnitude of the correlations was generally reduced in imputed data sets, and this effect increased with the amount of missing data. Significant results from MANOVA testing also were substantially affected. In particular, the number of false positives increased with the level of missing data for all imputation methods. No one imputation method was universally the best, but the simple substitution methods (Half Minimum and Mean) consistently performed poorly.

KW - Imputation

KW - Mass spectrometry

KW - Metabolomics

KW - Missing data

KW - Multivariate analysis

KW - Within-subject correlation

UR - http://www.scopus.com/inward/record.url?scp=85018795057&partnerID=8YFLogxK

UR - http://www.scopus.com/inward/citedby.url?scp=85018795057&partnerID=8YFLogxK

U2 - 10.1093/bib/bbw010

DO - 10.1093/bib/bbw010

M3 - Article

VL - 18

SP - 312

EP - 320

JO - Briefings in Bioinformatics

JF - Briefings in Bioinformatics

SN - 1467-5463

IS - 2

ER -