Bayesian Variable Selection in Generalized Linear Models with Missing Varibles

  • Yang, Xiaowei (PI)

Project: Research project

Project Details


DESCRIPTION (provided by applicant): The applicant seeks to address the problem of missing values A major challenge for biomedical research comes from the problems of missing values, which may be caused by subjective (e.g., nonresponse and dropout) and technical reasons (e.g., censoring over/below quantization level). Generalized linear models (GLMs) and Generalized Linear Mixed Models (GLMMs) are popularly applied in biomedical data analysis where a fundamental task is to identify a subset of independent variables (e.g., genetic, proteomic, behavioral, or environmental factors) to interpret or predict a dependent variable (e.g., therapeutic effectiveness and safety). Given an incomplete data set, practitioners may needlessly resort to the strategy of case-deletion where individuals are excluded from consideration if they miss any of the variables targeted for analysis. This method would not only sacrifice useful information, but also give rise to biased estimates because it requires strong assumptions to accept the missingness mechanisms. A more satisfactory solution for missing data problems involves multiple imputation, where several imputations are created for the same set of missing values. Across multiply imputed data sets, however, traditional variable selection methods (based on significance tests or likelihood criteria) often result in models with different selected predictors, thus presenting a problem of combining the models to make final inferences. In this R01 proposal, we aim to develop alternative strategies of variable selection for GLMs with missing values by drawing on a Bayesian framework. One approach called "impute, then select" (ITS) involves initially performing multiple imputation and then applying Bayesian variable selection to the multiply imputed data sets. The second strategy - "simultaneously impute and select" (SIAS) - conducts Bayesian variable selection and missing data imputation simultaneously within one Markov Chain Monte Carlo (MCMC) process. ITS and SIAS offer two generic frameworks within which various Bayesian variable selection algorithms and missing data imputation algorithms can be implemented. The strategies will be extended to handle complex data sets such as those with multi-level design structures and/or large number of variables. The strategies will be developed, evaluated, and implemented into an R library for normal, binomial/multinomial, and Poisson regression models with mixed categorical and continuous explanatory variables. Simulated and practical data sets from studies on childhood autism and drug dependence will be used to address the effectiveness and flexibility of the proposed strategies. PUBLIC HEALTH RELEVANCE: Missing data is the normal circumstance when developing large data sets. This issue comes to the forefront when using large data sets to develop personalized and individualized care. To avoid this loss of data and provide better predictions of risk and benefit, imputation-based Bayesian variable selection strategy provides a powerful analytical tool. The availability of our new method and software package will greatly enhance the capacity and quality of medical research and healthcare delivery
Effective start/end date8/11/118/30/14


  • National Institutes of Health: $229,953.00
  • National Institutes of Health: $192,688.00
  • National Institutes of Health: $96,862.00
  • National Institutes of Health: $95,377.00


  • Medicine(all)


Explore the research topics touched on by this project. These labels are generated based on the underlying awards/grants. Together they form a unique fingerprint.