Ancestry inference using principal component analysis and spatial analysis: A distance-based analysis to account for population substructure

Jinyoung Byun, Younghun Han, Ivan P. Gorlov, Jonathan A. Busam, Michael F Seldin, Christopher I. Amos

Research output: Contribution to journalArticle

5 Citations (Scopus)

Abstract

Background: Accurate inference of genetic ancestry is of fundamental interest to many biomedical, forensic, and anthropological research areas. Genetic ancestry memberships may relate to genetic disease risks. In a genome association study, failing to account for differences in genetic ancestry between cases and controls may also lead to false-positive results. Although a number of strategies for inferring and taking into account the confounding effects of genetic ancestry are available, applying them to large studies (tens thousands samples) is challenging. The goal of this study is to develop an approach for inferring genetic ancestry of samples with unknown ancestry among closely related populations and to provide accurate estimates of ancestry for application to large-scale studies. Methods: In this study we developed a novel distance-based approach, Ancestry Inference using Principal component analysis and Spatial analysis (AIPS) that incorporates an Inverse Distance Weighted (IDW) interpolation method from spatial analysis to assign individuals to population memberships. Results: We demonstrate the benefits of AIPS in analyzing population substructure, specifically related to the four most commonly used tools EIGENSTRAT, STRUCTURE, fastSTRUCTURE, and ADMIXTURE using genotype data from various intra-European panels and European-Americans. While the aforementioned commonly used tools performed poorly in inferring ancestry from a large number of subpopulations, AIPS accurately distinguished variations between and within subpopulations. Conclusions: Our results show that AIPS can be applied to large-scale data sets to discriminate the modest variability among intra-continental populations as well as for characterizing inter-continental variation. The method we developed will protect against spurious associations when mapping the genetic basis of a disease. Our approach is more accurate and computationally efficient method for inferring genetic ancestry in the large-scale genetic studies.

Original languageEnglish (US)
Article number789
JournalBMC Genomics
Volume18
Issue number1
DOIs
StatePublished - Oct 16 2017

Fingerprint

Spatial Analysis
Principal Component Analysis
Population
Inborn Genetic Diseases
Anthropology
Genotype
Genome
Research

Keywords

  • Ancestry inference
  • Inverse distance weighted interpolation
  • Principal component analysis
  • Spatial analysis

ASJC Scopus subject areas

  • Biotechnology
  • Genetics

Cite this

Ancestry inference using principal component analysis and spatial analysis : A distance-based analysis to account for population substructure. / Byun, Jinyoung; Han, Younghun; Gorlov, Ivan P.; Busam, Jonathan A.; Seldin, Michael F; Amos, Christopher I.

In: BMC Genomics, Vol. 18, No. 1, 789, 16.10.2017.

Research output: Contribution to journalArticle

Byun, Jinyoung ; Han, Younghun ; Gorlov, Ivan P. ; Busam, Jonathan A. ; Seldin, Michael F ; Amos, Christopher I. / Ancestry inference using principal component analysis and spatial analysis : A distance-based analysis to account for population substructure. In: BMC Genomics. 2017 ; Vol. 18, No. 1.
@article{48605b9195154d00af5285fb6c109ff2,
title = "Ancestry inference using principal component analysis and spatial analysis: A distance-based analysis to account for population substructure",
abstract = "Background: Accurate inference of genetic ancestry is of fundamental interest to many biomedical, forensic, and anthropological research areas. Genetic ancestry memberships may relate to genetic disease risks. In a genome association study, failing to account for differences in genetic ancestry between cases and controls may also lead to false-positive results. Although a number of strategies for inferring and taking into account the confounding effects of genetic ancestry are available, applying them to large studies (tens thousands samples) is challenging. The goal of this study is to develop an approach for inferring genetic ancestry of samples with unknown ancestry among closely related populations and to provide accurate estimates of ancestry for application to large-scale studies. Methods: In this study we developed a novel distance-based approach, Ancestry Inference using Principal component analysis and Spatial analysis (AIPS) that incorporates an Inverse Distance Weighted (IDW) interpolation method from spatial analysis to assign individuals to population memberships. Results: We demonstrate the benefits of AIPS in analyzing population substructure, specifically related to the four most commonly used tools EIGENSTRAT, STRUCTURE, fastSTRUCTURE, and ADMIXTURE using genotype data from various intra-European panels and European-Americans. While the aforementioned commonly used tools performed poorly in inferring ancestry from a large number of subpopulations, AIPS accurately distinguished variations between and within subpopulations. Conclusions: Our results show that AIPS can be applied to large-scale data sets to discriminate the modest variability among intra-continental populations as well as for characterizing inter-continental variation. The method we developed will protect against spurious associations when mapping the genetic basis of a disease. Our approach is more accurate and computationally efficient method for inferring genetic ancestry in the large-scale genetic studies.",
keywords = "Ancestry inference, Inverse distance weighted interpolation, Principal component analysis, Spatial analysis",
author = "Jinyoung Byun and Younghun Han and Gorlov, {Ivan P.} and Busam, {Jonathan A.} and Seldin, {Michael F} and Amos, {Christopher I.}",
year = "2017",
month = "10",
day = "16",
doi = "10.1186/s12864-017-4166-8",
language = "English (US)",
volume = "18",
journal = "BMC Genomics",
issn = "1471-2164",
publisher = "BioMed Central",
number = "1",

}

TY - JOUR

T1 - Ancestry inference using principal component analysis and spatial analysis

T2 - A distance-based analysis to account for population substructure

AU - Byun, Jinyoung

AU - Han, Younghun

AU - Gorlov, Ivan P.

AU - Busam, Jonathan A.

AU - Seldin, Michael F

AU - Amos, Christopher I.

PY - 2017/10/16

Y1 - 2017/10/16

N2 - Background: Accurate inference of genetic ancestry is of fundamental interest to many biomedical, forensic, and anthropological research areas. Genetic ancestry memberships may relate to genetic disease risks. In a genome association study, failing to account for differences in genetic ancestry between cases and controls may also lead to false-positive results. Although a number of strategies for inferring and taking into account the confounding effects of genetic ancestry are available, applying them to large studies (tens thousands samples) is challenging. The goal of this study is to develop an approach for inferring genetic ancestry of samples with unknown ancestry among closely related populations and to provide accurate estimates of ancestry for application to large-scale studies. Methods: In this study we developed a novel distance-based approach, Ancestry Inference using Principal component analysis and Spatial analysis (AIPS) that incorporates an Inverse Distance Weighted (IDW) interpolation method from spatial analysis to assign individuals to population memberships. Results: We demonstrate the benefits of AIPS in analyzing population substructure, specifically related to the four most commonly used tools EIGENSTRAT, STRUCTURE, fastSTRUCTURE, and ADMIXTURE using genotype data from various intra-European panels and European-Americans. While the aforementioned commonly used tools performed poorly in inferring ancestry from a large number of subpopulations, AIPS accurately distinguished variations between and within subpopulations. Conclusions: Our results show that AIPS can be applied to large-scale data sets to discriminate the modest variability among intra-continental populations as well as for characterizing inter-continental variation. The method we developed will protect against spurious associations when mapping the genetic basis of a disease. Our approach is more accurate and computationally efficient method for inferring genetic ancestry in the large-scale genetic studies.

AB - Background: Accurate inference of genetic ancestry is of fundamental interest to many biomedical, forensic, and anthropological research areas. Genetic ancestry memberships may relate to genetic disease risks. In a genome association study, failing to account for differences in genetic ancestry between cases and controls may also lead to false-positive results. Although a number of strategies for inferring and taking into account the confounding effects of genetic ancestry are available, applying them to large studies (tens thousands samples) is challenging. The goal of this study is to develop an approach for inferring genetic ancestry of samples with unknown ancestry among closely related populations and to provide accurate estimates of ancestry for application to large-scale studies. Methods: In this study we developed a novel distance-based approach, Ancestry Inference using Principal component analysis and Spatial analysis (AIPS) that incorporates an Inverse Distance Weighted (IDW) interpolation method from spatial analysis to assign individuals to population memberships. Results: We demonstrate the benefits of AIPS in analyzing population substructure, specifically related to the four most commonly used tools EIGENSTRAT, STRUCTURE, fastSTRUCTURE, and ADMIXTURE using genotype data from various intra-European panels and European-Americans. While the aforementioned commonly used tools performed poorly in inferring ancestry from a large number of subpopulations, AIPS accurately distinguished variations between and within subpopulations. Conclusions: Our results show that AIPS can be applied to large-scale data sets to discriminate the modest variability among intra-continental populations as well as for characterizing inter-continental variation. The method we developed will protect against spurious associations when mapping the genetic basis of a disease. Our approach is more accurate and computationally efficient method for inferring genetic ancestry in the large-scale genetic studies.

KW - Ancestry inference

KW - Inverse distance weighted interpolation

KW - Principal component analysis

KW - Spatial analysis

UR - http://www.scopus.com/inward/record.url?scp=85031502194&partnerID=8YFLogxK

UR - http://www.scopus.com/inward/citedby.url?scp=85031502194&partnerID=8YFLogxK

U2 - 10.1186/s12864-017-4166-8

DO - 10.1186/s12864-017-4166-8

M3 - Article

C2 - 29037167

AN - SCOPUS:85031502194

VL - 18

JO - BMC Genomics

JF - BMC Genomics

SN - 1471-2164

IS - 1

M1 - 789

ER -