Discovery of tandem and interspersed segmental duplications using high-throughput sequencing

Arda Soylev, Thong Minh Le, Hajar Amini, Can Alkan, Fereydoun Hormozdiari, Bonnie Berger

Research output: Contribution to journalArticle

1 Citation (Scopus)

Abstract

Several algorithms have been developed that use high-throughput sequencing technology to characterize structural variations (SVs). Most of the existing approaches focus on detecting relatively simple types of SVs such as insertions, deletions and short inversions. In fact, complex SVs are of crucial importance and several have been associated with genomic disorders. To better understand the contribution of complex SVs to human disease, we need new algorithms to accurately discover and genotype such variants. Additionally, due to similar sequencing signatures, inverted duplications or gene conversion events that include inverted segmental duplications are often characterized as simple inversions, likewise, duplications and gene conversions in direct orientation may be called as simple deletions. Therefore, there is still a need for accurate algorithms to fully characterize complex SVs and thus improve calling accuracy of more simple variants. Results: We developed novel algorithms to accurately characterize tandem, direct and inverted interspersed segmental duplications using short read whole genome sequencing datasets. We integrated these methods to our TARDIS tool, which is now capable of detecting various types of SVs using multiple sequence signatures such as read pair, read depth and split read. We evaluated the prediction performance of our algorithms through several experiments using both simulated and real datasets. In the simulation experiments, using a 30× coverage TARDIS achieved 96% sensitivity with only 4% false discovery rate. For experiments that involve real data, we used two haploid genomes (CHM1 and CHM13) and one human genome (NA12878) from the Illumina Platinum Genomes set. Comparison of our results with orthogonal PacBio call sets from the same genomes revealed higher accuracy for TARDIS than state-of-the-art methods. Furthermore, we showed a surprisingly low false discovery rate of our approach for discovery of tandem, direct and inverted interspersed segmental duplications prediction on CHM1 (<5% for the top 50 predictions). Availability and implementation: TARDIS source code is available at https://github.com/BilkentCompGen/tardis, and a corresponding Docker image is available at https://hub.docker.com/r/alkanlab/tardis/. Supplementary information: Supplementary data are available at Bioinformatics online.

Original languageEnglish (US)
Pages (from-to)3923-3930
Number of pages8
JournalBioinformatics
Volume35
Issue number20
DOIs
StatePublished - Oct 15 2019

Fingerprint

Genomic Segmental Duplications
Duplication
Sequencing
High Throughput
Genes
Throughput
Genome
Gene Conversion
Deletion
Inversion
Signature
Experiments
Haploidy
Gene
Human Genome
Bioinformatics
Platinum
Computational Biology
Prediction
Performance Prediction

ASJC Scopus subject areas

  • Statistics and Probability
  • Biochemistry
  • Molecular Biology
  • Computer Science Applications
  • Computational Theory and Mathematics
  • Computational Mathematics

Cite this

Discovery of tandem and interspersed segmental duplications using high-throughput sequencing. / Soylev, Arda; Le, Thong Minh; Amini, Hajar; Alkan, Can; Hormozdiari, Fereydoun; Berger, Bonnie.

In: Bioinformatics, Vol. 35, No. 20, 15.10.2019, p. 3923-3930.

Research output: Contribution to journalArticle

Soylev, Arda ; Le, Thong Minh ; Amini, Hajar ; Alkan, Can ; Hormozdiari, Fereydoun ; Berger, Bonnie. / Discovery of tandem and interspersed segmental duplications using high-throughput sequencing. In: Bioinformatics. 2019 ; Vol. 35, No. 20. pp. 3923-3930.
@article{372f1108386c4e9398d807e1a2b6b160,
title = "Discovery of tandem and interspersed segmental duplications using high-throughput sequencing",
abstract = "Several algorithms have been developed that use high-throughput sequencing technology to characterize structural variations (SVs). Most of the existing approaches focus on detecting relatively simple types of SVs such as insertions, deletions and short inversions. In fact, complex SVs are of crucial importance and several have been associated with genomic disorders. To better understand the contribution of complex SVs to human disease, we need new algorithms to accurately discover and genotype such variants. Additionally, due to similar sequencing signatures, inverted duplications or gene conversion events that include inverted segmental duplications are often characterized as simple inversions, likewise, duplications and gene conversions in direct orientation may be called as simple deletions. Therefore, there is still a need for accurate algorithms to fully characterize complex SVs and thus improve calling accuracy of more simple variants. Results: We developed novel algorithms to accurately characterize tandem, direct and inverted interspersed segmental duplications using short read whole genome sequencing datasets. We integrated these methods to our TARDIS tool, which is now capable of detecting various types of SVs using multiple sequence signatures such as read pair, read depth and split read. We evaluated the prediction performance of our algorithms through several experiments using both simulated and real datasets. In the simulation experiments, using a 30× coverage TARDIS achieved 96{\%} sensitivity with only 4{\%} false discovery rate. For experiments that involve real data, we used two haploid genomes (CHM1 and CHM13) and one human genome (NA12878) from the Illumina Platinum Genomes set. Comparison of our results with orthogonal PacBio call sets from the same genomes revealed higher accuracy for TARDIS than state-of-the-art methods. Furthermore, we showed a surprisingly low false discovery rate of our approach for discovery of tandem, direct and inverted interspersed segmental duplications prediction on CHM1 (<5{\%} for the top 50 predictions). Availability and implementation: TARDIS source code is available at https://github.com/BilkentCompGen/tardis, and a corresponding Docker image is available at https://hub.docker.com/r/alkanlab/tardis/. Supplementary information: Supplementary data are available at Bioinformatics online.",
author = "Arda Soylev and Le, {Thong Minh} and Hajar Amini and Can Alkan and Fereydoun Hormozdiari and Bonnie Berger",
year = "2019",
month = "10",
day = "15",
doi = "10.1093/bioinformatics/btz237",
language = "English (US)",
volume = "35",
pages = "3923--3930",
journal = "Bioinformatics",
issn = "1367-4803",
publisher = "Oxford University Press",
number = "20",

}

TY - JOUR

T1 - Discovery of tandem and interspersed segmental duplications using high-throughput sequencing

AU - Soylev, Arda

AU - Le, Thong Minh

AU - Amini, Hajar

AU - Alkan, Can

AU - Hormozdiari, Fereydoun

AU - Berger, Bonnie

PY - 2019/10/15

Y1 - 2019/10/15

N2 - Several algorithms have been developed that use high-throughput sequencing technology to characterize structural variations (SVs). Most of the existing approaches focus on detecting relatively simple types of SVs such as insertions, deletions and short inversions. In fact, complex SVs are of crucial importance and several have been associated with genomic disorders. To better understand the contribution of complex SVs to human disease, we need new algorithms to accurately discover and genotype such variants. Additionally, due to similar sequencing signatures, inverted duplications or gene conversion events that include inverted segmental duplications are often characterized as simple inversions, likewise, duplications and gene conversions in direct orientation may be called as simple deletions. Therefore, there is still a need for accurate algorithms to fully characterize complex SVs and thus improve calling accuracy of more simple variants. Results: We developed novel algorithms to accurately characterize tandem, direct and inverted interspersed segmental duplications using short read whole genome sequencing datasets. We integrated these methods to our TARDIS tool, which is now capable of detecting various types of SVs using multiple sequence signatures such as read pair, read depth and split read. We evaluated the prediction performance of our algorithms through several experiments using both simulated and real datasets. In the simulation experiments, using a 30× coverage TARDIS achieved 96% sensitivity with only 4% false discovery rate. For experiments that involve real data, we used two haploid genomes (CHM1 and CHM13) and one human genome (NA12878) from the Illumina Platinum Genomes set. Comparison of our results with orthogonal PacBio call sets from the same genomes revealed higher accuracy for TARDIS than state-of-the-art methods. Furthermore, we showed a surprisingly low false discovery rate of our approach for discovery of tandem, direct and inverted interspersed segmental duplications prediction on CHM1 (<5% for the top 50 predictions). Availability and implementation: TARDIS source code is available at https://github.com/BilkentCompGen/tardis, and a corresponding Docker image is available at https://hub.docker.com/r/alkanlab/tardis/. Supplementary information: Supplementary data are available at Bioinformatics online.

AB - Several algorithms have been developed that use high-throughput sequencing technology to characterize structural variations (SVs). Most of the existing approaches focus on detecting relatively simple types of SVs such as insertions, deletions and short inversions. In fact, complex SVs are of crucial importance and several have been associated with genomic disorders. To better understand the contribution of complex SVs to human disease, we need new algorithms to accurately discover and genotype such variants. Additionally, due to similar sequencing signatures, inverted duplications or gene conversion events that include inverted segmental duplications are often characterized as simple inversions, likewise, duplications and gene conversions in direct orientation may be called as simple deletions. Therefore, there is still a need for accurate algorithms to fully characterize complex SVs and thus improve calling accuracy of more simple variants. Results: We developed novel algorithms to accurately characterize tandem, direct and inverted interspersed segmental duplications using short read whole genome sequencing datasets. We integrated these methods to our TARDIS tool, which is now capable of detecting various types of SVs using multiple sequence signatures such as read pair, read depth and split read. We evaluated the prediction performance of our algorithms through several experiments using both simulated and real datasets. In the simulation experiments, using a 30× coverage TARDIS achieved 96% sensitivity with only 4% false discovery rate. For experiments that involve real data, we used two haploid genomes (CHM1 and CHM13) and one human genome (NA12878) from the Illumina Platinum Genomes set. Comparison of our results with orthogonal PacBio call sets from the same genomes revealed higher accuracy for TARDIS than state-of-the-art methods. Furthermore, we showed a surprisingly low false discovery rate of our approach for discovery of tandem, direct and inverted interspersed segmental duplications prediction on CHM1 (<5% for the top 50 predictions). Availability and implementation: TARDIS source code is available at https://github.com/BilkentCompGen/tardis, and a corresponding Docker image is available at https://hub.docker.com/r/alkanlab/tardis/. Supplementary information: Supplementary data are available at Bioinformatics online.

UR - http://www.scopus.com/inward/record.url?scp=85073184028&partnerID=8YFLogxK

UR - http://www.scopus.com/inward/citedby.url?scp=85073184028&partnerID=8YFLogxK

U2 - 10.1093/bioinformatics/btz237

DO - 10.1093/bioinformatics/btz237

M3 - Article

C2 - 30937433

AN - SCOPUS:85073184028

VL - 35

SP - 3923

EP - 3930

JO - Bioinformatics

JF - Bioinformatics

SN - 1367-4803

IS - 20

ER -