aug2013 illumina platinum genomes

14
© 2010 Illumina, Inc. All rights reserved. Illumina, illuminaDx, Solexa, Making Sense Out of Life, Oligator, Sentrix, GoldenGate, GoldenGate Indexing, DASL, BeadArray, Array of Arrays, Infinium, BeadXpress, VeraCode, IntelliHyb, iSelect, CSPro, GenomeStudio, Genetic Energy, HiSeq, and HiScan are registered trademarks or trademarks of Illumina, Inc. All other brands and names contained herein are the property of their respective owners. Platinum Genomes: Identifying variants using a large pedigree Michael A. Eberle GIAB August, 2013

Upload: genomeinabottle

Post on 10-May-2015

4.599 views

Category:

Technology


2 download

TRANSCRIPT

Page 1: Aug2013 illumina platinum genomes

© 2010 Illumina, Inc. All rights reserved.Illumina, illuminaDx, Solexa, Making Sense Out of Life, Oligator, Sentrix, GoldenGate, GoldenGate Indexing, DASL, BeadArray, Array of Arrays, Infinium, BeadXpress, VeraCode, IntelliHyb, iSelect, CSPro, GenomeStudio, Genetic Energy, HiSeq, and HiScan are registered trademarks or trademarks of Illumina, Inc. All other brands and names contained herein are the property of their respective owners.

Platinum Genomes: Identifying variants

using a large pedigree

Michael A. Eberle

GIAB August, 2013

Page 2: Aug2013 illumina platinum genomes

2

Platinum Genome project: Improving technology & tools

Create a catalogue of highly accurate whole-genome variant calls within a well characterized pedigree

– SNPs, indels & CNVs– Including highly confident reference positions– Provide direct supporting evidence for every variant call

Develop a framework to assess variant callers

Provide a path to improve variant callers by providing a better truth data to sensitively assess sensitivity and precision

– Modifying the SNP filters to maximize accuracy

Correct FPFN

Truth Test

Page 3: Aug2013 illumina platinum genomes

3

NIST GIAB – Pedigree analysis

12889 12890 12891 12892

12877 12878

12879 12880 12881 12882 12883 12884 12885 1288712886 12888 12893

All 17 members sequenced to at least 50x depth (PCR-Free protocol)

Variants are called across the pedigree using different software & technology

Inheritance information provides high confident, direct validation of variant calls

Analysis of SNPs in the parents and 11 children

Page 4: Aug2013 illumina platinum genomes

4

Pedigree Analysis – Using haplotypes to detect conflicts

ACAGTA

ACAGTA

ACAGTA

ACATTA

ACAGTA

ATCTGA

ATCTGA

ATCTGA

GTCGTC

GTCGTC

GTCGTC

GCATTA

GCATTA

GCATTA

GCATTA

GCATTA

With a sufficiently large pedigree all four possible inheritance patterns will be observed and most of the genotypes can be phased into haplotypes

Parents

Children

Page 5: Aug2013 illumina platinum genomes

5

Using haplotypes to detect conflicts

ACAGTA

ACAGTA

ACAGTA

ACATTA

ACAGTA

ATCTGA

ATCTGA

ATCTGA

GTCGTC

GTCGTC

GTCGTC

GCATTA

GCATTA

GCATTA

GCATTA

GCATTA

Individual GT accuracy is assessed using surrounding genotype calls across the pedigree

Genotypes are parsimoniously phased to minimize the number of conflicts across the pedigree

Facilitates assigning conflicts to sample, imputation of missing data and error correction

Error at this sample/position

Parents

Children

Page 6: Aug2013 illumina platinum genomes

6

First step is to define the inheritance of the parental chromosomes to the eleven children everywhere in the genome

– Identified 709 crossover events between the parents and eleven children

Variants called across the pedigree using multiple callers– E.g. GATK, Cortex, Isaac & CGI for SNPs

Define accurate variants as those where the genotypes are 100% consistent with the transmission of the parental haplotypes

– At any position of the genome there are only 16 possible combinations of genotypes (biallelic & diploid) across the pedigree that are consistent with the inheritance pattern

– 313 (~1.6M) possible genotype combinations

Analysis of variant calls within the pedigree structure

Page 7: Aug2013 illumina platinum genomes

7

Homozygous positions (GATK)– ~2.6B positions identified as homozygous reference across the pedigree

SNPs (GATK, Cortex, Isaac & CGI)– ~4.7M positions where SNPs agree with transmission of parental chromosomes– >95% (4.5M) called consistent with transmission by multiple algorithms/technologies– >98% (4.6M) with supporting evidence from other call sets (i.e. same variant called in

at least one of the samples)

Indels (GATK, Cortex & CGI)– ~640k indels consistent with transmission of parental chromosomes– Events range in size from 1 to 350bp

CNVs (BreakDancer & Grouper)– ~772 CNVs - mostly deletions though a couple of duplications– Events range from 1kb to 322kb though still refining break points

Current state

Page 8: Aug2013 illumina platinum genomes

8

CNVs

Page 9: Aug2013 illumina platinum genomes

9

Incorporating larger variants

SNPs and small indels work well because the genotypes are highly accurate– A single genotyping error in any of the 13 samples will almost never be consistent

with the haplotype transmission

Developing approaches for other variants types that have lower calling accuracy– Many CNV callers do not provide GT information– Accuracy is too low to use pedigree-consistency

Page 10: Aug2013 illumina platinum genomes

10

Incorporating CNVs into this framework

Make breakpoint calls within each sample using

BreakDancer & Grouper

Identify regions of overlap between samples (keeping

singletons)

Corroborate based on read counts within the putative CNV

events

Refine to breakpoint resolution

NA12877

NA12878

NA12879

NA12880

NA12881

NA12882

Test Regions

• Count the uniquely aligned reads within the defined break points for the test regions for each sample & identify events where the read counts are consistent with a deletion or duplication

• For internally-consistent events, follow up with targeted analysis to identify bp resolution of events

• On average ~150x depth for every event

Page 11: Aug2013 illumina platinum genomes

11

AB CD CB DA CB DB DA CB CA DB CB CA DA0

500

1000

1500

2000

Rea

d C

ount

s

0

1

2

Using read counts to confirm deletions – 8.5kb deletion

Best Sol’n: A=0 ; B=1 ; C=1 ; D=1

All Samples with haplotype A are consistent with haploid based on read countsA A A A A A

Diploid

Haploid

Zero-ploid

Page 12: Aug2013 illumina platinum genomes

12

Breakdown of 772 “accurate” CNVs (1kb to 322kb in size)

26640898

BreakDancerGrouper

Page 13: Aug2013 illumina platinum genomes

13

Assembling breakpoints for the 772 CNVs– Reassessing the “failed” calls where applicable

Incorporating different calling algorithms / methods– E.g. SNP inheritance can help identify CNVs that are missed by other methods– Including mate pair data (~2kb insert size)

Working on different methods to improve our catalogue of ~30bp to 2kb events & incorporating different callers

Assigning error modes for “failed” SNPs– Many look like cell line mutations & alignment errors

Comparing our call set to other datasets to assess accuracy and completeness– Other GIAB call sets– Fosmid data (Jaffe & Kidd)

Next steps

Page 14: Aug2013 illumina platinum genomes

14

Illumina Oxford

Morten Kallberg Zamin Iqbal

Xiaoyu Chen Gil McVean

Han-Yu Chuang

Phil Tedder

Sean Humphray

Elliott Margulies

David Bentley

This data and more available at www.platinumgenomes.org

Acknowledgements