variant calling pipeline - ut southwestern · variant calling pipeline erika villa bioinformatics...
TRANSCRIPT
Variant Calling Pipeline
Erika Villa
Bioinformatics Core Facility
10/17/2018
Genome
A genome is the entire set of genetic material for an organism.blueprint of life that contains information to grow, develop, survive and reproduce
The human genome
~3 billion base pairs of DNA across 23 pairs of chromosomes.
~20,000 protein coding genes
No individuals are genetically identical
But we are more similar than we are different
More than 99 percent of the human genome is the same in all people.
Differences in less than 1 percent of our genome accounts for the vast diversity in humans across the globe.
Projects that give us insight about human differences
2015: 1000 genome project found typical individual varies in 4.1-5 million sites(~20 million bp) from reference.
2017: dbSnp 324 million variants for humans
Exome
The exome is a subset of the genome composed of only exons.Exons are the coding regions of a gene
The exons of all our genes make up approximately 1.5% of our genome Exonic mutations are thought to harbor ~85% of mutations largely effecting disease.There are some important DNA sequences that are not contained within the exome in noncoding DNA that have important biological functions, such as regulating the
coding regions of the genome.
Sequencing Approaches
Whole Genome(WGS), Whole Exome(WES), Target Gene Panel
Target Gene Panel • A gene panel is a gene subset of the exome
• It contains a subset of exons for a select group of genes
• Gene Panels are useful if you need to do deep sequencing > 1000X
• Many clinical tumor tests use gene panels.
What portion of the genome do you want to sequence?
Pros and Cons of WGS vs WES
Whole Genome
• ~$1300 for 30-40X coverage
• All variants possible
• Sequence can better predict large structural changes including CNV, large Indels, etc
• Whole Genome has more uniform coverage of the protein coding regions
Whole Exome
• ~$500 for 100X coverage
• Restricted to exonic regions
• In somatic/mosaic conditions you might need > 1000X coverage.
• Generate less data to store and analyze
COST
DETECTABLE VARIANTS
PROS
Human Reference GenomeTo make assertions about genetic variation we rely on a reference
Reference Genome: A representative example of a species' genetic makeup
Curated by Genome Reference Consortium (GRC)
• GRCh37/hg19: 2009 derived from thirteen anonymous volunteers from Buffalo, NY.
• GRCh38/hg38: Dec 2013-includes ALT contigs. More representative of population.
2001(150,000 gaps) 2009(250 gaps) 2013(12 gaps)
Build 38 was a significant ‘upgrade’, and due to its accuracy and reputation it is the ‘go to’ reference for many large scale projects
Catalogs of Human Variation
HapMap
• The International HapMap Project: SNP genotyping arrays to develop a haplotype map (HapMap) of the human genome.
1000G
• The 1000 Genomes project sequenced > 1000 genomes in pure and ad-mixture populations to identify human variation in the human genome
ExAC
• ExAC collected the SNP and Indel calls in ~ 26K genomes/exomes and their prevalence in different populations
gnomAD
• The Genome Aggregation Database (gnomAD) is a resource of aggregate genomes and aimed to harmonize both exome and genome sequencing data from over 120K exomes and 15K genomes.
Types of variation in Genome
• Single Nucleotide Polymorphisms(SNPs or SNVs)
• Short Insertions/Deletions (Indels)
• Large Structural Variations SNVs
INDELs
SVs
A C T G A
A T T G A
A A
A T T G ATT
A A G T T
Substitutions
Insertions
Deletions
Inversions
reference
Some SNPs of Interest
EXAMPLES• Non-synonymous mutations
- Results in Amino Acid change- Affects the Protein Sequence- Types of non-synonymous mutations
* Missense
* Nonsense: also described as stop_gained
Diseases can be driven by various types of genetic alterationsExamine Variants and understand features
Original Synonumous Missense Missense Nonsense
GAG GAA GAT GTG TAG
Glutamic Acid Glutamic Acid Aspartic Acid Valine Stop codon
Features used in biological sequence annotationEffects that we see in variants
In a gene? In an exon? Protein coding change? http://www.sequenceontology.org/
Structural Variants: The variation in structure of an organism's chromosome. Typically a structure variation affects a sequence length about 1Kb to 3Mb
1 kb = 10^3 bp1 Mb = 10^6 bp1 Gb = 10^9 bp
Alterations in Genome• A genetic disorder is a genetic problem caused by one or more abnormalities in the genome.
• A single-gene disorder is the result of a single mutated gene.
• Autosomal dominant disorders occur with only one mutated
copy of the gene.
• Recessive disorders require both copies are mutated.
• X-linked dominant disorders are caused by mutations in
genes on the X chromosome.
• Mitochondrial disease, also known as maternal inheritance,
applies to genes encoded by mitochondrial DNA.
Inherited Diseases
Complex Disease
• Complex diseases are caused by a combination of genetic, environmental, and lifestyle factors, most of which have not yet been identified.
• Some examples include Alzheimer's disease, scleroderma, asthma, Parkinson's disease, multiple sclerosis, osteoporosis, connective tissue diseases, kidney diseases, autoimmune diseases, etc
Somatic Mutations
Somatic Mutation Germline Mutation
Somatic mutation: An alteration in DNA that occurs after conception. Somatic mutations can occur in any of the cells of the body except the germ cells and therefore are not passed on to children. Can cause cancer or other diseases.
Somatic Disease• Acquired diseases are caused by acquired mutations in a gene or group of genes that occur during a person's life.
• These include many cancers, as well as some forms of neurofibromatosis.
Mosaicism • Mosaicism, involves the presence of two or more populations of cells with different genotypes in one individual, who has developed from a single fertilized egg.
• Intersex conditions can be caused by mosaicism where some cells
in the body have XX and others XY chromosomes
• Other endogenous factors can also lead to mosaicism including
mobile elements, DNA polymerase slippage, and unbalanced
chromosomal segregation.
• Exogenous factors include nicotine and UV radiation
Germline and Somatic Workflows for
Variant Discovery
BICF and BioHPC
Alignment WorkflowFirst Step for Germline and Somatic Workflows
Alignment: BWABurrow-Wheelers Aligner
“BWA is carefully designed to achieve a good balance between performance and accuracy”
SE and PE reads
Difficulties: ambiguity caused by repeats and sequencing errors.
Human Reference Sequences-GRCh37/hg19
- GRCh38
Other Organisms Reference Sequences
Available for e.g. Mouse(mm10/GRCm38)
Others not available
Alignment: DedupingWith or without UMI
Why are we so worried about sequence duplication?
• When DNA is sequenced, PCR is used to amplify sequence library to ensure that only DNA with “a known adapter” is sequenced.
• Since PCR has a small error rate, “early errors” can be amplified and could skew your results
• We remove duplicates to remove potential noise.
Alignment: Indel Realignment• Why does GATK need Indel Realignment?• Sometimes, alignment algorithms align reads inconsistently, adding the alignment gaps to different places.• Indel Realignment uses “known” gold standard indels to realign these gaps
Alignment Workflow: Recalibration• Why does GATK need Base Recalibration?
• Every base has a quality score that variant callers rely on these scores
• Quality scores are prone to different types of biases
• Base recalibration detects systematic errors made by the sequencer when it estimates the quality score of each base call
Germline Workflow
Germline Union VCF
Variant Callers
• Strelka2: https://github.com/Illumina/strelka
– Sangtae Kim, Konrad Scheffler, Aaron L Halpern, Mitchell A Bekritsky, Eunho Noh, Morten Källberg, Xiaoyu Chen, Doruk Beyter, Peter Krusche, Christopher T Saunders. Strelka2: Fast and accurate variant calling for clinical sequencing applications.
• Speedseq: https://github.com/hall-lab/speedseq
– Chiang C, Layer RM, Faust GG, Lindberg MR, Rose DB, Garrison EP, Marth GT, Quinlan AR, Hall IM. SpeedSeq: ultra-fast personal genome analysis and interpretation. Nat Methods. 2015 Oct;12(10):966-8.https://github.com/hall-lab/speedseq
• Platypus: http://www.well.ox.ac.uk/platypus
– Rimmer A, Phan H, Mathieson I, Iqbal Z, Twigg SR; WGS500 Consortium, Wilkie AO, McVean G, Lunter G. Integrating mapping-, assembly- and haplotype-based approaches for calling variants in clinical sequencing applications. Nat Genet. 2014 Aug;46(8):912-8.http://www.well.ox.ac.uk/platypus
• Gatk: https://software.broadinstitute.org/gatk/
– Van der Auwera GA, Carneiro MO, Hartl C, Poplin R, Del Angel G, Levy-Moonshine A, Jordan T, Shakir K, Roazen D, Thibault J, Banks E, Garimella KV, Altshuler D, Gabriel S, DePristo MA. From FastQ data to high confidence variant calls: the Genome Analysis Toolkit best practices pipeline. Curr Protoc Bioinformatics. 2013;43:11.10.1-33.
Recommended Filtering for Germline Testing
• Depth >20• LOF or Missense (Coding Changes)• Alt Read Ct >3• Mutation Allele Frequency (MAF) >0.10• If novel:
- Called by 2+ callers
Important Terminology to understand
Different tumor cells can show distinct morphological and phenotypic profiles; eg. cell morphology and gene expression
Somatic Workflow
Somatic Variant Callers
• Shimmer: https://github.com/nhansen/Shimmer
– Hansen NF, Gartner JJ, Mei L, Samuels Y, Mullikin JC. Shimmer: detection of genetic alterations in tumors using next-generation sequence data. Bioinformatics. 2013 Jun 15;29(12):1498-503.
• Virmid: https://sourceforge.net/projects/virmid/
– Kim S, Jeong K, Bhutani K, Lee J, Patel A, Scott E, Nam H, Lee H, Gleeson JG, Bafna V. Virmid: accurate detection of somatic mutations with sample impurity inference. Genome Biol. 2013 Aug 29;14(8):R90
• VarScan: http://dkoboldt.github.io/varscan/
– Koboldt DC, Zhang Q, Larson DE, Shen D, McLellan MD, Lin L, Miller CA, Mardis ER, Ding L, Wilson RK. VarScan 2: somatic mutation and copy number alteration discovery in cancer by exome sequencing. Genome Res. 2012 Mar;22(3):568-76.
• Speedseq: https://github.com/hall-lab/speedseq
– Chiang C, Layer RM, Faust GG, Lindberg MR, Rose DB, Garrison EP, Marth GT, Quinlan AR, Hall IM. SpeedSeq: ultra-fast personal genome analysis and interpretation. Nat Methods. 2015 Oct;12(10):966-8.https://github.com/hall-lab/speedseq
• MuTect:
• https://software.broadinstitute.org/gatk/documentation/tooldocs/current/org_broadinstitute_ gatk_tools_walkers_cancer_m2_MuTect2.php
– http://archive.broadinstitute.org/cancer/cga/mutect
Recommended Filtering for Somatic Mutations
• Depth >20• LOF or Missense (Coding Changes)• MAF(Normal) * 5 < MAF(Tumor)• In COSMIC > 5 Subject
- Tumor: Alt Read Ct > 3- Tumor: MAF > 0.01
• Others- Tumor: Alt Read Ct >8- Tumor: MAF >0.05- Tumor: Called by 2+ Callers
Annotations
• ClinVar- ClinVar is a freely accessible, public archive that aggregates information about genomic variation and it’s relationship to human health.
• GWAS Catalog-GWAS Catalog is a quality controlled, manually curated, literature derived collection of published GWAS assaying at least 100,000 single-nucleotide polymorphisms (SNPs) and all SNP-trait associations with P <1 × 10(-5).
• Decipher- The DECIPHER database contains data from 27,302 patients who have given consent to the broad data sharing; DECIPHER also supports more limited sharing via consortia. Used by clinical community to share and compare phenotype and genotypic data
Disease Studies
Cancer Datasets and Annotation
• Clinical Interpretation of Variants in CancerCIVIC
• Catalog of Somatic Mutation in CancerCOSMIC- GeneFusions- Gene Census- Curated Genes- Drug Resistance(so far 9 genes)- Genome Wide Screens
• The Cancer Genome Atlas - TCGA- Tons of data, RNASeq, CNV, WES, WGS etc
Astrocyte - BioHPC Workflow Platform
astrocyte.biohpc.swmed.eduor
portal.biohpc.swmed.edu: Cloud Services -> Astrocyte Workflow Platform
Standardized Workflow
Simple Web Forms
Online documentation & results visualization
Workflows run on HPC cluster without developer or user needing cluster knowledge
Bioinformatics Core Facility (BICF)BICF provides bioinformatics, statistical and data management support to researchers on campusBICF functions as a conduit between bioinformatics research programs and the clinical and basic science research community at UTSWPlease email [email protected] with questions or comments about the workflow
Create New Project
Add DataTo Your Project
Adding Data To Your Project
#For NGS experiements, this is recommended
Data to Import
• Design File: tab delimited *txt file with sample names, Family/Group names, fastq file names
• Fastq Files: One or two fastq files per sample
• Capture Bed file: tab delimited file with target capture region in bed format. (Must contain at least 3 columns specifying chromosome, chromosome start position and chromosome end position)
Make A Design FileFamilyIDThis ID will be used to call samples in batchSampleIDThis ID will be used to name all workflow produced files. E.g. S0001 will produce S0001.bamFqR1Name of the fastq file(not full path)FqR2Name of the fastq file R2 (not full path)
Rules for Making Design File
• Use tab as delimiter- Excel save as “Text (tab delimited)”
• If no SubjectID, use same number/character for all riws
• If no FqR2, leave them empty• For all contentes, no “-”• For all contents, no spaces• Column names MUST be exactly the
same as documented
Run Workflow in this ProjectMy Project Select Project
mydesignfile.txt
mycapturefile.bed
GM12878.R1.fastq.gzGM12878.R2.fastq.gzmydesignfile.txt
mycapturefile.bed
SELECT YOUR FILES
Select your data file, set up workflow and submit
Project is Queued/Running/Complete
/RUNNING/QUEUED
GM12878.R1.fastq.gz
GM12878.R1.fastq.gz
Keep Trying: My first attempt belowMake sure you have all the appropriate files selected
BICF Help Desk: Email: [email protected] Hours: 10-11am Daily Location: E4.380
Timeline of Germline workflowOne Sample
Key Files for Germline Pipeline• VCF file — SNPs/Indels for each sample
• SampleID.germline.vcf.gz• Coverage Histogram for each sample
• SampleID.coverage_histogram.png• Cumulative Distribution Plot for all samples
• coverage_cdf.png• QC for all samples
• SampleID.sequence.stats.txt• Structural Variants (unfiltered)
• SampleID.sssv.sv.vcf.gz.annot.txt• Copy Number for each sample
• SampleID.cnvcalls.txt
Key Files Somatic Mutation Pipeline
• VCF file — SNPs/Indels for each sample
• FamilyID.somatic.vcf.gz
• Match Check File
• FamilyID_matched.txt
• QC for tumor normal pairs
• FamilyID.sequence.stats.txt
BAM files can be viewed on
Referencesame as analysis reference
http://newbam.iobio.io/
VCF Files can be viewed by
http://vcf.iobio.io
Thank you
Questions?