target enrichment with ngs: cardiomyopathy as a case study - bmr genomics
DESCRIPTION
Seminar on target enrichment performed with Illumina MiSeq. A description of the experiment and the output provided by the bioinformatics analyses. How to use IGV to inspect the alignments and variant calling.TRANSCRIPT
Target EnrichmentUnderstanding the output
Andrea Telatin BMR Genomics
Today menu:!
!
• Disease research applications for TE panels
• bioinformatic analysis of the data…
• …and how to handle the output
using a Cardiomiopathy panel as a test case
Why?Technology Overview
3.3 Gb 50 Mb 0.5 Mb
Gilissen, Genome Biol 2011
Gilissen, Genome Biol 2011
Exome Seq Custom Panels
With Custom TE!!
Finding relevant variants !
Spending less !
Focus on Your Favourite Genes
With Custom TE!!
Finding relevant variants !
Spending less !
Focus on Your Favourite Genes
Case study
Case study
Antonio Puerta
Cardiomyopathies• Targets most common causative SNPs for
• ARVC (Arrhythmogenic right ventricular cardiomyopathy)
• Brugada Syndrome
• Long QT
• Hypertrophic cardiomyopathy
The Panel
Cardiomyopathies• We designed a panel for CMPD
• Platform of choice: Agilent HaloPlex
• Sequencer: Illumina MiSeq (PE 2x150)
• 56 targeted genes (165 regions)
• 500 kb target size
The Panel
Output at a glance• Sequenced 44 samples so far
• Average cov: 232X (±36X)
• Reads on target: 99.6%
• Target > 5X: 95.6%
How?Bioinformatic Analysis
• Target enrichment + Library Preparation
• Sequencing
• Alignment against reference
• Local realignment
• Variant calling
• Variant annotation
• Data mining
!
!
Format: SAM
!
Format: VCF
!
Sequence alignment
This is a hard example. !That is another easy example.
This is a --hard---- example. || ||||| | | ||||||||| That is another easy example.
This is a-- h-ard---- example. || ||||| | | ||||||||| That is anothe-r easy example.
This is a hard example.------ || ||||| | | That is another easy example.
Gap C
ost
To discover more…• The standard algorithms for sequence alignment
are Needleman-Wunsch and Smith-Waterman
• For large sequences the standard is BLAST
• For short reads one of the most popular choices is BWA (uses BWT)
• Interesting CUDA enabled implementations2003 Thesis
Sequence alignment
Short
Chromosomes (reference)
Short reads
Chromosomes (reference)
Short reads
Challenges
• Million reads to be aligned
• Short reads are less likely to be “unique”
The SAM/BAM formats
• SAM (Sequence Alignment Format) is a plain text format born and designed for short reads alignments
• It’s complex for humans, because designed for machines
• It has been a major improvement in NGS analyses
SAM
DAT
A
Sequence realignment
• Sequence alignment is (mostly) done one sequence at a time
• At the end we can “rethink” the choices done while aligning, looking at the whole picture
Variants?
Variants?
Errors!
• Once that the alignment is “cleaned”, variant calling becomes a little bit easier.
• Several aspects are involved, much more than the mere “counting differences”
• These aspects are complex, interesting… …but we are not talking about them today!
The VCF format
Annotation
Chromosomes (reference)
Short reads
Genes/Transcripts
G>G Y>. C>WAminoacid changes
Functional annotationDisease database
Effect predictorsLiterature links
VEP: Variant Effect Predictor• ! genes and transcripts affected by the variants
• ! location of the variants (e.g. upstream, in coding sequence, in non-coding RNA, in regulatory regions)
• ! consequence of your variants on the protein sequence (e.g. stop gained, missense, stop lost, frameshift)
• ! known variants that match yours, and associated minor allele frequencies from the 1000 Genomes Project
• ! SIFT and PolyPhen scores for changes to protein sequence
ANNOVARANNOVAR is an efficient tool to functionally annotate genetic variants.
• Gene-based annotation: identify whether SNPs or CNVs cause protein coding changes and the amino acids that are affected.
• Region-based annotations: identify variants in specific genomic regions, for example, conserved regions among 44 species, predicted transcription factor binding sites,…
• Filter-based annotation: identify variants that are reported in dbSNP, or identify the subset of common SNPs (MAF>1%) in the 1000 Genome Project, or identify subset of non-synonymous SNPs with SIFT score>0.05, …
Can open: ALIGNMENTS (BAM) ANNOTATIONS (BED) VARIANTS (VCF)
Any questions?
Summarizing!
• Target enrichment: many individuals sequenced on genes of interest
• SAM/BAM formats to store alignments
• The IGV program to visualise tracks (including alignments)
• The VCF format to store genomic variations
• Annotation programs add things to a flat file
Acknowledgments: BMR Genomics
• CEO: Barbara Simionati
• NGS Team Leader: Giorgio Malacrida
• Target Enrichment specialist: Ilena Li Mura
• Variant annotation specialist: Ivano Zara
…and everybody else there, making the whole team special.