target enrichment with ngs: cardiomyopathy as a case study - bmr genomics

Target EnrichmentUnderstanding the output

Andrea Telatin BMR Genomics

Today menu:!

!

• Disease research applications for TE panels

• bioinformatic analysis of the data…

• …and how to handle the output

using a Cardiomiopathy panel as a test case

Why?Technology Overview

3.3 Gb 50 Mb 0.5 Mb

Gilissen, Genome Biol 2011

Gilissen, Genome Biol 2011

Exome Seq Custom Panels

With Custom TE!!

Finding relevant variants !

Spending less !

Focus on Your Favourite Genes

Case study

Case study

Antonio Puerta

Cardiomyopathies• Targets most common causative SNPs for

• ARVC (Arrhythmogenic right ventricular cardiomyopathy)

• Brugada Syndrome

• Long QT

• Hypertrophic cardiomyopathy

The Panel

Cardiomyopathies• We designed a panel for CMPD

• Platform of choice: Agilent HaloPlex

• Sequencer: Illumina MiSeq (PE 2x150)

• 56 targeted genes (165 regions)

• 500 kb target size

The Panel

Output at a glance• Sequenced 44 samples so far

• Average cov: 232X (±36X)

• Reads on target: 99.6%

• Target > 5X: 95.6%

How?Bioinformatic Analysis

• Target enrichment + Library Preparation

• Sequencing

• Alignment against reference

• Local realignment

• Variant calling

• Variant annotation

• Data mining

!

!

Format: SAM

!

Format: VCF

!

Sequence alignment

This is a hard example. !That is another easy example.

This is a --hard---- example. || ||||| | | ||||||||| That is another easy example.

This is a-- h-ard---- example. || ||||| | | ||||||||| That is anothe-r easy example.

This is a hard example.------ || ||||| | | That is another easy example.

Gap C

ost

To discover more…• The standard algorithms for sequence alignment

are Needleman-Wunsch and Smith-Waterman

• For large sequences the standard is BLAST

• For short reads one of the most popular choices is BWA (uses BWT)

• Interesting CUDA enabled implementations2003 Thesis

Sequence alignment

Short

Chromosomes (reference)

Short reads

Challenges

• Million reads to be aligned

• Short reads are less likely to be “unique”

The SAM/BAM formats

• SAM (Sequence Alignment Format) is a plain text format born and designed for short reads alignments

• It’s complex for humans, because designed for machines

• It has been a major improvement in NGS analyses

SAM

DAT

A

Sequence realignment

• Sequence alignment is (mostly) done one sequence at a time

• At the end we can “rethink” the choices done while aligning, looking at the whole picture

Variants?

Errors!

• Once that the alignment is “cleaned”, variant calling becomes a little bit easier.

• Several aspects are involved, much more than the mere “counting differences”

• These aspects are complex, interesting… …but we are not talking about them today!

The VCF format

Annotation

Chromosomes (reference)

Short reads

Genes/Transcripts

G>G Y>. C>WAminoacid changes

Functional annotationDisease database

Effect predictorsLiterature links

VEP: Variant Effect Predictor• ! genes and transcripts affected by the variants

• ! location of the variants (e.g. upstream, in coding sequence, in non-coding RNA, in regulatory regions)

• ! consequence of your variants on the protein sequence (e.g. stop gained, missense, stop lost, frameshift)

• ! known variants that match yours, and associated minor allele frequencies from the 1000 Genomes Project

• ! SIFT and PolyPhen scores for changes to protein sequence

ANNOVARANNOVAR is an efficient tool to functionally annotate genetic variants.

• Gene-based annotation: identify whether SNPs or CNVs cause protein coding changes and the amino acids that are affected.

• Region-based annotations: identify variants in specific genomic regions, for example, conserved regions among 44 species, predicted transcription factor binding sites,…

• Filter-based annotation: identify variants that are reported in dbSNP, or identify the subset of common SNPs (MAF>1%) in the 1000 Genome Project, or identify subset of non-synonymous SNPs with SIFT score>0.05, …

Can open: ALIGNMENTS (BAM) ANNOTATIONS (BED) VARIANTS (VCF)

Any questions?

Summarizing!

• Target enrichment: many individuals sequenced on genes of interest

• SAM/BAM formats to store alignments

• The IGV program to visualise tracks (including alignments)

• The VCF format to store genomic variations

• Annotation programs add things to a flat file

Acknowledgments: BMR Genomics

• CEO: Barbara Simionati

• NGS Team Leader: Giorgio Malacrida

• Target Enrichment specialist: Ilena Li Mura

• Variant annotation specialist: Ivano Zara

…and everybody else there, making the whole team special.

target enrichment with ngs: cardiomyopathy as a case study - bmr genomics

Science