reference genome based sequence variation...

39
Reference genome based sequence variation detection Computational Biology Service Unit (CBSU) Cornell Center for Comparative and Population Genomics (3CPG) Center for Vertebrate Genomics (CVG) CBSU/3CPG/CVG Joint Workshop Series

Upload: others

Post on 17-Aug-2020

17 views

Category:

Documents


0 download

TRANSCRIPT

Page 1: Reference genome based sequence variation detectioncbsu.tc.cornell.edu/lab/doc/variation_workshop_2011_v3.pdf · 2014. 10. 31. · Reference genome based sequence variation detection

Reference genome based sequence variation detection

Computational Biology Service Unit (CBSU)Cornell Center for Comparative and Population Genomics (3CPG)

Center for Vertebrate Genomics (CVG)

CBSU/3CPG/CVG Joint Workshop Series 

Page 2: Reference genome based sequence variation detectioncbsu.tc.cornell.edu/lab/doc/variation_workshop_2011_v3.pdf · 2014. 10. 31. · Reference genome based sequence variation detection

Assembly Alignment

Two different data analysis strategies

Page 3: Reference genome based sequence variation detectioncbsu.tc.cornell.edu/lab/doc/variation_workshop_2011_v3.pdf · 2014. 10. 31. · Reference genome based sequence variation detection

De novo Assembly

ACGGTACCTAAACCGGTACCTAAACCGGA

ACGAGCAACACGGTACCTA

TACCTAAACCGGACCCGGAAAGAC

ACGGTAGCTAAACCGGTAGCTAAACCGGA

ACGAGCAACACGGTAGCTA

TAGCTAAACCGGACCCGGAAAGAC

......ACGAGCAACACGGTACCTAAACCGGACCCGGAAAGAC..... ......ACGAGCAACACGGTAGCTAAACCGGACCCGGAAAGAC.....

Page 4: Reference genome based sequence variation detectioncbsu.tc.cornell.edu/lab/doc/variation_workshop_2011_v3.pdf · 2014. 10. 31. · Reference genome based sequence variation detection

De novo Assembly

ACGGTACCTAAACCGGTACCTAAACCGGA

ACGAGCAACACGGTACCTA

TACCTAAACCGGACCCGGAAAGAC

ACGGTAGCTAAACCGGTAGCTAAACCGGA

ACGAGCAACACGGTAGCTA

TAGCTAAACCGGACCCGGAAAGAC

......ACGAGCAACACGGTACCTAAACCGGACCCGGAAAGAC..... ......ACGAGCAACACGGTAGCTAAACCGGACCCGGAAAGAC.....

......ACGAGCAACACGGTACCTAAACCGGACCCGGAAAGAC.....

......ACGAGCAACACGGTAGCTAAACCGGACCCGGAAAGAC.....

Page 5: Reference genome based sequence variation detectioncbsu.tc.cornell.edu/lab/doc/variation_workshop_2011_v3.pdf · 2014. 10. 31. · Reference genome based sequence variation detection

ReferenceAlignment

ACGGTACCTAAACCGGTACCTAAACCGGA

ACGAGCAACACGGTACCTA

TACCTAAACCGGACCCGGAAAGAC

ACGGTAGCTAAACCGG

TAGCTAAACCGGA

ACGAGCAACACGGTAGCTA

TAGCTAAACCGGACCCGGAAAGAC

Page 6: Reference genome based sequence variation detectioncbsu.tc.cornell.edu/lab/doc/variation_workshop_2011_v3.pdf · 2014. 10. 31. · Reference genome based sequence variation detection

ReferenceAlignment

ACGGTACCTAAACCGGTACCTAAACCGGA

ACGAGCAACACGGTACCTA

TACCTAAACCGGACCCGGAAAGAC

ACGGTAGCTAAACCGG

TAGCTAAACCGGA

ACGAGCAACACGGTAGCTA

TAGCTAAACCGGACCCGGAAAGAC

ACGGTACCTAAACCGGTACCTAAACCGGA

ACGAGCAACACGGTACCTA

TACCTAAACCGGACCCGGAAAGAC

ACGGTAGCTAAACCGGTAGCTAAACCGGA

ACGAGCAACACGGTAGCTA

TAGCTAAACCGGACCCGGAAAGAC

Reference GenomeC

Page 7: Reference genome based sequence variation detectioncbsu.tc.cornell.edu/lab/doc/variation_workshop_2011_v3.pdf · 2014. 10. 31. · Reference genome based sequence variation detection

Chr Position Ref Coverage Depth Genotypes Genechr1 24515167 C 5 11 3 T() C() T()chr1 45396856 G 13 7 9 C() G() C()chr1 68417006 G 43 18 6 A() G() A()chr1 90162621 A 15 99 255M(AC) A() A()chr1 90162696 G 17 134 255 G() R(GA) G()chr1 90162750 C 19 108 176 Y(CT) Y(CT) C()chr1 90162816 G 30 72 106 G() K(GT) K(GT)chr1 90162975 G 162 48 255 G() R(GA) G()chr1 90163027 C 100 6 255 C() Y(CT) Y(CT)chr1 90163136 A 152 17 176 A() R(AG) R(AG)chr1 90163167 C 132 25 218 C() M(CA) M(CA)chr1 90163191 T 91 19 227 T() Y(TC) Y(TC)chr1 90164490 A 173 16 103 A() M(AC) M(AC)chr1 90164557 A 100 66 137 A() R(AG) A()chr1 90164612 A 62 48 107 A() R(AG) R(AG)chr1 90164677 A 88 37 64 R(AG) A() R(AG)chr1 90165817 T 88 35 56 Y(TC) Y(TC) T()… … … … … … … … …… … … … … … … … …chr17 72952985 C 23 26 31 T() Y(TC) T()chr18 7355152 G 23 34 3 A() G() A()chr18 7355177 A 16 29 3 C() A() C()chr18 25274226 T 28 35 22 C() Y(CT) C()chr18 34475963 A 25 12 25 G(KT) R(GA) G()chr18 38133671 G 69 63 21 C(SG) G() G()chr18 65363507 G 14 29 3 T(KG) G() T()chr18 65363509 T 18 31 3 G(KT) T() G()chr18 71606111 C 9 32 5 A() C() A()chr19 46381078 A 8 12 6 G(RA) A() G()

With limited number of individuals, whole genome/exomesequencing do not always reveal the causative mutations

Page 8: Reference genome based sequence variation detectioncbsu.tc.cornell.edu/lab/doc/variation_workshop_2011_v3.pdf · 2014. 10. 31. · Reference genome based sequence variation detection

Chr Position Ref Coverage Depth Genotypes Genechr1 24515167 C 5 11 3 T() C() T()chr1 45396856 G 13 7 9 C() G() C()chr1 68417006 G 43 18 6 A() G() A()chr1 90162621 A 15 99 255M(AC) A() A()chr1 90162696 G 17 134 255 G() R(GA) G()chr1 90162750 C 19 108 176 Y(CT) Y(CT) C()chr1 90162816 G 30 72 106 G() K(GT) K(GT)chr1 90162975 G 162 48 255 G() R(GA) G()chr1 90163027 C 100 6 255 C() Y(CT) Y(CT)chr1 90163136 A 152 17 176 A() R(AG) R(AG)chr1 90163167 C 132 25 218 C() M(CA) M(CA)chr1 90163191 T 91 19 227 T() Y(TC) Y(TC)chr1 90164490 A 173 16 103 A() M(AC) M(AC)chr1 90164557 A 100 66 137 A() R(AG) A()chr1 90164612 A 62 48 107 A() R(AG) R(AG)chr1 90164677 A 88 37 64 R(AG) A() R(AG)chr1 90165817 T 88 35 56 Y(TC) Y(TC) T()… … … … … … … … …… … … … … … … … …chr17 72952985 C 23 26 31 T() Y(TC) T()chr18 7355152 G 23 34 3 A() G() A()chr18 7355177 A 16 29 3 C() A() C()chr18 25274226 T 28 35 22 C() Y(CT) C()chr18 34475963 A 25 12 25 G(KT) R(GA) G()chr18 38133671 G 69 63 21 C(SG) G() G()chr18 65363507 G 14 29 3 T(KG) G() T()chr18 65363509 T 18 31 3 G(KT) T() G()chr18 71606111 C 9 32 5 A() C() A()chr19 46381078 A 8 12 6 G(RA) A() G()

With limited number of individuals, whole genome/exomesequencing do not always reveal the causative mutations

Sequence a mapping population

Page 9: Reference genome based sequence variation detectioncbsu.tc.cornell.edu/lab/doc/variation_workshop_2011_v3.pdf · 2014. 10. 31. · Reference genome based sequence variation detection

FASTQ files

SAM/BAM files

VCF file

Reference genome based sequence variation detection

Step 1: Alignment

Step 2: Call SNP/INDELs

Page 10: Reference genome based sequence variation detectioncbsu.tc.cornell.edu/lab/doc/variation_workshop_2011_v3.pdf · 2014. 10. 31. · Reference genome based sequence variation detection

Reference genome based sequence variation detection

Step 3: Filter SNP/INDELs

Step 4: Annotate SNP/INDELs

Page 11: Reference genome based sequence variation detectioncbsu.tc.cornell.edu/lab/doc/variation_workshop_2011_v3.pdf · 2014. 10. 31. · Reference genome based sequence variation detection

Reference genome based sequence variation detection

Step 1: Alignment

Step 2: Call SNP/INDELs

BWALi H. and Durbin R. (2009)  Bioinformatics, 25:1754‐60

SAMtools GATK + PicardLi H. et al. Bioinformatics, 25, 2078‐9 Broad Institute

or

Page 12: Reference genome based sequence variation detectioncbsu.tc.cornell.edu/lab/doc/variation_workshop_2011_v3.pdf · 2014. 10. 31. · Reference genome based sequence variation detection

Reference genome based sequence variation detection

Step 3: Filtering

Step 4: Annotation

• GATK• Write your own code

• Annovarhttp://www.openbioinformatics.org/annovar/

Page 13: Reference genome based sequence variation detectioncbsu.tc.cornell.edu/lab/doc/variation_workshop_2011_v3.pdf · 2014. 10. 31. · Reference genome based sequence variation detection

Standard file formats

• FASTQ• SAM/BAM• VCF

Page 14: Reference genome based sequence variation detectioncbsu.tc.cornell.edu/lab/doc/variation_workshop_2011_v3.pdf · 2014. 10. 31. · Reference genome based sequence variation detection

@20F75AAXX:5:1:335:1565

ACCTTGTTGAGAAACAGGAGGTGTTGTTCTTCAAAG

+20F75AAXX:5:1:335:1565

]]]]][]][][[][]Z[[[][[[[][[[[][[[[[R

@20F75AAXX:5:1:466:1056

GGAAGCAACAGCTAATACATGAATGGATATCGATCG

+20F75AAXX:5:1:466:1056

[]]]]][]]]Y]]]][Y[[[[[[[[[[Y[Y[YW[[[

@20F75AAXX:5:1:256:1724

GCCCAACAAAGACCGGTCACCAAAGACAGATGATTC

+20F75AAXX:5:1:256:1724

]][]][]][[[[]L[[[[][[[Z[[[[[S[[ZW[[[

FASTQ file:

Page 15: Reference genome based sequence variation detectioncbsu.tc.cornell.edu/lab/doc/variation_workshop_2011_v3.pdf · 2014. 10. 31. · Reference genome based sequence variation detection

HWI‐EAS83_20F7TAAXX:1:1:379:338 16 4 157555988 25 36M * 0 0

AGAAAACTGCAAAGCACGAGTCTAGCAGATACCCTT

h?DhhhLDPOhhhhhhhhhhhhhhhhhhhhhhhhhh XT:A:U NM:i:2 X0:i:1 X1:i:0 XM:i:2 XO:i:0 XG:i:0

MD:Z:2C32G0

HWI‐EAS83_20F7TAAXX:1:1:98:170 16 4 28122708 37 36M * 0 0

GCACCCTTTAACTCGGGCTAACTATCTTGCTTCACC

VbINbYZh_hUhQhd\^hfhhhhhhhhhhhhhhhhh XT:A:U NM:i:1 X0:i:1 X1:i:0 XM:i:1 XO:i:0 XG:i:0 MD:Z:33G2

HWI‐EAS83_20F7TAAXX:1:1:582:80 4 * 0 0 * * 0 0

ATGGCTGCCTCGCAGAATCGAAAGTTAGTGCCGCAC

hfhhhhahh`hhAVhEhahQKHKQA_IIPPF@DhEV

HWI‐EAS83_20F7TAAXX:1:1:169:517 16 3 170277940 25 36M * 0 0

AAAACCATATCTGCTGGAAACTCTGCTTCCACAAGC

CDhKDBhDhFaGghMhahhhhPhhhhhhhhhhhhhh XT:A:U NM:i:2 X0:i:1 X1:i:0 XM:i:2 XO:i:0 XG:i:0

MD:Z:0T0C34

SAM file: 

• Sequence (forward strand of the reference genome)

• Quality score

• Alignment information (position, strand, mismatches, gap) 

• Ambigous alignments

• Paired‐end information

• Read group

Information encoded in SAM file

Page 16: Reference genome based sequence variation detectioncbsu.tc.cornell.edu/lab/doc/variation_workshop_2011_v3.pdf · 2014. 10. 31. · Reference genome based sequence variation detection

BAM is a compressed SAM file

• BAM file is several times smaller than SAM;

• BAM file can be indexed and queried;

• Most software operates directly on BAM;

• BAM format can potentially replace fastqformat. 

Page 17: Reference genome based sequence variation detectioncbsu.tc.cornell.edu/lab/doc/variation_workshop_2011_v3.pdf · 2014. 10. 31. · Reference genome based sequence variation detection

##fileformat=VCFv4.0##fileDate=20090805##source=myImputationProgramV3.1##reference=1000GenomesPilot‐NCBI36##phasing=partial##INFO=<ID=NS,Number=1,Type=Integer,Description="Number of Samples With Data">##INFO=<ID=DP,Number=1,Type=Integer,Description="Total Depth">##INFO=<ID=AF,Number=.,Type=Float,Description="Allele Frequency">##INFO=<ID=AA,Number=1,Type=String,Description="Ancestral Allele">##INFO=<ID=DB,Number=0,Type=Flag,Description="dbSNP membership, build 129">##INFO=<ID=H2,Number=0,Type=Flag,Description="HapMap2 membership">##FILTER=<ID=q10,Description="Quality below 10">##FILTER=<ID=s50,Description="Less than 50% of samples have data">##FORMAT=<ID=GT,Number=1,Type=String,Description="Genotype">##FORMAT=<ID=GQ,Number=1,Type=Integer,Description="Genotype Quality">##FORMAT=<ID=DP,Number=1,Type=Integer,Description="Read Depth">##FORMAT=<ID=HQ,Number=2,Type=Integer,Description="Haplotype Quality">#CHROM POS     ID        REF ALT    QUAL FILTER INFO                              FORMAT      NA00001        NA00002        NA0000320     14370   rs6054257 G      A       29   PASS   NS=3;DP=14;AF=0.5;DB;H2           GT:GQ:DP:HQ 0|0:48:1:51,51 1|0:48:8:51,51 1/1:43:5:.,.20     17330   .         T      A       3    q10    NS=3;DP=11;AF=0.017               GT:GQ:DP:HQ 0|0:49:3:58,50 0|1:3:5:65,3 0/0:41:320     1110696 rs6040355 A      G,T     67   PASS   NS=2;DP=10;AF=0.333,0.667;AA=T;DB GT:GQ:DP:HQ 1|2:21:6:23,27 2|1:2:0:18,2 2/2:35:420     1230237 .         T      .       47   PASS   NS=3;DP=13;AA=T                   GT:GQ:DP:HQ 0|0:54:7:56,60 0|0:48:4:51,51 0/0:61:220     1234567 microsat1 GTCT   G,GTACT 50   PASS   NS=3;DP=9;AA=G                    GT:GQ:DP    0/1:35:4       0/2:17:2     1/1:40:3

VCF file  ‐ variant call format

Page 18: Reference genome based sequence variation detectioncbsu.tc.cornell.edu/lab/doc/variation_workshop_2011_v3.pdf · 2014. 10. 31. · Reference genome based sequence variation detection

Alignment with BWA

Commonly used parameters:

Alignment step (aln):

‐n:  maximum number of edit distance (default 0.04)

‐o: maximum number of gap opens (default 1)

Write SAM file step (samse or sampe):

‐n maximum number of alignments to report

Page 19: Reference genome based sequence variation detectioncbsu.tc.cornell.edu/lab/doc/variation_workshop_2011_v3.pdf · 2014. 10. 31. · Reference genome based sequence variation detection

‐ Converting SAM to BAM‐ Index BAM

*** If you want to use Broad GATK software to call SNPs,  do not use SAMtools, always use Picard for processing SAM and BAM files. 

Samtools: view; index

Picard: SamFormatConverter; BuildBamIndex

Page 20: Reference genome based sequence variation detectioncbsu.tc.cornell.edu/lab/doc/variation_workshop_2011_v3.pdf · 2014. 10. 31. · Reference genome based sequence variation detection

BAM file can be visualized with IGV software

Page 21: Reference genome based sequence variation detectioncbsu.tc.cornell.edu/lab/doc/variation_workshop_2011_v3.pdf · 2014. 10. 31. · Reference genome based sequence variation detection

Clean up the BAM file• Mark possible PCR duplicates

• Base quality score recalibration

• Local realignment around indels

Page 22: Reference genome based sequence variation detectioncbsu.tc.cornell.edu/lab/doc/variation_workshop_2011_v3.pdf · 2014. 10. 31. · Reference genome based sequence variation detection

Clean up the BAM file• Mark possible PCR duplicates

• Base quality score recalibration

• Local realignment around indels

** For sequence reads with exact same sequence, only one copy is kept.

Page 23: Reference genome based sequence variation detectioncbsu.tc.cornell.edu/lab/doc/variation_workshop_2011_v3.pdf · 2014. 10. 31. · Reference genome based sequence variation detection

Clean up the BAM file• Mark possible PCR duplicates

• Base quality score recalibration

• Local realignment around indels

• Phred quality score: 20 ‐> 1% error rate.

• Illumina quality score: 0 to 62, need to be calibrated to reflect error rate.

Page 24: Reference genome based sequence variation detectioncbsu.tc.cornell.edu/lab/doc/variation_workshop_2011_v3.pdf · 2014. 10. 31. · Reference genome based sequence variation detection

Clean up the BAM file• Mark possible PCR duplicates

• Base quality score recalibration

• Local realignment around indels

Page 25: Reference genome based sequence variation detectioncbsu.tc.cornell.edu/lab/doc/variation_workshop_2011_v3.pdf · 2014. 10. 31. · Reference genome based sequence variation detection

Multi‐sample SNP and INDEL calling

• Use Unified Genotyper (GATK) or mpileup(SAMtools) to call SNP and INDEL from multiple samples.

• Set the variants calling thresholdEmission threshold: Q10 (>10x)  Q3(<10x)Confidence threshold: Q30(>10x) Q4(<10x)

Page 26: Reference genome based sequence variation detectioncbsu.tc.cornell.edu/lab/doc/variation_workshop_2011_v3.pdf · 2014. 10. 31. · Reference genome based sequence variation detection

Filtering

• Read depth (DP)

• Allele frequency (AF)

• Number of samples with data (NS)

Page 27: Reference genome based sequence variation detectioncbsu.tc.cornell.edu/lab/doc/variation_workshop_2011_v3.pdf · 2014. 10. 31. · Reference genome based sequence variation detection

• SAM ‐> BAM

• Flag possible PCR duplicates

• Quality score calibration

• INDEL realignment

• Call variants on multiple samples

• Filtering

SAMtools GATK/Picard

* SAMtools mpileup has built‐in realignment tool** Limited filtering function. Poor documentation.

*

**

Page 28: Reference genome based sequence variation detectioncbsu.tc.cornell.edu/lab/doc/variation_workshop_2011_v3.pdf · 2014. 10. 31. · Reference genome based sequence variation detection

GATK Documentation:http://www.broadinstitute.org/gsa/wiki/index.php/Best_Practice_Variant_Detection_with_the_GATK_v2

Page 29: Reference genome based sequence variation detectioncbsu.tc.cornell.edu/lab/doc/variation_workshop_2011_v3.pdf · 2014. 10. 31. · Reference genome based sequence variation detection

SAMtools Variants Calling Documentation:http://samtools.sourceforge.net/mpileup.shtml

Page 30: Reference genome based sequence variation detectioncbsu.tc.cornell.edu/lab/doc/variation_workshop_2011_v3.pdf · 2014. 10. 31. · Reference genome based sequence variation detection

1. Experimental Design.

2.  Computational Resource at Cornell.

Practical aspects 

Page 31: Reference genome based sequence variation detectioncbsu.tc.cornell.edu/lab/doc/variation_workshop_2011_v3.pdf · 2014. 10. 31. · Reference genome based sequence variation detection

Whole genome sequencing vs

Targeted sequencing

Target‐enrichment by array or in‐solution based capturing technology. (e.g. Exome sequencing).

Page 32: Reference genome based sequence variation detectioncbsu.tc.cornell.edu/lab/doc/variation_workshop_2011_v3.pdf · 2014. 10. 31. · Reference genome based sequence variation detection

ApeK I site

Line 1

Line 2

Line 3

Whole genome sequencing vs

Genotyping by Sequencing (GBS)

Ed Buckler Lab(http://www.maizegenetics.net/gbs‐overview)

Page 33: Reference genome based sequence variation detectioncbsu.tc.cornell.edu/lab/doc/variation_workshop_2011_v3.pdf · 2014. 10. 31. · Reference genome based sequence variation detection

Advantage of GBS over whole genome sequencing

1. Reduced cost by multiplexing;

2. Possible to map markers that are not on the reference genome;

Page 34: Reference genome based sequence variation detectioncbsu.tc.cornell.edu/lab/doc/variation_workshop_2011_v3.pdf · 2014. 10. 31. · Reference genome based sequence variation detection

To identify causative mutations in a mutant strain, it is necessary to use both sequencing 

and genetic linkage analysis. 

Page 35: Reference genome based sequence variation detectioncbsu.tc.cornell.edu/lab/doc/variation_workshop_2011_v3.pdf · 2014. 10. 31. · Reference genome based sequence variation detection

**

*

****

X

F1

F2

Mapping and Mutation Identification of the Pooled F2 population

Page 36: Reference genome based sequence variation detectioncbsu.tc.cornell.edu/lab/doc/variation_workshop_2011_v3.pdf · 2014. 10. 31. · Reference genome based sequence variation detection

SHOREmapSchneeberger K et al (2009) Nat Methods.6(8):550‐1.

Using SHOREmap for  mapping and mutation identification

Page 37: Reference genome based sequence variation detectioncbsu.tc.cornell.edu/lab/doc/variation_workshop_2011_v3.pdf · 2014. 10. 31. · Reference genome based sequence variation detection

Zuryn et al. (2010)  A Strategy for Direct Mapping and Identification of Mutationsby Whole‐Genome Sequencing.  Genetics 186: 427–430

Alternative approach: test for enrichment of new mutations

Page 38: Reference genome based sequence variation detectioncbsu.tc.cornell.edu/lab/doc/variation_workshop_2011_v3.pdf · 2014. 10. 31. · Reference genome based sequence variation detection

Computational Resource at Cornell

CBSU / 3CPG BioHPC Laboratory (625 Rhodes Hall)

Office Hour: 1:00 to 3:00 PM every Monday.

Email [email protected] to get an BioHPC lab account. 

Page 39: Reference genome based sequence variation detectioncbsu.tc.cornell.edu/lab/doc/variation_workshop_2011_v3.pdf · 2014. 10. 31. · Reference genome based sequence variation detection

Training workshops

• Linux for Biologists

• Programming workshop (PERL)