snp discovery in deer (cervus elaphus) using the illumina genome analyser … · 2010. 2. 4. ·...

25
John McEwan AgResearch PAG Jan 2010 SNP Discovery in Deer (Cervus elaphus) Using The Illumina Genome Analyser IIx

Upload: others

Post on 06-Mar-2021

10 views

Category:

Documents


0 download

TRANSCRIPT

Page 1: SNP Discovery in Deer (Cervus elaphus) Using The Illumina Genome Analyser … · 2010. 2. 4. · SNP detection • SNP Detection Criteria –Stacking: collapsed where reads same start

John McEwanAgResearch

PAG Jan 2010

SNP Discovery in Deer (Cervus elaphus) Using

The Illumina Genome Analyser IIx

Page 2: SNP Discovery in Deer (Cervus elaphus) Using The Illumina Genome Analyser … · 2010. 2. 4. · SNP detection • SNP Detection Criteria –Stacking: collapsed where reads same start

Summary

• 4.1M SNPs

• 8 lanes

• ~1c/SNP

• 9X with 7 animals

• 100bp PER

• Sufficient for SNP chip

Page 3: SNP Discovery in Deer (Cervus elaphus) Using The Illumina Genome Analyser … · 2010. 2. 4. · SNP detection • SNP Detection Criteria –Stacking: collapsed where reads same start

Deer SNPs… lessons learned

• Illumina GA IIx 100bp PER ~500bp insert 3Gbp x 7 animals

• Select animals span genetic diversity

• 1 flow cell 7 lanes

– WGS … more even coverage

– 100bp reads > match to related genome

– 8X coverage …. >98% depth of 4 or greater

– Low coverage SNPs vital to track read source

– Better info on flanking sequence

– PER = better assembly (by simulation)

– Forms basis for draft sequence of a genome

– Sheep ~$2M in 2007 3X ~$50K 2009 9X

– started Sept 2009, seq late Oct with Illumina

Page 4: SNP Discovery in Deer (Cervus elaphus) Using The Illumina Genome Analyser … · 2010. 2. 4. · SNP detection • SNP Detection Criteria –Stacking: collapsed where reads same start
Page 5: SNP Discovery in Deer (Cervus elaphus) Using The Illumina Genome Analyser … · 2010. 2. 4. · SNP detection • SNP Detection Criteria –Stacking: collapsed where reads same start

Wob 1

War 1

Red 1

Eas 1 M

Elk 1

Elk 2

1x 1x

1x 1x

2x 1x

Repeat mask

Blast UMD3

Assemble with Velvet

Meld against bovine scaffold

Detect SNPs

Sequencing

1x Hun 1

Page 6: SNP Discovery in Deer (Cervus elaphus) Using The Illumina Genome Analyser … · 2010. 2. 4. · SNP detection • SNP Detection Criteria –Stacking: collapsed where reads same start

Sequence

• 8 lanes

• 100bp PER

• 284.3M reads

• 28.4Gbp

• High % full length

• Not trimmed

Page 7: SNP Discovery in Deer (Cervus elaphus) Using The Illumina Genome Analyser … · 2010. 2. 4. · SNP detection • SNP Detection Criteria –Stacking: collapsed where reads same start

Masking

• Used Repeatmasker

• Used Ruminantia db

• Supplemented with:

– >10 identical reads assembly

– Multiple blast hit assembly

– Sped up sequence matching

– Greatly reduced output size

• Optimal masking sensitivity & mapping need to be different!!!

Page 8: SNP Discovery in Deer (Cervus elaphus) Using The Illumina Genome Analyser … · 2010. 2. 4. · SNP detection • SNP Detection Criteria –Stacking: collapsed where reads same start

Mapping: deer reads to UMD3

• Used Megablast

• Options

-D 3 -t 21 -W 11 -q -3 -r 2 -G 5 -E 2 -s 56 -N 2 -F "m D" -U T

• Opt speed with maximal specificity & % unique hits

• ~ 10% added if Blastn hits W9 also added (sensitive blast)

• Used unique hits and where ehit1/ehit2 =1e-20

48 44

8

52

Page 9: SNP Discovery in Deer (Cervus elaphus) Using The Illumina Genome Analyser … · 2010. 2. 4. · SNP detection • SNP Detection Criteria –Stacking: collapsed where reads same start

Mapping Specificity

• High specificity

• P~0.0009-0.004

• Some animal diffs?

Page 10: SNP Discovery in Deer (Cervus elaphus) Using The Illumina Genome Analyser … · 2010. 2. 4. · SNP detection • SNP Detection Criteria –Stacking: collapsed where reads same start
Page 11: SNP Discovery in Deer (Cervus elaphus) Using The Illumina Genome Analyser … · 2010. 2. 4. · SNP detection • SNP Detection Criteria –Stacking: collapsed where reads same start

Est distance between mate pair ends

~200bp insert sizes

0.02-0.03% had mate pairs wrong orientation if match on same chromosome→ that blast criteria very specific

Page 12: SNP Discovery in Deer (Cervus elaphus) Using The Illumina Genome Analyser … · 2010. 2. 4. · SNP detection • SNP Detection Criteria –Stacking: collapsed where reads same start

Velvet assembly criteria selection

• Varied kmer length

• N50 length

• % assembly coverage

• Non chimeric %

• cf CAP3

• Chose default kmer=31

Page 13: SNP Discovery in Deer (Cervus elaphus) Using The Illumina Genome Analyser … · 2010. 2. 4. · SNP detection • SNP Detection Criteria –Stacking: collapsed where reads same start

Velvet assembly

• 1Mbp regions assembled

• Divide and conquer approach

• Many small contigs

• 58.6% length UMD3

• UMD3 59.5% unique!

• N50 & coverage affected by insert length

• Better for SNP oligo design

Results

Start

N sequences (M) 284.3

Blast

N sequences (M) 147.3

Assembly

Contigs (M) 3.2

Bases (Gbp) 1.562

N50 (bp) 813

Page 14: SNP Discovery in Deer (Cervus elaphus) Using The Illumina Genome Analyser … · 2010. 2. 4. · SNP detection • SNP Detection Criteria –Stacking: collapsed where reads same start

Meld Process

Ovine contigs

Align (BLAST)

reference

contigs

MELDCTAGTGCATGCTGCactTGCTataTGTGCtagNNgcATATTGCTGNNTGCTAT

Bovine reference scaffold

Figure 2. Creation of MELDed ovine sequence using bovine as a guide genome

Ovine contigs

Align (BLAST)

reference

contigs

MELDCTAGTGCATGCTGCactTGCTataTGTGCtagNNgcATATTGCTGNNTGCTAT

Bovine reference scaffold

Figure 2. Creation of MELDed ovine sequence using bovine as a guide genome

Deer contigs

Page 15: SNP Discovery in Deer (Cervus elaphus) Using The Illumina Genome Analyser … · 2010. 2. 4. · SNP detection • SNP Detection Criteria –Stacking: collapsed where reads same start

Meld and overall assembly stats

• Reduce contigs 34%

• Reduce length 8%

• Increase N50 26%

• 53.8% coverage of UMD3

• Optimised for SNP discovery

Page 16: SNP Discovery in Deer (Cervus elaphus) Using The Illumina Genome Analyser … · 2010. 2. 4. · SNP detection • SNP Detection Criteria –Stacking: collapsed where reads same start

% Assembly Refseq Coverage

• Masked Bov refseqs

• Mapped deer assembly

• ~13% not mapped

• 80% refseqs >40% unique coverage

• Seq matched 66%

• Conservative

Page 17: SNP Discovery in Deer (Cervus elaphus) Using The Illumina Genome Analyser … · 2010. 2. 4. · SNP detection • SNP Detection Criteria –Stacking: collapsed where reads same start

SNP detection• SNP Detection Criteria

– Stacking: collapsed where reads same start base

– Depth: >3 (98% of sequence) and <17 reads deep

– MAF: at least 2 reads present

– SNP Class:

A 2 or more animals present for both alleles.

B 2 or more animals present for at least 1 allele,

C alleles present one animal

– SNP quality:

• discarded if 10bp flanking sequence has variants

– Previous expts get ~93% conversion rate on SNP chip

Page 18: SNP Discovery in Deer (Cervus elaphus) Using The Illumina Genome Analyser … · 2010. 2. 4. · SNP detection • SNP Detection Criteria –Stacking: collapsed where reads same start

Read Depth distribution at SNP calls

• ~Poisson

• Little genome bias?

Page 19: SNP Discovery in Deer (Cervus elaphus) Using The Illumina Genome Analyser … · 2010. 2. 4. · SNP detection • SNP Detection Criteria –Stacking: collapsed where reads same start

• A = both alleles seen in 2 deer

• SNP chip real estate• Infinium 2 SNP

1 probe 50bp no G/C, A/T• Infinium 1 SNP

2 probes 50bp

• 38% removed proximity filter

• 5% removed depth filter

• leaves 4.1M SNPs ~1/349bp

• ~90% pass design (0.8 threshold)

• ~ 1.98M Class A Infinium 2 SNPs

Illumina Deer SNP Results

Page 20: SNP Discovery in Deer (Cervus elaphus) Using The Illumina Genome Analyser … · 2010. 2. 4. · SNP detection • SNP Detection Criteria –Stacking: collapsed where reads same start

Estimated Minor allele frequency

• Bias to high MAF

• SNP chip results will be similar

• Average MAF =0.3

Page 21: SNP Discovery in Deer (Cervus elaphus) Using The Illumina Genome Analyser … · 2010. 2. 4. · SNP detection • SNP Detection Criteria –Stacking: collapsed where reads same start

SNP density across genome

1

10

100

1000

10000

0 20 40 60 80 100 120 140 160

SNP

nu

mb

er/M

bp

Chromosome 1 Mbp

A/C

A/G

A/T

G/C

Total

Page 22: SNP Discovery in Deer (Cervus elaphus) Using The Illumina Genome Analyser … · 2010. 2. 4. · SNP detection • SNP Detection Criteria –Stacking: collapsed where reads same start

SNP specificity

• Large % fixed differences

• Impt when selecting SNPs

• Reflects est genetic divergence

SNP freq

Elk only 0.04

Europe only 0.50

both 0.15

fixed dif 0.30

Page 23: SNP Discovery in Deer (Cervus elaphus) Using The Illumina Genome Analyser … · 2010. 2. 4. · SNP detection • SNP Detection Criteria –Stacking: collapsed where reads same start

Summary

• Sequenced 7 animals to ~1X coverage– selected to span genetic diversity

– ≥4X depth over 99% of the genome

– 100bp PER

• Used a mixture of assisted and de novo assembly – Optimised to provide high quality sequence for SNP discovery

– Ordered and orientated contigs via related genome

• SNP calling routine– corrects for “stacking” artifacts and repetitive regions

– traces animal origin of reads for high quality calls

• Results– 4.1M SNPs, 2.4M class A

– Suitable to create a high density Illumina SNP array

– Cost ~1 cent/SNP identified

Page 24: SNP Discovery in Deer (Cervus elaphus) Using The Illumina Genome Analyser … · 2010. 2. 4. · SNP detection • SNP Detection Criteria –Stacking: collapsed where reads same start

Acknowledgements

• Cindy Lawley Illumina

• Kimberly Gietzen

• Nan Leng

• Rudi Brauning, AgResearch

• Paul Fisher AgResearch

• Jason Archer

• Matt Bixley

• Jamie Ward

• Geoff Nicoll Landcorp

Page 25: SNP Discovery in Deer (Cervus elaphus) Using The Illumina Genome Analyser … · 2010. 2. 4. · SNP detection • SNP Detection Criteria –Stacking: collapsed where reads same start