solanaceae comparative genomics prl lunchtime seminar 2009

30
Brett Whitty, Buell Lab April 10, 2013

Upload: brett-whitty

Post on 04-Jul-2015

130 views

Category:

Lifestyle


6 download

TRANSCRIPT

Page 1: Solanaceae comparative genomics   prl lunchtime seminar 2009

Brett Whitty, Buell Lab

April 10, 2013

Page 2: Solanaceae comparative genomics   prl lunchtime seminar 2009

The Solanaceae

Capiscum anuum

Nicotiana benthamiana

Nicotiana langsdorffii x N. sanderae

Nicotiana tabacumSolanum tuberosum

Solanum lycopersicum

Petunia x hybrida

Solanum melongena

…and moreTobacco genome is ~4,500Mb, ~1Gb euchromatin

Sequenced at NCSU, funded by Philip Morris USA ($17.6M) 2004

Methyl-filtration strategy used to sequence gene-rich regions90% coverage of coding regions (theoretical)

856Mb (in 953,214 assemblies) of m-f reads released late 2008

http://www.pngg.org/tgi/

Tomato genome is ~950Mb, ~220Mb euchromatin

International genome sequencing project started in 2004

Target is 12 finished chromosomesSequencing is 41% complete

U.S. effort (chr. 1 & 10) is currently unfunded

http://www.sgn.cornell.edu/about/tomato_sequencing.pl

Potato genome is ~840Mb, ~220Mb euchromatin

International genome sequencing consortium formed in 2006

Target is 12 finished chromosomesSequencing is 20% complete

Our lab has been working on chromosome 6

http://www.potatogenome.net

Page 3: Solanaceae comparative genomics   prl lunchtime seminar 2009
Page 4: Solanaceae comparative genomics   prl lunchtime seminar 2009
Page 5: Solanaceae comparative genomics   prl lunchtime seminar 2009

• An integrated resource for publicly available sequence data for the Solanaceae

• Leverage partial genomic and transcriptomic sequence data to providebioinformatics tools and data that add value to, and improve usability of the available data for the Solanaceae community

• Provide consistent annotation of sequence data

• Provide comparative bioinformatics analyses and displays

http://solanaceae.plantbiology.msu.edu

Page 6: Solanaceae comparative genomics   prl lunchtime seminar 2009

Solanaceae Genomic Sequence Resources

We retrieve any new Solanaceae BAC sequences from GenBank on a weekly basis

This includes sequences from our Potato chr. 6 sequencing project submitted by the sequencing center

We purposefully rely on public sequence databases as the primary repository for sequence data to support and encourage data accessibility

Page 7: Solanaceae comparative genomics   prl lunchtime seminar 2009

2 2 11 3 5 2 2 6 2 34 1

779

1 1 5 1118

261

165

125

254

0

100

200

300

400

500

600

700

800

900

1000

1100

1200

Number of Solanaceae BACs in Resource Databases by Species

Release 3

Release 2

Release 1

Page 8: Solanaceae comparative genomics   prl lunchtime seminar 2009

0.3 0.1 1.4 0.3 1.2 0.3 0.5 0.8 0.1 3.7 0.0

90.4

0.1 0.0 0.2 0.0

16.6

24.7

21.1

13.4

32.2

0

10

20

30

40

50

60

70

80

90

100

110

120

130

Meg

abas

es

Total Length (in Mbp) of Solanaceae BAC Sequence by Species

Release 3

Release 2

Release 1

Page 9: Solanaceae comparative genomics   prl lunchtime seminar 2009

A Brief History of TIGR Gene Indices/TAs

TIGR Gene Indices

2005 John Quackenbush leaves TIGR Harvard Gene Indices

2007 Robin Buell leaves JCVI

2006 Plant group creates TIGR TAs TIGR Plant TAs

Childs KL, Hamilton JP, Zhu W, Ly E, Cheung F, Wu H, Rabinowicz PD, Town CD, Buell CR, Chan AP. 2007. The TIGR Plant Transcript Assemblies database. Nucleic Acids Res (2007) vol. 35 pp. D846-51

Page 10: Solanaceae comparative genomics   prl lunchtime seminar 2009

PlantGDB-assembled Putative Unique Transcripts (PUTs)

http://www.plantgdb.org/prj/ESTCluster/

Goal of assembly is to provideclosest approximation of arepresentative transcript set

Available for any plant specieswith >10,000 ESTs in GenBank will do build with <10k on request

Currently 11 Sol species have PUTs

Page 11: Solanaceae comparative genomics   prl lunchtime seminar 2009

15,278 18,037

6,791 7,612

114,191

9,884 7,110 4,024

48,945

3,718

70,344

0

10000

20000

30000

40000

50000

60000

70000

80000

90000

100000

110000

120000

Number of Solanaceae Transcript Assemblies in Resource Databases by Species

Page 12: Solanaceae comparative genomics   prl lunchtime seminar 2009

8.111.6

3.4 3.1

71.7

5.3 5.62.6

33.9

1.9

49.9

0

10

20

30

40

50

60

70

80

Me

gab

ase

s

Length (in Mbp) of Solanaceae Transcript Assembly Sequence by Species

Page 13: Solanaceae comparative genomics   prl lunchtime seminar 2009
Page 14: Solanaceae comparative genomics   prl lunchtime seminar 2009
Page 15: Solanaceae comparative genomics   prl lunchtime seminar 2009

Annotation of Solanaceae BACs Our annotation pipeline is run on all Solanaceae genomic sequences

publicly available in Genbank

The MAKER gene annotation pipeline software is used to produce genemodels by incorporating transcript and protein evidence with ab initiogenefinder predictions; these supplement any gene models previouslyannotated on the assemblies, should those be present, in the public data

*number of models annotated in BAC GenBank records --- does not include public models released by ITAG through SGN

Total Sol BACs

Total Length (bp)

Public Models*

MAKER Models

1810 210,050,443 1,135 29,234

Page 16: Solanaceae comparative genomics   prl lunchtime seminar 2009

Annotation of Solanaceae BACs (2)

Other computational analyses are performed, including:

Alignment of PlantGDB-assembled Solanaceae transcripts (PUTs) tothe genomic sequence using exonerate

Alignment of UniProt's SwissProt & UniRef protein databases to thegenomic sequence using exonerate

BLASTP of Solanaceae gene models against model dicot proteomes(Arabidopsis, Grape, Medicago, Poplar)

InterProScan search on the models to identify functional domains

Repeat feature prediction (using RepeatMasker)

ncRNA feature prediction (using tRNAscan-SE and RNAmmer)

…and additional computational analyses

Page 17: Solanaceae comparative genomics   prl lunchtime seminar 2009
Page 18: Solanaceae comparative genomics   prl lunchtime seminar 2009

Model Dicot vs. Solanaceae Comparative Genome Browsers We have created browsers for the public genome releases of

Arabidopsis (TAIR8), Grape(v1) and Poplar (v1.1) using the Generic Genome Browser (GBrowse)

Browser tracks:

Model genome public annotation (gene models, repeat regions, etc.)

All Solanaceae PUTs (11 species) aligned to the genomic sequence using exonerate’s est2genome model with a cutoff of 70% identity/70% of the length of the PUT

All Solanaceae PUTs aligned to the model genome’s proteome using TBLASTN with a cutoff of 70% identity/70% of the length of the PUT; alignments are displayed relative to the position of each gene model in the genome

Page 19: Solanaceae comparative genomics   prl lunchtime seminar 2009
Page 20: Solanaceae comparative genomics   prl lunchtime seminar 2009

Comparative Mapping to Model Dicot Genomes by BLAST Best Hit PUTs were mapped to Arabidopsis, Grape and Poplar genes

by best TBLASTX hit with an E-value cutoff of 1e-10

Arabidopsis Grape Poplar

PUTs Species Total PUTs# w/BLAST

hit% w/BLAST

hit# w/BLAST

hit% w/BLAST

hit# w/BLAST

hit% w/BLAST

hit

Capsicum annuum 15,278 10,292 67.36% 10,481 68.60% 10,589 69.31%

Nicotiana benthamiana 18,037 10,644 59.01% 10,884 60.34% 11,036 61.19%

N. langsdorffii x sanderae 6,791 4,026 59.28% 4,032 59.37% 4,155 61.18%

Nicotiana sylvestris 7,612 4,743 62.31% 4,917 64.60% 4,965 65.23%

Nicotiana tabacum 89,461 35,736 39.95% 37,546 41.97% 37,866 42.33%

Petunia x hybrida 9,884 6,271 63.45% 6,405 64.80% 6,500 65.76%

Solanum chacoense 7,110 5,038 70.86% 5,062 71.20% 5,163 72.62%

Solanum habrochaites 4,024 3,214 79.87% 3,255 80.89% 3,271 81.29%

Solanum lycopersicum 48,945 34,275 70.03% 34,855 71.21% 35,134 71.78%

Solanum pennellii 3,718 2,676 71.97% 2,732 73.48% 2,747 73.88%

Solanum tuberosum 70,344 45,125 64.15% 46,376 65.93% 46,993 66.80%

Page 21: Solanaceae comparative genomics   prl lunchtime seminar 2009

0%

10%

20%

30%

40%

50%

60%

70%

80%

90%

100%

Percentage of Solanaceae PUTs with Arabidopsis, Grape and Poplar TBLASTX Hits (at E <= 1e-10)

Arabidopsis

Grape

Poplar

Page 22: Solanaceae comparative genomics   prl lunchtime seminar 2009

Lineage-Specific Transcript Assemblies

All Solanaceae PUTSArabidopsis

Genome (TAIR8)

Grape Genome (v1)

TBLASTX

PUTs with no significant hits at E <= 1e-5

TBLASTX

PUTs with no significant hits at E <= 1e-5

Poplar Genome (v1.1)

TBLASTX

PUTs with no significant hits at E <= 1e-5

All PUTs Excluding Solanaceae

TBLASTX

PUTs with no significant hits at E <= 1e-5

Non-Solanaceae UniProt UniRef100

BLASTX

Putative Lineage-Specific Transcript Assemblies

(no significant hits at E <= 1e-5)

Page 23: Solanaceae comparative genomics   prl lunchtime seminar 2009

Lineage-Specific Transcript Assemblies (2)

PUT Species Total PUTs

# Putative Lineage

Specific PUTs

% Putative Lineage Specific

# PUT Length >200bp

% PUT Length >200bp

# with ESTScan

Translations

% with ESTScan

Translations

Capsicum annuum 15,278 3,262 21.4% 2,648 17.3% 2,012 13.2%

Nicotiana benthamiana 18,037 5,518 30.6% 4,223 23.4% 3,381 18.7%

N. langsdorffii x sanderae 6,791 2,049 30.2% 1,455 21.4% 1,124 16.6%

Nicotiana sylvestris 7,612 1,937 25.4% 1,544 20.3% 1,458 19.2%

Nicotiana tabacum 89,461 42,102 47.1% 35,060 39.2% 29,773 33.3%

Petunia x hybrida 9,884 2,200 22.3% 1,549 15.7% 1,520 15.4%

Solanum chacoense 7,110 1,284 18.1% 1,235 17.4% 843 11.9%

Solanum habrochaites 4,024 434 10.8% 391 9.7% 254 6.3%

Solanum lycopersicum 48,945 9,850 20.1% 7,461 15.2% 5,561 11.4%

Solanum pennellii 3,718 287 7.7% 206 5.5% 179 4.8%

Solanum tuberosum 70,344 17,232 24.5% 15,323 21.8% 12,408 17.6%

Page 24: Solanaceae comparative genomics   prl lunchtime seminar 2009

SNP IdentificationUsing Transcript Assemblies Input is multiple sequence alignments of PUT member

sequences

provided in PlantGDB PUTs dataset

we use vmatch to remap ESTs that are near-identical sub-sequence matches to PUT member ESTs; these are excluded from the PlantGDB assembly process, and the PlantGDB MSA

SNP-finding script identifies SNPs at positions in alignments with the following criteria:

minimum read depth of 4

minimum of 2 reads supporting an alternative base

Page 25: Solanaceae comparative genomics   prl lunchtime seminar 2009

SNP Identification on Transcripts (2)

Page 26: Solanaceae comparative genomics   prl lunchtime seminar 2009

SNP Identification on Transcripts (3)

PUTs Species Total PUTs# of PUTs w/SNP(s)

% of PUTs w/SNP(s)

Total Length of PUTs

w/SNP (bp)# of SNP Positions

Average Depth of Coverage at SNP Position

Average Alternative

Base Support

Capsicum annuum 15278 510 3.34% 460,754 1,461 16.9 7.2

Nicotiana benthamiana 18037 966 5.36% 1,164,075 5,106 18.7 9.7

N. langsdorffii x sanderae 6791 191 2.81% 143,052 821 31.1 15.6

Nicotiana sylvestris 7612 17 0.22% 8,756 33 6.4 3.8

Nicotiana tabacum 89461 2,110 2.36% 2,262,286 10,303 13.7 7.5

Petunia x hybrida 9884 133 1.35% 114,872 315 8.3 3.4

Solanum chacoense 7110 13 0.18% 10,399 48 5.4 2.2

Solanum habrochaites 4024 127 3.16% 157,292 695 33.8 14.0

Solanum lycopersicum 48945 5,198 10.62% 6,347,780 16,531 29.2 13.5

Solanum pennellii 3718 99 2.66% 86,679 273 35.6 17.2

Solanum tuberosum 70344 7,722 10.98% 8,872,526 57,705 19.0 9.7

Page 27: Solanaceae comparative genomics   prl lunchtime seminar 2009

0

1000

2000

3000

4000

5000

6000

7000

8000

9000

10000

11000

12000

13000

14000

15000

16000

17000

18000

19000

20000

21000

22000

23000

24000

25000

4 8 12 16 20 24 28 32 36 >40

Nu

mb

er

of

Pu

tati

ve S

NP

Po

siti

on

s

Minimum Depth of Coverage at SNP Position

Number of Putative SNPs vs. Minimum Depth of Coverage at SNP Positions

min 2 alt allele depth

min 4 alt allele depth

min 6 alt allele depth

min 8 alt allele depth

min 10 alt allele depth

Page 28: Solanaceae comparative genomics   prl lunchtime seminar 2009

The Solanaceae Comparative Genomics Resource in 2009/2010

SNP prediction on genomic sequences

Gene-centric views of data and resources

Integration of Tobacco genomic sequence into site resources

Increased annotation quality

Phylogenetic analysis

Comparative synteny displays

“Next generation” Potato genome sequencing?

Page 29: Solanaceae comparative genomics   prl lunchtime seminar 2009

Other Web Resources in the Buell LabPlease visit http://buell-lab.plantbiology.msu.edu

…also Biofuels Feedstock Genomics Resource (and more?)

Page 30: Solanaceae comparative genomics   prl lunchtime seminar 2009

Bioinformatics Programmer:

Morgan Chaires

Thanks:Kevin Childs

John Hamilton

Mike Geoffroy

Steven Lundback

AcknowledgementsPI:

Robin Buell

Bioinformatics/Project Lead:

Brett Whitty

Funding:

Solanaceae Comparative Genomics

Potato Chromosome 6