geba project summary

40
GEBA Project Summary Dongying Wu

Upload: marcia-stephenson

Post on 03-Jan-2016

39 views

Category:

Documents


5 download

DESCRIPTION

GEBA Project Summary. Dongying Wu. Phylogenetic Tree Building (Martin Wu). Concatenate alignments of 31 marker genes build a PHYML tree 667 non-GEBA genomes, 53 genomes. Phylogenetic Distance (PD). B. b. C. c. d. a. A. PD=sum of all the branch lengths PD{A,B,C}=a+b+c+d. - PowerPoint PPT Presentation

TRANSCRIPT

Page 1: GEBA  Project Summary

GEBA Project Summary

Dongying Wu

Page 2: GEBA  Project Summary

Phylogenetic Tree Building (Martin Wu)

Concatenate alignments of 31 marker genesbuild a PHYML tree

667 non-GEBA genomes, 53 genomes

Page 3: GEBA  Project Summary

Phylogenetic Distance (PD)

PD=sum of all the branch lengthsPD{A,B,C}=a+b+c+d

A

B

a

bC

c

d

Page 4: GEBA  Project Summary

Phylogenetic Distance Contribution of GEBA genomes

53 random non-GEBA taxa (from a pool of 667) contribute 3.15 to the tree PD

(standard deviation:0.68 for 100 sampling)

The total tree PD is 88.8, GEBA add 11.0 to the tree.

The 26 GEBA actinobacteria add 4.29 to the total PD (actinobacteria as a whole add 8.128 PD)

26 random non-GEBA actinobacteria (from a pool of 47) contribute 1.37 PD

(standard deviation 0.28, 100 sampling)

Page 5: GEBA  Project Summary

227,562 genes from 56 genomes => 17,176,180 links

Blastp: E value cutoff 1e-10, report 10000 hits

Only blastp hits that span 80% of the lengths of both genes are kept as links

Gene Family Classification

Page 6: GEBA  Project Summary

Links (matrix of sequence identities)

Expansion

Inflation (I=2)

MCL Clustering Algorithm

equilibrium state

Page 7: GEBA  Project Summary

50 - 100

20 - 50

10 - 20

5 - 10

1 - 5

20/56 - 1

10/56 - 20/56

2/56 - 5/56

5/56 - 10/56

1/56 46689

10601

2755

1588

1305

511

27

6

3

1

0 10000 20000 30000 40000 50000

Number of Families

Fam

il y Si ze

( gene

s/ geno

me)

Page 8: GEBA  Project Summary

Evenness estimation

genome Gene distribution ratio for family X

A 0.316

B 0.105

C 0.026

D 0

E 0.184

F 0.215

G 0.158

Median0.184

0.132

0.079

0.1580.1840

0.031

dist: Distanceaverrage=0.087

Evenness= 100 x e -4 x dist

0.031

Page 9: GEBA  Project Summary

Universality: ratio of genomes that a family appears inEvenness: even distribution of gene family members across genomesSize: number of members in a gene family

Page 10: GEBA  Project Summary

Family size

Page 11: GEBA  Project Summary

Large families:

famID size functions

F2669   4210 (75/genome) ABC-type transport system ATP-binding proteinF2670   1542 (27/genome) multi-sensor hybrid histidine kinase F2671   1367 (24/genome) short chain dehydrogenaseF2672   1157 (20/genome) acyl-CoA synthetaseF2673   782   (14/genome) serine/threonine protein kinaseF2674   755   (13/genome) two-component system response regulator (LuxR family)F2675   735   (13/genome) two-component system response regulator (winged helix family)

F2676   614   (11/genome) drug resistance transporterF2677   606   (11/genome) transcriptional regulator, LacI familyF2678   568   (10/genome) two-component system sensor sensor histidine kinaseF2679   543   (10/genome) sugar ABC transporter, permease component

Page 12: GEBA  Project Summary

Low universality large families:

famID size organism family function taxonomy number

F2682 461    7 outer membrane protein Bacteroidetes; Proteobacteria

F2699 303 6 outer membrane protein Bacteroidetes

F2736 180 6 anti-sigma factor Bacteroidetes; Proteobacteria

F2760 153 6 transcriptional regulator, AraC family Bacteroidetes; proteobacteria

F2772 147 5    RNA polymerase ECF-type sigma factor Bacteroidetes (Sphingobacteriales)

F2801 129 11 DNA-binding protein Actinobacteria(Actinobacteridae)

F2827 114 3 FtsX transmembrane transport protein Bacteroidetes (Sphingobacteriales)

F2867 103 3 hypothetical protein Actinobacteria;(Coriobacteriaceae)

Page 13: GEBA  Project Summary

3 out of 9 largest families have very low evenness value ( < 5)

short chain dehydrogenase acyl-CoA synthetasetwo-component system response regulator (LuxR)

0

10

20

30

40

50

60

70

80

0 1

0 2

0 3

0 4

0 6

0

0

20

40

60

80

10

0

12

0

0 1

0 2

0 3

0 4

0 5

0 6

0

0

10

20

30

40

50

60

70

80

90

10

0

0 1

0 2

0 3

0 4

0 5

0 6

0

56 Halobacteria Halorhabdus_utahensis55 Halobacteria Halomicrobium_mukohataei54 Halobacteria Halogeometricum_borinquense53 Aminanaerobia Thermanaerovibrio_acidaminovorans52 Deferribacteres Dethiosulfovibrio_peptidovorans51 Deinococci Meiothermus_silvanus50 Deinococci Meiothermus_ruber49 Chloroflexi Thermobaculum_terrenum48 Chloroflexi Sphaerobacter_thermophilus47 Actinobacteria Conexibacter_woesei46 Actinobacteria Atopobium_parvulum45 Actinobacteria Slackia_heliotrinireducens44 Actinobacteria Eggerthella_lenta43 Actinobacteria Cryptobacterium_curtum42 Actinobacteria Acidimicrobium_ferrooxidans41 Actinobacteria Kribbella_flavida40 Actinobacteria Catenulispora_acidiphila39 Actinobacteria Stackebrandtia_nassauensis38 Actinobacteria Geodermatophilus_obscurus37 Actinobacteria Nakamurella_multipartita36 Actinobacteria Actinosynnema_mirum35 Actinobacteria Saccharomonospora_viridis34 Actinobacteria Tsukamurella_paurometabola33 Actinobacteria Gordonia_bronchialis32 Actinobacteria Streptosporangium_roseum31 Actinobacteria Thermobispora_bispora30 Actinobacteria Thermomonospora_curvata29 Actinobacteria Nocardiopsis_dassonvillei28 Actinobacteria Kytococcus_sedentarius27 Actinobacteria Brachybacterium_faecium26 Actinobacteria Beutenbergia_cavernae25 Actinobacteria Cellulomonas_flavigena24 Actinobacteria Xylanimonas_cellulosilytica23 Actinobacteria Jonesia_denitrificans22 Actinobacteria Sanguibacter_keddieii21 Firmicutes Anaerococcus_prevotii20 Firmicutes Alicyclobacillus_acidocaldarius19 Firmicutes Veillonella_parvula 18 Firmicutes Desulfotomaculum_acetoxidans 17 Fusobacteria Sebaldella_termitidis 16 Fusobacteria Leptotrichia_buccalis15 Fusobacteria Streptobacillus_moniliformis14 Spirochaetes Brachyspira_murdochii13 Bacteroidetes Planctomyces_limnophilus 12 Bacteroidetes Rhodothermus_marinus 11 Bacteroidetes Capnocytophaga_ochracea 10 Bacteroidetes Chitinophaga_pinensis09 Bacteroidetes Pedobacter_heparinus08 Bacteroidetes Spirosoma_linguale 07 Bacteroidetes Dyadobacter_fermentans 06 Epsilonproteobacteria Sulfurospirillum_deleyianum 05 Deferribacteres Denitrovibrio_acetiphilus 04 Deltaproteobacteria Haliangium_ochraceum 03 Deltaproteobacteria Desulfomicrobium_baculatum 02 Deltaproteobacteria Desulfohalobium_retbaense 01 Gammaproteobacteria Kangiella_koreensis

50

Page 14: GEBA  Project Summary

phylum specific family

26/56 Actinobacteria

Gene number From Actinobacteria by chance

1 0.4643

2 0.2157

3 0.1001

4 0.0465

5 0.0216

6 0.0100

7 0.0047

8 0.0022

9 0.0010

10 0.0005

Page 15: GEBA  Project Summary

712 families (size >=10) are phylum specific

Fam

ily s

ize

Organism number

0

50

100

150

200

250

300

350

0 5 10 15 20 25

42 670

Page 16: GEBA  Project Summary

Family size Actonobacteria Bacteroidetes Deinococci Firmicutes Fusobacteria Halobacteria

10<= x <20 430 37 1 1 5 20 49420<= x <30 103 9 2 11430<= x <40 22 5 1 28

40<= x <50 7 1 1 9

50<= x <60 6 6

60<= x <70 4 1 5

70<= x <80 1 1 2

80<= x <90 3 1 4

90<= x <100 2 2

100<= x 3 3 6

581 58 1 1 6 23

Phylum-specific families from more than two organisms

Page 17: GEBA  Project Summary

F2699 Bacteroidetes=303; outer membrane protein

*F2752 Actinobacteria=160; RNA polymerase, sigma-24 subunit, ECF family

F2772 Bacteroidetes=147; putative ECF-type RNA polymerase sigma factor

F2801 Actinobacteria=129; DNA-binding protein

F2827 Bacteroidetes=114; FtsX-related transmembrane transport protein

F2867 Actinobacteria=103; unknown functions

The largest 6 phylum-specific families

* From 15 organisms

Page 18: GEBA  Project Summary

Novel gene families:None of the genes in a family has a Genbank hit (e cutoff: 1e-5)

1

10

100

1000

10000

100000

0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30

Novel GEBA families

Family Size

Fam

ily N

umber

31744 novel families(34353 genes)

Page 19: GEBA  Project Summary

0 1 2 3 4 5 6 7 8 9

10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31

0 1 2 3 4 5 6

Novel GEBA families

Genome Numbers

Fam

ily S

ize

Page 20: GEBA  Project Summary

Streptococcus agalactiae “pan-genome”

Tettelin H. et.al. PNAS 2005;102:13950-13955

Page 21: GEBA  Project Summary

217,079 genes from 53 GEBA Bacterial genomes

60024 families N genomes

Number of families with the selected genomes

A: N from1 to 53B: For every N, sample the families 100 times

Page 22: GEBA  Project Summary

0

10000

20000

30000

40000

50000

60000

70000

0 10 20 30 40 50 60 70 80

Bacteria from GEBA project

Genome Number

Gen

e F

amily

Num

ber

(incl

udin

g fa

mili

es w

ith s

ingl

e m

embe

rs)

0

500

1000

1500

2000

2500

3000

0 10 20 30 40 50 60

Number of Genomes

New

Gen

ome

fam

ilies

Page 23: GEBA  Project Summary

Actinobacteria: (73 genomes, including 26 GEBA genomes)

Streptococcus agalactiae (8 strains)

Enterobacteriaceae: (40 genomes)

9 Escherichia coli10 Yersinia pestis11 Salmonella enterica 3 Shigella flexneri

Bacteria: (53 GEBA genomes)

Page 24: GEBA  Project Summary

0

10000

20000

30000

40000

50000

60000

70000

0 10 20 30 40 50 60 70 80

S. agalactiae

Enterobacteriaceae

Actinobacteria

Bacteria from GEBA project

Genome Number

Gen

e F

amily

Num

ber

(incl

udin

g fa

mili

es w

ith s

ingl

e m

embe

rs)

Page 25: GEBA  Project Summary

0

50000

100000

150000

200000

250000

300000

350000

0 10 20 30 40 50 60 70 80

S. agalactiae

Enterobacteriaceae

Actinobacteria

Bacteria from GEBA project

Genome Number

Tot

al G

ene

Num

ber

Page 26: GEBA  Project Summary

0

10000

20000

30000

40000

50000

60000

70000

0 50000 100000 150000 200000 250000 300000 350000

S. agalactiae

Enterobacteriaceae

Actinobacteria

Bacteria from GEBA project

Total Gene Number

Gen

e F

amily

Num

ber

Page 27: GEBA  Project Summary

Calculate the PD (Phylogenetic Diversity)Of a sub-tree

Page 28: GEBA  Project Summary

0

2

4

6

8

10

12

14

16

18

0 10 20 30 40 50 60 70 80

Enterobacteriaceae

Actinobacteria

Bacteria from GEBA project

Genome Number

Phy

loge

netic

Div

ersi

ty

Page 29: GEBA  Project Summary

0

10000

20000

30000

40000

50000

60000

70000

0 2 4 6 8 10 12 14 16 18

Enterobacteriaceae

Actinobacteria

Bacteria from GEBA project

Phylogenetic Diversity

Gen

e F

amily

Num

ber

Page 30: GEBA  Project Summary

How far down the road GEBA has to go in terms of PD coverage

232812 Bacterial/Archaeal ss-rRNA from Greengenes

45997 clusters

MCL 99% Identity at 80% span

42426 Greengenes Bacterial/Archaeal ss-rRNA

667 Combo Bacterial ss-rRNA50 Combo Archaeal ss-rRNA56 GEBA ss-rRNA

Retrieve alignments from greengenes

QuickTree

Distant Tree for all representatives

Filter out ss-rRNA from Genome Porjects99% identity cutoffs

Filter out 18751 low-quality sequencesshort sequences <=1200ntlow-quality sequencesduplicateschimerics

Trim by the greengenes mask

Page 31: GEBA  Project Summary

74437 non-environmental Bacterial/Archaeal ss-rRNA from Greengenes

10397 clusters

MCL 99% Identity at 80% span

9946 Greengenes Bacterial/Archaeal ss-rRNA

667 Combo Bacterial ss-rRNA50 Combo Archaeal ss-rRNA56 GEBA ss-rRNA

Retrieve alignments from greengenes

QuickTree

Distant Tree for non-environmental representatives

Filter out ss-rRNA from Genome Porjects99% identity cutoffs

Filter out low-quality sequencesshort sequences <=1200ntlow-quality sequencesduplicateschimerics

Trim by the greengenes mask

Page 32: GEBA  Project Summary

GEBA

Pre-GEBA

Greengenes

*start from Haemophilus influenzae Rd KW20 **In each group, the taxa are sorted by their PD contributions in descending order

Page 33: GEBA  Project Summary

600

400

200

0

800

1000

1200

0 5000 10000 15000 20000 25000 30000 35000 40000

100

80

60

40

20

00 400 800 1200

GEBA genomes

pre-GEBA genomes

Organisms from the greengenes database

Organisms from the greengenes database(excluding environmental samples)

Organism Numbers

Phy

loge

netic

Div

ersi

ty

Page 34: GEBA  Project Summary

The slopes of the linear regression Lines represent the PD contribution of the genomes(each window contains 50 genomes)

Page 35: GEBA  Project Summary

0

0.01

0.02

0.03

0.04

0.05

0.06

0.07

0.08

0.09

0.1

0 1000 2000 3000 4000 5000 6000 7000 8000 9000 10000

0.048

Non-environmental Greengenes Representatives

Slo

pe

Window Position (window size 50 genomes)

0

0.02

0.04

0.06

0.08

0.1

0.12

0 100 200 300 400 500 600 700

Pre-GEBA Genomes

Slo

pe

Window Position (window size 50 genomes)

GEBA genomes: 0.048

Only the top 150 PD contributors out of 717 pre-GEBA genomes have an average PD contribution greater than the GEBA genomes.

The genome sequencing efforts have only covered 11.5% phylogenetic diversity to date in this study.

We can pick an additional 550 organisms and still have an average PD contribution greater than or equal to the 56 GEBA genomes

To increase PD coverage to 50%, we need to sequence at least 1520 more genomes

Non-environmental Tree

Page 36: GEBA  Project Summary

0

0.02

0.04

0.06

0.08

0.1

0.12

0.14

0.16

0.18

0 5000 10000 15000 20000 25000 30000 35000 40000 45000

0.0515

Window Position (window size 50 genomes)

Slo

pe

Greengenes Representatives

All-representative Tree

Current genome sequences only cover 2.2% of the PD

We can pick an additional 4400 organisms and still have an average PD contribution greater than or equal to the 56 GEBA genomes

To cover 50% of the phylogenetic diversity, we have to sequences 9218 more genomes

Page 37: GEBA  Project Summary

Oligotropha carboxidovorans

Burkholderia xenovorans Lb400Ralstonia eutrophaRhodobacter sphaeroides

Cylindrotheca closterium

2500650649 Thermomonospora curvata 2500714725 Meiothermus silvanus

Microcystis aeruginosaArabidopsis thaliana

Synechococcus elongatus 2500680479 Acidimicrobium ferrooxidans Methylococcus capsulatus

Staphylothermus marinusArchaeoglobus fulgidus

Thermococcus kodakarensis

2500517881 Halomicrobium mukohataei2500153872 Halogeometricum borinquense

Methanocaldococcus jannaschii Pyrococcus furiosus

Rhodopseudomonas palustrisRhodospirillum rubrum

Thiobacillus denitrificans Hydrogenovibrio marinus

Burkholderia xenovorans LB400

Roseovarius sp HTCC2601Rhizobium leguminosarum

Thermomicrobium roseum2500516270 Dyadobacter fermentans2500403940 Pedobacter heparinus

Burkholderia ambifaria Pseudomonas putida

Roseobacter sp MED193 Xanthobacter autotrophicus

Jannaschia sp CCS1

2500697965 Geodermatophilus obscurus Arthrobacter aurescens

Chlorobium phaeobacteroides Chlorobium tepidum IV

Rhodopseudomonas palustris

2500706456 Nakamurella multipartita Archaeoglobus fulgidus

2500712642 Meiothermus silvanus 2500546384 Meiothermus ruber

2500583064 Rhodothermus marinus2500348490 Veillonella parvula

2500608982 Planctomyces limnophilus Microcystis aeruginosa

2500621325 Alicyclobacillus acidocaldarius

Bacillus licheniformis Bacillus subtilis

1006330

6346

73

8391

100100

73

71

100

1003758

42

8075

100

67

10031

10058

100

37

5880

32

72

100

74

96100

38

73

45

95

1005968

10049

10060 100

0.5

III

I

II

IV

V

rbcL

Page 38: GEBA  Project Summary

Methylococcus_capsulatus 2500680479_Acidimicrobium_ferrooxidansArabidopsis_thaliana Synechococcus_elongatus Microcystis_aeruginosa 2500714725_Meiothermus_silvanus Cylindrotheca_closterium 2500650649_Thermomonospora_curvata Rhodobacter_sphaeroidesRalstonia_eutrophaI Oligotropha_carboxidovoransBurkholderia_xenovorans_LB400 Hydrogenovibrio_marinusThiobacillus_denitrificansI Rhodospirillum_rubrum Rhodopseudomonas_palustris Methanocaldococcus_jannaschii2500153872_Halogeometricum_borinquenseI 2500517881_Halomicrobium_mukohataei Pyrococcus_furiosusI Thermococcus_kodakarensisI Archaeoglobus_fulgidus Staphylothermus_marinus 2500608982_Planctomyces_limnophilus 2500348490_Veillonella_parvula 2500583064_Rhodothermus_marinus 2500546384_Meiothermus_ruber 2500712642_Meiothermus_silvanus Rhodopseudomonas_palustris Chlorobium_tepidum Chlorobium_phaeobacteroides Microcystis_aeruginosa 2500621325_Alicyclobacillus_acidocaldariusBacillus_subtilis Bacillus_licheniformis Archaeoglobus_fulgidus 2500706456_Nakamurella_multipartita Arthrobacter_aurescensRhizobium_leguminosarum Roseovarius_sp_HTCC2601Burkholderia_xenovorans_LB400 Roseobacter_sp_MED193 Pseudomonas_putida Burkholderia_ambifaria Jannaschia_sp_CCS1 Xanthobacter_autotrophicus2500697965_Geodermatophilus_obscurus 2500403940_Pedobacter_heparinus 2500516270_Dyadobacter_fermentans Thermomicrobium_roseum

RRRRRRRRRRRRRRRRRRRRRRRPPPPPFFFPPPPMYRRRRRRRRRRRRR

HHHHHHHHHHHHHHHHHHHHHHHHHHHHHHHHHHHHHHHHHHHHHHHHHH

EEEEEEEEEEEEEEEEEEEEEEEHHHHHEEEEEEEEEEEEEEEEEEEEEE

DDDDDDDDDDDDDDDDDDDDDDDDDDDDDDDDDDDDDDDDDDDDDDDDDD

KKKKKKKKKKKKKKKKKKKKKKKKKKKKKKKKKKKKKKKKKKKKKKKKKK

IIIII

I

FFFFFFFFFFFFFFFFLLLLYYYLVV

VLLL

LLFFFFFFFFLFFF

DDDDDDDDDDDDDDDDDDDDDDDDDDDDDDDDDDDDDDDDDDDDDDDDDD

GGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGYGGGGGGGGGGGGGG

KKKKKKKKKKKKKKKKKKKKKKKKMVQQNNNACVVDCNASNSSSSSNSSS

KKKKKKKKKKKKKKKKKKKKKKKKKKKKKKKKKKKKKKKKKKKKKKKKKK

NNNNNNNNNNNNNNNNNNNNNNN-----GGG------NGGNNNNNNNNNN

TTTTTTTTTTTTTTTTTTTTTTTVFTMMQQQTTSSTTTTTTTTTTTTTTT

EEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEGGGGEGEDDEEEEEEEEEE

HHHHHHHHHHHHHHHHHHHHHHHVIIIIIIILILLVLHQQHHHHHHHHHH

FSSSSSSSSSSSSSSSSSSSSSSAGAAAGGGSSSSSGSGCSSSSSSSSSS

KKKKKKKKKKKKKKKKKKKKKKKRRRRRRRRSSSSKGKKKKKKKKKKKKK

GGGGGGGGGGGGGGGGGGGGGGGGGGGGRRRTGGGGGGGGGGGGGGGGGG

GGGGGGSGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGG

GGGGGGGGGGGGGGGGGGGGGGGGGGGGSSSGGGGGGGGGGGGGGGGGGG

III

I

II

IV

V

rbcL Active sites

Catalytic

RuBP binding

Page 39: GEBA  Project Summary

Glycerate-3-P

P-glyceroyl-P GAP DHAP Fructose-1,6-PFructose-6-P

Xylulose-P

Ribulose-5-P

Ribulose-1,5-P

CO2

rbcL

pgk gap tpiA glpX tktA

rpe

Calvin cycle

Page 40: GEBA  Project Summary

Organism phylum rpe prk rbcL rbcS pgkThermomonospora_curvata_DSM_43183 Actinobacteria x x I x x

Meiothermus_silvanus_DSM_0994 Deinococci x x I,IV x x

Acidimicrobium_ferrooxidans Actinobacteria x x I x x

*Halogeometricum_borinquense_DSM_11551 Halobacteria x III x

Halomicrobium_mukohataei_DSM_12286 Halobacteria x III x

Alicyclobacillus_acidocaldarius_subsp Firmicutes x x IV x

Meiothermus_ruber_DSM_01279 Deinococci x x IV x

Nakamurella_multipartita_DSM_44233 Actinobacteria x x IV

Planctomyces_limnophilus_DSM_03776 Bacteroidetes x IV x

Rhodothermus_marinus_DSM_4252 Bacteroidetes x x IV x

Veillonella_parvula_DSM_02008 Firmicutes x IV x

Geodermatophilus_obscurus_DSM_43160 Actinobacteria x x V x

Pedobacter_heparinus_DSM_02366 Bacteroidetes x x V x

Dyadobacter_fermentans_DSM_18053 Bacteroidetes x x V x

Calvin Cycle

* Finished genome