finding transcription modules from large gene-expression data sets ned wingreen – molecular...

33
Finding Transcription Modules from large gene-expression data sets Ned Wingreen – Molecular Biology Morten Kloster, Chao Tang – NEC Laboratories America

Post on 21-Dec-2015

220 views

Category:

Documents


1 download

TRANSCRIPT

Page 1: Finding Transcription Modules from large gene-expression data sets Ned Wingreen – Molecular Biology Morten Kloster, Chao Tang – NEC Laboratories America

Finding Transcription Modules from large gene-expression data sets

Ned Wingreen – Molecular BiologyMorten Kloster, Chao Tang – NEC Laboratories America

Page 2: Finding Transcription Modules from large gene-expression data sets Ned Wingreen – Molecular Biology Morten Kloster, Chao Tang – NEC Laboratories America

Outline

• Introduction – transcription, regulation, gene chips, and transcription modules.

• Iterative Signature Algorithm (ISA).

• Advantages of Progressive Iterative Signature Algorithm (PISA).

• PISA applied to yeast data.

Page 3: Finding Transcription Modules from large gene-expression data sets Ned Wingreen – Molecular Biology Morten Kloster, Chao Tang – NEC Laboratories America

Transcription regulation

http://doegenomestolife.org

Page 4: Finding Transcription Modules from large gene-expression data sets Ned Wingreen – Molecular Biology Morten Kloster, Chao Tang – NEC Laboratories America

Gene chips

DNA microarray

Page 5: Finding Transcription Modules from large gene-expression data sets Ned Wingreen – Molecular Biology Morten Kloster, Chao Tang – NEC Laboratories America

Gene-expression profile

Egc g=1,2,...,Ng

c=1,2,...,Nc

But data very noisy…

Page 6: Finding Transcription Modules from large gene-expression data sets Ned Wingreen – Molecular Biology Morten Kloster, Chao Tang – NEC Laboratories America

Transcription module

C1 C2 C3 Conditions

G1 G7G2 G3 G4 G5 G6 Genes

TF1 TF2 TF3 TF4Transcription factors

A Transcription Module: a set of conditions and a set of genes connected by a transcription factor.

Page 7: Finding Transcription Modules from large gene-expression data sets Ned Wingreen – Molecular Biology Morten Kloster, Chao Tang – NEC Laboratories America

A gene can be in multiple transcription modules.

Conditions

Gen

esc1 c2 c3 … … cm … … cn ... ... cNc

g 1

g 2 g 3 . .g i . .g j . . g N g

Signature of a transcription module

Page 8: Finding Transcription Modules from large gene-expression data sets Ned Wingreen – Molecular Biology Morten Kloster, Chao Tang – NEC Laboratories America

Iterative Signature Algorithm (ISA)Barkai group (2002,2003)

( ) { : }

( ) { : }

m

m

gcm m G Cg G

gcm m C Gc C

C G c C E t

G C g G E t

1 1

2 2

G C

G C

G CG C

G CN N

m m

m m

m m

m m

Transcription Module (TM)

Gene vector and condition vector:

T

( 1) ( ( ))

( 1) ( ( 1))

G

C

G Ct C

C Gt G

n f n

n f n

m E m

m E m

Conditions

Gen

es

c1 c2 c3 … … cm … … cn ... ... cNC

g 1

g 2 g 3 . .g i . .g j . . g N

G

Thresholding on both genes and conditions reduces noise.

Thresholding:

Page 9: Finding Transcription Modules from large gene-expression data sets Ned Wingreen – Molecular Biology Morten Kloster, Chao Tang – NEC Laboratories America

Limitations of ISA• Lots of spurious modules (millions…).

• Weak modules may be absorbed by strong ones.

• ISA does not make use of identified modules to find new ones.

c1 c2 c3 … … cm … … cn ... ... cNc

g 1

g 2 g 3 . .g i . .g j . . g Ng

Page 10: Finding Transcription Modules from large gene-expression data sets Ned Wingreen – Molecular Biology Morten Kloster, Chao Tang – NEC Laboratories America

Progressive Iterative Signature Algorithm (PISA)

c1 c2 c3 … … cm … … cn ... ... cNc

g 1

g 2 g 3 . .g i . .g j . . g N g

Page 11: Finding Transcription Modules from large gene-expression data sets Ned Wingreen – Molecular Biology Morten Kloster, Chao Tang – NEC Laboratories America

Advantages of PISA over ISA

• Removing found modules reveals “hidden” modules, and reduces noise for unrelated modules.

• No positive feedback.

• Improved thresholding for genes.

• Combines coregulated and counter-regulated genes.

Page 12: Finding Transcription Modules from large gene-expression data sets Ned Wingreen – Molecular Biology Morten Kloster, Chao Tang – NEC Laboratories America

Example of PISA vs. ISA

TF1 TF2

G1 G2

A B

Page 13: Finding Transcription Modules from large gene-expression data sets Ned Wingreen – Molecular Biology Morten Kloster, Chao Tang – NEC Laboratories America

The gene-score threshold

•Goal: less than one gene included in the module by mistake.

•Require: threshold that is insensitive to (unknown) module size.

Gene scores along the condition vector for some module

Page 14: Finding Transcription Modules from large gene-expression data sets Ned Wingreen – Molecular Biology Morten Kloster, Chao Tang – NEC Laboratories America

Eliminating false modulesFor scrambled data, preliminary modules either have few genes or few contributing conditions.

Truepositives

Page 15: Finding Transcription Modules from large gene-expression data sets Ned Wingreen – Molecular Biology Morten Kloster, Chao Tang – NEC Laboratories America

PISA applied to yeast data

• Applied PISA to a dataset containing almost all available microarray data for S. cerevisiae: >6000 genes, ~1000 conditions.

• Found ~140 different modules, including all “good” modules found by ISA.

• Found some unknown modules.

• Found many “good” small modules that ISA could not find / separate from the spurious modules.

• ~2600 genes in at least one module, ~900 genes in more than module.

Page 16: Finding Transcription Modules from large gene-expression data sets Ned Wingreen – Molecular Biology Morten Kloster, Chao Tang – NEC Laboratories America

Some modules found by PISA

Page 17: Finding Transcription Modules from large gene-expression data sets Ned Wingreen – Molecular Biology Morten Kloster, Chao Tang – NEC Laboratories America

Example: Zinc module

ZRT1

YNL254C

INO1ZAP1

YOL154W

ADH4

ZRT3ZRT2

YOR387C

ZRT1

ZAP1

ZRT2

YNL254C

YOL154W

ZRT3

ADH4

RAD27

ZRC1

… Lyons

et a

l., P

NA

S 97

, 795

7-7

962

(2000)

ZAP1-regulated genesduring zinc starvation.

Zinc module found by PISA

Page 18: Finding Transcription Modules from large gene-expression data sets Ned Wingreen – Molecular Biology Morten Kloster, Chao Tang – NEC Laboratories America

Comparison with other databases“Gold standard”: Gene Ontology (Genome Res. 11, 1425-1433

(2001)) Database A: Immunoprecipitation (Lee et al., Science 298, 799-804 (2002))

Database B: Comparative genomics (Kellis et al., Nature 423, 241-254

(2003))

Page 19: Finding Transcription Modules from large gene-expression data sets Ned Wingreen – Molecular Biology Morten Kloster, Chao Tang – NEC Laboratories America

anticorrelated correlated

Oxidative stress response(69)De novo purine biosyn (32)Lysine biosyn (11)Biotin syn & transport (6)Arg biosyn (6)aa biosyn (96)

Oxidative stress response (69) aryl alcohol dehydrogenase (6) proteolysis (27) trehalose & hexose metabolism/conversion (21) COS genes (11) heat shock (52) repair of disulfide bonds (26)

Mating genes for type a (15)Mating type a signaling genes (6)Mating (110)Mating factors/receptors: a/ difference (26)

rRNA processing (117) Ribosomal proteins (126) Histone (19) Fatty acid syn ++ (22) Cell cycle G2/M (31) Cell cycle M/G1 (35) Cell cycle G1/S (66)

Correlations

Page 20: Finding Transcription Modules from large gene-expression data sets Ned Wingreen – Molecular Biology Morten Kloster, Chao Tang – NEC Laboratories America

Summary

• Data from gene chips can be used to identify transcription modules (TMs).

• Iterative approach (ISA) is promising.

• PISA improves on ISA by taking out found TMs.

– PISA also improves gene thresholding, avoids positive feedback, and improves signal to noise by grouping coregulated and counter-regulated genes.

– PISA very effective for finding “secondary modules”.

http://cn.arxiv.org/abs/q-bio/0311017

Page 21: Finding Transcription Modules from large gene-expression data sets Ned Wingreen – Molecular Biology Morten Kloster, Chao Tang – NEC Laboratories America

Future Directions

• Input to experiment: – new modules and new genes in old modules.– what kinds of experiments give the most informative

data?

• Improve PISA:– better pre/post-processing of data.

• Apply PISA to other organisms.

• Combine PISA with other data (experimental, bioinformatic) to systematically identify TMs, and reconstruct the transcription network.

Page 22: Finding Transcription Modules from large gene-expression data sets Ned Wingreen – Molecular Biology Morten Kloster, Chao Tang – NEC Laboratories America

De novo purine biosynthesisNumber of genes: 32Average number of contributing conditions: 14.6Consistency: 0.59Best ISA overlap: 0.59 at tG=5.0; frequency 16

Page 23: Finding Transcription Modules from large gene-expression data sets Ned Wingreen – Molecular Biology Morten Kloster, Chao Tang – NEC Laboratories America

Galactose induced genesNumber of genes: 23Average number of contributing conditions: 18.1Consistency: 0.55Best ISA overlap: 0.74 at tG=3.2; frequency 686

Page 24: Finding Transcription Modules from large gene-expression data sets Ned Wingreen – Molecular Biology Morten Kloster, Chao Tang – NEC Laboratories America

Hexose transporters

Number of genes: 10Average number of contributing conditions: 33.7Consistency: 0.59Best ISA overlap: 0.6 at tG=3.8; frequency 41

Page 25: Finding Transcription Modules from large gene-expression data sets Ned Wingreen – Molecular Biology Morten Kloster, Chao Tang – NEC Laboratories America

Peroxide shockNumber of genes: 69Average number of contributing conditions: 23.9Consistency: 0.50Best ISA overlap: 0.34 at tG=3.4; frequency (1)

Page 26: Finding Transcription Modules from large gene-expression data sets Ned Wingreen – Molecular Biology Morten Kloster, Chao Tang – NEC Laboratories America

Implementation of PISA

• Normalization of gene-expression data

• Iterative algorithm to find preliminary modules (modified ISA)– avoiding positive feedback– gene-score threshold

• Orthogonalization

• Finding consistent modules

Page 27: Finding Transcription Modules from large gene-expression data sets Ned Wingreen – Molecular Biology Morten Kloster, Chao Tang – NEC Laboratories America

Normalization of expression data

Gene-score matrix EG:

Condition-score matrix EC:

removes reference-condition bias

normalizes total RNA levels

makes gene scores comparable

makes condition scores comparable

Page 28: Finding Transcription Modules from large gene-expression data sets Ned Wingreen – Molecular Biology Morten Kloster, Chao Tang – NEC Laboratories America

Iterative algorithm: modified ISA (mISA)

Start with a random set of genes GI.

Produce condition-score vector sC.

Produce gene-score vector sG, using “leave-one-out” scoring to avoid positive feedback.

From sG, calculate gene vector mG for next iteration.

Page 29: Finding Transcription Modules from large gene-expression data sets Ned Wingreen – Molecular Biology Morten Kloster, Chao Tang – NEC Laboratories America

OrthogonalizationAfter finding each converged preliminary module (sG, sC), remove component along sC from all genes:

s1C

s’

s2C

Page 30: Finding Transcription Modules from large gene-expression data sets Ned Wingreen – Molecular Biology Morten Kloster, Chao Tang – NEC Laboratories America

Why does scrambled data yield large modules?

Long tails of expression data lead to single-condition modules.

Page 31: Finding Transcription Modules from large gene-expression data sets Ned Wingreen – Molecular Biology Morten Kloster, Chao Tang – NEC Laboratories America

Finding consistent modules

• Repeat PISA runs many times (~30).• Tabulate preliminary modules.• A preliminary module contributes to a module if:

– the preliminary module contains > 50% of the genes in the module,

– these genes constitute > 20% of the preliminary module.

• A gene is included in a module if it appears in >50% of the contributing modules, always with the same gene-score sign.

Page 32: Finding Transcription Modules from large gene-expression data sets Ned Wingreen – Molecular Biology Morten Kloster, Chao Tang – NEC Laboratories America

Comparison with other databasesGene Ontology (Genome Res. 11, 1425-1433 (2001))

Database A: Immunoprecipitation (Lee et al., Science 298, 799-804 (2002)) Database B: Comparative genomics (Kellis et al., Nature 423, 241-254 (2003))

1

0

1

G

n

i G

c N c

i m ip

N

m

Ng — number of genes in organismm — number of genes in module c — number of genes in GO categoryn — number of genes in both module and GO category

p value:

Page 33: Finding Transcription Modules from large gene-expression data sets Ned Wingreen – Molecular Biology Morten Kloster, Chao Tang – NEC Laboratories America

Correlation of modules

1 2 1 2

'

Corr( , ) ' '

CC

C

mm

m

m m m m

Conditions

Gen

es

c1 c2 c3 … … cm … … cn ... ... cNc

g 1

g 2 g 3 . .g i . .g j . . g Ng