incob 2007 incob 2007 aug. 29, 2007 motif-directed network component analysis for regulatory network...
TRANSCRIPT
InCoB 2007InCoB 2007Aug. 29, 2007
Motif-directed Network Component Motif-directed Network Component Analysis for Regulatory Network InferenceAnalysis for Regulatory Network Inference
Motif-directed Network Component Motif-directed Network Component Analysis for Regulatory Network InferenceAnalysis for Regulatory Network Inference
Chen Wang, Lily Chen, Yue Wang, (Jason) Jianhua Xuan*Chen Wang, Lily Chen, Yue Wang, (Jason) Jianhua Xuan*Virginia Tech, USAVirginia Tech, USA
Po Zhao, Eric HoffmanPo Zhao, Eric HoffmanChildren’s National Medical Center, USAChildren’s National Medical Center, USA
Robert ClarkeRobert ClarkeGeorgetown University Medical Center, USAGeorgetown University Medical Center, USA
InCoB 2007InCoB 2007
OutlineOutlineOutlineOutline
• Background & Motivation• Proposed Approach
– Motif-directed network component analysis (mNCA)
– Stability analysis• Experimental Results
– Muscle regeneration
• Conclusion & Discussion
InCoB 2007InCoB 2007
Background & MotivationBackground & MotivationBackground & MotivationBackground & Motivation
• High-throughput biological data (e.g., microarray data, proteomic data, etc.) provide us a great opportunity to study genome systems.– Identify gene modules, interactions and
pathways.
• Gene regulatory network modeling– Clustering or biclustering– Decomposition
• The whole gene population is regulated by a few key transcription factors (TFs).
• TFs and their interactions can form a skeleton of the regulatory networks.
InCoB 2007InCoB 2007
BackgroundBackgroundBackgroundBackground
• However, decomposition methods relying on microarray data alone often make their results difficult to interpret biologically.– Independent Component Analysis (ICA), and – Non-negative Matrix Factorization (NMF).
• Network Component Analysis (NCA) – An integrative approach– Microarray gene expression data– Protein binding data (i.e., ChIP-on-chip data) – network
connections (topology)• Available in yeast model system
InCoB 2007InCoB 2007
MotivationMotivationMotivationMotivation
• Limitations of NCA:– ChIP-on-chip data are often not available for species like mouse and
human;– When different data sources are integrated, the consistency is often
not guaranteed;– ChIP-on-chip data come from biological experiments, which might
contain false-positives leading to incorrect network inference.
• Proposed solution - motif-directed network component analysis (mNCA)– Motif information derived from DNA sequence for initial network
topology.– With the awareness of false-positives in motif information, stability
analysis procedures shall be developed to combat the inconsistency between motif information and microarray data.
InCoB 2007InCoB 2007
Motivation - Pathway Building Motivation - Pathway Building Motivation - Pathway Building Motivation - Pathway Building
• Emery Dreifuss Muscular Dystrophy (EDMD)
Aug. 29, 2007
Bakay, M, et al., Brain (129), 2006
InCoB 2007InCoB 2007
Network Component Analysis (NCA)Network Component Analysis (NCA)Network Component Analysis (NCA)Network Component Analysis (NCA)
TF Connection mRNA TF Connection mRNA
InCoB 2007InCoB 2007
• A linear model:
Mathematical Formulation of NCAMathematical Formulation of NCAMathematical Formulation of NCAMathematical Formulation of NCA
CHRNG = a1 MYOD1 + a2 MYOG
A: the connection strengthsT: transcription factor activities (TFAs)
0
,
. .
N M N L L MA
s t A Z
E T
2
0
min || |
.
| ,
. .
N M N L L MA
s t A Z
E T
Criterion to infer TFAs and regulation relationship according to both expression and topology:
InCoB 2007InCoB 2007
Illustration of NCAIllustration of NCAIllustration of NCAIllustration of NCA
gene
=
0 100 200 300 400 500 600 700 800 900 1000-12
-10
-8
-6
-4
-2
0
2
4
6
8
0 100 200 300 400 500 600 700 800 900 1000-10
-5
0
5
0 100 200 300 400 500 600 700 800 900 1000-8
-6
-4
-2
0
2
4
6
8
=E A T
0 20 40 60 80 100 120 140 160 180 200
-10
-8
-6
-4
-2
0
2
4
6
Microarray data Regulation strength
Transcription Factor Activities (TFAs)
InCoB 2007InCoB 2007
mNCA - Motif InformationmNCA - Motif InformationmNCA - Motif InformationmNCA - Motif Information• Transcription Factors (TFs)
– Proteins that bind to the promoter regions of genes – Activate or inhibit gene expression.
• Motif (DNA sequence motif) – Common pattern in binding sites for a TF– Short sequences (5-25 bp)– Up to 1000 bp (or farther) from the gene– Inexactly repeating patterns
Gene 1Gene 2Gene 3Gene 4
Gene 5
Binding sites for a TF
InCoB 2007InCoB 2007
Motif RepresentationMotif RepresentationMotif RepresentationMotif Representation
• Consensus sequence MyoD (M00001): SRACAGGTGKYG
• Position-Weighted Matrices (PWMs) MyoD (M00001):
• Sequence Logo: – graphical depiction of a profile – conservation of elements in a motif MyoD (M00001):
InCoB 2007InCoB 2007
• Input:– Promoter region of a gene g (2000bp upstream)– Muscle specific binding site s
• Match™ search algorithm – Minimize false positives
[Kel, A.E., et al., ucleic Acids Res, 2003. 31(13): p. 3576-9.]
• Output:– Initial connection strength – motif score
Motif IdentificationMotif IdentificationMotif IdentificationMotif Identification
0gsA
: average scores of matrix similarity and core similarity 0gsA
InCoB 2007InCoB 2007
Stability Analysis for mNCAStability Analysis for mNCAStability Analysis for mNCAStability Analysis for mNCA• The information sources:
– mRNA Microarray data (specific but noisy)– motif information (general & with false positives)
• The questions we want to answer:– What TFs play a relevant role in the experiment? – What genes are regulated by a particular TF? (downstream
targets)
• Stability analysis: If small perturbations being applied, – A bad TFA estimate tends to be altered easily, even
destroyed;– A good TFA estimate tends to keep its activity pattern
throughout the perturbation..
InCoB 2007InCoB 2007
Testing Stability by PerturbationsTesting Stability by PerturbationsTesting Stability by PerturbationsTesting Stability by Perturbations• Method 1: Theresholding the motif score
– A TF-gene connection is deleted if the motif score is below some cut-off threshold. By setting different cut-off thresholds, we can change the number of connections, hence, change the network topology accordingly.
• Method 2: Deleting/inserting connections– TF-gene connections are altered randomly, either by
deleting the existing connections or inserting new connections with some small percentage (e.g., 10%).
Aug. 29, 2007
InCoB 2007InCoB 2007
Understanding of Stability AnalysisUnderstanding of Stability AnalysisUnderstanding of Stability AnalysisUnderstanding of Stability Analysis
• Obtain the confidence measure of an estimate:
perturbation
comparison
e.g. absolute correlation coefficient: 0.92;highly confident
e.g. absolute correlation coefficient: 0.52;less confident
InCoB 2007InCoB 2007
Stability MeasurementStability MeasurementStability MeasurementStability Measurement• Stability measurements from perturbations:
stability measurements of j-th TFA{| correlation[ ( ), ( )] | }j j i kTFA i TFA k
75% Quantile
25% Quantile
Median
Boxplot of the stability measurements
InCoB 2007InCoB 2007
Experimental ResultsExperimental ResultsExperimental ResultsExperimental Results
• Dataset Description:Staged skeletal muscle degeneration/regeneration was induced by injection of cardiotoxin (CTX). In the time range up to 40 days, 27 time points were sampled, and each time sample contains two mice duplicates.
The time course microarray data set was acquired with Affymetrix’s Murine Genome U74v2 Set from an expression profiling study in Children’s National Medical Center (CNMC). We obtained expression measurements of 7570 probesets in each sample.
0.5 40(day)
10 11 12 13 14 16 20 301 2 3 4 5 …
…
InCoB 2007InCoB 2007
Muscle Related TFsMuscle Related TFsMuscle Related TFsMuscle Related TFs
• 24 Muscle related TF binding sites from TRANSFAC:
YY1 Tal-1alpha:E47
NF-Y alpha-CP1
Sp1 Hand1:E47
MEF-2
USF USF2
Tal-1beta:E47
Ebox myogenin
E2A
NKX25 Nkx2-5
TATA TBX5
MyoD
SRF
TBP GATA-4
GATA E47 E12
InCoB 2007InCoB 2007
Muscle Related TFsMuscle Related TFsMuscle Related TFsMuscle Related TFs• Some muscle related TF binding sites from TRANSFAC:
InCoB 2007InCoB 2007
• Thresholding the motif score: – The threshold of motif score was set from low to high, making the connection number
vary gradually from 12,000 to 18,000, which results in more than 30% topology alterations.
Stability Analysis (Method I)Stability Analysis (Method I)Stability Analysis (Method I)Stability Analysis (Method I)
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 240
0.1
0.2
0.3
0.4
0.5
0.6
0.7
0.8
0.9
1S
tabi
lity
Mea
sure
men
t
Transcription factor index
Tal
-1al
pha:
E47
YY
1
NF
-Y
Myo
D
TB
X5
TA
TA
Nkx
2-5
NK
X25
E2A
myo
geni
n
Ebo
x
Tal
-1be
ta:E
47
US
F2
US
F
ME
F-2
Han
d1:E
47
Sp1
alph
a-C
P1
SR
F
TB
P
GA
TA
-4
GA
TA
E47 E12YY1
myogenin
MyoD
InCoB 2007InCoB 2007
• Deleting or inserting connections:– For each transcription factor, 10% of connections were altered randomly regardless of
the motif score, by deleting existing connections or inserting new connections to test the stability of TFA estimates.
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 240
0.1
0.2
0.3
0.4
0.5
0.6
0.7
0.8
0.9
1
Sta
bili
ty M
easure
ment
Transcription factor index
Tal-1alp
ha:E
47
YY
1
NF
-Y
MyoD
TB
X5
TA
TA
Nkx2-5
NK
X25
E2A
myogenin
Ebox
Tal-1beta
:E47
US
F2
US
F
ME
F-2
Hand1:E
47
Sp1
alp
ha-C
P1
SR
F
TB
P
GA
TA
-4
GA
TA
E47
E12
YY1
myogenin
MyoD
Stability Analysis (Method II)Stability Analysis (Method II)Stability Analysis (Method II)Stability Analysis (Method II)
InCoB 2007InCoB 2007
Stable TFA EstimatesStable TFA EstimatesStable TFA EstimatesStable TFA Estimates
• The most stable TFA - YY1:– Observed expression is of almost no change;– Estimated TFA is muscle regeneration related.
YY1’s gene expression (probe id: 98767_at) Estimated YY1’s TFA
0 5 10 15 20 25 30-0.5
0
0.5
1
1.5
2
2.5
Time (days)
log
expr
essi
on r
atio
0 5 10 15 20 25 30-0.14
-0.12
-0.1
-0.08
-0.06
-0.04
-0.02
0
0.02
Time (days)
log
TF
A r
atio
InCoB 2007InCoB 2007
• The difference between YY1’s mRNA level and protein level is supported by biological experiments.Walowitz, JL, et al., “Proteolytic Regulation of the Zinc Finger Transcription Factor YY1, a Repressor of Muscle-restricted Gene Expression ,”J Biol Chem, Vol. 273, Issue 12, 6656-6661, March 20, 1998.
YY1 expression level YY1 protein level
YY1’s TFA EstimateYY1’s TFA EstimateYY1’s TFA EstimateYY1’s TFA Estimate
InCoB 2007InCoB 2007
YY1 – A Repressor in Muscle RegenerationYY1 – A Repressor in Muscle RegenerationYY1 – A Repressor in Muscle RegenerationYY1 – A Repressor in Muscle Regeneration
• Underlying regulation mechanism:
Calpain II’s gene expression (probe id: 101040_at)
Calpain IIYY1 YY1 targets
Estimated YY1’s TFA
0 5 10 15 20 25 30-0.14
-0.12
-0.1
-0.08
-0.06
-0.04
-0.02
0
0.02
Time (days)
log
TF
A r
atio
0 5 10 15 20 25 30
-0.4
-0.2
0
0.2
0.4
0.6
0.8
1
Time (days)
log
expr
essi
on r
atio
InCoB 2007InCoB 2007
Stable TFA estimatesStable TFA estimatesStable TFA estimatesStable TFA estimates
• Some other stable TFAs - myogenin & MyoD MyoD(probe id: 102986_at)
myogenin(probe id: 103053_at)
Expression
Estimated TFA
0 5 10 15 20 25 30-0.02
0
0.02
0.04
0.06
0.08
0.1
0.12
0.14
0.16
Time (days)
log
TF
A r
atio
0 5 10 15 20 25 300
0.05
0.1
0.15
0.2
0.25
Time (days)
log
TF
A r
atio
0 5 10 15 20 25 30-0.5
0
0.5
1
1.5
2
2.5
Time (days)
log
expr
essi
on r
atio
0 5 10 15 20 25 30-0.5
0
0.5
1
1.5
2
2.5
Time (days)
log
expr
essi
on r
atio
InCoB 2007InCoB 2007
Identifying TF’s Downstream TargetsIdentifying TF’s Downstream TargetsIdentifying TF’s Downstream TargetsIdentifying TF’s Downstream Targets• Stability Analysis:
– Similarly, we can test the stability of regulation strength A with small perturbations, hence to rank the most likely targets of a specific TF.
• Ranking downstream targets by frequency count (confidence measure):– Perform multiple independent perturbations by deleting
a connection with some probability.– Count how many times a TF-gene regulation strength
is in the top rank group (defined by some preset threshold), based on its regulation strength A.
InCoB 2007InCoB 2007
Stability Analysis of MyoD’s TargetsStability Analysis of MyoD’s TargetsStability Analysis of MyoD’s TargetsStability Analysis of MyoD’s Targets• MyoD’s downstream targets ranking:
– 1000 independent perturbations are carried out.– Each connection is deleted with a probability (e.g., 0.3). – The top ranking
threshold is set to 100 in this case.
if one gene’s regulation strength by MyoD is in the top 100, then this gene is counted for once.
0 100 200 300 400 5000
100
200
300
400
500
600
700
Sorted downstream targets' index
Fre
quen
cy C
ount
InCoB 2007InCoB 2007
• MyoD’s downstream genes from Ingenuity Pathway Analysis:
MyoD’s Downstream TargetsMyoD’s Downstream TargetsMyoD’s Downstream TargetsMyoD’s Downstream Targets
Top 100 genes: 16 directly related genes with MyoD, and several key muscle regeneration TFs: MYC, MYOG, and MEF2C
InCoB 2007InCoB 2007
• YY1’s downstream genes from Ingenuity Pathway Analysis:
YY1’s Downstream TargetsYY1’s Downstream TargetsYY1’s Downstream TargetsYY1’s Downstream Targets
InCoB 2007InCoB 2007
ConclusionsConclusionsConclusionsConclusions
• A new computational approach, namely motif-directed network component analysis (mNCA), has been developed to integrate motif information and microarray data for regulatory network inference.– Motif information has been utilized to derive the initial
topology information for mNCA.– With the awareness of many false-positives in motif
information, stability analysis procedures have been developed to extract stable TFAs and TFs’ downstream targets.
• The experimental results have demonstrated that mNCA can help reveal key regulators in muscle regeneration.
InCoB 2007InCoB 2007
Future Work – New Hypothesis & Future Work – New Hypothesis & Validation Validation
Future Work – New Hypothesis & Future Work – New Hypothesis & Validation Validation
• Integrative approaches to pathway building
Calpain IIYY1
MyoD
c-Myc
PAX2
DYS
….
myogenin
MYL4TNNC1
MYBPH…DES
: interaction from database and knowledge
: interaction derived from computational methods
CYBBMCM5………RRM1
InCoB 2007InCoB 2007Aug. 29, 2007
AcknowledgementAcknowledgementAcknowledgementAcknowledgement
• NIH Grants:– NS2925-13A, CA 096483 & CA109872
• DoD/CDMRP Grant– BC030280
InCoB 2007InCoB 2007
Thank you very much!
Aug. 29, 2007