comparative study of granger causality algorithm for gene regulatory network
TRANSCRIPT
1
INVESTIGATION OF IMAGE PROCESSING ALGORITHMS FOR MEDICAL APPLICATION
Zhafir Aglna TijaniU1120208F
A final year project presentation in partial fullfilment of the requirement for the degree of Bachelor of Engineering
2
Background and Theory
Implementation
Result and Discussion
Conclusion
Outline
3
Background and Theory
• Problems and Objectives• Gene Regulatory Network• Granger Causality• 3 Methods of Granger Causality• Project Focus
Implementation
Result and Discussion
Conclusion
Outline
4
“It is more pragmatic to cure the cause of disease at its sources than to handle the
actual diseases”
Gene
5
• The Interaction between genes is called Gene Regulatory Network
• The discovery of this network still have a lot of challenge because of complexity of the network
• Efficient Computational Tools are required
To find an effective and efficient means to discover unknown Gene
Regulatory Network
Objective
6
Modelling of GRN
• Nodes and Edges• Depicting the
relation between genes
• Obtained from DNA Microarray
• Prominent Method : Granger Causality
http://img.medicalxpress.com/newman/gfx/news/hires/2013/1-novelnoninva.jpg
7
Granger Causality
• Method for Time Series Analysis • Utilized Vector Auto-regression (VAR) Model to calculate
causality based on Time Series data.
Granger (1969)
A BTime Series Time Series
U t=∑𝑘=1
𝑝
AkU t −k+εt 𝐹 𝑌→ 𝑋≡ ln ¿ Σ𝑥𝑥′ ∨ ¿
¿ Σ𝑥𝑥∨¿¿¿
8
Granger Causality
“If past values of A and B can predict future value of B better than past values of B alone, Then, time series A granger cause time
series B”
Granger (1969)
A BTime Series Time Series
9
MVGC Lasso CopulaBarnett et al. (2013) Arnold et al. (2007) Liu and Bahadori (2012)
3 Methods of Implementing Granger Causality
“These 3 Methods has been implemented independently,but never been compared using the same condition.”
10
Main Focus of the Project
• Comparative Study of Algorithms
• Focus on the Performance of 3 Algorithms
• Finding Strength and Weaknesses
• Utilizing Control Variables and Metrics Performance
11
Background and Theory
Implementation
Result and Discussion
Conclusion
• Control Variables• Causality Graph and Matrix• Edge Analysis)• Performance Metrics• Data for Analysis
Outline
12
Implementation
Time Series input
GC Algorithm
Causality Matrix and
Graph
Edge Analysis
Data for Discussion
• Implementation using MATLAB 2010b• Based on Existing Toolboxes :
• MVGC Toolbox ( Barnett, 2013 )• Lasso Granger• Copula Granger ( Liu and Bahadori, 2012 )• GLMnet
Program Flow
13
Implementation
Control Variables
• Based on Set of Equations• Linear Time Series Dataset • Generated by specifying The Number of Time Points• Advantages :
• Provide Ground Truth Network : Actual Causality of the Time Series• Ground Truth can be compared with the Algorithm Output to measure
the performance of Algorithms
• 2 Types of Dataset : 3-VAR and 5-VAR Time Series
• 8 different Number of Time Points : 200, 400, 800, 1200, 1600,
2400, 3200, 4000
Synthetic Time Series Dataset
14
3 Granger Causality Algorithms
15
Causality Matrix
• 1 represent : Link Exist between Variables• 0 represent : Link Does not Exist
0 0 0 0 0
1 0 0 0 0
1 0 0 0 0
1 0 0 0 1
1 0 0 1 0
• Output of GC Algorithm is the Causality Matrix• Depict granger causality between time series
16
Edge Analysis
• The result of Algorithm are masked with Binary Masking with the threshold of 0.0001
0 0 0 0 0
1 0 0 0 0
1 0 0 0 0
1 0 0 0 1
1 0 0 1 0
• Edge Analysis is a method to measure the performance of an Algorithm by comparing it with the Benchmark
• Benchmark = Ground Truth
0 0 1 0 1
1 0 1 1 1
1 1 1 0 1
0 1 0 1 1
1 1 1 0 1
Ground Truth Lasso Method
17
Edge Analysis
For above example • TP : 4• TN : 6• FP : 13• FN : 2
0 0 0 0 0
1 0 0 0 0
1 0 0 0 0
1 0 0 0 1
1 0 0 1 0
• Using Parameters from Confusion Matrix : • True Positives, True Negatives, False Positives, and False
Negatives
0 0 1 0 1
1 0 1 1 1
1 1 1 0 1
0 1 0 1 1
1 1 1 0 1
Ground Truth Lasso Method
18
7 Performance Metrics
𝑆𝑒𝑛𝑠𝑖𝑡𝑖𝑣𝑖𝑡𝑦=𝑇𝑃
𝑇𝑃+𝐹𝑁
𝑆𝑝𝑒𝑐𝑖𝑓𝑖𝑐𝑖𝑡𝑦=𝑇𝑁
𝑇𝑁+𝐹𝑃
𝑃𝑟𝑒𝑐𝑖𝑠𝑖𝑜𝑛=𝑇𝑃
𝑇𝑃+𝐹𝑃
𝐹𝑎𝑙𝑠𝑒𝑃𝑜𝑠𝑖𝑡𝑖𝑣𝑒 𝑅𝑎𝑡𝑒=𝐹𝑃
𝑇𝑁+𝐹𝑃
𝐹𝑎𝑙𝑠𝑒𝐷𝑖𝑠𝑐𝑜𝑣𝑒𝑟𝑦 𝑅𝑎𝑡𝑒=𝐹𝑃
𝑇𝑃+𝐹𝑃
𝐴𝑐𝑐𝑢𝑟𝑎𝑐𝑦=𝑇𝑃+𝑇𝑁
𝑇𝑃+𝑇𝑁+𝐹𝑃+𝐹𝑁
𝐹 1𝑆𝑐𝑜𝑟𝑒=2𝑇𝑃
2𝑇𝑃+𝐹𝑃+𝐹𝑁
• Calculated based on the value of TP, TN, FP, and FN• Used in Past Research in Similar Topic
19
Data for Analysis
• The Result of Granger Causality depends on the generated time series
• Few sample was not sufficient, Since time series generated was different each time
• The experiment was iterated by 2000 times
• Mean Value of each performance metrics will be the basis for comparative study
0 0 1 0 1
1 0 1 1 1
1 1 1 0 1
0 1 0 1 1
1 1 1 0 1
0 0 1 0 0
0 0 1 1 0
1 1 1 0 1
0 1 0 1 1
0 1 0 1 1
Lasso : 1st Iteration
Lasso : 2nd Iteration
20
Background and Theory
Implementation
Result and Discussion
Conclusion
Outline
• Performance Metrics Scores• Specific Result
• 5-VAR Accuracy• 3-VAR and 5-VAR F1 Score
• Overall Score Result
21
Scores of Metrics
• Bar chart to represent the score of each performance metrics on 3 methods
• X axis : Number of Time Points• Y axis : Score of Metrics
• 7 Metrics Performance• 2 Scenario : 3-VAR and 5-VAR
200 400 800 1200 1600 2400 3200 40000
0.1
0.2
0.3
0.4
0.5
0.6
VAR5 F1 Score
MVGCLASSOCOPULA
Number of Time Points
Score
22
200 400 800 120016002400320040000
0.1
0.2
0.3
0.4
0.5
0.6
0.7
VAR5 Specificity
MVGC
LASSO
COPULA
Number of Time Points
Score
200 400 800 1200 1600 2400 3200 40000
0.2
0.4
0.6
0.8
1
1.2VAR5 Sensitivity
MVGC
LASSO
COPULA
Number of Time Points
Score
200 400 800 1200 1600 2400 3200 40000
0.05
0.1
0.15
0.2
0.25
0.3
0.35
0.4
VAR5 Precision
MVGC
LASSO
COPULA
Number of Time Points
Score
200 400 800 120016002400320040000
0.2
0.4
0.6
0.8
1
VAR5 False Positive Rate
MVGC
LASSO
COPULA
Number of Time Points
Score
200 400 800 120016002400320040000
0.10.20.30.40.50.60.70.80.9
VAR5 False Discovery Rate
MVGC
LASSO
COPULA
Number of Time Points
Score
200 400 800 1200 1600 2400 3200 40000
0.1
0.2
0.3
0.4
0.5
0.6
VAR5 Accuracy
MVGC
LASSO
COPULA
Number of Time Points
Score
23
200 400 800 1200 1600 2400 3200 40000
0.1
0.2
0.3
0.4
0.5
0.6
VAR5 Accuracy
MVGCLASSOCOPULA
Number of Time Points
Sco
re
5-VAR Accuracy
• Accuracy• Proportion of true result among total
links available
• MVGC• Increasing as Number of time Points
Increase• Score range was small ( around 0,1 )
• Lasso• Increasing as Number of Time Points
Increase• Two extreme scores, Wide score Range
• Copula• Optimized during number of time
points around 400• Bad performance at higher number of
time points
24
3-VAR and 5-VAR F1 Score
• F1 Score• Statistical Significance based on
Harmonic mean of Precision and Recall
• MVGC• Consistent Pattern, Increases as time
point increases
• Lasso• Contrast Pattern• Heavily affected by number of
variables
• Copula• Unique Pattern• Has a certain point / range where
performance is optimized
200 400 800 1200 1600 2400 3200 40000
0.10.20.30.40.50.60.70.8
VAR3 F1 Score
MVGCLASSOCOPULA
Number of Time Points
Score
200 400 800 1200 1600 2400 3200 40000
0.1
0.2
0.3
0.4
0.5
0.6
VAR5 F1 Score
MVGCLASSOCOPULA
Number of Time Points
Score
25
Overall Performance
Metrics type Best Performance Average
Performance
Worst Performance
Sensitivity Lasso Copula MVGC
Specificity MVGC Lasso Copula
Precision MVGC Lasso Copula
False Positive Rate MVGC Lasso Copula
False Discovery Rate MVGC Lasso Copula
Accuracy MVGC Lasso Copula
F1 – Score MVGC Lasso Copula
3 – Variable Time Series
• Overall performance based on average score of all time points• MVGC Outperforms other two methods in 3-VAR Scenario• Lasso scores was good during high number of time points• Copula has certain range which their score was high ( around 200 – 800 time
points ), but outside of that the score were lower than other method
26
Metrics type Best
Performance
Average
Performance
Worst
PerformanceSensitivity MVGC Copula LassoSpecificity Lasso Copula MVGCPrecision MVGC Copula LassoFalse Positive Rate Lasso Copula MVGCFalse Discovery Rate MVGC Copula LassoAccuracy Copula MVGC LassoF1 – Score MVGC Copula Lasso
5 – Variable Time Series
• MVGC shows Consistency in both 5-VAR and 3-VAR• Copula provides best accuracy compared to other method, especially during 200
– 800 time points • Lasso score is the highest during high number of time points, but the score during
low number of time points were low.
27
Background and Theory
Implementation
Result and Discussion
Conclusion
Outline
• Conclusion• Future
Works
28
Conclusion
• 3 Methods of GC : MVGC, Lasso, and Copula can be compared using 7 Performance Metrics
• MVGC provides consistency in most of condition
• Lasso has advantages in handling high number of time points
• Copula has certain range which their performance was optimized
• Even though overall score favours MVGC compared to other methods, the results are still conditional
29
Suggestions for Future Work
• Granger Causality Algorithms for non-linear Data• Non-linear data provides better representation for Gene Regulatory Network
• Application to Real Dataset• Granger Causality Analysis may be applied to real dataset
• Other Algorithm for GRN ( Dynamic Bayesian Network )• DBN is another prominent method in this topic
30
Thank You
31
Q & A