comparative study of granger causality algorithm for gene regulatory network

1

INVESTIGATION OF IMAGE PROCESSING ALGORITHMS FOR MEDICAL APPLICATION

Zhafir Aglna TijaniU1120208F

A final year project presentation in partial fullfilment of the requirement for the degree of Bachelor of Engineering

2

Background and Theory

Implementation

Result and Discussion

Conclusion

Outline

3


• Problems and Objectives• Gene Regulatory Network• Granger Causality• 3 Methods of Granger Causality• Project Focus

Implementation


Conclusion

Outline

4

“It is more pragmatic to cure the cause of disease at its sources than to handle the

actual diseases”

Gene

5

• The Interaction between genes is called Gene Regulatory Network

• The discovery of this network still have a lot of challenge because of complexity of the network

• Efficient Computational Tools are required

To find an effective and efficient means to discover unknown Gene

Regulatory Network

Objective

6

Modelling of GRN

• Nodes and Edges• Depicting the

relation between genes

• Obtained from DNA Microarray

• Prominent Method : Granger Causality

http://img.medicalxpress.com/newman/gfx/news/hires/2013/1-novelnoninva.jpg

7

Granger Causality

• Method for Time Series Analysis • Utilized Vector Auto-regression (VAR) Model to calculate

causality based on Time Series data.

Granger (1969)

A BTime Series Time Series

U t=∑𝑘=1

𝑝

AkU t −k+εt 𝐹 𝑌→ 𝑋≡ ln ¿ Σ𝑥𝑥′ ∨ ¿

¿ Σ𝑥𝑥∨¿¿¿

8

Granger Causality

“If past values of A and B can predict future value of B better than past values of B alone, Then, time series A granger cause time

series B”

Granger (1969)

A BTime Series Time Series

9

MVGC Lasso CopulaBarnett et al. (2013) Arnold et al. (2007) Liu and Bahadori (2012)

3 Methods of Implementing Granger Causality

“These 3 Methods has been implemented independently,but never been compared using the same condition.”

10

Main Focus of the Project

• Comparative Study of Algorithms

• Focus on the Performance of 3 Algorithms

• Finding Strength and Weaknesses

• Utilizing Control Variables and Metrics Performance

11


Implementation


Conclusion

• Control Variables• Causality Graph and Matrix• Edge Analysis)• Performance Metrics• Data for Analysis

Outline

12

Implementation

Time Series input

GC Algorithm

Causality Matrix and

Graph

Edge Analysis

Data for Discussion

• Implementation using MATLAB 2010b• Based on Existing Toolboxes :

• MVGC Toolbox ( Barnett, 2013 )• Lasso Granger• Copula Granger ( Liu and Bahadori, 2012 )• GLMnet

Program Flow

13

Implementation

Control Variables

• Based on Set of Equations• Linear Time Series Dataset • Generated by specifying The Number of Time Points• Advantages :

• Provide Ground Truth Network : Actual Causality of the Time Series• Ground Truth can be compared with the Algorithm Output to measure

the performance of Algorithms

• 2 Types of Dataset : 3-VAR and 5-VAR Time Series

• 8 different Number of Time Points : 200, 400, 800, 1200, 1600,

2400, 3200, 4000

Synthetic Time Series Dataset

14

3 Granger Causality Algorithms

15

Causality Matrix

• 1 represent : Link Exist between Variables• 0 represent : Link Does not Exist

0 0 0 0 0

1 0 0 0 0

1 0 0 0 0

1 0 0 0 1

1 0 0 1 0

• Output of GC Algorithm is the Causality Matrix• Depict granger causality between time series

16

Edge Analysis

• The result of Algorithm are masked with Binary Masking with the threshold of 0.0001

0 0 0 0 0

1 0 0 0 0

1 0 0 0 0

1 0 0 0 1

1 0 0 1 0

• Edge Analysis is a method to measure the performance of an Algorithm by comparing it with the Benchmark

• Benchmark = Ground Truth

0 0 1 0 1

1 0 1 1 1

1 1 1 0 1

0 1 0 1 1

1 1 1 0 1

Ground Truth Lasso Method

17

Edge Analysis

For above example • TP : 4• TN : 6• FP : 13• FN : 2

0 0 0 0 0

1 0 0 0 0

1 0 0 0 0

1 0 0 0 1

1 0 0 1 0

• Using Parameters from Confusion Matrix : • True Positives, True Negatives, False Positives, and False

Negatives

0 0 1 0 1

1 0 1 1 1

1 1 1 0 1

0 1 0 1 1

1 1 1 0 1

Ground Truth Lasso Method

18

7 Performance Metrics

𝑆𝑒𝑛𝑠𝑖𝑡𝑖𝑣𝑖𝑡𝑦=𝑇𝑃

𝑇𝑃+𝐹𝑁

𝑆𝑝𝑒𝑐𝑖𝑓𝑖𝑐𝑖𝑡𝑦=𝑇𝑁

𝑇𝑁+𝐹𝑃

𝑃𝑟𝑒𝑐𝑖𝑠𝑖𝑜𝑛=𝑇𝑃

𝑇𝑃+𝐹𝑃

𝐹𝑎𝑙𝑠𝑒𝑃𝑜𝑠𝑖𝑡𝑖𝑣𝑒 𝑅𝑎𝑡𝑒=𝐹𝑃

𝑇𝑁+𝐹𝑃

𝐹𝑎𝑙𝑠𝑒𝐷𝑖𝑠𝑐𝑜𝑣𝑒𝑟𝑦 𝑅𝑎𝑡𝑒=𝐹𝑃

𝑇𝑃+𝐹𝑃

𝐴𝑐𝑐𝑢𝑟𝑎𝑐𝑦=𝑇𝑃+𝑇𝑁

𝑇𝑃+𝑇𝑁+𝐹𝑃+𝐹𝑁

𝐹 1𝑆𝑐𝑜𝑟𝑒=2𝑇𝑃

2𝑇𝑃+𝐹𝑃+𝐹𝑁

• Calculated based on the value of TP, TN, FP, and FN• Used in Past Research in Similar Topic

19

Data for Analysis

• The Result of Granger Causality depends on the generated time series

• Few sample was not sufficient, Since time series generated was different each time

• The experiment was iterated by 2000 times

• Mean Value of each performance metrics will be the basis for comparative study

0 0 1 0 1

1 0 1 1 1

1 1 1 0 1

0 1 0 1 1

1 1 1 0 1

0 0 1 0 0

0 0 1 1 0

1 1 1 0 1

0 1 0 1 1

0 1 0 1 1

Lasso : 1st Iteration

Lasso : 2nd Iteration

20


Implementation


Conclusion

Outline

• Performance Metrics Scores• Specific Result

• 5-VAR Accuracy• 3-VAR and 5-VAR F1 Score

• Overall Score Result

21

Scores of Metrics

• Bar chart to represent the score of each performance metrics on 3 methods

• X axis : Number of Time Points• Y axis : Score of Metrics

• 7 Metrics Performance• 2 Scenario : 3-VAR and 5-VAR

200 400 800 1200 1600 2400 3200 40000

0.1

0.2

0.3

0.4

0.5

0.6

VAR5 F1 Score

MVGCLASSOCOPULA

Number of Time Points

Score

22

200 400 800 120016002400320040000

0.1

0.2

0.3

0.4

0.5

0.6

0.7

VAR5 Specificity

MVGC

LASSO

COPULA


Score

200 400 800 1200 1600 2400 3200 40000

0.2

0.4

0.6

0.8

1

1.2VAR5 Sensitivity

MVGC

LASSO

COPULA


Score

200 400 800 1200 1600 2400 3200 40000

0.05

0.1

0.15

0.2

0.25

0.3

0.35

0.4

VAR5 Precision

MVGC

LASSO

COPULA


Score

200 400 800 120016002400320040000

0.2

0.4

0.6

0.8

1

VAR5 False Positive Rate

MVGC

LASSO

COPULA


Score

200 400 800 120016002400320040000

0.10.20.30.40.50.60.70.80.9

VAR5 False Discovery Rate

MVGC

LASSO

COPULA


Score

200 400 800 1200 1600 2400 3200 40000

0.1

0.2

0.3

0.4

0.5

0.6

VAR5 Accuracy

MVGC

LASSO

COPULA


Score

23

200 400 800 1200 1600 2400 3200 40000

0.1

0.2

0.3

0.4

0.5

0.6

VAR5 Accuracy

MVGCLASSOCOPULA


Sco

re

5-VAR Accuracy

• Accuracy• Proportion of true result among total

links available

• MVGC• Increasing as Number of time Points

Increase• Score range was small ( around 0,1 )

• Lasso• Increasing as Number of Time Points

Increase• Two extreme scores, Wide score Range

• Copula• Optimized during number of time

points around 400• Bad performance at higher number of

time points

24

3-VAR and 5-VAR F1 Score

• F1 Score• Statistical Significance based on

Harmonic mean of Precision and Recall

• MVGC• Consistent Pattern, Increases as time

point increases

• Lasso• Contrast Pattern• Heavily affected by number of

variables

• Copula• Unique Pattern• Has a certain point / range where

performance is optimized

200 400 800 1200 1600 2400 3200 40000

0.10.20.30.40.50.60.70.8

VAR3 F1 Score

MVGCLASSOCOPULA


Score

200 400 800 1200 1600 2400 3200 40000

0.1

0.2

0.3

0.4

0.5

0.6

VAR5 F1 Score

MVGCLASSOCOPULA


Score

25

Overall Performance

Metrics type Best Performance Average

Performance

Worst Performance

Sensitivity Lasso Copula MVGC

Specificity MVGC Lasso Copula

Precision MVGC Lasso Copula

False Positive Rate MVGC Lasso Copula

False Discovery Rate MVGC Lasso Copula

Accuracy MVGC Lasso Copula

F1 – Score MVGC Lasso Copula

3 – Variable Time Series

• Overall performance based on average score of all time points• MVGC Outperforms other two methods in 3-VAR Scenario• Lasso scores was good during high number of time points• Copula has certain range which their score was high ( around 200 – 800 time

points ), but outside of that the score were lower than other method

26

Metrics type Best

Performance

Average

Performance

Worst

PerformanceSensitivity MVGC Copula LassoSpecificity Lasso Copula MVGCPrecision MVGC Copula LassoFalse Positive Rate Lasso Copula MVGCFalse Discovery Rate MVGC Copula LassoAccuracy Copula MVGC LassoF1 – Score MVGC Copula Lasso

5 – Variable Time Series

• MVGC shows Consistency in both 5-VAR and 3-VAR• Copula provides best accuracy compared to other method, especially during 200

– 800 time points • Lasso score is the highest during high number of time points, but the score during

low number of time points were low.

27


Implementation


Conclusion

Outline

• Conclusion• Future

Works

28

Conclusion

• 3 Methods of GC : MVGC, Lasso, and Copula can be compared using 7 Performance Metrics

• MVGC provides consistency in most of condition

• Lasso has advantages in handling high number of time points

• Copula has certain range which their performance was optimized

• Even though overall score favours MVGC compared to other methods, the results are still conditional

29

Suggestions for Future Work

• Granger Causality Algorithms for non-linear Data• Non-linear data provides better representation for Gene Regulatory Network

• Application to Real Dataset• Granger Causality Analysis may be applied to real dataset

• Other Algorithm for GRN ( Dynamic Bayesian Network )• DBN is another prominent method in this topic

30

Thank You

31

Q & A

comparative study of granger causality algorithm for gene regulatory network

Engineering

time series analysis

b time series time series

var time series

granger causality algorithms

granger causality method

granger causality http

time series ground truth

synthetic time series