collaboration with statistician? 矩陣視覺化於探索式資料分析

72
1 August 30, 2014 中央研究院 人文社會科學館國際會議廳 Collaboration with Statistician? 矩陣視覺化於探索式資料分析 陳君厚 中央研究院 統計科學研究所

Post on 14-Nov-2014

3.105 views

Category:

Technology


26 download

DESCRIPTION

陳君厚 (Chun-Houh Chen) 中央研究院統計科學研究所研究員兼副所長 中華機率統計學會理事長 國際統計計算學會亞洲分會主席 美國加州大學洛杉磯分校數學博士,專長為 Bioinformatics, Data/Information Visualization, Dimension Reduction。

TRANSCRIPT

Page 1: Collaboration with Statistician? 矩陣視覺化於探索式資料分析

1 August 30, 2014

中央研究院 人文社會科學館國際會議廳

Collaboration with Statistician? 矩陣視覺化於探索式資料分析

陳君厚 中央研究院 統計科學研究所

Page 2: Collaboration with Statistician? 矩陣視覺化於探索式資料分析

2

http://escience.washington.edu/blog/new-phd-tracks-big-data

Page 3: Collaboration with Statistician? 矩陣視覺化於探索式資料分析

3

Q: 君厚, Big Data 這麼熱, 我們是否應該排進統計系所學程?

A: Big Data? 現在的統計系所畢業學生能夠處理 Data 嗎?

Page 4: Collaboration with Statistician? 矩陣視覺化於探索式資料分析

4

胡海國 教授 台灣大學 醫學院 精神科主任

主要合作研究者

楊泮池 教授 台灣大學 醫學院 院長

陳章榮 研究員 美國 FDA 毒理研究所

李御賢 教授 長庚紀念醫院 核心實驗室 銘傳大學

白果能 研究員 中央研究院 生物醫學科學研究所

吳漢銘 教授 淡江大學 數學系

周正中 教授 中正大學 生命科學系

楊永正 教授 陽明大學 生物醫學資訊研究所

林文昌 研究員 中央研究院 生物醫學科學研究所

黃奇英 教授 陽明大學 臨床醫學研究所

楊欣洲 研究員 中央研究院 生物醫學科學研究所

Page 5: Collaboration with Statistician? 矩陣視覺化於探索式資料分析

中研院賴明昭副院長推廣跨領域合作 (November 10, 2004)

I. Collaboration with Statistician?

合作不一定是資料分析 與統計學家合作很可能是需要資料分析

Page 6: Collaboration with Statistician? 矩陣視覺化於探索式資料分析

陳老師,我的同事用Factor Analysis上 同樣的Journal只花了三分之一的時間

Luxury Research vs. Necessity Research

Senior Researcher vs. Young Investigator

(Established) vs. (Struggling)

0 10.2 0.4 0.6 0.8

Corr. -1 0 1

Corr. -1 0 1

Corr. -1 0 1

Corr. -1 0 1

-.1 .1

-0.2 0.2 -0.4 0.4

Page 7: Collaboration with Statistician? 矩陣視覺化於探索式資料分析

Chun-houh, can you create some powerful statistical/bioinformatics methods so we can get our experiments published in Nature/Science?

Sir, can you conduct some meaningful biological/medical experiments so we can get our methods published in Nature/Science?

Page 8: Collaboration with Statistician? 矩陣視覺化於探索式資料分析

Mutual Trust (XX 人在中研院 YY 會議中說) 統計所的陳君厚說你們 ZZ 所的生物晶片 都有問題

Mutual Understanding You can remove Figure1 together with my name from the paper

P3G9P1P6

G7N6G13N1N4N2N3N5G10G12G5P2G11N7G15

G4G16G8P7S2G14S1S3P4

G3G6

P5G2G1

Negative Disorg. Host./Excit.

De l./Hall.

15 10 5 0Average Euc lidean Distance

GONEG

GWNEG

G2

G3

G4

G1

PANSS Score

1 2 3 4 5 6 7

P3G9P1P6

G7N6G13N1N4N2N3N5G10G12G5P2G11N7G15

G4G16G8P7S2G14S1S3P4

G3G6

P5G2G1

Average Correlation1 0.20.8 0.6 0.4

Correlation Coefficient

-1 -0.5 0 0.5 1

Negative�Symptoms

Disorganized�Thought

Hostility /�Excitement

Delusion /�Hallucination

PANSS Score

1 2 3 4 5 6 7

Correlation Coefficient

-1 -0.5 0 0.5 1

G7N6N3N1N2N4N5G16G10N7G5G13G11P2G15G12G8P7S1G14S2S3P4P6P3G9P1P5G4G2G1G3G6

G7N6N3N1N2N4N5G16G10N7G5G13G11P2G15G12G8P7S1G14S2S3P4P6P3G9P1P5G4G2G1G3G6

Negative�Symptoms

Disorganized�Thought

Hostility /�ExcitementDelusion /�HallucinationAnxiety�Symptoms

RMG�(n=61)

MBG�(n=50)

PDHG2�(n=38)

PDHG1�(n=14)

0 5 10Average Euc lidean Distance

0.2 10.4 0.6 0.8Average Correlation Negative Disorg. Host./

Excit.De l./Hall. Anxiety

Page 9: Collaboration with Statistician? 矩陣視覺化於探索式資料分析

9

II. 矩陣視覺化於探索式資料分析

Matrix Visualization: Approaching Statistics and Statistical Approach

矩陣視覺化: 趨近統計與統計趨勢

Page 10: Collaboration with Statistician? 矩陣視覺化於探索式資料分析

Lab 309 (???) for Information Visualization

Dr. 田銀錦 Postdoc. Fellow

張勝傑 張文宗 陳柏旭 鐘雅齡 黃建勳 林香誼 劉勝宗 曾聖澧 葉紫君 吳怡真 林倩如 歐陽智聞 . . .

10

Prof. 吳漢銘 Dept. Math. Tamkang U.

Mr. 高君豪 Ph.D. student

Prof. 須上英 Dept. Stat. Nat’l Taipei U.

Dr. 何孟如 Postdoc. Fellow

Ms. 石佳鑫 Research Assistant

Page 11: Collaboration with Statistician? 矩陣視覺化於探索式資料分析

11

Data analysis

With the goal of •  discovering useful information •  suggesting conclusions •  supporting decision making

A process of •  inspecting data •  cleaning data •  transforming data •  modeling data

了解資料 探索式資料分析 (Exploratory Data Analysis)

資料視覺化

Page 12: Collaboration with Statistician? 矩陣視覺化於探索式資料分析

allow the data to speak for themselves before standard assumptions or formal modeling

Matrix Visualization as an EDA tool for assisting formal mathematical modeling

The greatest value of a picture is when it forces us to notice what we never expected to see.

It is important to understand what you CAN DO before you learn to measure how WELL you seem to have DONE it.

Exploratory Data Analysis EDA, John Tukey (1977)

1915~ 2000

12

Page 13: Collaboration with Statistician? 矩陣視覺化於探索式資料分析

John W. Tukey在探索式資料分析 (Exploratory Data Analysis, EDA) 書中開宗明義地提到: It is important to understand what you CAN DO before you learn to measure how WELL you seem to have DONE it. 學習你可以做什麼,有助於在資料分析的過程中達到事半功倍的效果。EDA的作用在於從「看」資料獲得資料所傳達的訊息,所著重的是簡單的算術與容易建構的圖、表。透過 EDA 對於圖表中所顯露之型樣 (pattern) 做一初步的認知與描述,再進一步以人類的心智 (mind) 對所接收的訊息做全面的分析與判斷,以探索潛藏於資料中的訊息。強調的是探索式的分析而非嚴謹的模式確認。

Page 14: Collaboration with Statistician? 矩陣視覺化於探索式資料分析

14

資料視覺化需要統計嗎?

Page 15: Collaboration with Statistician? 矩陣視覺化於探索式資料分析

I. SetosaI. VerginicaI. Versicolor

Species name

0

20

40

60

80

Pet É Pet É Sep É Sep É

0

20

40

60

80

Pet É Pet É Sep É Sep É

20

40

60

80

30 60 90 120

Series

Data

nscores

Petal width

I.S I.V I.V

10

20

30

40

50

Species name

Graphics/Visualization for high dimensional data? P>5 p>10 p>100 p>10000

15

Page 16: Collaboration with Statistician? 矩陣視覺化於探索式資料分析

Recent Review Articles for MV The History of the Cluster Heat Map

Leland WILKINSON and Michael FRIENDLY

The American Statistician, May 2009, Vol. 63, No. 2 179

REVIEW Seriation and Matrix Reordering Methods: An

Historical Overview by Innar Liiv

Statistical Analysis and Data Mining 3: 70–91, 2010

Figure 2. Shaded matrix display from Loua (1873), available online at http://books.google.com/books/. This was designed as a summary of 40 separate maps of Paris, showing the characteristics (e.g., national origin, professions, age, social classes) of 20 districts, using a color scale ranging from white (low) through yellow and blue to red (high).

Figure 3. Sorted shaded display from Brinton (1914). The data are ranks of U.S. states on each of 10 educational features assessed in 1910. The matrix has been sorted by the row-marginal ranks.

Figure 5. Sorted shaded display from Czekanowski (1909), reproduced in Hage and Harary (1995).

Figure 9. Cluster heat map from Wilkinson (1994). The data are social statistics (i.e., urbanization, literacy, life expectancy for females, GDP, health expenditures, educational expenditures, military expenditures, death rate, infant mortality, birth rate, and ratio of birth to death rate) from a United Nations survey of world countries. The variables were standardized before the hierarchical clustering was performed.

Matrix Visualization (MV): reorderable matrix, heatmap, color histogram, data image 16

Page 17: Collaboration with Statistician? 矩陣視覺化於探索式資料分析

1100004000005000000000000000002223302220034220330032111010001030000002000000000000000000000000000002555000100000000000111100000000333330053151214444205555455005551550550010000300003002210000000020000000000000010000000000000000000010000000000000000000202002202000020000022020100000323120000021322122200000000003000000000000000002200000000000000000000031100020000131300002503300000020202003043331300031551004400000344040005500000000444430444143554133300000000003000000000000000000000000000000000000000020200000030122032000000000000020101000000030200020501000500450000000134000000000200120000003204423110005005000042000000020000020001000100000003141030000000030155033000000400000000000000000000000040333555000100000044033003040440000333230300000023322220000000002000000000000000003210000000000000000000011100020000010000000503000000020022200200300302034554000550000444040020002000000200200000003204044330000000000001000000000000000001000200222200121120042400030040030000010402200023022322002000330301222333000400000200000244043340020302300000003304000452310003000003000000400330000000003000000002245430244400030000440030020233200000043433334433232222231321000100220230020000000000000200220333020424133331110000005500000000240300000000000000000010001000200000000000000000000300000000000000000000000000020444000400440000000332433033000433334333003444244441110000013001112000030112222101011121221100121222111111111111115000005001111111155555005511555555551331000000300500000010043441110101101111000101112110000000004000000000000230004400000000000000000002000000030010030000022103202200032323322202233303321221100110001100000321032212000332232223233333133214410034002002000000050000000000000000000002020002000000012000020000000000003000000000002200200000200440000401000000000015001001000011110210020322023002200004000004000000210000000002000010010003020221040400000020055300000000002000000000000010031212101555505400000344440005002000000000000000100000002100000000000000000000000000000001101200020013020321044140040000441140000100000000011101010000000223200222000500000400010000100000000323240000022222204005500053004003000000050000000000000000000000000000043400030000040000000000000000044434033311032444422333000000044530000000033030000423333333021333333211002000000000200000020000000002003200002103221121033200030000041330001210001000020102222211032323310333000200300300000011022022000312022222010322022223330003004003000000210220200002220220110002020220055511150551115555522335544101142424433454455445545000000300200000000120032023000222123112001313222114000000003303333330030555500432011120203123111002000000000000000000000002200043000000000000020000030110000302003300000232032030000433333333023333333325500002000002330100000342100402333213344403431115121110150111155552201000000000021234031111021335500555000101000010000200022020040333421221222505523445550000000003000000220000300001000000032000000031144400350001033332200402102211001000000100030202321330003300303300200003043300030022002000002000000320000000005000000002030332014300002100000003000003300000000020020000000300000020000000000000000200100111021201220000000022155111150201023024111404111555550050005003341000200040343300000000000000000001044400240040030033320505434204040022400000240414445505000500000000000000000000000333134331011111134025550005000055550000000000000002222100210003331230030330000000050000045000000000031243313304323303410551001300310330000025000000000333220231231224113230000000000000000000000210023010000000000000000001222100120000030000000200000000000000000000230210222200000000300000000032000000043000000000000000000430000000003000000000130000002000000000000012000003133300320023210300202303323000020002200022043220022000000210100401000010033011000210010013012101032005040045004005000000050000000000000000003300000000020000200000000000000100000000043434033202224012330511000300000030005030033200000333440431545144114125500053001300000000200000000003000230000000035540000000000000000002002000000000022201001200022200211330003000440200000014220000210000000000000201000301000001302000000004300432430404443403341224440242033300020030000200000003302044200000000000000000032330000000000300000223232000000322223223102302022110000000013100000000221210003320010010110011110001100000000030000000023200000033000000000000000000030000000230200200000023000000220000000011000221200100000000004000000002320000003200000020000021010001100000000020000000000100000022110000100000000220010444000300300300000333003000230200002000001200000020000000003002000002220200003220000010000000000001100000003040000000033310000043000000000000230000032222000000000010000000022110000212210111010320000214140024000013000000250300200002200000220003243222320200000031000000000400000243000000000000030000030

Data Matrix

50 題精神症狀量表 95

位精

神醫

學患

Data Map

5

0

嚴重

正常

50 題精神症狀量表

95 位

精神

醫學

患者

Page 18: Collaboration with Statistician? 矩陣視覺化於探索式資料分析

Generalized Association Plots (GAP) for MV of continuous data

2.permutation

4.summary

3.partition GAP

Data Matrix Continuous

ordinal Binary nominal

1.presentation

18

Approaching Statistics & Statistical Approach 1100004000005000000000000000002223302220034220330032111010001030000002000000000000000000000000000002555000100000000000111100000000333330053151214444205555455005551550550010000300003002210000000020000000000000010000000000000000000010000000000000000000202002202000020000022020100000323120000021322122200000000003000000000000000002200000000000000000000031100020000131300002503300000020202003043331300031551004400000344040005500000000444430444143554133300000000003000000000000000000000000000000000000000020200000030122032000000000000020101000000030200020501000500450000000134000000000200120000003204423110005005000042000000020000020001000100000003141030000000030155033000000400000000000000000000000040333555000100000044033003040440000333230300000023322220000000002000000000000000003210000000000000000000011100020000010000000503000000020022200200300302034554000550000444040020002000000200200000003204044330000000000001000000000000000001000200222200121120042400030040030000010402200023022322002000330301222333000400000200000244043340020302300000003304000452310003000003000000400330000000003000000002245430244400030000440030020233200000043433334433232222231321000100220230020000000000000200220333020424133331110000005500000000240300000000000000000010001000200000000000000000000300000000000000000000000000020444000400440000000332433033000433334333003444244441110000013001112000030112222101011121221100121222111111111111115000005001111111155555005511555555551331000000300500000010043441110101101111000101112110000000004000000000000230004400000000000000000002000000030010030000022103202200032323322202233303321221100110001100000321032212000332232223233333133214410034002002000000050000000000000000000002020002000000012000020000000000003000000000002200200000200440000401000000000015001001000011110210020322023002200004000004000000210000000002000010010003020221040400000020055300000000002000000000000010031212101555505400000344440005002000000000000000100000002100000000000000000000000000000001101200020013020321044140040000441140000100000000011101010000000223200222000500000400010000100000000323240000022222204005500053004003000000050000000000000000000000000000043400030000040000000000000000044434033311032444422333000000044530000000033030000423333333021333333211002000000000200000020000000002003200002103221121033200030000041330001210001000020102222211032323310333000200300300000011022022000312022222010322022223330003004003000000210220200002220220110002020220055511150551115555522335544101142424433454455445545000000300200000000120032023000222123112001313222114000000003303333330030555500432011120203123111002000000000000000000000002200043000000000000020000030110000302003300000232032030000433333333023333333325500002000002330100000342100402333213344403431115121110150111155552201000000000021234031111021335500555000101000010000200022020040333421221222505523445550000000003000000220000300001000000032000000031144400350001033332200402102211001000000100030202321330003300303300200003043300030022002000002000000320000000005000000002030332014300002100000003000003300000000020020000000300000020000000000000000200100111021201220000000022155111150201023024111404111555550050005003341000200040343300000000000000000001044400240040030033320505434204040022400000240414445505000500000000000000000000000333134331011111134025550005000055550000000000000002222100210003331230030330000000050000045000000000031243313304323303410551001300310330000025000000000333220231231224113230000000000000000000000210023010000000000000000001222100120000030000000200000000000000000000230210222200000000300000000032000000043000000000000000000430000000003000000000130000002000000000000012000003133300320023210300202303323000020002200022043220022000000210100401000010033011000210010013012101032005040045004005000000050000000000000000003300000000020000200000000000000100000000043434033202224012330511000300000030005030033200000333440431545144114125500053001300000000200000000003000230000000035540000000000000000002002000000000022201001200022200211330003000440200000014220000210000000000000201000301000001302000000004300432430404443403341224440242033300020030000200000003302044200000000000000000032330000000000300000223232000000322223223102302022110000000013100000000221210003320010010110011110001100000000030000000023200000033000000000000000000030000000230200200000023000000220000000011000221200100000000004000000002320000003200000020000021010001100000000020000000000100000022110000100000000220010444000300300300000333003000230200002000001200000020000000003002000002220200003220000010000000000001100000003040000000033310000043000000000000230000032222000000000010000000022110000212210111010320000214140024000013000000250300200002200000220003243222320200000031000000000400000243000000000000030000030

Page 19: Collaboration with Statistician? 矩陣視覺化於探索式資料分析

1.  Data Matrix (n * p)

(w/ Color coding)

Continuous Ordinal Binary

Nominal

2. Proximity Matrix for Subject (n * n)

Continuous Ordinal Binary

Nominal

3. Proximity (Variable p * p)

Continuous Ordinal Binary

Nominal

Some essential elements in a GAP MV procedure

4. Permutation (variable)

4. Permutation (subject)

19

Page 20: Collaboration with Statistician? 矩陣視覺化於探索式資料分析

Statistical Approach Identify Global Trend: Singular Value Decomposition

20

Chen 2002, Statistica Sinica Rank 2 Elliptical

R2E

SVD

SVD1

(c) Correlation-1 0 +1

-8 1:1 +8(a) Expression

(d)

(b) Correlation-1 0 +1

Alter O. et al 2000, PNAS

SVD2

Page 21: Collaboration with Statistician? 矩陣視覺化於探索式資料分析

21

Eisen et al. (1998)

Tree seriation & flipping of intermediate nodes

(a)

A E D C B

D A

E C B

(b)

(c)

A B E D C

1 flip 3 flips 5 flips many flips

2n-1=25-1=16

Different Seriations (Ordering of Terminal Nodes or Leaves) Generated from Identical Tree Structure

ideal model

external and internal references for guiding flipping mechanism

Statistical Approach: Identify Local Clusters

Page 22: Collaboration with Statistician? 矩陣視覺化於探索式資料分析

(c) Correlation-1 0 +1

-8 1:1 +8(a) Expression

(d)

(b) Correlation-1 0 +1

(c) Correlation-1 0 +1

-8 1:1 +8(a) Expression (b) Correlation

-1 0 +1

(d) (e)(c) Correlation-1 0 +1(c) Correlation

-1 0 +1

-8 1:1 +8(a) Expression

-8 1:1 +8(a) Expression (b) Correlation

-1 0 +1(b) Correlation

-1 0 +1

(d) (e)

( c) Correl at i on

- 1 0 +1

( d)

- 8 1: 1 +8( a) Expressi on ( b) Correl ati on

- 1 0 +1

GAP Elliptical (R2E) Seriation Hierarchical Tree Seriation Tree guided by (R2E) 22

Approaching Statistics & Statistical Approach

HCT R2E HCTR2E + =

Page 23: Collaboration with Statistician? 矩陣視覺化於探索式資料分析

Admission 6 month

0.2 10.4 0.6 0.8

Absolute Random Error Coefficient

0 1comforting the aggravating patient�assistant to the aggravating patient�transport of the aggravating patient to service setting �financial aid�general psychological/practical supportcoping with medical team�understanding diagnosis and treatment�identifying early signs of relapse�understanding mental health laws�general social acceptance�occupational therapy�sheltered working facilities�advice on intimate relationship for patient�lifelong custodial care for patient

Need �cluster for�assistant�to patient�care

Need �cluster for�accessing �to relevant�information

Need �cluster �for�societal �support

Need �cluster�for�burden�release

Admission Hwu et al. Schizophrenia Research (2002) Symptom Patterns and Subgrouping of Schizophrenic Patients: Significance of Negative Symptoms Assessed on Admission

0 10.2 0.4 0.6 0.8

Corr. -1 0 1

Corr. -1 0 1

Corr. -1 0 1

Corr. -1 0 1

-.1 .1

-0.2 0.2 -0.4 0.4

Psychiatry Research (1998) Lin, Chen et al. Psychopathological Dimensions in Schizophrenia: A Correlational Approach to Items of the SANS and SAPS

Genes, Brain and Behavior (2009) Lin et al. Clustering by neurocognition for fine-mapping of the schizophrenia susceptibility loci on chromosome 6p

6 month Liu et al. J. of the Formosan Med. Ass. (2012) Medium-term course and outcome of schizophrenia depicted by the sixth-month subtype after an acute episode

GAP for Heritable (Genetic) Disease: Schizophrenia (National Taiwan University)

P3G9P1P6

G7N6G13N1N4N2N3N5G10G12G5P2G11N7G15

G4G16G8P7S2G14S1S3P4

G3G6

P5G2G1

Negative Disorg. Host./Excit.

De l./Hall.

15 10 5 0Average Euc lidean Distance

GONEG

GWNEG

G2

G3

G4

G1

PANSS Score

1 2 3 4 5 6 7

P3G9P1P6

G7N6G13N1N4N2N3N5G10G12G5P2G11N7G15

G4G16G8P7S2G14S1S3P4

G3G6

P5G2G1

Average Correlation1 0.20.8 0.6 0.4

Correlation Coefficient

-1 -0.5 0 0.5 1

Negative�Symptoms

Disorganized�Thought

Hostility /�Excitement

Delusion /�Hallucination

PANSS Score

1 2 3 4 5 6 7

Correlation Coefficient

-1 -0.5 0 0.5 1

G7N6N3N1N2N4N5G16G10N7G5G13G11P2G15G12G8P7S1G14S2S3P4P6P3G9P1P5G4G2G1G3G6

G7N6N3N1N2N4N5G16G10N7G5G13G11P2G15G12G8P7S1G14S2S3P4P6P3G9P1P5G4G2G1G3G6

Negative�Symptoms

Disorganized�Thought

Hostility /�ExcitementDelusion /�HallucinationAnxiety�Symptoms

RMG�(n=61)

MBG�(n=50)

PDHG2�(n=38)

PDHG1�(n=14)

0 5 10Average Euc lidean Distance

0.2 10.4 0.6 0.8Average Correlation Negative Disorg. Host./

Excit.De l./Hall. Anxiety

J. of the Formosan Med. Ass.(2008) Yeh et al. Factors Related to Perceived Needs of Chief Caregivers of Patients with Schizophrenia

PLoS ONE (2011) Lai et al. MicroRNA expression aberration as potential peripheral blood biomarkers for schizophrenia

Schizophrenia Research (2013) Liu et al. Development of a brief self-report questionnaire for screening putative pre-psychotic states.

Page 24: Collaboration with Statistician? 矩陣視覺化於探索式資料分析

BMC Genomics 9 (2008) Genomics and proteomics of immune modulatory effects of a butanol fraction of Echinacea purpurea in human dendritic cells Wang et al.

Phytochemistry 70 (2009) Anti-diabetic properties of three common Bidens pilosa variants in Taiwan Chien et al.

Journal of Nutritional Biochemistry 21 (2010) Comparative metabolomics approach coupled with cell- and gene-based assays for species classification and anti-inflammatory bioactivity validation of Echinacea plants Hou et al.

GAP for Comparative Metabolome: Chinese Herbal Medicine Drs. Ning-Sun Yang, Lie-Fen Shyur, Wen-Chin Yang Agricultural Biotechnology Research Center (ABRC) of Academia Sinica

BMC Complementary and Alternative Medicine 13 (2013) Morus alba and active compound oxyresveratrol exert anti-inflammatory activity via inhibition of leukocyte migration involving MEK/ERK signaling. Chen et al.

紫錐菊

咸豐草

白桑

Page 25: Collaboration with Statistician? 矩陣視覺化於探索式資料分析

Journal of Clinical Oncology 23 (2005) Tumor-Associated Macrophages in Cancer Progression Chen J. J. et al.

The New England Journal of Medicine 356 (2007) A Five-Gene Signature and Clinical Outcome in Non–Small-Cell Lung Cancer Chen H. Y. et al.

BMC Genomics 6 (2005)Molecular signature of clinical severity in recovering patients with (SARS-CoV) Lee Y. S. et al. (Chang Gung Hospital)

Open Access Scientific Reports 1 (2006) In silico Therapeutic Drug Screening for Reversing the Lung Adenocarcinoma Overexpressed Gene Signatures. Kuo Y. L. et al. (Nat’l Yang-Ming Univ.)

Cancer Research 66 (2006) Non–Small Cell Lung Cancer with Tumor Cell Invasiveness Sher Y. P. et al.

GAP for Cancer Study: Non–Small Cell Lung Cancer (National Taiwan University)

GAP for Infectious Disease: SARS b Simple Match Between Pathways

a PPI to Pathway c Simple Match Between PPIs

F13A1,HSPB1!MAPK14,EGFR!

EGFR,HSPB1!STAT1,PDGFRB!PDGFRB,CRKL!

HCK,CRKL!ITGAV,PTK2!FLT1,CRKL!

CRKL,MAPK1!CRKL,RAF1!

MAPK3,PTPN11!STAT5A,SHC1!

CRK,SRC!GAB1,SOS1!

CRK,SHC1!PXN,PTPN11!

PDGFRB,PTPN11!PDGFRB,PLCG1!

PLCG1,PTK2!CRKL,GAB1!

CRKL,PTPN11!BAD,YWHAZ!

BAD,RAF1!PTK2,PTEN!PXN,PTEN!

CRKL,PIK3R1!AKT1,HSPB1!AKT1,PDPK1!

MAPK14,AKT1!PIK3R1,SHC1!

PIK3R1,SRC!HCK,SOS1!

CRKL,SOS1!PDGFRB,RAF1!

FLT1,PTPN11!HCK,PLCG1!FLT1,PLCG1!CRKL,EGFR!

CRK,KDR!CRKL,PTK2!FLT1,PTK2!

MAPK14,MAPK3!BAD,MAPK8!

AKT1,SMAD4!FLT1,HCK!

HCK,PIK3CB!CTNNB1,FLT1!

PIK3R1,PXN!FLT1,PIK3R1!

AKT1,PAK1!AKT1,NOS3!AKT1,MDM2!PTK2,YES1!

PXN,MAPK8!CRK,FLT1!

MAPK3,MAPK1!PDGFRB,SLC9A3R1!

EGFR,HCK!MCM7,CDC6!CDC6,MCM6!

PLK1,PKMYT1!E2F1,CDC6!

CCNB1,PKMYT1!CDK7,E2F1!

PLK1,CCNB1!CCNB1,CDC25A!

CCNA2,CCNB1!

M1!

M2!

H1!

H2!

B1!

B2!

A1!

A2!

P1!

P2!

P3!P4!P5!

Signalling to RAS!Signaling by EGFR!PDGFR-alpha signaling pathway!PDGFR-beta signaling pathway!Signaling events activated by Hepatocyte Growth Factor Receptor (c-Met)!IGF1 pathway!Signaling events mediated by VEGFR1 and VEGFR2!role of pi3k subunit p85 in regulation of actin organization and cell migration!PI3K/AKT signalling!akt signaling pathway!mTOR signaling pathway!Hedgehog signaling events mediated by Gli proteins!PPAR signaling pathway - Homo sapiens (human)!Canonical Wnt signaling pathway!Complement and coagulation cascades - Homo sapiens (human)!Unwinding of DNA!Activation of the pre-replicative complex!cdk regulation of dna replication!sonic hedgehog receptor ptc1 regulates cell cycle!Cyclin A/B1 associated events during G2/M transition!E2F mediated regulation of DNA replication!

a Both Positive!

Mahlavu Only!

Huh7 Only!

Not on the pathway!

b,c 1 0

Color legends

GAP for Protein-Protien

Interaction C-Y F. Huang,

Nat’l Yang-Ming Univ.

Molecular and Cellular Proteomics 12 (2013) An analysis of protein-protein interactions in cross-talk pathways reveals CRKL as a novel prognostic marker in hepatocellular carcinoma. Liu et al.

Page 26: Collaboration with Statistician? 矩陣視覺化於探索式資料分析

X1 x2 x3 x4 0

1

Scatter-plot Matrix (SM) Parallel Coordinates Plot (PCP)

Mosaic Plot

Visualization for Binary Data

26

Page 27: Collaboration with Statistician? 矩陣視覺化於探索式資料分析

1. Data Matrix

2. Subject Proximity

3. Variable Proximity

Essential elements in a GAP MV procedure?

1. Data Matrix

2. Subject Proximity

3. Variable Proximity

Continuous Binary

Correlation Covariance polychoric Correlation . . .

Euclidean Distance Manhattan Distance Correlation … ?

27

Matrix Visualization for Binary Data

Page 28: Collaboration with Statistician? 矩陣視覺化於探索式資料分析

28

Commonly used similarity coefficients for binary data

Tzeng et al. (BMEI 2009) (IEEE Xplore Digital Library)

Page 29: Collaboration with Statistician? 矩陣視覺化於探索式資料分析

CGMIM performs automated text-mining of OMIM to identify genetically-related cancers Online Mendelian In Man (OMIM) is a computerized database of information about genes and heritable traits in human populations OMIM is maintained on the Internet by the National Center for Biotechnology Information at the US National Institutes of Health CGMIM considers 21 anatomic sites based on the major cancers identified by the National Cancer Institute of Canada CGMIM compares each OMIM entry name and alternative name with a list of gene names assigned by HUGO (HUman Genome Organization). CGMIM produces the number of genes for which an OMIM entry mentions each pair of cancers, as well as a ratio of the observed and expected number of genes for the combination

http://www.bccrc.ca/ccr/CGMIM/ CGMIM Online Binary GAP Example

29

Page 30: Collaboration with Statistician? 矩陣視覺化於探索式資料分析

CGMIM All Data (1948 genes * 21 Sites) Original Order

21 Cancer Sites

1948 Related Genes

Jaccard: a/(a+b+c)

30

Page 31: Collaboration with Statistician? 矩陣視覺化於探索式資料分析

21 Cancer Sites

1948 Related Genes

CGMIM All Data (1948 genes * 21 Sites) Single_Tree_GrandPa_Guide

Jaccard: a/(a+b+c)

31

Page 32: Collaboration with Statistician? 矩陣視覺化於探索式資料分析

21 Cancer Sites

768 Related Genes

CGMIM 768 genes at least at 2 Sites Original Order

Jaccard: a/(a+b+c)

32

Page 33: Collaboration with Statistician? 矩陣視覺化於探索式資料分析

21 Cancer Sites

768 Related Genes

CGMIM 768 genes at least at 2 Sites GAP_Elliptical_Order

Jaccard: a/(a+b+c)

33

Page 34: Collaboration with Statistician? 矩陣視覺化於探索式資料分析

Matrix visualization of nominal data (GAP approach)

Example: Classification of Animals Data

Shizuhiko Nishisato 2006 34

Approaching Statistics & Statistical Approach

Page 35: Collaboration with Statistician? 矩陣視覺化於探索式資料分析

35

票種 code 合計

早鳥/社會 0 351

早鳥/學生 1 121

邀請/⼀一般 2 48

邀請/媒體 3 25

邀請/貴賓 4 22

黑客松 5 90

邀請/講師 6 32 邀請/贊助 7 12

  總計 701

性別 code 合計

女 0 166

男 1 535

  總計 701 業別 code 合計

無法認定 0 56

資訊業 1 203

研究機構 2 41

科技業 3 33

顧問公司 4 13

金融業 5 18

通訊業 6 21

自由業 7 11

傳播業 8 17

學術 9 242

政府機關 10 11

醫藥業 11 8

電子業 12 4

公益團體 13 8

通路商 14 5

其他 15 10

  總計 701

餐飲 code 合計

素食 0 47

葷食 1 654

  總計 701

活動 code 合計

黑客松 0 91

資料分析 1 196

演講議程 2 414

  總計 701

通知 code 合計

否 0 61

是 1 640

  總計 701

電郵 code 合計

.edu 0 49

.org 1 15

gmail 2 500

hotmail 3 44

yahoo 4 10

MSN 5 3

.com 6 64

其他 7 16

  總計 701

報名資料 可能變數

報名序 1~701 票種

業別

通知

活動

電郵

… … … … …

0 1 1 1 1 0 9 1 1 2 0 1 1 0 6 0 1 2 1 2 0 9 2 1 2 1 9 2 1 2 1 9 2 1 2 1 9 1 1 2 1 9 1 1 2 0 3 2 1 3 1 9 2 1 3 0 3 2 1 6 0 11 2 1 2 1 9 1 1 2 1 9 2 1 2 0 1 2 0 2 1 9 2 1 2 1 9 2 1 2 0 3 2 1 6 0 2 2 1 1 0 1 2 1 2 … … … … …

701 人

7 變數 報 名

葷素

性別

… … …

139 1 1

140 1 1

141 0 0

142 1 1

143 1 1

144 1 0

145 1 1

146 1 1

147 1 0

148 1 1

149 1 1

150 1 1

151 1 1

152 1 1

155 1 1

157 1 1

158 1 0

159 0 1

160 1 0

162 1 1

163 1 0

… … …

Page 36: Collaboration with Statistician? 矩陣視覺化於探索式資料分析

1. Data Matrix

2. Subject Proximity

3. Variable Proximity

Essential elements in a GAP MV procedure?

1. Data Matrix

2. Subject Proximity

3. Variable Proximity Continuous Nominal

Correlation Covariance polychoric Correlation . . .

Euclidean Distance Manhattan Distance Correlation … ?

type measurements

Matching proportion

χ 2

?

?

36

Approaching Statistics & Statistical Approach

Page 37: Collaboration with Statistician? 矩陣視覺化於探索式資料分析

Shizuhiko  Nishisato,  2006    Classifica5on  of  Animals    

35  animals  were  sorted  into  piles  of  similar  animals  by  15  students  

Animal/Subject S1 S2 S3 S4 S5 S6 S7 S8 S9 S10 S11 S12 S13 S14 S15 Alligator 8 1 6 9 6 1 3 4 1 4 3 4 3 6 4 Bear 6 3 2 6 6 1 3 4 4 5 5 1 4 2 2 Camel 4 3 9 3 1 4 3 5 4 2 5 1 7 7 8 Cat 6 3 7 4 0 1 1 2 3 3 1 1 6 3 5 Cheetah 3 3 7 4 0 1 3 5 4 3 6 1 6 2 2 Chiken 7 2 4 1 2 5 7 1 5 1 1 3 8 4 1 Chimpanzee 5 3 5 7 5 2 4 4 2 2 6 4 3 6 6 Cow 1 3 9 6 1 1 3 5 3 4 1 1 4 5 8 Crane 7 2 4 5 2 5 5 1 5 1 2 3 8 4 1 Crow 7 2 4 5 2 5 5 1 5 1 2 3 8 4 1 Dog 6 3 7 10 0 2 1 2 3 3 1 1 4 3 5 Duck 7 2 4 1 2 5 5 1 5 1 2 3 8 4 1 Elephant 4 3 6 3 1 4 3 5 4 5 3 1 7 7 2 Fox 6 3 7 4 0 1 6 2 3 3 3 1 4 3 5 Frog 8 1 3 2 3 3 2 3 1 4 4 2 1 1 3 Giraffe 1 3 8 3 1 4 3 5 4 2 5 1 7 7 8 Goat 3 3 9 6 1 4 6 5 3 3 1 1 5 3 5 Hawk 7 2 4 5 2 5 5 1 5 1 3 3 8 4 1 Hippopotamus 4 3 6 6 6 4 3 3 4 4 5 1 7 7 2 Horse 6 3 9 6 1 2 3 5 3 3 1 1 5 5 8 Leopard 1 3 7 4 0 1 3 5 4 3 3 1 6 2 2 Lion 5 3 7 4 6 1 3 5 4 3 3 1 7 2 2 Lizard 2 1 3 2 3 3 2 3 1 4 4 2 2 1 3 Monkey 6 3 5 7 5 2 4 4 2 2 6 4 3 6 6 Ostrich 3 2 4 1 2 5 3 1 5 1 5 3 8 7 8 Pig 1 3 9 6 1 1 6 5 3 3 1 1 5 5 5 Pigeon 7 2 4 5 2 5 5 1 5 1 2 1 8 4 1 Rabbit 6 3 1 6 0 4 6 2 3 3 1 1 5 3 5 Racoon 6 3 7 10 4 1 6 2 3 3 3 1 4 3 5 Rhinoceros 4 3 5 6 6 4 3 5 4 4 5 1 7 7 2 Snake 8 1 3 9 6 3 2 3 1 4 4 2 2 1 3 Sparrow 7 2 4 5 2 5 5 1 5 2 2 3 8 4 1 Tiger 5 3 7 4 0 1 3 5 4 3 3 1 6 2 2 Tortoise 8 1 3 9 3 3 2 3 1 5 4 2 1 1 3 Turkey 7 2 4 1 2 5 7 1 5 1 1 3 8 4 1

What about 3500 samples 1500 variables

?

A typical nominal data

37

Page 38: Collaboration with Statistician? 矩陣視覺化於探索式資料分析

Alligator    Bear    Camel    Cat    Cheetah    Chicken    Cow    Crane    Chimpanzee    Crow  

Dog    Duck    Elephant    Fox    Frog    Giraffe    Goat    Hawk    Hippopotamus    Horse  

Leopard    Lion    Lizard    Ostrich    Pig    Pigeon    Rabbit    Racoon    Rhinoceros    Snake  

Sparrow    Tiger    Tortoise    Turkey    Monkey  

Page 39: Collaboration with Statistician? 矩陣視覺化於探索式資料分析

39

S12

Mammalia Reptilia Aves Primates ? Mammalia Reptilia Aves

Uni-variate Display Bar-Chart Pie-Chart

S2

20

15

10

5

0 3 4 2 1 3 2 1

Page 40: Collaboration with Statistician? 矩陣視覺化於探索式資料分析
Page 41: Collaboration with Statistician? 矩陣視覺化於探索式資料分析

41 3. Mammal 1. Reptile 2. Bird

4. P

rimat

e?

1. M

amm

al

2. R

eptil

e 3.

Bird

Bi-variate Display Mosaic Display

Page 42: Collaboration with Statistician? 矩陣視覺化於探索式資料分析

Conventional multivariate visualization for this data

2D Mosaic Display

5D Mosaic Display

Parallel Coordinate Plot

Scatter-plot Matrix

Scatter-plot

42

Page 43: Collaboration with Statistician? 矩陣視覺化於探索式資料分析

GAP Categorical MV Solution with 3 matrix maps

(original orders)

43

Alliga Bear Camel Cat Cheeta Chicke Chimpa Cow Crane Crow Dog Duck Elepha Fox Frog Giraff Goat Hawk Hippop Horse Leopar Lion Lizard Monke Ostric Pig Pigeon Rabbit Racoo Rhinoc Snake Sparro Tiger Tortoi Turkey

Page 44: Collaboration with Statistician? 矩陣視覺化於探索式資料分析

44

GAP Categorical MV Solution with 3 matrix maps (R2E orders)

Page 45: Collaboration with Statistician? 矩陣視覺化於探索式資料分析

45

11 6 7

15 14 12 13 9 4 3 5 8

10 2 1

Os

tr

ic

Tu

rk

ey

C

hic

ke

P

ige

on

H

aw

k

Sp

ar

ro

D

uc

k

Cr

ow

C

ra

ne

L

iza

rd

F

ro

g

To

rt

oi

Sn

ak

e

Al

lig

a

Hip

po

p

Be

ar

R

hin

oc

E

le

ph

a

Lio

n

Tig

er

L

eo

pa

r

Co

w

Fo

x

Ra

co

o

Ch

ee

ta

C

at

D

og

P

ig

Ra

bb

it

Ho

rs

e

Go

at

C

am

el

G

ira

ff

M

on

ke

C

him

pa

11 6 7

15 14 12 13 9 4 3 5 8

10 2 1

Mammalia Reptilia Aves Primates

Page 46: Collaboration with Statistician? 矩陣視覺化於探索式資料分析

D0_Lion    D1_Elephant    D2_Camel    D3_Hawk    D4_Fox  

A0_Dog    A1_Alligator    A2_Chimpanzee    A3_Cow    A4_Crow    A5_Pigeon    A6_Cheetah    A7_Chiken    A8_Bear    A9_Cat  

B0_Rabbit    B1_Frog    B2_Goat    B3_Tiger    B4_Rhinoceros    B5_Giraffe    B6_Duck    B7_Sparrow    B8_Hippopotamus    B9_Monkey  

C0_Turkey    C1_Pig    C2_Crane    C3_Leopard    C4_Ostrich    C5_Lizard    C6_Horse    C7_Racoon    C8_Tortoise    C9_Snake  

Page 47: Collaboration with Statistician? 矩陣視覺化於探索式資料分析

Alligator    Bear    Camel    Cat    Cheetah    Chicken    Cow    Crane    Chimpanzee    Crow  

Dog    Duck    Elephant    Fox    Frog    Giraffe    Goat    Hawk    Hippopotamus    Horse  

Leopard    Lion    Lizard    Ostrich    Pig    Pigeon    Rabbit    Racoon    Rhinoceros    Snake  

Sparrow    Tiger    Tortoise    Turkey    Monkey  

?

Page 48: Collaboration with Statistician? 矩陣視覺化於探索式資料分析

48

Univariate frequency breakdown Bivariate table

與會人士類別性資料

Page 49: Collaboration with Statistician? 矩陣視覺化於探索式資料分析

49

票種 code 合計

早鳥票/社會人士 0 351

早鳥票/學生 1 121

邀請票/⼀一般 2 48

邀請票/媒體 3 25

邀請票/貴賓 4 22

g0v 黑客松票 5 90

邀請票/講師 6 32 邀請票/贊助 7 12

  總計 701

性別 code 合計

女 0 166

男 1 535

  總計 701

業別 code 合計

Missing、無法認定 0 56

資訊業 1 203

研究機構 2 41

科技業 3 33

顧問公司 4 13

金融業 5 18

通訊業 6 21

自由業 7 11

傳播業 8 17

學術 9 242

政府機關 10 11

醫藥業 11 8

電子業 12 4

公益團體 13 8

通路商 14 5

製造業、食品業、運輸業、服務業 15 10

  總計 701

餐飲需求 code 合計

素食 0 47

葷食 1 654

  總計 701

參加活動 code 合計

g0v 黑客松 0 91

資料分析上手課程 1 196

演講議程 2 414

  總計 701

日後通知 code 合計

否 0 61

是 1 640

  總計 701

電郵 code 合計

.edu 0 49

.org 1 15

gmail 2 500

hotmail 3 44

yahoo 4 10

MSN 5 3

.com 6 64

未定、其他 7 16

  總計 701

報名資料可能變數

報名序 1~701

Page 50: Collaboration with Statistician? 矩陣視覺化於探索式資料分析

50

701

5 變數

Optimization 票種

業別

通知

活動

電郵

… … … … …

0 1 1 1 1 0 9 1 1 2 0 1 1 0 6 0 1 2 1 2 0 9 2 1 2 1 9 2 1 2 1 9 2 1 2 1 9 1 1 2 1 9 1 1 2 0 3 2 1 3 1 9 2 1 3 0 3 2 1 6 0 11 2 1 2 1 9 1 1 2 1 9 2 1 2 0 1 2 0 2 1 9 2 1 2 1 9 2 1 2 0 3 2 1 6 0 2 2 1 1 0 1 2 1 2 … … … … …

葷素

性別

報名

… … …

1 1 139

1 1 140

0 0 141

1 1 142

1 1 143

1 0 144

1 1 145

1 1 146

1 0 147

1 1 148

1 1 149

1 1 150

1 1 151

1 1 152

1 1 155

1 1 157

1 0 158

0 1 159

1 0 160

1 1 162

1 0 163

… … …

3 共變數

Page 51: Collaboration with Statistician? 矩陣視覺化於探索式資料分析

51

其他

.com

MSN

Yahoo

Hotmail

Gmail

.org

.edu

演講議程

資料分析

黑客松

其他

通路商

公益團體

電子業

醫藥業

政府機關

學術

傳播業

自由業

通訊業

金融業

顧問公司

科技業

研究機構

資訊業

無法認定

邀請贊助

邀請講師

黑客松

邀請貴賓

邀請媒體

邀請一般

早鳥學生

早鳥社會

票種 電郵

通知 活動

業別 李育杰

701 人

5 變數

Page 52: Collaboration with Statistician? 矩陣視覺化於探索式資料分析

52

報 名 序 女素

票 業 活 通 電 種 別 動 知 郵

票種 業別 活動 通知 電郵

5 變數

701

Original Orders (報名序)

Page 53: Collaboration with Statistician? 矩陣視覺化於探索式資料分析

53

邀請/贊助 邀請/講師 g0v黑客松 邀請/貴賓 邀請/媒體 邀請/一般 早鳥/學生 早鳥/社會

其他 通路商 公益團體 電子業 醫藥業 政府機關 學術 傳播業 自由業 通訊業 金融業 顧問公司 科技業 研究機構 資訊業 無法認定

其他 .com MSN yahoo hotmail gmail .org .edu

通知 是否

活 動

演講議程 資料分析 g0v黑客松

報 名 序

票 種

業 別

電 郵

女素

票 業 活 通 電 種 別 動 知 郵

票種 業別 活動 通知 電郵

5 變數 Original Orders (報名序)

701

Page 54: Collaboration with Statistician? 矩陣視覺化於探索式資料分析

票 通 活 電 業 種 知 動 郵 別

54

其他 通路商 公益團體 電子業 醫藥業 政府機關 學術 傳播業 自由業 通訊業 金融業 顧問公司 科技業 研究機構 資訊業 無法認定

其他 .com MSN yahoo hotmail gmail .org .edu

通知 是否

活 動

演講議程 資料分析 g0v黑客松

報 名 序

票 種

業 別

電 郵

Random Orders

票種 通知 活動 電郵 業別 女素

5 變數

701

邀請/贊助 邀請/講師 g0v黑客松 邀請/貴賓 邀請/媒體 邀請/一般 早鳥/學生 早鳥/社會

Page 55: Collaboration with Statistician? 矩陣視覺化於探索式資料分析

55

票種 業別 活動 通知 電郵 報名 女男 素葷

通知 活動

票種

業別

電郵 其他 .com MSN yahoo hotmail gmail .org .edu

是否

演講議程 資料分析 黑客松

邀請/贊助 邀請/講師

黑客松 邀請/貴賓 邀請/媒體 邀請/一般 早鳥/學生 早鳥/社會

其他 通路商

公益團體 電子業 醫藥業

政府機關 學術

傳播業 自由業 通訊業 金融業

顧問公司 科技業

研究機構 資訊業

無法認定

701 人

沈澱圖 (pie-chart, bar-chat) (可觀察單一變數各類別之比例,但失去變數間連結)

166

100

Page 56: Collaboration with Statistician? 矩陣視覺化於探索式資料分析

56

其他 通路商 公益團體 電子業 醫藥業 政府機關 學術 傳播業 自由業 通訊業 金融業 顧問公司 科技業 研究機構 資訊業 無法認定

報 名 序

票 種

業 別

Elliptical Seriations (R2E)

電郵 通知 活動 票種 業別

電 通 活 票 業 郵 知 動 種 別 其他

.com MSN yahoo hotmail gmail .org .edu

通知 是否

活 動

演講議程 資料分析 g0v黑客松

電 郵

5 變數

701

女素

邀請/贊助 邀請/講師 g0v黑客松 邀請/貴賓 邀請/媒體 邀請/一般 早鳥/學生 早鳥/社會

Page 57: Collaboration with Statistician? 矩陣視覺化於探索式資料分析

業 電 通 活 票 別 郵 知 動 種

57

其他 通路商 公益團體 電子業 醫藥業 政府機關 學術 傳播業 自由業 通訊業 金融業 顧問公司 科技業 研究機構 資訊業 無法認定

報 名 序

業別 電郵 通知 活動 票種

票 種

業 別

Hierarchical Clustering Tree (HCT)

其他 .com MSN yahoo hotmail gmail .org .edu

通知 是否

活 動

演講議程 資料分析 g0v黑客松

電 郵

5 變數

701

女素

邀請/贊助 邀請/講師 g0v黑客松 邀請/貴賓 邀請/媒體 邀請/一般 早鳥/學生 早鳥/社會

Page 58: Collaboration with Statistician? 矩陣視覺化於探索式資料分析

電 通 活 票 業 郵 知 動 種 別

58

其他 通路商 公益團體 電子業 醫藥業 政府機關 學術 傳播業 自由業 通訊業 金融業 顧問公司 科技業 研究機構 資訊業 無法認定

報 名 序

票 種

業 別

Hierarchical Clustering Tree (HCT)

其他 .com MSN yahoo hotmail gmail .org .edu

通知 是否

活 動

演講議程 資料分析 g0v黑客松

電 郵 電郵

通知 活動 票種 業別

5 變數

701

女素

邀請/贊助 邀請/講師 g0v黑客松 邀請/貴賓 邀請/媒體 邀請/一般 早鳥/學生 早鳥/社會

Page 59: Collaboration with Statistician? 矩陣視覺化於探索式資料分析

59

電郵 通知 活動 票種 業別

票 種

其他 通路商 公益團體 電子業 醫藥業 政府機關 學術 傳播業 自由業 通訊業 金融業 顧問公司 科技業 研究機構 資訊業 無法認定

業 別

電 郵

其他 .com MSN yahoo hotmail gmail .org .edu

通 知

是否

活 動

演講議程 資料分析 g0v黑客松

報名 Orders: HCT-R2E Hierarchical Clustering Tree with Flips Guided Elliptical Seriation

邀請/贊助 邀請/講師 g0v黑客松 邀請/貴賓 邀請/媒體 邀請/一般 早鳥/學生 早鳥/社會

女素

Page 60: Collaboration with Statistician? 矩陣視覺化於探索式資料分析

Data: 160 international organization membership pattern (variables) for 230 countries/regions (subjects) 0. non-member □ 1. member ■ 2. observer 3. associate member 4. guest 5. dialogue partner

http://www.faqs.org/docs/factbook/index.html

CIA Political Map of the World

230 countries (regions)

CIA

60

160 international organization

Matrix Visualization with cartography links Approaching Statistics & Statistical Approach

Page 61: Collaboration with Statistician? 矩陣視覺化於探索式資料分析

. . . . . . 160 maps (?)

Draw one membership map for each organization (variable)?

1 2 3

4 5 6

7 8 9

158 159 160 61

Page 62: Collaboration with Statistician? 矩陣視覺化於探索式資料分析

Data: Ranks of 5 Candidates (扁宋連許李) on 360 Townships 2000 總統大選資料

Is it possible to visualize information structure for all 5 candidates in a single MAP?

A B C D E

D E

A B C

? Rank 1 2 3 4 4.5 5

���������������������

��"

#"

��"

��"

��"

��"

��"

��"

��"

��"

��"

��"

�"

�"�!" �" �"��"

ABCDE

Cartography Coloring Scheme with Categorical GAP (CartoGAP) - 2

扁 宋 連

許 扁宋連 許李

Page 63: Collaboration with Statistician? 矩陣視覺化於探索式資料分析

A B C

E

A B C D E

D

Cartography Coloring Scheme with GAP (CartoGAP)-2 (B). CateGAP Color Map for Each Individual Variable

(C). Final Single CateGAP Cartography Color MAP for Complete Information Visualization

扁 宋 連

扁宋連 許李

Page 64: Collaboration with Statistician? 矩陣視覺化於探索式資料分析

From physical maps to conceptual maps

64

Chromosome Map

Semiconductor Wafer Quality

Control

Macro Biodiversity

Micro Biodiversity

Page 65: Collaboration with Statistician? 矩陣視覺化於探索式資料分析

65

Matrix Visualization for Symbolic Data (Analysis)

for Big Data?

Page 66: Collaboration with Statistician? 矩陣視覺化於探索式資料分析

Fig.  1.  Diagram  for  related  conven5onal  data  matrix  and  symbolic  (interval  type)  data  table  with  their  corresponding  proximity  matrices  for  samples/concepts  and  variables.

1.1  Symbolic  Data  Analysis  (SDA)    and    1.2  Matrix  Visualiza<on  (MV)

Page 67: Collaboration with Statistician? 矩陣視覺化於探索式資料分析

67

58 variables

1899 Level 4 Cities

市區町村

58 variables (interval)

151 Level 2

Areas地域

continuous Data ↓

Rank Data

(1~1899)

merged (interval of ranks)

data

10 Level 1 Regions

covariate

Example: Japan Minryoku 2010 Data (with Junji Nakano, ISM)

Level 1 Level 2 Level 3 Level 4 Region (10) Area (151) District (821) City (1899)

Page 68: Collaboration with Statistician? 矩陣視覺化於探索式資料分析

Japan Minryoku 2010 Data

Page 69: Collaboration with Statistician? 矩陣視覺化於探索式資料分析

12 displaying modes for MV of interval data

58 interval variables

151

regi

ons

(con

cept

s)

Min Mid Max Length

Sufficient Sediment Row Condition Col Condition

Length < 949 len<949, 949<mid 1746 < length 900<mid<1000

Page 70: Collaboration with Statistician? 矩陣視覺化於探索式資料分析

Statisticians, Data Analysts, & Bioinformaticists A statistician is someone who wants to get exactly the right answer, even if it’s the answer to the wrong question. A data analyst is someone who is willing to settle for an approximate answer, as long as it’s the answer to the right question. A bioinformaticist is someone who is willing to settle for answers of unknown accuracy, to questions that have not been clearly articulated, as long as the results can be graphed in color.

David B. Allison, Ph.D. Department of Biostatistics University of Alabama at Birmingham

Page 71: Collaboration with Statistician? 矩陣視覺化於探索式資料分析

12. MV for Color Blind people Types of color blind Monochromacy Dichromacy Protanopia and deuteranopia Hereditary tritanopia Anomalous Trichromacy

http://www.vischeck.com/examples/

To act passively to prevent from using color systems that are difficult for color blind people to understand. or To work actively in assisting people with visual impairments to have better visualization of data/information.

“I believe there are more mathematics/statistics blind people than color blind people” 71

Approaching Statistics & Statistical Approach

Page 72: Collaboration with Statistician? 矩陣視覺化於探索式資料分析

Thank You!

72