Transcript
Page 1: (Open) data analysis for decision support: challenges and essentials - With examples from Open Coesione

(Open) data analysis for decision support: challenges and essentials !

Antonio Vetrò Technische Universität München, Germany

01 September 2014,Matera (Italy), RENA Summer school

@phisaz

[email protected]

With examples from Open Coesione

With material from a joint work with: Lorenzo Canova, Marco Torchiano (PoliTO - Politecnico di Torino) Federico Morando, Raimondo Iemma (NEXA Center for Internet and Society - PoliTO) Aline Pennisi ( Ministero dell’ Economia e delle Finanze ) Feedback from Andrea Milan (United Nations University) Daniel Méndez Fernández (Technische Universität München)

Page 2: (Open) data analysis for decision support: challenges and essentials - With examples from Open Coesione

RENA Summer School 2014

2

Page 3: (Open) data analysis for decision support: challenges and essentials - With examples from Open Coesione

RENA Summer School 2014

2

Page 4: (Open) data analysis for decision support: challenges and essentials - With examples from Open Coesione

RENA Summer School 2014

Deciding and

implementing together

Monitoring togetherPlanning together

2

Page 5: (Open) data analysis for decision support: challenges and essentials - With examples from Open Coesione

Outline

• Data analysis : a philosophical perspective, empiricism

• Data analysis challenges: examples with Open Data

3

Page 6: (Open) data analysis for decision support: challenges and essentials - With examples from Open Coesione

Outline

• Data analysis : a philosophical perspective, empiricism

• Data analysis challenges: examples with Open Data

4

Page 7: (Open) data analysis for decision support: challenges and essentials - With examples from Open Coesione

5

Page 8: (Open) data analysis for decision support: challenges and essentials - With examples from Open Coesione

Data

Source: Klaus Mainzer,Modern Aspects of Philosophy of Science in Informatics and its Applications,

Lehrstuhl für Philosophie und Wissenschaftstheorie Carl von Linde-Akademie Munich Center for Technology in Society Technische Universität München

Klaus Mainzer

Munich Center for Technology in Society Technische Universität München

Knowledge Representation : World, Model, and Formal Theory

World Model Theory

observation simulation deduction

approximation: {good, sufficient, insufficient}

interpretation: {true, false}

6

Page 9: (Open) data analysis for decision support: challenges and essentials - With examples from Open Coesione

Data

Source: Klaus Mainzer,Modern Aspects of Philosophy of Science in Informatics and its Applications,

Lehrstuhl für Philosophie und Wissenschaftstheorie Carl von Linde-Akademie Munich Center for Technology in Society Technische Universität München

Klaus Mainzer

Munich Center for Technology in Society Technische Universität München

Knowledge Representation : World, Model, and Formal Theory

World Model Theory

observation simulation deduction

approximation: {good, sufficient, insufficient}

interpretation: {true, false}

Figure: techrepublic.com

6

Page 10: (Open) data analysis for decision support: challenges and essentials - With examples from Open Coesione

Data analysis A philosophical perspective, empiricism

Observations / Evaluations

Questions / Hypotheses

Theory/System of theories

Pattern building

Falsification / support

Theory building

Study population

Deductive logicInductive logic

See also: Runeson et al. Case Study Research in Software Engineering: Guidelines and Experiments

7

Page 11: (Open) data analysis for decision support: challenges and essentials - With examples from Open Coesione

• Each empirical method…

• has a specific specific purpose • relies on a specific data type • has a specific setting !!

Purpose • Exploratory • Descriptive • Explanatory / confirmatory • Improving !

Data Type • Qualitative • Quantitative

Data analysis A philosophical perspective, empiricism

Observations / Evaluations

Questions / Hypotheses

Theory/System of theories

Pattern building Falsification /

support

Theory building

Study population

Deductive logic

Inductive logic

8

Page 12: (Open) data analysis for decision support: challenges and essentials - With examples from Open Coesione

Data analysis A philosophical perspective

Observations / Evaluations

(Tentative) Hypotheses

Theory/System of theories

Pattern building Falsification /

support

Theory building

Study population

Deductive logicInductive logic

See also: Vessey et al A unified classification system for research in the computing disciplines

9

Page 13: (Open) data analysis for decision support: challenges and essentials - With examples from Open Coesione

Data analysis A philosophical perspective

Observations / Evaluations

(Tentative) Hypotheses

Theory/System of theories

Pattern building Falsification /

support

Theory building

Study population

Deductive logicInductive logic

Formal / conceptual analysis

Grounded theory

Exploratory – case/field studies – experiments – data analysis

Survey / interview research

Confirmatory – case studies – experiments – data analysis – …

Ethnographic studies

See also: Vessey et al A unified classification system for research in the computing disciplines

9

Page 14: (Open) data analysis for decision support: challenges and essentials - With examples from Open Coesione

Data analysis A philosophical perspective

Observations / Evaluations

(Tentative) Hypotheses

Theory/System of theories

Pattern building Falsification /

support

Theory building

Study population

Deductive logicInductive logic

Formal / conceptual analysis

Grounded theory

Exploratory – case/field studies – experiments – data analysis

Survey / interview research

Confirmatory – case studies – experiments – data analysis – …

Ethnographic studies

See also: Vessey et al A unified classification system for research in the computing disciplines

9

Page 15: (Open) data analysis for decision support: challenges and essentials - With examples from Open Coesione

Data analysis A philosophical perspective

Observations / Evaluations

(Tentative) Hypotheses

Theory/System of theories

Pattern building Falsification /

support

Theory building

Study population

Deductive logicInductive logic

Formal / conceptual analysis

Grounded theory

Exploratory – case/field studies – experiments – data analysis

Survey / interview research

Confirmatory – case studies – experiments – data analysis – …

Ethnographic studies

See also: Vessey et al A unified classification system for research in the computing disciplines

9

Page 16: (Open) data analysis for decision support: challenges and essentials - With examples from Open Coesione

Data analysis A philosophical perspective

Observations / Evaluations

(Tentative) Hypotheses

Theory/System of theories

Pattern building Falsification /

support

Theory building

Study population

Deductive logicInductive logic

Formal / conceptual analysis

Grounded theory

Exploratory – case/field studies – experiments – data analysis

Survey / interview research

Confirmatory – case studies – experiments – data analysis – …

Ethnographic studies

See also: Vessey et al A unified classification system for research in the computing disciplines

Deciding and

implementing togetherMonitoring together

Planning together

9

Page 17: (Open) data analysis for decision support: challenges and essentials - With examples from Open Coesione

Outline

• Data analysis : a philosophical perspective, empiricism

• Data analysis challenges: examples with Open Data

10

Page 18: (Open) data analysis for decision support: challenges and essentials - With examples from Open Coesione

Opportunities

Mike Lemansky, Open Data 11

Page 19: (Open) data analysis for decision support: challenges and essentials - With examples from Open Coesione

Opportunities

Lab

Mike Lemansky, Open Data 11

Page 20: (Open) data analysis for decision support: challenges and essentials - With examples from Open Coesione

12

Page 21: (Open) data analysis for decision support: challenges and essentials - With examples from Open Coesione

& challenges

12

Page 22: (Open) data analysis for decision support: challenges and essentials - With examples from Open Coesione

& challenges

12

Page 23: (Open) data analysis for decision support: challenges and essentials - With examples from Open Coesione

Open Coesione

13

Page 24: (Open) data analysis for decision support: challenges and essentials - With examples from Open Coesione

Open Coesione

13

Page 25: (Open) data analysis for decision support: challenges and essentials - With examples from Open Coesione

Open Coesione

Dataset “progetti” - FSE 2007/2013 Snapshot 31/12/2013 Focus on funding and dates, 22/74 columns

13

Page 26: (Open) data analysis for decision support: challenges and essentials - With examples from Open Coesione

Open Coesione

Dataset “progetti” - FSE 2007/2013 Snapshot 31/12/2013 Focus on funding and dates, 22/74 columns

> colnames(subsetProgetti) [1] "FINANZ_UE" "FINANZ_STATO_FONDO_DI_ROTAZIONE" [3] "FINANZ_STATO_FSC" "FINANZ_STATO_PAC" [5] "FINANZ_STATO_ALTRI_PROVVEDIMENTI" "FINANZ_REGIONE" [7] "FINANZ_PROVINCIA" "FINANZ_COMUNE" [9] "FINANZ_ALTRO_PUBBLICO" "FINANZ_STATO_ESTERO" [11] "FINANZ_PRIVATO" "FINANZ_DA_REPERIRE" [13] "FINANZ_TOTALE_PUBBLICO" "DPS_DATA_INIZIO_PREVISTA" [15] "DPS_DATA_FINE_PREVISTA" "DPS_DATA_INIZIO_EFFETTIVA" [17] "DPS_DATA_FINE_EFFETTIVA" "DPS_FLAG_CUP" [19] "DPS_FLAG_PRESENZA_DATE" "DPS_FLAG_COERENZA_DATE_PREV" [21] "DPS_FLAG_COERENZA_DATE_EFF" "DATA_AGGIORNAMENTO" 13

Page 27: (Open) data analysis for decision support: challenges and essentials - With examples from Open Coesione

Milepost5 850 NE 81st Ave Portland, OR 97213 http://milepost5.net/galleries/

Gallery of challenges: Guided Tour

14

Page 28: (Open) data analysis for decision support: challenges and essentials - With examples from Open Coesione

Challenge #1: Errors in data

15

Page 29: (Open) data analysis for decision support: challenges and essentials - With examples from Open Coesione

16

Page 30: (Open) data analysis for decision support: challenges and essentials - With examples from Open Coesione

16

Page 31: (Open) data analysis for decision support: challenges and essentials - With examples from Open Coesione

16

Page 32: (Open) data analysis for decision support: challenges and essentials - With examples from Open Coesione

43 !

16

Page 33: (Open) data analysis for decision support: challenges and essentials - With examples from Open Coesione

43 !

Errors can be inserted from:

- source (observation, sensor)

- manual insertion

- error from ETL*

!Be careful before claiming errors:

they might be “just” accuracy problems

* extraction, transformation, and loading

16

Page 34: (Open) data analysis for decision support: challenges and essentials - With examples from Open Coesione

Challenge #2: accuracy

17

Page 35: (Open) data analysis for decision support: challenges and essentials - With examples from Open Coesione

18

Page 36: (Open) data analysis for decision support: challenges and essentials - With examples from Open Coesione

18

Page 37: (Open) data analysis for decision support: challenges and essentials - With examples from Open Coesione

18

Page 38: (Open) data analysis for decision support: challenges and essentials - With examples from Open Coesione

43 !

18

Page 39: (Open) data analysis for decision support: challenges and essentials - With examples from Open Coesione

Unione Europea 22.50Fondo di Rotazione 21.76Regione 0.74

———45.00

43 !

18

Page 40: (Open) data analysis for decision support: challenges and essentials - With examples from Open Coesione

Unione Europea 22.50Fondo di Rotazione 21.76Regione 0.74

———45.00

43 !

18

Page 41: (Open) data analysis for decision support: challenges and essentials - With examples from Open Coesione

Unione Europea 22.50Fondo di Rotazione 21.76Regione 0.74

———45.00

»Refer always to raw data

»If not possible, estimate accuracy on analysis (e.g., about 5% in the example above)

43 !

18

Page 42: (Open) data analysis for decision support: challenges and essentials - With examples from Open Coesione

Challenge #3: missing data

19

Page 43: (Open) data analysis for decision support: challenges and essentials - With examples from Open Coesione

20

Page 44: (Open) data analysis for decision support: challenges and essentials - With examples from Open Coesione

20

Page 45: (Open) data analysis for decision support: challenges and essentials - With examples from Open Coesione

20

Page 46: (Open) data analysis for decision support: challenges and essentials - With examples from Open Coesione

sub: analysis on 22 attributes int: analysis on the whole dataset general: both sub and int NA = NAs belongs to the domain

Completeness

pcc= percentage of complete cells pcrp = percentage of complete rows

Valu

e

21

Page 47: (Open) data analysis for decision support: challenges and essentials - With examples from Open Coesione

NA in “finanziamenti”

sub: analysis on 22 attributes int: analysis on the whole dataset general: both sub and int NA = NAs belongs to the domain

Completeness

pcc= percentage of complete cells pcrp = percentage of complete rows

Valu

e

21

Page 48: (Open) data analysis for decision support: challenges and essentials - With examples from Open Coesione

NA in “finanziamenti”

NA in “finanziamenti”, “pagamenti”,

missing values in Ateco and other columns

sub: analysis on 22 attributes int: analysis on the whole dataset general: both sub and int NA = NAs belongs to the domain

Completeness

pcc= percentage of complete cells pcrp = percentage of complete rows

Valu

e

21

Page 49: (Open) data analysis for decision support: challenges and essentials - With examples from Open Coesione

No datesNA in “finanziamenti”

NA in “finanziamenti”, “pagamenti”,

missing values in Ateco and other columns

sub: analysis on 22 attributes int: analysis on the whole dataset general: both sub and int NA = NAs belongs to the domain

Completeness

pcc= percentage of complete cells pcrp = percentage of complete rows

Valu

e

21

Page 50: (Open) data analysis for decision support: challenges and essentials - With examples from Open Coesione

No datesNA in “finanziamenti”

NA in “finanziamenti”, “pagamenti”,

missing values in Ateco and other columns

sub: analysis on 22 attributes int: analysis on the whole dataset general: both sub and int NA = NAs belongs to the domain

Completeness

pcc= percentage of complete cells pcrp = percentage of complete rows

Valu

e

Codes and descriptions

Ateco + other descriptions

21

Page 51: (Open) data analysis for decision support: challenges and essentials - With examples from Open Coesione

No datesNA in “finanziamenti”

NA in “finanziamenti”, “pagamenti”,

missing values in Ateco and other columns

sub: analysis on 22 attributes int: analysis on the whole dataset general: both sub and int NA = NAs belongs to the domain

Completeness

pcc= percentage of complete cells pcrp = percentage of complete rows

Valu

e

Codes and descriptions

Ateco + other descriptions

No rows are complete

21

Page 52: (Open) data analysis for decision support: challenges and essentials - With examples from Open Coesione

No datesNA in “finanziamenti”

NA in “finanziamenti”, “pagamenti”,

missing values in Ateco and other columns

sub: analysis on 22 attributes int: analysis on the whole dataset general: both sub and int NA = NAs belongs to the domain

Completeness

pcc= percentage of complete cells pcrp = percentage of complete rows

Valu

e

In 89% of projects dates are present

Codes and descriptions

Ateco + other descriptions

No rows are complete

21

Page 53: (Open) data analysis for decision support: challenges and essentials - With examples from Open Coesione

No datesNA in “finanziamenti”

NA in “finanziamenti”, “pagamenti”,

missing values in Ateco and other columns

sub: analysis on 22 attributes int: analysis on the whole dataset general: both sub and int NA = NAs belongs to the domain

Completeness

pcc= percentage of complete cells pcrp = percentage of complete rows

Valu

e

In 89% of projects dates are present

Codes and descriptions

Ateco + other descriptions

Codes and descriptions

Ateco + other descriptions

No rows are complete

21

Page 54: (Open) data analysis for decision support: challenges and essentials - With examples from Open Coesione

What to do with missing data

1. Understand domain: - e.g., NA or 0 ?

2. Find motivation (e.g.. missing start date o.k. if project hasn’t started yet) 3. Understand how much they impact your analysis 4. You might also:

– exclude rows with missing values – use imputation techniques

– mean substitution – regression substitution – group mean substitution – hot deck imputation – multiple imputation

Source: A Mockus , Missing data in software engineering, Guide to advanced empirical software engineering, 200822

Page 55: (Open) data analysis for decision support: challenges and essentials - With examples from Open Coesione

Challenge #4: outliers

23

Page 56: (Open) data analysis for decision support: challenges and essentials - With examples from Open Coesione

» Outliers can point to interesting facts

Challenge #4: outliers

23

Page 57: (Open) data analysis for decision support: challenges and essentials - With examples from Open Coesione

» … or to something which deserves a second look

Challenge #4: outliers

24

Page 58: (Open) data analysis for decision support: challenges and essentials - With examples from Open Coesione

Valu

e

pcvc= percentage of cells with correct value

25

Page 59: (Open) data analysis for decision support: challenges and essentials - With examples from Open Coesione

ca. 50000 fundings < 1€Va

lue

pcvc= percentage of cells with correct value

25

Page 60: (Open) data analysis for decision support: challenges and essentials - With examples from Open Coesione

ca. 50000 fundings < 1€

ca.210000 <5€

Valu

e

pcvc= percentage of cells with correct value

25

Page 61: (Open) data analysis for decision support: challenges and essentials - With examples from Open Coesione

ca. 50000 fundings < 1€

ca.210000 <5€

ca. 360000<55€Va

lue

pcvc= percentage of cells with correct value

25

Page 62: (Open) data analysis for decision support: challenges and essentials - With examples from Open Coesione

ca. 50000 fundings < 1€

ca.210000 <5€

ca. 360000<55€Va

lue

ca.430000<89€

pcvc= percentage of cells with correct value

25

Page 63: (Open) data analysis for decision support: challenges and essentials - With examples from Open Coesione

What to do with outliers

1. Retention – Check the distribution of data: if heavy tailed, keep

them but don’t apply techniques which require normality

2. Exclusion – Remove them in case you think is a measurement error

or an exceptional case 3. Sensitivity analysis

– compare results with and without outliers – reason on the motivations

26

Page 64: (Open) data analysis for decision support: challenges and essentials - With examples from Open Coesione

Challenge #5: Drawing proper conclusions

27

Page 65: (Open) data analysis for decision support: challenges and essentials - With examples from Open Coesione

Challenge #5: Drawing proper conclusions

» Knowledge is more than statistical significance

» Context and domain knowledge are fundamental

» Consider both qualitative and quantitative aspects

» Triangulate data with other sources27

Page 66: (Open) data analysis for decision support: challenges and essentials - With examples from Open Coesione

Summing up and additional suggestions

28

Page 67: (Open) data analysis for decision support: challenges and essentials - With examples from Open Coesione

Summing up and additional suggestions

28

Challenges : watch out to

!- Errors

- Data accuracy

- Missing data

- Outliers

- Drawing proper conclusions

Page 68: (Open) data analysis for decision support: challenges and essentials - With examples from Open Coesione

Summing up and additional suggestions

28

Challenges : watch out to

!- Errors

- Data accuracy

- Missing data

- Outliers

- Drawing proper conclusions

Verify data collection !- sampling - reference time and context - appropriateness to goals - transformations

Page 69: (Open) data analysis for decision support: challenges and essentials - With examples from Open Coesione

Summing up and additional suggestions

28

Challenges : watch out to

!- Errors

- Data accuracy

- Missing data

- Outliers

- Drawing proper conclusions

Verify data collection !- sampling - reference time and context - appropriateness to goals - transformations

Check first “how data looks like” !Most programs (Excel, SPSS,STATA,R,…) offer predefined functions

Page 70: (Open) data analysis for decision support: challenges and essentials - With examples from Open Coesione

Summing up and additional suggestions

28

Challenges : watch out to

!- Errors

- Data accuracy

- Missing data

- Outliers

- Drawing proper conclusions

Verify data collection !- sampling - reference time and context - appropriateness to goals - transformations

Check first “how data looks like” !Most programs (Excel, SPSS,STATA,R,…) offer predefined functions

Keep track of:

- modifications and reasons

- different versions

- raw data

Page 71: (Open) data analysis for decision support: challenges and essentials - With examples from Open Coesione

Interesting readings

29

Page 72: (Open) data analysis for decision support: challenges and essentials - With examples from Open Coesione

Gallery of challenges: Guided Tour End of Guided Tour

30

Page 73: (Open) data analysis for decision support: challenges and essentials - With examples from Open Coesione

Gallery of challenges: Guided Tour End of Guided Tour

30


Top Related