(open) data analysis for decision support: challenges and essentials - with examples from open...

Post on 02-Jul-2015

187 Views

Category:

Software

0 Downloads

Preview:

Click to see full reader

DESCRIPTION

This slideset was presented at the 2014 RENA Summer School on Good Government and Open Citizenship. It uses examples from an open dataset on EU fundings in Italy to show essentials and challenges in using open data to support decisions

TRANSCRIPT

(Open) data analysis for decision support: challenges and essentials !

Antonio Vetrò Technische Universität München, Germany

01 September 2014,Matera (Italy), RENA Summer school

@phisaz

vetro@in.tum.de

With examples from Open Coesione

With material from a joint work with: Lorenzo Canova, Marco Torchiano (PoliTO - Politecnico di Torino) Federico Morando, Raimondo Iemma (NEXA Center for Internet and Society - PoliTO) Aline Pennisi ( Ministero dell’ Economia e delle Finanze ) Feedback from Andrea Milan (United Nations University) Daniel Méndez Fernández (Technische Universität München)

RENA Summer School 2014

2

RENA Summer School 2014

2

RENA Summer School 2014

Deciding and

implementing together

Monitoring togetherPlanning together

2

Outline

• Data analysis : a philosophical perspective, empiricism

• Data analysis challenges: examples with Open Data

3

Outline

• Data analysis : a philosophical perspective, empiricism

• Data analysis challenges: examples with Open Data

4

5

Data

Source: Klaus Mainzer,Modern Aspects of Philosophy of Science in Informatics and its Applications,

Lehrstuhl für Philosophie und Wissenschaftstheorie Carl von Linde-Akademie Munich Center for Technology in Society Technische Universität München

Klaus Mainzer

Munich Center for Technology in Society Technische Universität München

Knowledge Representation : World, Model, and Formal Theory

World Model Theory

observation simulation deduction

approximation: {good, sufficient, insufficient}

interpretation: {true, false}

6

Data

Source: Klaus Mainzer,Modern Aspects of Philosophy of Science in Informatics and its Applications,

Lehrstuhl für Philosophie und Wissenschaftstheorie Carl von Linde-Akademie Munich Center for Technology in Society Technische Universität München

Klaus Mainzer

Munich Center for Technology in Society Technische Universität München

Knowledge Representation : World, Model, and Formal Theory

World Model Theory

observation simulation deduction

approximation: {good, sufficient, insufficient}

interpretation: {true, false}

Figure: techrepublic.com

6

Data analysis A philosophical perspective, empiricism

Observations / Evaluations

Questions / Hypotheses

Theory/System of theories

Pattern building

Falsification / support

Theory building

Study population

Deductive logicInductive logic

See also: Runeson et al. Case Study Research in Software Engineering: Guidelines and Experiments

7

• Each empirical method…

• has a specific specific purpose • relies on a specific data type • has a specific setting !!

Purpose • Exploratory • Descriptive • Explanatory / confirmatory • Improving !

Data Type • Qualitative • Quantitative

Data analysis A philosophical perspective, empiricism

Observations / Evaluations

Questions / Hypotheses

Theory/System of theories

Pattern building Falsification /

support

Theory building

Study population

Deductive logic

Inductive logic

8

Data analysis A philosophical perspective

Observations / Evaluations

(Tentative) Hypotheses

Theory/System of theories

Pattern building Falsification /

support

Theory building

Study population

Deductive logicInductive logic

See also: Vessey et al A unified classification system for research in the computing disciplines

9

Data analysis A philosophical perspective

Observations / Evaluations

(Tentative) Hypotheses

Theory/System of theories

Pattern building Falsification /

support

Theory building

Study population

Deductive logicInductive logic

Formal / conceptual analysis

Grounded theory

Exploratory – case/field studies – experiments – data analysis

Survey / interview research

Confirmatory – case studies – experiments – data analysis – …

Ethnographic studies

See also: Vessey et al A unified classification system for research in the computing disciplines

9

Data analysis A philosophical perspective

Observations / Evaluations

(Tentative) Hypotheses

Theory/System of theories

Pattern building Falsification /

support

Theory building

Study population

Deductive logicInductive logic

Formal / conceptual analysis

Grounded theory

Exploratory – case/field studies – experiments – data analysis

Survey / interview research

Confirmatory – case studies – experiments – data analysis – …

Ethnographic studies

See also: Vessey et al A unified classification system for research in the computing disciplines

9

Data analysis A philosophical perspective

Observations / Evaluations

(Tentative) Hypotheses

Theory/System of theories

Pattern building Falsification /

support

Theory building

Study population

Deductive logicInductive logic

Formal / conceptual analysis

Grounded theory

Exploratory – case/field studies – experiments – data analysis

Survey / interview research

Confirmatory – case studies – experiments – data analysis – …

Ethnographic studies

See also: Vessey et al A unified classification system for research in the computing disciplines

9

Data analysis A philosophical perspective

Observations / Evaluations

(Tentative) Hypotheses

Theory/System of theories

Pattern building Falsification /

support

Theory building

Study population

Deductive logicInductive logic

Formal / conceptual analysis

Grounded theory

Exploratory – case/field studies – experiments – data analysis

Survey / interview research

Confirmatory – case studies – experiments – data analysis – …

Ethnographic studies

See also: Vessey et al A unified classification system for research in the computing disciplines

Deciding and

implementing togetherMonitoring together

Planning together

9

Outline

• Data analysis : a philosophical perspective, empiricism

• Data analysis challenges: examples with Open Data

10

Opportunities

Mike Lemansky, Open Data 11

Opportunities

Lab

Mike Lemansky, Open Data 11

12

& challenges

12

& challenges

12

Open Coesione

13

Open Coesione

13

Open Coesione

Dataset “progetti” - FSE 2007/2013 Snapshot 31/12/2013 Focus on funding and dates, 22/74 columns

13

Open Coesione

Dataset “progetti” - FSE 2007/2013 Snapshot 31/12/2013 Focus on funding and dates, 22/74 columns

> colnames(subsetProgetti) [1] "FINANZ_UE" "FINANZ_STATO_FONDO_DI_ROTAZIONE" [3] "FINANZ_STATO_FSC" "FINANZ_STATO_PAC" [5] "FINANZ_STATO_ALTRI_PROVVEDIMENTI" "FINANZ_REGIONE" [7] "FINANZ_PROVINCIA" "FINANZ_COMUNE" [9] "FINANZ_ALTRO_PUBBLICO" "FINANZ_STATO_ESTERO" [11] "FINANZ_PRIVATO" "FINANZ_DA_REPERIRE" [13] "FINANZ_TOTALE_PUBBLICO" "DPS_DATA_INIZIO_PREVISTA" [15] "DPS_DATA_FINE_PREVISTA" "DPS_DATA_INIZIO_EFFETTIVA" [17] "DPS_DATA_FINE_EFFETTIVA" "DPS_FLAG_CUP" [19] "DPS_FLAG_PRESENZA_DATE" "DPS_FLAG_COERENZA_DATE_PREV" [21] "DPS_FLAG_COERENZA_DATE_EFF" "DATA_AGGIORNAMENTO" 13

Milepost5 850 NE 81st Ave Portland, OR 97213 http://milepost5.net/galleries/

Gallery of challenges: Guided Tour

14

Challenge #1: Errors in data

15

16

16

16

43 !

16

43 !

Errors can be inserted from:

- source (observation, sensor)

- manual insertion

- error from ETL*

!Be careful before claiming errors:

they might be “just” accuracy problems

* extraction, transformation, and loading

16

Challenge #2: accuracy

17

18

18

18

43 !

18

Unione Europea 22.50Fondo di Rotazione 21.76Regione 0.74

———45.00

43 !

18

Unione Europea 22.50Fondo di Rotazione 21.76Regione 0.74

———45.00

43 !

18

Unione Europea 22.50Fondo di Rotazione 21.76Regione 0.74

———45.00

»Refer always to raw data

»If not possible, estimate accuracy on analysis (e.g., about 5% in the example above)

43 !

18

Challenge #3: missing data

19

20

20

20

sub: analysis on 22 attributes int: analysis on the whole dataset general: both sub and int NA = NAs belongs to the domain

Completeness

pcc= percentage of complete cells pcrp = percentage of complete rows

Valu

e

21

NA in “finanziamenti”

sub: analysis on 22 attributes int: analysis on the whole dataset general: both sub and int NA = NAs belongs to the domain

Completeness

pcc= percentage of complete cells pcrp = percentage of complete rows

Valu

e

21

NA in “finanziamenti”

NA in “finanziamenti”, “pagamenti”,

missing values in Ateco and other columns

sub: analysis on 22 attributes int: analysis on the whole dataset general: both sub and int NA = NAs belongs to the domain

Completeness

pcc= percentage of complete cells pcrp = percentage of complete rows

Valu

e

21

No datesNA in “finanziamenti”

NA in “finanziamenti”, “pagamenti”,

missing values in Ateco and other columns

sub: analysis on 22 attributes int: analysis on the whole dataset general: both sub and int NA = NAs belongs to the domain

Completeness

pcc= percentage of complete cells pcrp = percentage of complete rows

Valu

e

21

No datesNA in “finanziamenti”

NA in “finanziamenti”, “pagamenti”,

missing values in Ateco and other columns

sub: analysis on 22 attributes int: analysis on the whole dataset general: both sub and int NA = NAs belongs to the domain

Completeness

pcc= percentage of complete cells pcrp = percentage of complete rows

Valu

e

Codes and descriptions

Ateco + other descriptions

21

No datesNA in “finanziamenti”

NA in “finanziamenti”, “pagamenti”,

missing values in Ateco and other columns

sub: analysis on 22 attributes int: analysis on the whole dataset general: both sub and int NA = NAs belongs to the domain

Completeness

pcc= percentage of complete cells pcrp = percentage of complete rows

Valu

e

Codes and descriptions

Ateco + other descriptions

No rows are complete

21

No datesNA in “finanziamenti”

NA in “finanziamenti”, “pagamenti”,

missing values in Ateco and other columns

sub: analysis on 22 attributes int: analysis on the whole dataset general: both sub and int NA = NAs belongs to the domain

Completeness

pcc= percentage of complete cells pcrp = percentage of complete rows

Valu

e

In 89% of projects dates are present

Codes and descriptions

Ateco + other descriptions

No rows are complete

21

No datesNA in “finanziamenti”

NA in “finanziamenti”, “pagamenti”,

missing values in Ateco and other columns

sub: analysis on 22 attributes int: analysis on the whole dataset general: both sub and int NA = NAs belongs to the domain

Completeness

pcc= percentage of complete cells pcrp = percentage of complete rows

Valu

e

In 89% of projects dates are present

Codes and descriptions

Ateco + other descriptions

Codes and descriptions

Ateco + other descriptions

No rows are complete

21

What to do with missing data

1. Understand domain: - e.g., NA or 0 ?

2. Find motivation (e.g.. missing start date o.k. if project hasn’t started yet) 3. Understand how much they impact your analysis 4. You might also:

– exclude rows with missing values – use imputation techniques

– mean substitution – regression substitution – group mean substitution – hot deck imputation – multiple imputation

Source: A Mockus , Missing data in software engineering, Guide to advanced empirical software engineering, 200822

Challenge #4: outliers

23

» Outliers can point to interesting facts

Challenge #4: outliers

23

» … or to something which deserves a second look

Challenge #4: outliers

24

Valu

e

pcvc= percentage of cells with correct value

25

ca. 50000 fundings < 1€Va

lue

pcvc= percentage of cells with correct value

25

ca. 50000 fundings < 1€

ca.210000 <5€

Valu

e

pcvc= percentage of cells with correct value

25

ca. 50000 fundings < 1€

ca.210000 <5€

ca. 360000<55€Va

lue

pcvc= percentage of cells with correct value

25

ca. 50000 fundings < 1€

ca.210000 <5€

ca. 360000<55€Va

lue

ca.430000<89€

pcvc= percentage of cells with correct value

25

What to do with outliers

1. Retention – Check the distribution of data: if heavy tailed, keep

them but don’t apply techniques which require normality

2. Exclusion – Remove them in case you think is a measurement error

or an exceptional case 3. Sensitivity analysis

– compare results with and without outliers – reason on the motivations

26

Challenge #5: Drawing proper conclusions

27

Challenge #5: Drawing proper conclusions

» Knowledge is more than statistical significance

» Context and domain knowledge are fundamental

» Consider both qualitative and quantitative aspects

» Triangulate data with other sources27

Summing up and additional suggestions

28

Summing up and additional suggestions

28

Challenges : watch out to

!- Errors

- Data accuracy

- Missing data

- Outliers

- Drawing proper conclusions

Summing up and additional suggestions

28

Challenges : watch out to

!- Errors

- Data accuracy

- Missing data

- Outliers

- Drawing proper conclusions

Verify data collection !- sampling - reference time and context - appropriateness to goals - transformations

Summing up and additional suggestions

28

Challenges : watch out to

!- Errors

- Data accuracy

- Missing data

- Outliers

- Drawing proper conclusions

Verify data collection !- sampling - reference time and context - appropriateness to goals - transformations

Check first “how data looks like” !Most programs (Excel, SPSS,STATA,R,…) offer predefined functions

Summing up and additional suggestions

28

Challenges : watch out to

!- Errors

- Data accuracy

- Missing data

- Outliers

- Drawing proper conclusions

Verify data collection !- sampling - reference time and context - appropriateness to goals - transformations

Check first “how data looks like” !Most programs (Excel, SPSS,STATA,R,…) offer predefined functions

Keep track of:

- modifications and reasons

- different versions

- raw data

Interesting readings

29

Gallery of challenges: Guided Tour End of Guided Tour

30

Gallery of challenges: Guided Tour End of Guided Tour

30

top related