(Open) data analysis for decision support: challenges and essentials !
Antonio Vetrò Technische Universität München, Germany
01 September 2014,Matera (Italy), RENA Summer school
@phisaz
With examples from Open Coesione
With material from a joint work with: Lorenzo Canova, Marco Torchiano (PoliTO - Politecnico di Torino) Federico Morando, Raimondo Iemma (NEXA Center for Internet and Society - PoliTO) Aline Pennisi ( Ministero dell’ Economia e delle Finanze ) Feedback from Andrea Milan (United Nations University) Daniel Méndez Fernández (Technische Universität München)
RENA Summer School 2014
2
RENA Summer School 2014
2
RENA Summer School 2014
Deciding and
implementing together
Monitoring togetherPlanning together
2
Outline
• Data analysis : a philosophical perspective, empiricism
• Data analysis challenges: examples with Open Data
3
Outline
• Data analysis : a philosophical perspective, empiricism
• Data analysis challenges: examples with Open Data
4
5
Data
Source: Klaus Mainzer,Modern Aspects of Philosophy of Science in Informatics and its Applications,
Lehrstuhl für Philosophie und Wissenschaftstheorie Carl von Linde-Akademie Munich Center for Technology in Society Technische Universität München
Klaus Mainzer
Munich Center for Technology in Society Technische Universität München
Knowledge Representation : World, Model, and Formal Theory
World Model Theory
observation simulation deduction
approximation: {good, sufficient, insufficient}
interpretation: {true, false}
6
Data
Source: Klaus Mainzer,Modern Aspects of Philosophy of Science in Informatics and its Applications,
Lehrstuhl für Philosophie und Wissenschaftstheorie Carl von Linde-Akademie Munich Center for Technology in Society Technische Universität München
Klaus Mainzer
Munich Center for Technology in Society Technische Universität München
Knowledge Representation : World, Model, and Formal Theory
World Model Theory
observation simulation deduction
approximation: {good, sufficient, insufficient}
interpretation: {true, false}
Figure: techrepublic.com
6
Data analysis A philosophical perspective, empiricism
Observations / Evaluations
Questions / Hypotheses
Theory/System of theories
Pattern building
Falsification / support
Theory building
Study population
Deductive logicInductive logic
See also: Runeson et al. Case Study Research in Software Engineering: Guidelines and Experiments
7
• Each empirical method…
• has a specific specific purpose • relies on a specific data type • has a specific setting !!
Purpose • Exploratory • Descriptive • Explanatory / confirmatory • Improving !
Data Type • Qualitative • Quantitative
Data analysis A philosophical perspective, empiricism
Observations / Evaluations
Questions / Hypotheses
Theory/System of theories
Pattern building Falsification /
support
Theory building
Study population
Deductive logic
Inductive logic
8
Data analysis A philosophical perspective
Observations / Evaluations
(Tentative) Hypotheses
Theory/System of theories
Pattern building Falsification /
support
Theory building
Study population
Deductive logicInductive logic
See also: Vessey et al A unified classification system for research in the computing disciplines
9
Data analysis A philosophical perspective
Observations / Evaluations
(Tentative) Hypotheses
Theory/System of theories
Pattern building Falsification /
support
Theory building
Study population
Deductive logicInductive logic
Formal / conceptual analysis
Grounded theory
Exploratory – case/field studies – experiments – data analysis
Survey / interview research
Confirmatory – case studies – experiments – data analysis – …
Ethnographic studies
See also: Vessey et al A unified classification system for research in the computing disciplines
9
Data analysis A philosophical perspective
Observations / Evaluations
(Tentative) Hypotheses
Theory/System of theories
Pattern building Falsification /
support
Theory building
Study population
Deductive logicInductive logic
Formal / conceptual analysis
Grounded theory
Exploratory – case/field studies – experiments – data analysis
Survey / interview research
Confirmatory – case studies – experiments – data analysis – …
Ethnographic studies
See also: Vessey et al A unified classification system for research in the computing disciplines
9
Data analysis A philosophical perspective
Observations / Evaluations
(Tentative) Hypotheses
Theory/System of theories
Pattern building Falsification /
support
Theory building
Study population
Deductive logicInductive logic
Formal / conceptual analysis
Grounded theory
Exploratory – case/field studies – experiments – data analysis
Survey / interview research
Confirmatory – case studies – experiments – data analysis – …
Ethnographic studies
See also: Vessey et al A unified classification system for research in the computing disciplines
9
Data analysis A philosophical perspective
Observations / Evaluations
(Tentative) Hypotheses
Theory/System of theories
Pattern building Falsification /
support
Theory building
Study population
Deductive logicInductive logic
Formal / conceptual analysis
Grounded theory
Exploratory – case/field studies – experiments – data analysis
Survey / interview research
Confirmatory – case studies – experiments – data analysis – …
Ethnographic studies
See also: Vessey et al A unified classification system for research in the computing disciplines
Deciding and
implementing togetherMonitoring together
Planning together
9
Outline
• Data analysis : a philosophical perspective, empiricism
• Data analysis challenges: examples with Open Data
10
Opportunities
Mike Lemansky, Open Data 11
Opportunities
Lab
Mike Lemansky, Open Data 11
12
& challenges
12
& challenges
12
Open Coesione
13
Open Coesione
13
Open Coesione
Dataset “progetti” - FSE 2007/2013 Snapshot 31/12/2013 Focus on funding and dates, 22/74 columns
13
Open Coesione
Dataset “progetti” - FSE 2007/2013 Snapshot 31/12/2013 Focus on funding and dates, 22/74 columns
> colnames(subsetProgetti) [1] "FINANZ_UE" "FINANZ_STATO_FONDO_DI_ROTAZIONE" [3] "FINANZ_STATO_FSC" "FINANZ_STATO_PAC" [5] "FINANZ_STATO_ALTRI_PROVVEDIMENTI" "FINANZ_REGIONE" [7] "FINANZ_PROVINCIA" "FINANZ_COMUNE" [9] "FINANZ_ALTRO_PUBBLICO" "FINANZ_STATO_ESTERO" [11] "FINANZ_PRIVATO" "FINANZ_DA_REPERIRE" [13] "FINANZ_TOTALE_PUBBLICO" "DPS_DATA_INIZIO_PREVISTA" [15] "DPS_DATA_FINE_PREVISTA" "DPS_DATA_INIZIO_EFFETTIVA" [17] "DPS_DATA_FINE_EFFETTIVA" "DPS_FLAG_CUP" [19] "DPS_FLAG_PRESENZA_DATE" "DPS_FLAG_COERENZA_DATE_PREV" [21] "DPS_FLAG_COERENZA_DATE_EFF" "DATA_AGGIORNAMENTO" 13
Milepost5 850 NE 81st Ave Portland, OR 97213 http://milepost5.net/galleries/
Gallery of challenges: Guided Tour
14
Challenge #1: Errors in data
15
16
16
16
43 !
16
43 !
Errors can be inserted from:
- source (observation, sensor)
- manual insertion
- error from ETL*
!Be careful before claiming errors:
they might be “just” accuracy problems
* extraction, transformation, and loading
16
Challenge #2: accuracy
17
18
18
18
43 !
18
Unione Europea 22.50Fondo di Rotazione 21.76Regione 0.74
———45.00
43 !
18
Unione Europea 22.50Fondo di Rotazione 21.76Regione 0.74
———45.00
43 !
18
Unione Europea 22.50Fondo di Rotazione 21.76Regione 0.74
———45.00
»Refer always to raw data
»If not possible, estimate accuracy on analysis (e.g., about 5% in the example above)
43 !
18
Challenge #3: missing data
19
20
20
20
sub: analysis on 22 attributes int: analysis on the whole dataset general: both sub and int NA = NAs belongs to the domain
Completeness
pcc= percentage of complete cells pcrp = percentage of complete rows
Valu
e
21
NA in “finanziamenti”
sub: analysis on 22 attributes int: analysis on the whole dataset general: both sub and int NA = NAs belongs to the domain
Completeness
pcc= percentage of complete cells pcrp = percentage of complete rows
Valu
e
21
NA in “finanziamenti”
NA in “finanziamenti”, “pagamenti”,
missing values in Ateco and other columns
sub: analysis on 22 attributes int: analysis on the whole dataset general: both sub and int NA = NAs belongs to the domain
Completeness
pcc= percentage of complete cells pcrp = percentage of complete rows
Valu
e
21
No datesNA in “finanziamenti”
NA in “finanziamenti”, “pagamenti”,
missing values in Ateco and other columns
sub: analysis on 22 attributes int: analysis on the whole dataset general: both sub and int NA = NAs belongs to the domain
Completeness
pcc= percentage of complete cells pcrp = percentage of complete rows
Valu
e
21
No datesNA in “finanziamenti”
NA in “finanziamenti”, “pagamenti”,
missing values in Ateco and other columns
sub: analysis on 22 attributes int: analysis on the whole dataset general: both sub and int NA = NAs belongs to the domain
Completeness
pcc= percentage of complete cells pcrp = percentage of complete rows
Valu
e
Codes and descriptions
Ateco + other descriptions
21
No datesNA in “finanziamenti”
NA in “finanziamenti”, “pagamenti”,
missing values in Ateco and other columns
sub: analysis on 22 attributes int: analysis on the whole dataset general: both sub and int NA = NAs belongs to the domain
Completeness
pcc= percentage of complete cells pcrp = percentage of complete rows
Valu
e
Codes and descriptions
Ateco + other descriptions
No rows are complete
21
No datesNA in “finanziamenti”
NA in “finanziamenti”, “pagamenti”,
missing values in Ateco and other columns
sub: analysis on 22 attributes int: analysis on the whole dataset general: both sub and int NA = NAs belongs to the domain
Completeness
pcc= percentage of complete cells pcrp = percentage of complete rows
Valu
e
In 89% of projects dates are present
Codes and descriptions
Ateco + other descriptions
No rows are complete
21
No datesNA in “finanziamenti”
NA in “finanziamenti”, “pagamenti”,
missing values in Ateco and other columns
sub: analysis on 22 attributes int: analysis on the whole dataset general: both sub and int NA = NAs belongs to the domain
Completeness
pcc= percentage of complete cells pcrp = percentage of complete rows
Valu
e
In 89% of projects dates are present
Codes and descriptions
Ateco + other descriptions
Codes and descriptions
Ateco + other descriptions
No rows are complete
21
What to do with missing data
1. Understand domain: - e.g., NA or 0 ?
2. Find motivation (e.g.. missing start date o.k. if project hasn’t started yet) 3. Understand how much they impact your analysis 4. You might also:
– exclude rows with missing values – use imputation techniques
– mean substitution – regression substitution – group mean substitution – hot deck imputation – multiple imputation
Source: A Mockus , Missing data in software engineering, Guide to advanced empirical software engineering, 200822
Challenge #4: outliers
23
» Outliers can point to interesting facts
Challenge #4: outliers
23
» … or to something which deserves a second look
Challenge #4: outliers
24
Valu
e
pcvc= percentage of cells with correct value
25
ca. 50000 fundings < 1€Va
lue
pcvc= percentage of cells with correct value
25
ca. 50000 fundings < 1€
ca.210000 <5€
Valu
e
pcvc= percentage of cells with correct value
25
ca. 50000 fundings < 1€
ca.210000 <5€
ca. 360000<55€Va
lue
pcvc= percentage of cells with correct value
25
ca. 50000 fundings < 1€
ca.210000 <5€
ca. 360000<55€Va
lue
ca.430000<89€
pcvc= percentage of cells with correct value
25
What to do with outliers
1. Retention – Check the distribution of data: if heavy tailed, keep
them but don’t apply techniques which require normality
2. Exclusion – Remove them in case you think is a measurement error
or an exceptional case 3. Sensitivity analysis
– compare results with and without outliers – reason on the motivations
26
Challenge #5: Drawing proper conclusions
27
Challenge #5: Drawing proper conclusions
» Knowledge is more than statistical significance
» Context and domain knowledge are fundamental
» Consider both qualitative and quantitative aspects
» Triangulate data with other sources27
Summing up and additional suggestions
28
Summing up and additional suggestions
28
Challenges : watch out to
!- Errors
- Data accuracy
- Missing data
- Outliers
- Drawing proper conclusions
Summing up and additional suggestions
28
Challenges : watch out to
!- Errors
- Data accuracy
- Missing data
- Outliers
- Drawing proper conclusions
Verify data collection !- sampling - reference time and context - appropriateness to goals - transformations
Summing up and additional suggestions
28
Challenges : watch out to
!- Errors
- Data accuracy
- Missing data
- Outliers
- Drawing proper conclusions
Verify data collection !- sampling - reference time and context - appropriateness to goals - transformations
Check first “how data looks like” !Most programs (Excel, SPSS,STATA,R,…) offer predefined functions
Summing up and additional suggestions
28
Challenges : watch out to
!- Errors
- Data accuracy
- Missing data
- Outliers
- Drawing proper conclusions
Verify data collection !- sampling - reference time and context - appropriateness to goals - transformations
Check first “how data looks like” !Most programs (Excel, SPSS,STATA,R,…) offer predefined functions
Keep track of:
- modifications and reasons
- different versions
- raw data
Interesting readings
29
Gallery of challenges: Guided Tour End of Guided Tour
30
Gallery of challenges: Guided Tour End of Guided Tour
30