![Page 1: An Overview of Data Completeness Assessment Techniques](https://reader035.vdocuments.mx/reader035/viewer/2022081521/58ee639b1a28abd8288b45c1/html5/thumbnails/1.jpg)
An Overview of Data Completeness Assessment Techniques
Simon RazniewskiFree University of Bozen-Bolzano, Italy
![Page 2: An Overview of Data Completeness Assessment Techniques](https://reader035.vdocuments.mx/reader035/viewer/2022081521/58ee639b1a28abd8288b45c1/html5/thumbnails/2.jpg)
2/33
Background
• Diplom (~Master) from TU Dresden, Germany, 2010• PhD from Free University of Bozen-Bolzano, Italy, 2014
– Spent some time at UCSD and AT&T Labs-Research• Now Assistant Professor in Bozen-Bolzano
• Trilingual province– (German, Italian, Ladin)
• Autonomous since 43 years• University founded in 1997• 3500 students
Bolzano
![Page 3: An Overview of Data Completeness Assessment Techniques](https://reader035.vdocuments.mx/reader035/viewer/2022081521/58ee639b1a28abd8288b45c1/html5/thumbnails/3.jpg)
3/33
Background (2)• PhD centered on formal approaches to data completeness
• Other research interests:– Data currency (see WebDB2015 paper)– Process mining– Data-driven (machine learning) approaches to data
completeness– ….
• Presentation today: Joint work with Werner Nutt, Divesh Srivastava and Flip Korn
![Page 4: An Overview of Data Completeness Assessment Techniques](https://reader035.vdocuments.mx/reader035/viewer/2022081521/58ee639b1a28abd8288b45c1/html5/thumbnails/4.jpg)
4/33
ContinentName Population
(billion)Africa 1
America NullAsia 5
Australia 0.03
ContinentName Population
(billion)Area
(million km²)Africa 1 30
America Null 16Asia 5 43
Australia 0.03 3Europe 0.7 4
Data Completeness• Data quality commonly distinguishes dimensions
– Correctness– Timeliness– Completeness
• (In-)completeness is an issue in many settings, e.g.– Data from multiple sources – Optional data– Human-intensive workflows
• Aspects of incompleteness– Schema– Records– Values
Focus today on records, for values see [Razniewski&Nutt, CIKM 2012]
![Page 5: An Overview of Data Completeness Assessment Techniques](https://reader035.vdocuments.mx/reader035/viewer/2022081521/58ee639b1a28abd8288b45c1/html5/thumbnails/5.jpg)
5/33
What can one research?• How to avoid incompleteness
– Information systems design– Process design
• How to deal with incompleteness– Statistical procedures to predict missing data– Missing value imputation
• How to understand incompleteness– How to describe it– How to reason about it
![Page 6: An Overview of Data Completeness Assessment Techniques](https://reader035.vdocuments.mx/reader035/viewer/2022081521/58ee639b1a28abd8288b45c1/html5/thumbnails/6.jpg)
6/33
Motivation: Data warehouse of a telecommunication company
Warningsday week ID message
Mon 1 tw37 high voltageFri 1 tw37 high voltage
Wed 2 tw37 overheatTue 1 tw59 auto restartFri 1 tw59 overheat
Mon 2 tw83 high voltageTue 2 tw83 auto restart
MaintenanceID resp reason
tw37 A disk failuretw59 D software crashtw83 B unknowntw91 C update failuretw91 C network error
Teamsname specialization
A hardwareB hardwareC networkC softwareD network
Admin John knows• Team table is complete (HR says so)• Maintenance is complete for teams A, B and C
• their reporting systems export data automatically• Warnings is complete for all of Week 1,
and Monday and Wednesday of Week 2• Potential data loss due to a system failure on Tuesday• Data further than Wednesday maybe not fully loaded
![Page 7: An Overview of Data Completeness Assessment Techniques](https://reader035.vdocuments.mx/reader035/viewer/2022081521/58ee639b1a28abd8288b45c1/html5/thumbnails/7.jpg)
7/33
John wants to know“Give me all warnings in week 2 that are generated by objects in maintenance with a hardware team.”
SELECT * FROM Warnings W JOIN Maintenance M ON W.ID = M.ID JOIN Teams T ON M.responsible = T.name WHERE W.week = 2 AND T.specialization = 'hardware'
W.Day W.week W.ID W.message M.ID M.resp M.reason T.name T.specializationWed 2 tw37 overheat tw37 A disk failure A hardware
Mon 2 tw83 high voltage
tw83 B unknown B hardware
Tue 2 tw83 auto restart tw83 B unknown B hardware
Is this all that hardware
teams have done?
![Page 8: An Overview of Data Completeness Assessment Techniques](https://reader035.vdocuments.mx/reader035/viewer/2022081521/58ee639b1a28abd8288b45c1/html5/thumbnails/8.jpg)
8/33
John reasons
“Give me all warnings in week 2 that are generated by objects in maintenance with a hardware team.”
• Warnings is complete for Week 1 and Monday and Wednesday of Week 2• Maintenance is complete for teams A, B and C• Team is complete
The query result definitely contains all warnings from– Monday for team A– Monday for team B– Monday for team C– Wednesday for team A– Wednesday for team B– Wednesday for team C
Warningsday week ID message
MaintenanceID resp reason
Teamsname specialization
![Page 9: An Overview of Data Completeness Assessment Techniques](https://reader035.vdocuments.mx/reader035/viewer/2022081521/58ee639b1a28abd8288b45c1/html5/thumbnails/9.jpg)
9/33
John looks at the data The query result definitely contains all warnings from
– Monday for team A– Monday for team B– Monday for team C– Wednesday for team A– Wednesday for team B– Wednesday for team C
• There are no other hardware teams than A and B
The query result is fully complete for Monday and Wednesday
Teamsname specialization
A hardwareB hardwareC networkC softwareD network
![Page 10: An Overview of Data Completeness Assessment Techniques](https://reader035.vdocuments.mx/reader035/viewer/2022081521/58ee639b1a28abd8288b45c1/html5/thumbnails/10.jpg)
10/33
Questions“Warnings are complete for Week 1”
1. How can we formally describe complete parts of a database?
“The query result contains all warnings from Monday of week 2 for team A”
2. How can we use database completeness information to identify complete parts of query answers?
![Page 11: An Overview of Data Completeness Assessment Techniques](https://reader035.vdocuments.mx/reader035/viewer/2022081521/58ee639b1a28abd8288b45c1/html5/thumbnails/11.jpg)
11/33
Related workPublication Description Language Focus of the work
Motro, TODS 1989
ViewsSchema-level reasoning
Levy, VLDB 1996
LC statements, similar to views Schema-level reasoning
Fan & Geerts, PODS 2009
Various query languages
(CQ-Datalog)
Master data management,
where an upper bound database exists
Lang et al., SIGMOD 2014
Columns/operators Distributed databases on the web,
operational failures during query execution
![Page 12: An Overview of Data Completeness Assessment Techniques](https://reader035.vdocuments.mx/reader035/viewer/2022081521/58ee639b1a28abd8288b45c1/html5/thumbnails/12.jpg)
12/33
Formalism: Patterns
We have all warnings from week 1
We have all warnings from Monday of week 2
• Less expressive than previous formalisms• Can be expressed in the same schema as the data
Warningsday week ID message
Mon 1 tw37 high voltageFri 1 tw37 high voltage
Wed 2 tw37 overheatTue 1 tw59 auto restartFri 1 tw59 overheat
Mon 2 tw83 high voltageTue 2 tw83 auto restart
Warningsday week ID message
Mon 1 tw37 high voltageFri 1 tw37 high voltage
Wed 2 tw37 overheatTue 1 tw59 auto restartFri 1 tw59 overheat
Mon 2 tw83 high voltageTue 2 tw83 auto restart
* 1 * *
Warningsday week ID message
Mon 1 tw37 high voltageFri 1 tw37 high voltage
Wed 2 tw37 overheatTue 1 tw59 auto restartFri 1 tw59 overheat
Mon 2 tw83 high voltageTue 2 tw83 auto restart
* 1 * *Mon 2 * *
Warningsday week ID message
Mon 1 tw37 high voltageFri 1 tw37 high voltage
Wed 2 tw37 overheatTue 1 tw59 auto restartFri 1 tw59 overheat
Mon 2 tw83 high voltageTue 2 tw83 auto restart
* 1 * *Mon 2 * *
Warningsday week ID message
Mon 1 tw37 high voltageFri 1 tw37 high voltage
Wed 2 tw37 overheatTue 1 tw59 auto restartFri 1 tw59 overheat
Mon 2 tw83 high voltageTue 2 tw83 auto restart
* 1 * *Mon 2 * *
![Page 13: An Overview of Data Completeness Assessment Techniques](https://reader035.vdocuments.mx/reader035/viewer/2022081521/58ee639b1a28abd8288b45c1/html5/thumbnails/13.jpg)
13/33
John’s knowledge expressed by patterns
Warnings
day week ID messageMon 1 tw37 high voltageFri 1 tw37 high voltage
Wed 2 tw37 overheatTue 1 tw59 auto restartFri 1 tw59 overheat
Mon 2 tw83 high voltageTue 2 tw83 auto restart
* 1 * *Mon 2 * *Wed 2 * *
MaintenanceID resp reason
tw37 A disk failuretw59 D software crashtw83 B unknowntw91 C update failuretw91 C network error
* A ** B ** C *
Teams
name specializationA hardwareB hardwareC networkC softwareD network* *
Team table is complete Maintenance is complete for teams A, B and C
Warnings is complete for all of Week 1, and Monday and Wednesday of Week 2
![Page 14: An Overview of Data Completeness Assessment Techniques](https://reader035.vdocuments.mx/reader035/viewer/2022081521/58ee639b1a28abd8288b45c1/html5/thumbnails/14.jpg)
14/33
John’s conclusions expressed by patterns
“Give me all warnings in week 2 that are generated by objects in maintenance with a hardware team.”
W.Day W.week W.ID W.message M.ID M.resp M.reason T.name T.specializationWed 2 tw37 overheat tw37 A disk failure A hardware
Mon 2 tw83 high voltage
tw83 B unknown B hardware
Tue 2 tw83 auto restart tw83 B unknown B hardware
W.Day W.week W.ID W.message M.ID M.resp M.reason T.name T.specializationWed 2 tw37 overheat tw37 A disk failure A hardware
Mon 2 tw83 high voltage
tw83 B unknown B hardware
Tue 2 tw83 auto restart tw83 B unknown B hardware
Mon * * * * A * A *
The query result contains all warnings from• Monday for team A• …
W.Day W.week W.ID W.message M.ID M.resp M.reason T.name T.specialization
Wed 2 tw37 overheat tw37 A disk failure A hardware
Mon 2 tw83 high voltage
tw83 B unknown B hardware
Tue 2 tw83 auto restart tw83 B unknown B hardware
Mon * * * * A * A *
Mon * * * * B * B *
Mon * * * * C * C *
Wed * * * * A * A *
Wed * * * * B * B *
Wed * * * * C * C *
![Page 15: An Overview of Data Completeness Assessment Techniques](https://reader035.vdocuments.mx/reader035/viewer/2022081521/58ee639b1a28abd8288b45c1/html5/thumbnails/15.jpg)
15/33
How to compute the completeness patterns for queries?
Queries are computed by relational algebraHere: Select, project, equijoin
Schema reasoning: - Apply algebra operators to completeness patterns (analogous to query result computation)
𝑊𝑎𝑟𝑛𝑖𝑛𝑔𝑠
𝜎𝑤𝑒𝑒𝑘=2
⋈𝑊 . 𝐼𝐷=𝑀 . 𝐼𝐷
𝑀𝑎𝑖𝑛𝑡𝑒𝑛𝑎𝑛𝑐𝑒
⋈𝑀 . 𝑟𝑒𝑠𝑝=𝑇 .𝑛𝑎𝑚𝑒
𝜎 𝑠𝑝𝑒𝑐= h 𝑤
𝑇𝑒𝑎𝑚𝑠
![Page 16: An Overview of Data Completeness Assessment Techniques](https://reader035.vdocuments.mx/reader035/viewer/2022081521/58ee639b1a28abd8288b45c1/html5/thumbnails/16.jpg)
16/33
??
Teamsname specialization
A hardwareB hardwareC networkC softwareD network* *
name specializationA hardwareB hardware* *
Rule 1: Statements with * survive
Reasoning about selections
name specialization
A hardware
B hardware
![Page 17: An Overview of Data Completeness Assessment Techniques](https://reader035.vdocuments.mx/reader035/viewer/2022081521/58ee639b1a28abd8288b45c1/html5/thumbnails/17.jpg)
17/33
𝑊𝑎𝑟𝑛𝑖𝑛𝑔𝑠
𝜎𝑤𝑒𝑒𝑘=2
⋈𝑊 . 𝐼𝐷=𝑀 . 𝐼𝐷
𝑀𝑎𝑖𝑛𝑡𝑒𝑛𝑎𝑛𝑐𝑒
⋈𝑀 . 𝑟𝑒𝑠𝑝=𝑇 .𝑛𝑎𝑚𝑒
𝜎 𝑠𝑝𝑒𝑐= h 𝑤
𝑇𝑒𝑎𝑚𝑠
![Page 18: An Overview of Data Completeness Assessment Techniques](https://reader035.vdocuments.mx/reader035/viewer/2022081521/58ee639b1a28abd8288b45c1/html5/thumbnails/18.jpg)
18/33
day week ID messageWed 2 tw37 overheatMon 2 tw83 high voltageTue 2 tw83 auto restart
?
𝝈𝒘𝒆𝒆𝒌=𝟐(𝑾 )
Rule 2: Irrelevant constants are ignoredRule 3: Selected constants survive and are promoted
Warnings
day week ID message
Mon 1 tw37 high voltage
Fri 1 tw37 high voltage
Wed 2 tw37 overheat
Tue 1 tw59 auto restart
Fri 1 tw59 overheat
Mon 2 tw83 high voltage
Tue 2 tw83 auto restart
* 1 * *
Mon 2 * *
Wed 2 * *
day week ID message
Wed 2 tw37 overheat
Mon 2 tw83 high voltage
Tue 2 tw83 auto restart
Mon 2 * *
Wed 2 * *
Reasoning about selections (2)
**
![Page 19: An Overview of Data Completeness Assessment Techniques](https://reader035.vdocuments.mx/reader035/viewer/2022081521/58ee639b1a28abd8288b45c1/html5/thumbnails/19.jpg)
19/33
𝑊𝑎𝑟𝑛𝑖𝑛𝑔𝑠
𝜎𝑤𝑒𝑒𝑘=2
⋈𝑊 . 𝐼𝐷=𝑀 . 𝐼𝐷
𝑀𝑎𝑖𝑛𝑡𝑒𝑛𝑎𝑛𝑐𝑒
⋈𝑀 . 𝑟𝑒𝑠𝑝=𝑇 .𝑛𝑎𝑚𝑒
𝜎 𝑠𝑝𝑒𝑐= h 𝑤
𝑇𝑒𝑎𝑚𝑠
![Page 20: An Overview of Data Completeness Assessment Techniques](https://reader035.vdocuments.mx/reader035/viewer/2022081521/58ee639b1a28abd8288b45c1/html5/thumbnails/20.jpg)
20/33
M.ID M.resp M.reason T.name T.specializationtw37 A disk failure A hardware
tw83 B unknown B hardware
?
𝑴⋈𝑴 . 𝒓𝒆𝒔𝒑=𝑻 .𝒏𝒂𝒎𝒆𝝈 𝒔𝒑𝒆𝒄= 𝒉𝒘 (𝑻 )name specialization
A hardware
B hardware
* *
Maintenance
ID resp reason
tw37 A disk failure
tw59 D software crash
tw83 B unknown
tw91 C update failure
tw91 C network error
* A *
* B *
* C *
M.ID M.resp M.reason T.name T.specialization
tw37 A disk failure A hardware
tw83 B unknown B hardware
* A * A *
* B * B *
* C * C *
Reasoning about joins
Rule 1: Constants join with equal constantsRule 2: Wildcards join with anythingRule 3: Constants can be promoted
M.ID M.resp M.reason T.name T.specialization
tw37 A disk failure A hardware
tw83 B unknown B hardware
* A * * *
* B * * *
* C * * *
* * * A *
* * * B *
* * * C *
![Page 21: An Overview of Data Completeness Assessment Techniques](https://reader035.vdocuments.mx/reader035/viewer/2022081521/58ee639b1a28abd8288b45c1/html5/thumbnails/21.jpg)
21/33
Algorithmic completeness
Proven: Extended algebra gives all conclusions that hold on the schema level (reasoning only with the yellow metadata)
• Independent of the algebra tree chosen
![Page 22: An Overview of Data Completeness Assessment Techniques](https://reader035.vdocuments.mx/reader035/viewer/2022081521/58ee639b1a28abd8288b45c1/html5/thumbnails/22.jpg)
22/33
𝑴⋈𝑴 . 𝒓𝒆𝒔𝒑=𝑻 .𝒏𝒂𝒎𝒆𝝈 𝒔𝒑𝒆𝒄= 𝒉𝒘 (𝑻 )name specializationA hardwareB hardware* *
MaintenanceID resp reason
tw37 A disk failuretw59 D software crashtw83 B unknowntw91 C update failuretw91 C network error
* A ** B ** C *
Looking at the data
M.ID M.resp M.reason T.name T.specializationtw37 A disk failure A hardware
tw83 B unknown B hardware
* A * * *
* B * * *
* C * * *
* * * A *
* * * B *
* * * C *
There cannot be
other hardware teams than
A and B
M.ID M.resp M.reason T.name T.specializationtw37 A disk failure A hardwaretw83 B unknown B hardware
* * * * *Database instance allows for more promotion!(for details see paper)
![Page 23: An Overview of Data Completeness Assessment Techniques](https://reader035.vdocuments.mx/reader035/viewer/2022081521/58ee639b1a28abd8288b45c1/html5/thumbnails/23.jpg)
23/33
So much about the theory, but…
1. How can we implement this?
2. How fast is this?– In comparison with query evaluation
3. How can we manage large sets of statements?
![Page 24: An Overview of Data Completeness Assessment Techniques](https://reader035.vdocuments.mx/reader035/viewer/2022081521/58ee639b1a28abd8288b45c1/html5/thumbnails/24.jpg)
24/33
How can we implement this?
• Ideally, a plugin inside a DBMS– Promotion procedure benefits from fast access to data
• So far: Separate Java program
• Schema-level algebra can also be encoded in SQL Could compile normal queries into metadata queries
![Page 25: An Overview of Data Completeness Assessment Techniques](https://reader035.vdocuments.mx/reader035/viewer/2022081521/58ee639b1a28abd8288b45c1/html5/thumbnails/25.jpg)
25/33
How fast is this? (1)
• Synthetic data
• Wikipedia has around 1000 lists declared as complete (using a template or in natural language)
http://en.wikipedia.org/wiki/List_of_places_in_Carmarthenshire_%28categorised%29
![Page 26: An Overview of Data Completeness Assessment Techniques](https://reader035.vdocuments.mx/reader035/viewer/2022081521/58ee639b1a28abd8288b45c1/html5/thumbnails/26.jpg)
26/33
• Manually extracted some and grouped them by topic– Recurrent topics: Sports teams, political assemblies, geographical features,
songs, operas and other pieces of art
• Generated one table each about cities, schools and countries
cityname country state county
* USA Virginia ** Germany * ** Ukraine * ** Bulgaria * ** USA New York ** UK Carmarthenshire ** USA West Virginia Hampshire County* Czech Moravian-Silesia Nový Jičín* Slovenia * *
How fast is this? (2)
![Page 27: An Overview of Data Completeness Assessment Techniques](https://reader035.vdocuments.mx/reader035/viewer/2022081521/58ee639b1a28abd8288b45c1/html5/thumbnails/27.jpg)
27/33
SELECT * FROM country, city, school WHERE country.capital=city.name AND city.state=school.state
SQL runtime: 2040 ms (25891 records)Completeness pattern runtime: 900 ms (46 patterns)
Median over 7 join queries:• SQL runtime: 2040 ms• Completeness pattern runtime: 460 ms
How fast is this? (3)
![Page 28: An Overview of Data Completeness Assessment Techniques](https://reader035.vdocuments.mx/reader035/viewer/2022081521/58ee639b1a28abd8288b45c1/html5/thumbnails/28.jpg)
28/33
How can we manage large sets of patterns?
Redundancies in workflows may lead to redundant patterns - Introduce overhead and restrict comprehensibility Should be identified and removed
John reports first that all data for Monday of week 2 is complete, later, that the data for the whole week 2 is complete
(Monday,2)(*,2)
Trivial?
(Monday,*,hardware) (Wednesday,*,software)(Tuesday,2,software) (*,*,hardware)(Monday,2,*) (*,2,software)
![Page 29: An Overview of Data Completeness Assessment Techniques](https://reader035.vdocuments.mx/reader035/viewer/2022081521/58ee639b1a28abd8288b45c1/html5/thumbnails/29.jpg)
29/33
Minimization of sets of patterns: Options
• Option 1: Pairwise comparison
• Option 2: Employment of index structures for quick entailment checking(similar problem studied in theorem proving/AI)– Path indexes– Discrimination trees
• Option 3: Hashing– Store all statements in a hashmap– For each statement, all generalizations are generated (exponentially many!)– A statement is most general, if none of its generalizations exists in the hashmap
(Mon, 1, sw) (*, 1, sw), (Mon, *, sw), (Mon, 1, *), (*, *, sw), (Mon, *, *), (*, 1, *), (*, *, *)
• Options can be combined with sorting by number of wildcards
(*, *, *), (Mon, *, *), (*, 2, sw), (Tue, 1, hw)
Later statements cannot entail earlier statements
![Page 30: An Overview of Data Completeness Assessment Techniques](https://reader035.vdocuments.mx/reader035/viewer/2022081521/58ee639b1a28abd8288b45c1/html5/thumbnails/30.jpg)
30/33
Minimization of sets of patterns - Results
(Pairwise comparison and path indexes failed immediately)
Time/space tradeoff:• Unsorted discrimination trees fasted• Sorted hashing/discrimination trees most space efficient
![Page 31: An Overview of Data Completeness Assessment Techniques](https://reader035.vdocuments.mx/reader035/viewer/2022081521/58ee639b1a28abd8288b45c1/html5/thumbnails/31.jpg)
31/33
Summary
• Completeness patterns are a natural way to describecomplete parts of databases and query answers– Can be expressed in the same schema
• Modified the relational algebra to manipulate completeness patterns– Selection and projection easy– Join may be expensive (in theory, in practice, usually not)
• Current work– Correctness and completeness patterns– Column-level patterns
![Page 32: An Overview of Data Completeness Assessment Techniques](https://reader035.vdocuments.mx/reader035/viewer/2022081521/58ee639b1a28abd8288b45c1/html5/thumbnails/32.jpg)
32/33
Open Questions• Automated ways to get large sets of statements
– Sensor networks– Web extraction (e.g. from Wikipedia)– Streams (e.g. transit data)
• What can be said if an answer is not be guaranteed to be complete– Probabilistic completeness assessment based on historical data– Error bounds
• Algorithmic completeness of promotion
![Page 33: An Overview of Data Completeness Assessment Techniques](https://reader035.vdocuments.mx/reader035/viewer/2022081521/58ee639b1a28abd8288b45c1/html5/thumbnails/33.jpg)
33/33
References• Technical part today based on:
– Identifying the Extent of Completeness of Query Answers over Partially Complete Databases, Simon Razniewski, Flip Korn, Werner Nutt and Divesh Srivastava, SIGMOD 2015
• Other relevant papers:– Spatial data completeness: Adding Completeness Information to Query
Answers over Spatial Data, Simon Razniewski and Werner Nutt, SIGSPATIAL, 2014
– Completeness over processes: Verification of Query Completeness over Processes, Simon Razniewski, Marco Montali and Werner Nutt, BPM 2013
– Completeness of values: Completeness of Queries over SQL Databases, Werner Nutt and Simon Razniewski, CIKM 2012
![Page 34: An Overview of Data Completeness Assessment Techniques](https://reader035.vdocuments.mx/reader035/viewer/2022081521/58ee639b1a28abd8288b45c1/html5/thumbnails/34.jpg)
Acknowledgment
This research has been supported by the project “MAGIC”, funded by the Province of Bozen-Bolzano, Italy