Data cleaning Data cleaning: An overview Error detection (Chapter 3) Data repairing (Chapter 3)

Download Data cleaning Data cleaning: An overview Error detection (Chapter 3) Data repairing (Chapter 3)

Post on 13-Jan-2016

42 views

Category:

Documents

0 download

Embed Size (px)

DESCRIPTION

TDD: Research Topics in Distributed Databases. Data cleaning Data cleaning: An overview Error detection (Chapter 3) Data repairing (Chapter 3) Taking record matching and data repairing together (Chapter 7) Certain fixes (Chapter 7). A platform for improving data quality. profiling. - PowerPoint PPT Presentation

TRANSCRIPT

<ul><li><p> Data cleaning</p><p>Discovering data quality rules (Chapter 3)Error detection (Chapter 3)Data repairing (Chapter 3)Taking record matching and data repairing together (Chapter 7)Certain fixes (Chapter 7)</p><p>TDD: Topics in Distributed Databases*</p></li><li><p>*</p><p>A platform for improving data qualityDirty dataClean DataMaster dataBusiness rulesdependenciesValidationautomatically discover rulesstandardizationprofilingdata accuracy</p><p>error detectingvalidatingdata enrichmentdata accuracymonitoringdata repairingcertain fixesrecord matchingA practical system is already in use*</p></li><li><p> Data cleaning</p><p>Discovering data quality rules Error detectionData repairingRecord matching and its interaction with data repairingCertain fixesTDD: Research Topics in Distributed Databases*</p></li><li><p>Profiling: Discovering conditional dependenciesSeveral effective algorithms for discovering conditional dependencies, developed by researchers in the UK, Canada and the US (e.g., AT&amp;T)Input: sample data DOutput: a cover of conditional dependencies that hold on D Automatic discovery of data quality rules Where do dependencies (data quality rules) come from?Manual design: domain knowledge analysisBusiness rules DiscoveryExpensiveInadequate*</p></li><li><p>*Assessing the quality of conditional dependenciesInput: a set of conditional functional dependenciesOutput: a maximum satisfiable subset of dependenciesAutomated methods for reasoning about data quality rules Approximation algorithmsEfficient: low polynomial timePerformance guarantee: provable within a small distanceComplexity: the MAXSC problem for CFDs is NP-completeTheorem: there is an -approximation algorithm for MAXSCthere exist constant such that for the subset m found by the algorithm has a bound: card(m) &gt; card(OPT())MAXSC: are the dependencies discovered dirty themselves?*</p></li><li><p> Data cleaning</p><p>Discovering data quality rules Error detectionData repairingRecord matching and its interaction with data repairingCertain fixesTDD: Research Topics in Distributed Databases*</p></li><li><p>*Detecting violations of CFDs*1[CC=44, zip] [street]2[CC=44, AC=131] [city=Edi]Input: A set of CFDs, and a database DOutput: All tuples in D that violate at least one CFD in Automatically check whether the data is dirty or clean *</p><p>tidnametitleCCACphnstrcityzipt1MikeMTS441311234567MayfieldNYCEH4 8LEt2RickDMTS441313456789ChrichtonNYCEH4 8LEt3PhilDMTS441312909229ChrichtonEdiEH4 8LE</p></li><li><p>Detecting CFD violationsInput: a set of CFDs and a database DBOutput: the set of tuples in DB that violate at least one CFD in Approach: automatically generate SQL queries to find violationsComplication 1: consider (R: X Y, Tp), the pattern tableau may be large (recall that each tuple in Tp is in fact a constraint)Goal: the size of the SQL queries is independent of TpTrick: treat Tp as a data table</p><p>CINDs can be checked along the same lines</p><p>*</p></li><li><p>Single CFD: step 1A pair of SQL queries, treating Tp as a data tableSingle-tuple violation (pattern matching)Multi-tuple violations (traditional FDs) (cust(country, area-code, phone street, city, zip), Tp)Single-tuple violation: Qcselect * from R t, Tp tpwhere t[country] tp[country] AND t[area-code] tp[area-code] AND t[phone] tp[phone] (t[street] tp[street] OR t[city] tp[city] OR t[zip] tp[zip])) : not matching; t[A1] tp[A1]: (t[A1] = tp[A1] OR tp[A1] = _) Testing pattern tuples*</p></li><li><p>Single CFD: step 2Multi-tuple violations (the semantics of traditional FDs): Qvselect distinct t.country, t.area-code, t.phone from R t, Tp tpwhere t[country] tp[country] AND t[area-code] tp[area-code] AND t[phone] tp[phone]group by t.country, t.area-code, t.phone having count(distinct street, city, zip) &gt; 1 </p><p>Tp is treated as a data table (cust(country, area-code, phone street, city, zip), Tp)The semantics of FDs*</p></li><li><p>Multiple CFDsComplication 2: if the set has n CFDs, do we use 2n SQL queries, and thus 2n passes of the database DB?Goal: 2 SQL queries no matter how many CFDs are in the size of the SQL queries is independent of TpTrick: merge multiple CFDs into oneGiven (R: X1 Y1, Tp1), (R: X2 Y2, Tp2) Create a single pattern table: Tm = X1 X2 Y1 Y2, Introduce @, a dont-care variable, to populate attributes of pattern tuples in X1 X2, etc (tp[A] = @)Modify the pair of SQL queries by using Tm*</p></li><li><p>Handling multiple CFDsCFD2: (zipstate, T2)CFD1: (areastate, T1)CFD3: (area,zipstate, T3)CFDM:(area,zipstate, TM)Qc: select * from R t, TM tp where t[area] tp[area] AND t[zip] tp[zip] AND t[state] tp[state]Qv: select distinct area, zip from Macro group by area, zip having count(distinct state) &gt; 1Macro: select (case tp[area] when @ then @ else t[area] end) as area . . . from R t, TM tp where t[area] tp[area] AND t[zip] tp[zip] AND tp[state] =_Dont care*</p><p>zipstate07974NJ90291CA01202_</p><p>areastate__212NY</p><p>areazipstate48095120CA31090995CA</p><p>areazipstateCFD2:@07974NJCFD2:@90291CACFD2:@01202_CFD1:_@_CFD1:212@NYCFD3:48095120CACFD3:31090995CA</p></li><li><p>*To find violations of 1, it is necessary to ship data (part of t1 or t2) from one site to anotherDetecting errors in horizontally partitioned data1[CC=44, zip] [street]2[CC=44, AC=131] [city=Edi]Partitioned by titleHorizontal partitiondistributed data*</p><p>tidnametitleCCACphnstrcityzipt1MikeMTS441311234567MayfieldNYCEH4 8LE</p><p>tidnametitleCCACphnstrcityzipt2RickDMTS441313456789ChrichtonNYCEH4 8LEt3PhilDMTS441312909229ChrichtonEdiEH4 8LE</p></li><li><p>To find violations of 2, it is necessary to ship dataDetecting errors in vertically partitioned data1[CC=44, zip] [street]2[CC=44, AC=131] [city=Edi]vertical partitionphoneaddress*</p><p>tidnametitleCCACphnstrcityzipt1MikeMTS441311234567MayfieldNYCEH4 8LEt2RickDMTS441313456789ChrichtonNYCEH4 8LEt3PhilDMTS441312909229ChrichtonEdiEH4 8LE</p><p>tidnametitleCCACphnt1MikeMTS441311234567t2RickDMTS441313456789t3PhilDMTS441312909229</p><p>tidstrcityzipt1MayfieldNYCEH4 8LEt2ChrichtonNYCEH4 8LEt3ChrichtonEdiEH4 8LE</p></li><li><p> The error detection problem for CFDs is NP-complete for horizontally partitioned data, andvertically partitioned data,when either minimum data shipment or minimum response time is concerned. Error detection in distributed dataIn contrast, error detection in centralized data is trivial distributed dataHeuristic (approximation) algorithmsRead: Detecting Inconsistencies in Distributed DataError detection in distributed data is far more intriguing than its centralized counterpart*</p></li><li><p> Data cleaning</p><p>Discovering data quality rules Error detectionData repairingRecord matching and its interaction with data repairingCertain fixesTDD: Research Topics in Distributed Databases*</p></li><li><p>*DependenciesData repairing: Fixing the errors identifiedInput: a set of conditional dependencies, and a database DBOutput: a candidate repair DB such that DB satisfies , and cost(DB, DB) is maximalThe most challenging part of data cleaningAccuracy ModelrepairingHow to define?*</p></li><li><p>Example: networking service providerAssume the billing and maintenance departments have separate databases.Internally consistent,yet containing errors.Goal: reconcile and improve data quality of, for example, integrated customer and billing data. BillingMaintenanceCustSitesDevicesCustEquip*</p></li><li><p>Service provider example, continued.*</p></li><li><p>Constraints and Violations (1)Consider inclusion and functional dependencies, (so each x is a violation of one or the other).A functional dependency (FD) says that the values of some fields determine the values of others. cust[zip] -&gt; cust[state] cust[phno] -&gt; cust[name, street, city, state, zip] t1 and t2 violate*</p></li><li><p>Constraints and Violations (2)An inclusion dependency (IND) says that the values of some fields should appear in some others.equipphno serno eqmfct eqmodel instdate t9949-2212L32400LUze300Oct-01*</p></li><li><p>Constraint RepairFind a repair to suggestA repair is a database which does not violate constraintsA good repair is also similar to the original databaseConstraint 1: FD: R[A] -&gt; R[B, C]R: *</p></li><li><p>Problem StatementInput: a relational database D, and a set C of integrity constraints (FDs, INDs)Question: find a good repair D of Drepair: D satisfies Cgood: D is close to the original data in D changes are minimal: what metrics (referred to as cost) should we use?changes: value modification, tuple insertion to avoid loss of information (tuple deletion can be expressed via modification)We want to compute D efficiently, as suggested fix to the users*</p></li><li><p>Repair ModelFor the duration of constraint repair, each input tuple is uniquely identified, t1, t2, This value of attribute A of tuple t in the input database is D(t,A)The output of the algorithm is a repaired database, D', so D'(t,A) is the value of t.A in the current proposed repair.For example, D = D' below, except D(t3,C) D'(t3,C).D: *</p></li><li><p>Cost Model ComponentsDistanceWeightConfidence placed by the user in the accuracy of a tuple or attribute.*</p></li><li><p>Intuition: Tuple/Attribute WeightsSimple model of data reliabilityExample: Billing department may be more reliable about address information, weight = 2, but less about equipment, weight = 1Example 2: Download table of zip codes from post-office web site, weight = 100In the absence of other information, all weights default to 1.*</p></li><li><p>Attribute-Level Cost Model If D(t.A) = v and D'(t.A) = v', then Cost(t.A) = dist(v, v') * weight(t.A) Cost(D): the sum of Cost(t.A) for all changed tuples t and attributes AExample: (if we model delete as dist(x,null))**</p></li><li><p>Problem and complexityInput: a relational database D, and a set C of integrity constraints Question: find a repair D of D such that cost(D) is minimal</p><p>Complexity: Finding an optimal repair is NP-complete in size of databaseIntractable even for a small, constant number of FDs alone.Intractable even for a small, constant number of INDs aloneBy contrast, in delete-only model, repair with either INDs or FDs alone is in PTime, while repair with both is CoNP hardWhat should we do?*Data complexity</p></li><li><p>Heuristic approach to value-based repairIn light of intractability, we turn to heuristic approaches. However, most have problems.Straightforward constraint-by-constraint repair algorithms fail to terminate (fixing individual constraints one by one)consider R1(A, B), R2(B, C), with FD: R1[A] R1[B]IND: R2[B] R1[B]D(R1): {(1, 2), (1, 3)}, D(R2) = {(2, 1), (3, 4)}</p><p>Also, user must be involved since any single decision can be wrong.One approach: Equivalence-Class-Based Repair.*</p></li><li><p>Equivalence Classes</p><p>An equivalence class is a set of "cells" defined by a tuple t and an attribute A.An equivalence class eq has a target value targ(eq) drawn from eq -- to be assigned after all the equivalence classes are foundFor example, targ(EQ1) = "Alice Smith" or "Ali Stith"*</p></li><li><p>Equivalence Classes ContinuedIn the repair, give each member of the equivalence class the same target value: that is for all (t,A) in eq, D'(t,A) = targ(eq) Target value is chosen from the set of values associated with eq in the input: {D(ti,Ai)}A given eq class has an easily computed cost w.r.t. a particular target value. For example, if weights are all 1, we might have,Cost({"A","A","B"},"A") = dist("B","A") = 1Cost({"A","A","B"},"B") = 2*dist("A","B") = 2</p><p>Separate The decision of which attribute values need to be equivalentThe decision of exactly what value targ(eq) should be assigned*</p></li><li><p>Resolving FD violationsTo resolve a tuple t violating FD R[A]-&gt;R[B], violations, we compute the set of tuples V from R that match t on A, and union equivalence classes of the attributes in B among all tuples in V changing RHSChanging the left-hand side to an existing value may not resolve the violationChanging left-hand side to a new value is arbitrary loss of e.g., keysWhy?*</p></li><li><p>Resolving IND violationsTo resolve a tuple t violating IND R[A] S[B], we pick a tuple s from S and union equivalence classes between t.A and s.B for attributes A in A and B in B.How to repair data using CFDs? CINDs?*</p></li><li><p> Data cleaning</p><p>Discovering data quality rules Error detectionData repairingRecord matching and its interaction with data repairingChapter 6, Foundations of Data Quality ManagementCertain fixesTDD: Research Topics in Distributed Databases*</p></li><li><p>Record matchingInput: large, unreliable data sourcesOutput: tuples that refer to the same real-world entityMatching dependenciesRecord matchingRead: Interaction between Record Matching and Data Repairing*</p></li><li><p>Error detection and data enrichment via matching There is interaction between data repairing and record matching3. card[tel] = trans[phn] card[address] trans[post] 1. card[LN,address] = trans[LN,post] card[FN] trans[FN] card[X] trans[Y]2. card[LN, tel] = trans[LN, phn] card[FN] trans[FN] card[X] trans[Y]inconsistentenrich21 Match*</p><p>FNLNpostphnwhenwhereamountM.Smith10 Oak St, EDI, EH8 9LEnull1pm/7/7/09EDI $3,500MaxSmithPO Box 25, EDI32567772pm/7/7/09NYC$6,300</p><p>FNLNaddresstelDOBgenderMarkSmith10 Oak St, EDI, EH8 9LE325677710/27/97M</p></li><li><p>Unifying repairing and matching1. tran([AC = 020] -&gt; [city = Ldn])3. tran[LN, city, St, post] = card[LN,city, St, zip] tran[FN] card[FN] -&gt; tran[FN,phn] card[FN, tel] 4. tran([city, phn] -&gt; [St, AC, post])2. tran([FN = Bob] -&gt; [FN = Robert])Repairing and matching operations should be interleaved*</p><p>FNLNStCityACZiptelRobertBrady5 Wren StLdn020WC1H 9SE3887644</p><p>FNLNStCityACpostphnitemBobBrady5 Wren StEdi020WC1H 9SE3887834watchRobertBradyNullLdn020WC1E 7HX3887644necklace</p></li><li><p> Data cleaning</p><p>Data cleaning: An overviewError detectionData repairingRecord matching and its interaction with data repairingCertain fixesTDD: Research Topics in Distributed Databases*</p></li><li><p>Dependencies can detect errors, but do not tell us how to fix themHow to fix the errors detected?*[AC=020] [city=Ldn][AC=131] [city=Edi]020EdiLdnHeuristic methods may not x the erroneous t[AC], and worse still, may mess up the correct attribute t[city]</p><p>131*</p><p>BobBrady0200791724852501 Elm St.EdiEH7 4AHCD</p><p>FNLNACphntypestrcityzipitemBobBrady0200791724852501 Elm St.EdiEH7 4AHCD</p><p>BobBrady0200791724852501 Elm St.EdiEH7 4AHCD</p></li><li><p>Certain fixes: 100% correct. The need for this is evident when repairing critical dataEvery update guarantees to fix an error;The repairing process does not introduce new errors.</p><p>The quest for certain fixesEditing rules are a departure from data dependencies Editing rules instead of dependencies Editing rules tell us which attributes to change and how to change them; dynamic semantics Dependencies have static semantics: violation or notEditing rules are defined on a master relation and an input relation correcting errors with master data values Dependencies are only defined on input relations*</p></li><li><p>Editing rules and master data Input relation RMaster relation Dmt1s1s2certain131certaintype=2Robert501 Elm Row1 home phone2 mobile phoneEditing rules: 1: ((zip, zip) (AC, str, city), tp1 = ( ))2: ((phn, Mphn) (FN, LN), tp2[type] = (2))</p><p>Master data is a single repository of high-quality data that provides various applications with a synchronized, consistent view of its core business entities.Certain regions: validated either by users or inferenceRepairing: interact with...</p></li></ul>