sequential dependencies flip korn, at&t lukasz golab, at&t howard karloff, at&t avishek...
TRANSCRIPT
![Page 1: Sequential Dependencies Flip Korn, AT&T Lukasz Golab, AT&T Howard Karloff, AT&T Avishek Saha, University of Utah Divesh Srivastava, AT&T](https://reader035.vdocuments.mx/reader035/viewer/2022062408/56649f005503460f94c15d5b/html5/thumbnails/1.jpg)
Sequential DependenciesFlip Korn, AT&T
Lukasz Golab, AT&THoward Karloff, AT&T
Avishek Saha, University of UtahDivesh Srivastava, AT&T
![Page 2: Sequential Dependencies Flip Korn, AT&T Lukasz Golab, AT&T Howard Karloff, AT&T Avishek Saha, University of Utah Divesh Srivastava, AT&T](https://reader035.vdocuments.mx/reader035/viewer/2022062408/56649f005503460f94c15d5b/html5/thumbnails/2.jpg)
Data Quality
• Cost to business: $600 billion• Problems prevalent in measurement data
– Equipment failure– Calibration/systemic errors– Configuration errors– Management errors
• Our goal: detection, not cleaning
![Page 3: Sequential Dependencies Flip Korn, AT&T Lukasz Golab, AT&T Howard Karloff, AT&T Avishek Saha, University of Utah Divesh Srivastava, AT&T](https://reader035.vdocuments.mx/reader035/viewer/2022062408/56649f005503460f94c15d5b/html5/thumbnails/3.jpg)
Data Quality
• Principled approach: integrity constraints– Assert semantics– Deviations = quality issues– Allow approx for real-world data– Condition tableau discovery
• Domain often gives rise to semantics– missing, extraneous, out-of-order
![Page 4: Sequential Dependencies Flip Korn, AT&T Lukasz Golab, AT&T Howard Karloff, AT&T Avishek Saha, University of Utah Divesh Srivastava, AT&T](https://reader035.vdocuments.mx/reader035/viewer/2022062408/56649f005503460f94c15d5b/html5/thumbnails/4.jpg)
Sequential Dependency Definition
• Sequential dependency X g Y– (yi – yi-1) g, y’s sorted w.r.t. x-values
– extension of functional dependency, g = (0,)• X Y : t1,t2 t1[X] < t2[X] t1[Y] < t2[Y]
• Example #1: date [20,) price– Prices increasing by at least 20 units
• Example #2: poll# [4,6] time– Consecutive polls within 4-6 mins
![Page 5: Sequential Dependencies Flip Korn, AT&T Lukasz Golab, AT&T Howard Karloff, AT&T Avishek Saha, University of Utah Divesh Srivastava, AT&T](https://reader035.vdocuments.mx/reader035/viewer/2022062408/56649f005503460f94c15d5b/html5/thumbnails/5.jpg)
Approx Sequential Dependency
Start End
Confidence = 67%
g = (0,)
![Page 6: Sequential Dependencies Flip Korn, AT&T Lukasz Golab, AT&T Howard Karloff, AT&T Avishek Saha, University of Utah Divesh Srivastava, AT&T](https://reader035.vdocuments.mx/reader035/viewer/2022062408/56649f005503460f94c15d5b/html5/thumbnails/6.jpg)
Conditional Sequential Dependencies
Start End[1,6] --
[2,11] --[7,12] --
Confidence ≥ 80%
[1,6]
[2,11][7,12]
![Page 7: Sequential Dependencies Flip Korn, AT&T Lukasz Golab, AT&T Howard Karloff, AT&T Avishek Saha, University of Utah Divesh Srivastava, AT&T](https://reader035.vdocuments.mx/reader035/viewer/2022062408/56649f005503460f94c15d5b/html5/thumbnails/7.jpg)
Conditional Sequential Dependencies
Example #1: g = (0,)
![Page 8: Sequential Dependencies Flip Korn, AT&T Lukasz Golab, AT&T Howard Karloff, AT&T Avishek Saha, University of Utah Divesh Srivastava, AT&T](https://reader035.vdocuments.mx/reader035/viewer/2022062408/56649f005503460f94c15d5b/html5/thumbnails/8.jpg)
Conditional Sequential Dependencies
Example #2: g = [9,11] g = [20,]
![Page 9: Sequential Dependencies Flip Korn, AT&T Lukasz Golab, AT&T Howard Karloff, AT&T Avishek Saha, University of Utah Divesh Srivastava, AT&T](https://reader035.vdocuments.mx/reader035/viewer/2022062408/56649f005503460f94c15d5b/html5/thumbnails/9.jpg)
Contributions
• Introduce sequential dependencies (SDs)– algorithm for computing confidence
• Tableau Discovery for CSDs– problem definition– fast approximation algorithm
![Page 10: Sequential Dependencies Flip Korn, AT&T Lukasz Golab, AT&T Howard Karloff, AT&T Avishek Saha, University of Utah Divesh Srivastava, AT&T](https://reader035.vdocuments.mx/reader035/viewer/2022062408/56649f005503460f94c15d5b/html5/thumbnails/10.jpg)
Approx Sequential Dependency
• Confidence: (N-OPS)/N, OPS = min ins+del– Edit distance
• Ex: <5,9,12,25,31,30,34,40> with [4,6]– del 12, ins 15, ins 20, del 31 conf = 4/8
• Doesn’t overpenalize for rare drops– Eg, <5, 10, 15, 20, 30, 35, 40, 45> with [5,5]
• Penalize large gaps– Eg, [3,5] with gap of 6 vs. 1000
![Page 11: Sequential Dependencies Flip Korn, AT&T Lukasz Golab, AT&T Howard Karloff, AT&T Avishek Saha, University of Utah Divesh Srivastava, AT&T](https://reader035.vdocuments.mx/reader035/viewer/2022062408/56649f005503460f94c15d5b/html5/thumbnails/11.jpg)
Approx Sequential Dependency
• How to compute OPS for g=[G1,G2]?• dcost(d) = #ins (or ) to <0> to end in d
– Eg, [4,6]: dcost(6) = 1, dcost(7) = , dcost(8) = 2– d/G2 when (d+1)/G1 = d/G2; else
• Let T(i) := OPS made to <a1,a2,…,ai> <…,v>
• Suppose T(1), T(2), …, T(i-1) already computed• T(i) = minj { T(j) + (i-1-j) + [dcost(ai-aj)-1] }
– O(G2/(G2-G1) N log N) algorithm
![Page 12: Sequential Dependencies Flip Korn, AT&T Lukasz Golab, AT&T Howard Karloff, AT&T Avishek Saha, University of Utah Divesh Srivastava, AT&T](https://reader035.vdocuments.mx/reader035/viewer/2022062408/56649f005503460f94c15d5b/html5/thumbnails/12.jpg)
Tableau Discovery
• Assume underlying SD given– Data often suggest ordering semantics
• Good tableau = small set of intervals– Each interval satisfies confidence threshold– Union satisfies support threshold
• Find maximal time intervals [i,j] s.t.– Confidence satisfied in [i,j]
• Can we do better than testing all [i,j]’s?
![Page 13: Sequential Dependencies Flip Korn, AT&T Lukasz Golab, AT&T Howard Karloff, AT&T Avishek Saha, University of Utah Divesh Srivastava, AT&T](https://reader035.vdocuments.mx/reader035/viewer/2022062408/56649f005503460f94c15d5b/html5/thumbnails/13.jpg)
Tableau Discovery: Candidates
• Relax constraint: confidence ≥ ĉ/(1+ε)• For any interval I, exists J s.t.• (a) I J and• (b) |J| ≤ (1+ε)|I|• conf(J) ≥ conf(I)/(1+ε)
I
J
![Page 14: Sequential Dependencies Flip Korn, AT&T Lukasz Golab, AT&T Howard Karloff, AT&T Avishek Saha, University of Utah Divesh Srivastava, AT&T](https://reader035.vdocuments.mx/reader035/viewer/2022062408/56649f005503460f94c15d5b/html5/thumbnails/14.jpg)
Tableau Discovery: Candidates
• Test just enough intervals:• (a) lengths 1, (1+δ), (1+δ)2, …• (b) starting points δ, δ(1+δ), δ(1+δ)2, …
![Page 15: Sequential Dependencies Flip Korn, AT&T Lukasz Golab, AT&T Howard Karloff, AT&T Avishek Saha, University of Utah Divesh Srivastava, AT&T](https://reader035.vdocuments.mx/reader035/viewer/2022062408/56649f005503460f94c15d5b/html5/thumbnails/15.jpg)
Tableau Discovery: Candidates
• Processing cost:– Intervals at level h have length (1+δ)h
– N/(δ(1+δ)h) intervals at level h– log1+δN total levels
– sum of lengths = O((N/δ)log1+δN) = O(N/δ2 lg N)
• Improvement:– Interval lengths in [A,2A] start at δA,2δA,3δA,…– Prefix property
![Page 16: Sequential Dependencies Flip Korn, AT&T Lukasz Golab, AT&T Howard Karloff, AT&T Avishek Saha, University of Utah Divesh Srivastava, AT&T](https://reader035.vdocuments.mx/reader035/viewer/2022062408/56649f005503460f94c15d5b/html5/thumbnails/16.jpg)
Tableau Discovery: Assembly
• Optimal solution in quadratic time• Greedy partial set cover• Can implement in linear time• Constant performance ratio
![Page 17: Sequential Dependencies Flip Korn, AT&T Lukasz Golab, AT&T Howard Karloff, AT&T Avishek Saha, University of Utah Divesh Srivastava, AT&T](https://reader035.vdocuments.mx/reader035/viewer/2022062408/56649f005503460f94c15d5b/html5/thumbnails/17.jpg)
Summary of Results
• Tableau almost identical at small δ• Significant speedup at small δ• “Inflating” ĉ to (1+δ)ĉ works well
![Page 18: Sequential Dependencies Flip Korn, AT&T Lukasz Golab, AT&T Howard Karloff, AT&T Avishek Saha, University of Utah Divesh Srivastava, AT&T](https://reader035.vdocuments.mx/reader035/viewer/2022062408/56649f005503460f94c15d5b/html5/thumbnails/18.jpg)
Experiments: Sample Tableau
Data: WeatherDates, conf ≥ 0.995, support ≥ 0.5, δ = 0.05
![Page 19: Sequential Dependencies Flip Korn, AT&T Lukasz Golab, AT&T Howard Karloff, AT&T Avishek Saha, University of Utah Divesh Srivastava, AT&T](https://reader035.vdocuments.mx/reader035/viewer/2022062408/56649f005503460f94c15d5b/html5/thumbnails/19.jpg)
Experiments: Tableau Size
Gaps in [0,∞ ) Gaps in [0,5]
DowJones data: support 0.5
![Page 20: Sequential Dependencies Flip Korn, AT&T Lukasz Golab, AT&T Howard Karloff, AT&T Avishek Saha, University of Utah Divesh Srivastava, AT&T](https://reader035.vdocuments.mx/reader035/viewer/2022062408/56649f005503460f94c15d5b/html5/thumbnails/20.jpg)
Experiments: Scalability
Gaps in [0,∞ ) Gaps in [4,6]
Network datasupport 0.5conf 0.99
WeatherDatessupport 0.5conf 0.9
![Page 21: Sequential Dependencies Flip Korn, AT&T Lukasz Golab, AT&T Howard Karloff, AT&T Avishek Saha, University of Utah Divesh Srivastava, AT&T](https://reader035.vdocuments.mx/reader035/viewer/2022062408/56649f005503460f94c15d5b/html5/thumbnails/21.jpg)
Case Study: Polled Data
conf ≥ 0.995, support ≥ 0.5, δ=0.05
![Page 22: Sequential Dependencies Flip Korn, AT&T Lukasz Golab, AT&T Howard Karloff, AT&T Avishek Saha, University of Utah Divesh Srivastava, AT&T](https://reader035.vdocuments.mx/reader035/viewer/2022062408/56649f005503460f94c15d5b/html5/thumbnails/22.jpg)
Case Study: Stock Data
conf ≥ 0.995, support ≥ 0.5, δ=0.05
Dow Jones 2-week moving average
104
103
102
![Page 23: Sequential Dependencies Flip Korn, AT&T Lukasz Golab, AT&T Howard Karloff, AT&T Avishek Saha, University of Utah Divesh Srivastava, AT&T](https://reader035.vdocuments.mx/reader035/viewer/2022062408/56649f005503460f94c15d5b/html5/thumbnails/23.jpg)
Conclusions
• Constraint-driven approach– Define, discover, detect
• Use whatever semantics available– Domain knowledge, expectation, etc.
• Model errors carefully– Confidence measure
• Tableaux useful for summary
![Page 24: Sequential Dependencies Flip Korn, AT&T Lukasz Golab, AT&T Howard Karloff, AT&T Avishek Saha, University of Utah Divesh Srivastava, AT&T](https://reader035.vdocuments.mx/reader035/viewer/2022062408/56649f005503460f94c15d5b/html5/thumbnails/24.jpg)
The End
![Page 25: Sequential Dependencies Flip Korn, AT&T Lukasz Golab, AT&T Howard Karloff, AT&T Avishek Saha, University of Utah Divesh Srivastava, AT&T](https://reader035.vdocuments.mx/reader035/viewer/2022062408/56649f005503460f94c15d5b/html5/thumbnails/25.jpg)
Background
• Functional Dependency– X Y : t1,t2 t1[X] = t2[X] t1[Y] = t2[Y]
• Example– title salary– What happens when data merged?
![Page 26: Sequential Dependencies Flip Korn, AT&T Lukasz Golab, AT&T Howard Karloff, AT&T Avishek Saha, University of Utah Divesh Srivastava, AT&T](https://reader035.vdocuments.mx/reader035/viewer/2022062408/56649f005503460f94c15d5b/html5/thumbnails/26.jpg)
Page 26
Background• ssn|name|title|salary• 123|alice|manager|50• 456|bob|sales|40• 789|cathy|manager|50
• title salary
• ssn|name|company|title|salary
• 123|alice|ATT|manager|50
• 456|bob|ATT|sales|40
• 789|cathy|ATT|manager|50
• 012|david|IBM|engineer|30
• 345|emily|IBM|engineer|35
• [title,company] salary?– 100% support, 80% confidence
Hold Tableau Fail Tableau
ATT
Company
**
SalaryTitle
60% support, 100% confidence
IBM
Company
**
SalaryTitle
40% support, 50% confidence
![Page 27: Sequential Dependencies Flip Korn, AT&T Lukasz Golab, AT&T Howard Karloff, AT&T Avishek Saha, University of Utah Divesh Srivastava, AT&T](https://reader035.vdocuments.mx/reader035/viewer/2022062408/56649f005503460f94c15d5b/html5/thumbnails/27.jpg)
CFD Results
• Given FD, discover tableau:– min tableau size– subj. to support and confidence constraints• Hardness:– global conf: inapproximable– local conf: NP-hard, fast approx algo