analysis of uncertain data: smoothing of histograms eugene fink ankur sarin jaime g. carbonell 10...

17
Analysis of Uncertain Data: Smoothing of Histograms Eugene Fink Ankur Sarin Jaime G. Carbonell 10 20 30

Upload: susan-ramsey

Post on 20-Jan-2018

216 views

Category:

Documents


0 download

DESCRIPTION

Techniques Manual estimates Histograms Curve fitting

TRANSCRIPT

Page 1: Analysis of Uncertain Data: Smoothing of Histograms Eugene Fink Ankur Sarin Jaime G. Carbonell 10 20…

Analysis of Uncertain Data:Smoothing of Histograms

Eugene FinkAnkur Sarin

Jaime G. Carbonell

10 20 30

Page 2: Analysis of Uncertain Data: Smoothing of Histograms Eugene Fink Ankur Sarin Jaime G. Carbonell 10 20…

Density estimate problemConvert a set of numeric data points to a smoothed approximation of the underlying probability density.

10 20 30

1112

1921

ExamplePoints

1718

2226

2729

Page 3: Analysis of Uncertain Data: Smoothing of Histograms Eugene Fink Ankur Sarin Jaime G. Carbonell 10 20…

Techniques

•Manual estimates

•Histograms10 20 30

10 20 30

•Curve fitting10 20 30

Page 4: Analysis of Uncertain Data: Smoothing of Histograms Eugene Fink Ankur Sarin Jaime G. Carbonell 10 20…

Generalized histograms

10 20 30

0.2 chance: [11 .. 12]0.5 chance: [17 .. 22]0.3 chance: [26 .. 29]

General formprob1: [min1 .. max1]prob2: [min2 .. max2]

…probn: [minn .. maxn]

• Intervals do not overlap• Probabilities sum to 1.0

Page 5: Analysis of Uncertain Data: Smoothing of Histograms Eugene Fink Ankur Sarin Jaime G. Carbonell 10 20…

Special cases

•Standard histogram

•Set of points

•Weighted points

Page 6: Analysis of Uncertain Data: Smoothing of Histograms Eugene Fink Ankur Sarin Jaime G. Carbonell 10 20…

Smoothing problem

Given a generalized histogram, construct its coarser approximation.

10 20 30

10 20 30

10 20 30

Page 7: Analysis of Uncertain Data: Smoothing of Histograms Eugene Fink Ankur Sarin Jaime G. Carbonell 10 20…

Input

•Initial distribution:A point set or a fine-grained histogram

•Distance function:A measure of similarity between distributions

• Target size:The number of intervals in an approximation

Page 8: Analysis of Uncertain Data: Smoothing of Histograms Eugene Fink Ankur Sarin Jaime G. Carbonell 10 20…

Standard distance measures•Simple difference:∫ | p(x) − q(x) | dx

•Kullback-Leibler:∫ p(x) · log (p(x) / q(x)) dx

•Jensen-Shannon:(Kullback-Leibler (p, (p+q)/2) + Kullback-Leibler (q, (p+q)/2)) / 2

Page 9: Analysis of Uncertain Data: Smoothing of Histograms Eugene Fink Ankur Sarin Jaime G. Carbonell 10 20…

Smoothing algorithmRepeat: Merge two adjacent intervalsUntil the histogram has the right size

10 20 30

Page 10: Analysis of Uncertain Data: Smoothing of Histograms Eugene Fink Ankur Sarin Jaime G. Carbonell 10 20…

Interval merging

min1 min2max1 max2

prob1

prob2

min1 max2

prob1 + prob2

•For each potential merge,calculate the distance

•Perform the smallest-distance merge

Page 11: Analysis of Uncertain Data: Smoothing of Histograms Eugene Fink Ankur Sarin Jaime G. Carbonell 10 20…

Smoothing examples:Normal distribution

5000 points 200 intervals

50 intervals 10 intervals

Page 12: Analysis of Uncertain Data: Smoothing of Histograms Eugene Fink Ankur Sarin Jaime G. Carbonell 10 20…

Smoothing examples:Geometric distribution

5000 points 200 intervals

10 intervals50 intervals

Page 13: Analysis of Uncertain Data: Smoothing of Histograms Eugene Fink Ankur Sarin Jaime G. Carbonell 10 20…

Running time

•Theoretical:O (n · log n)

•Practical:O (n)

Page 14: Analysis of Uncertain Data: Smoothing of Histograms Eugene Fink Ankur Sarin Jaime G. Carbonell 10 20…

Running time3.4 GHz Pentium, C++ code

(2.5 ± 0.5) · num-pointsmicroseconds

Number of points

Tim

e (m

icro

sec)

102 104 106

102

104

106

Page 15: Analysis of Uncertain Data: Smoothing of Histograms Eugene Fink Ankur Sarin Jaime G. Carbonell 10 20…

Visual smoothing

We convert a piecewise-uniform distribution to a smooth curve by spline fitting.

The user usually prefers a smooth probability density.

10 20 30

Page 16: Analysis of Uncertain Data: Smoothing of Histograms Eugene Fink Ankur Sarin Jaime G. Carbonell 10 20…

Main results

10 20 30

10 20 30

10 20 30

•Density estimation

•Lossy compression ofgeneralized histograms

Page 17: Analysis of Uncertain Data: Smoothing of Histograms Eugene Fink Ankur Sarin Jaime G. Carbonell 10 20…

Advantages

•Explicit specification of - Distance measure- Compression level

•Effective representationfor automated reasoning