statistical learning methods for useful links emerging...

2003-4-5

DASFAA 2003 Plenary Tutorial [email protected] all right reserved 1

3/27/2003 DASFAA Tutorial, Kyoto 1

Statistical Learning Methods for Emerging Database Applications

Edward ChangAssociate Professor,Electrical Engineering, UC Santa BarbaraCTO, VIMA Technologies


Useful Links

Related Publicationshttp://www-db.stanford.edu/~echang/

Software Free Trialhttp://www.imagebeagle.comLocate objectionable images on your hard drivesBefore your boss finds it!!!


Outline

Statistical LearningEmerging Applications Data CharacteristicsClassical ModelsKernel Methods

Linear Model ViewNearest Neighbor ViewGeometric View

Dimension Reduction Methods


Statistical Learning

Program the computers to learn!Computers improve performancewith experience at some taskExample:

Task: playing checkersPerformance: % games it winsExperience: expert players


Statistical Learning

Task Ŷ = f(U)Represented by some model(s)Implies hypothesis

PerformanceMeasured by error functions

Experience (L)Characterized by training data

Algorithm (Φ)3/27/2003 DASFAA Tutorial, Kyoto 6

Supervised Learning

X: DataU: Unlabeled pool L: Labeled pool

G: LabelsRegressionClassification

Φ: Learning algorithmf = Φ(L) Ŷ = f(U)

2003-4-5



Learning Algorithms

Linear ModelK-NNNeural NetworksDecision TreesKernel MethodsEtc.


Classical Model

N:Number of training instancesN+, N-

D:DimensionalityN >> D N → ∞

E.g., PAC learnabilityN- ≈ N+


Emerging DB Applications

N < DN+ << N-

ExamplesInformation Retrieval with relevance feedbackGene Profiling


Image Retrieval Demo

N < DN < 50D = 150

N+ << N-

ACM SIGMOD 01; ACM MM 01,02; IEEE CVPR 03


SVMactive


SVMactive

2003-4-5



SVMactive


SVMactive


Ranking


Gene Profiling ExampleN = 59 cases, D = 4026 genes


Outline

Statistical LearningEmerging Applications Data CharacteristicsClassical Models (Classification)Kernel Methods




Linear Model

Y = β0 + ΣΣ βj Xj (j = 1 to p)Y = XTβRSS(β) = (y – Xβ)T(y – Xβ)

RSS: Residual Sum of Squareβ = (XTX)-1 XTy

2003-4-5



Linear Model


Maximum Likelihood

Y = β0 + ΣΣ βj Xj (j = 1 to p)Y = XTβY = XTβ + ε

ε (noise signals) are independentε → N (0, ∂2)

P(y|βx) has a normal dist. withMean at y = βxVariance ∂2


Linear Model

P(y|βx) → N (0, ∂2) Training

Given (x1,y1) (x2,y2) … (xn,yn)Infer P(β | x1, x2,… xn, y1, y2,…yn )By Bayes rule, orMaximum Likelihood Estimate


Maximum Likelihood

For what β isP(y1, y2,…yn | x1, x2,… xn, β) maximized?ΠΠ P(yi|βxi) maximized? ΠΠ exp(-½(yi-βxi/∂)2) maximized?ΣΣ (-½(yi-βxi/∂)2 maximized?ΣΣ (yi-βxi)2 minimized?


Least Square Linear Model

Solution Method #1RSS(β) = (y – Xβ)T(y – Xβ)β = (XTX)-1 XTy

Solution Method #2 (for D > N)Gradient decentPerceptron


Other Linear Models

LDAFind the projection direction which minimizes the overlap for two Gaussian distributions

Separating Hyperplane

2003-4-5



LDA







Maximum Margin Hyperplane


Linear Model Fits All Data?

2003-4-5



How about Joining the Dots?

Y(x) = 1/k ΣΣ yi,

xi ∈Nk(x)K = 1


Linear Models

N ≥ DLeast SquareLDA

D > NPerceptronMaximum Hyperplane


Linear Model Fits All?


NN with k = 1


Nearest Neighbor

Four Things Make a Memory Based Learner

A distance functionK: number of neighbors to consider?A weighted function (optional)How to fit with the local points?


Problems

Fitting NoiseJagged Boundaries

2003-4-5



Solutions

Fitting NoisePick a Larger K?

Jagged BoundariesIntroducing Kernel as a weighting function


NN with k = 15


NN


Solutions

Fitting NoisePick a larger K?

Jagged BoundariesIntroducing Kernel as a weighting function


Nearest Neighbor -> Kernel Method

Four Things Make a Memory Based Learner

A distance functionK: number of neighbors to consider? AllA weighted function: RBF kernelsHow to fit with the local points? Predict weights


Kernel Method

RBF Weighted FunctionKernel width holds the keyUse cross validation to find the “optimal” width

Fitting with the Local PointsWhere NN meets Linear Model

2003-4-5



LM vs. NNLinear Model

f(x) is approximated by a global linear functionMore stable, less flexible

Nearest NeighborK-NN assumes f(x) is well approximated by a locally constant functionLess stable, more flexible

Between LM and NNThe other models…


Decision Theories

Bias & Variance TradeoffBayes PredictionVC DimensionalityPAC Learnability


Variance vs. Bias

MSE(x0) = ET [f(x0) – ŷ0]2

= ET[ŷ0 – ET(ŷ0)]2 + [ET(ŷ0)– f(x0)]2

Error = VarT(ŷ0) + Bias2(ŷ0)


Outline

Statistical LearningEmerging Applications Data CharacteristicsClassical Models (Classification)Kernel MethodsDimension Reduction Methods


Where Are We and Where Am I Heading To ?

LM and NNKernel Method of Three Views

LM viewNN viewGeometric view


Linear Model View

Y = β0 + ΣΣ β XSeparating Hyperplane

Max||β||=1 CSubject to yyii f(f(xxii) ) ≥≥ C, orC, oryyii ((β0 +β xi) ≥≥ CC

2003-4-5







Maximum Margin Hyperplane


Classifier Margin

Margin Defined as with of the boundary before hitting a data object

Maximum MarginTends to minimize classification varianceNo formal theory for this yet




M’s Mathematical Representation

Plus-plane{x: wx+b = +1}

Minus-plane{x: wx+b = -1}

w ⊥ Plus-planew(u – v) = 0, if u and v on plus-plane

w ⊥ Minus-plane

2003-4-5





M

Let x- be any point on minus-planeLet x+ be the closest plus-plane-point to x-

x+ = x- + λw, whyThe line (x+x-) ⊥ minus-plane

M = |x+ - x-|


M

1. wx- + b = -1 2. wx+ + b = 1 3. x+ = x- + λw 4. M = |x+ - x-|5. w(x- + λw) + b = 1 (from 2 & 3)6. wx- + b + λww = 17. λww = 2


M

1. λww = 22. λ = 2/ww3. M = |x+ - x-| = |λw| = λ|w| = 2/|w|

4. Max MGradient decent, simulated annealing, EM, Newton’s method?


Max M

Max M = 2/|w|Min |w|/2Min |w|2/2

subject to yi(xiw+b) ≥ 1i = 1,…,N

Quadratic criterion with linear inequality constraints


Max M

Min |w|2/2subject to yi(xiw+b) ≥ 1i = 1,…,N

Lp = minw,b |w|2/2 + ΣΣi=1..N αi[yi(xiw+b)-1]

w = ΣΣi=1..N αiyixi

0 = ΣΣi=1..N αiyi

2003-4-5



Wolfe Dual

Ld = ΣΣi=1..N α - 1/2 ΣΣΣΣi,j=1..Nαiαjyiyjxixj

Subject to αi ≥ 0αi [yi(xiw+b)-1] = 0KKT conditions⌧αi > 0, yi(xiw+b) = 1 (Support Vectors)⌧αi = 0, yi(xiw+b) > 1


Class Predictionyyqq = = w xq + b

w = ΣΣi=1..N αiyixi

yyqq = sign(= sign(ΣΣi=1..N αiyi(xi ·Xq) + b)


Non-seperatable Classes

Soft Margin HyperplaneBasis Expansion


Non-separating Case


Soft Margin SVMs

Min |w|2/2subject to yi(xiw+b) ≥ 1i = 1,…,N

Min |w|2/2 + C ∑εi

xiw+b ≥ 1 - εi if yi = 1xiw+b ≤ -1 + εi if yi = -1εi ≥ 0


Non-separating Case

2003-4-5



Wolfe Dual

Ld = ΣΣi=1..N α - 1/2 ΣΣΣΣi,j=1..Nαiαjyiyjxixj

Subject to C ≥ αi ≥ 0ΣΣ αiyi = 0KKT conditions

yyqq = = sign ((ΣΣi=1..N αiyi(xi ·Xq) + b)


Basis Function


Harder 1D Example


Basis Function

Φ(X) = (x, x2)


Harder 1D Example


Some Basis Functions

Φ(X) = ΣΣ γmhm(X) hm(X) Rp → R

Common FunctionsPolynomialRadial basis functionsSigmoid functions

2003-4-5



Wolfe DualLd = ΣΣi=1..N α - 1/2 ΣΣΣΣi,j=1..Nαiαjyiyj Φ(xi)Φ (xj)Subject to

C ≥ αi ≥ 0ΣΣ αiyi = 0KKT conditions

yyqq = sign (= sign (ΣΣi=1..N αiyi(Φ(xi)·Φ(Xq)) + b)K(xi, xj) = Φ(xi)·Φ(Xj)

Kernel function!


Quadratic Basis Functions

Φ(X) = {1, xi, xi xj}, ij = 1..p(p+1)(p+2) termsP2 termsO(P2) computational cost

It is equivalent to (xixj+1)2

O(p) computational costTotal Cost

O(N2p)


Dot Product Saves the Day

O(N2p)Quadratic

O(N2p2)Cubic

O(N2p3)Quartic

O(N2p4)


Quiz

What is a polynomial kernel degree dfunction’s signature?(xixj+1)d


Nearest Neighbor View

Z, a set of zero mean jointly Gaussian random variables,

Each Zi corresponds to one example Xi

Cov (zi, zj) = K(xi, xj)yi, the lable of zi, +1 or -1

P(yi | zi) = σ(yi,zi)


Training Data

2003-4-5



General Kernel Classifier [Jaakkola, etc. 99]

MAP Classification for xt

yt = sign (Σ αi yi K(xt,xi)) K(xi, xj) = Cov (zi, zj) (some similarity function)

Supervised Training: Compute αi Given X and y, andAn error function such as J(α) = - ½ Σ αi αj yi yj K(xi,xj) + Σ F(αi)


Leave One Out


SVMs yt = sign (Σ αi yi K(xt,xi))(yi xi) training data, αi nonnegative, and kernel K positive definiteαi is obtained by maximizing

J(α) = - ½ Σ αi αj yi yj K(xi,xj) + Σ F(αi)F(αi) = αi

αi ≥ 0, Σyiαi = 0


SVMs


Important Insight

K(xi, xj) = Cov (zi, zj) To design of a kernel is to design a similarity function that produces a positive definite covariance matrix on the training instances


Basis Function Selection

Three General ApproachesRestriction methods⌧Limit the class of functionsSelection methods⌧Scan the dictionary adaptively (Boosting)Regularization methods⌧Use the entire dictionary but restrict

coefficients (Ridge Regression)

2003-4-5



Overfitting?

Probably NotBecause

N free parameters (not D)Maximizing margin


Geometrical View

S = w X + b|w| = 1, b = 0V = {w | Si f(xi) > 0; i = 1..n, |w| = 1}SVM is the center of the largest sphere contained in V


SVMs


BPMs

Bayes Objective FunctionŜt = Bayes Z (Xt) = argmin Si in S E H|Z = x [l(H(x), Si)]

BPMs [Herbrich, etc. 2001]Abp= argmin h in H Ex[E H|Z = x [l(H(x), h(x))]]


BPMs

Linear ClassifierInput X Posses Spherical Gaussian Density

BP is the Center of Mass of the Version Space


BPMs vs. SVMs

2003-4-5



BPMs

Use SVMs to find a good h in HFind the BP

Billiard Algorithm [Herbrich, etc. 2001]

Perceptron Algorithm [Herbrich, etc. 2001]


Billiard Ball Algorithm (R. Herbrich )


Outline

Statistical LearningEmerging Applications Data CharacteristicsClassical Models (Classification)Kernel MethodsDimension Reduction Methods


Dimensionality Curse

D: Data DimensionWhen D increases

Nearest neighbors are not localAll points are equally distanced


Sparse High-D Space [C. Aggarwal, etc. ICDT 2001]

Hyper-cube Range Queries

dd ssP =][3/27/2003 DASFAA Tutorial, Kyoto 96

2003-4-5



Sparse High-D Space

Spherical Range Queries


)12(

)5.0()]5.0,([+Γ

•=∈ dQspRP

ddd π

3/27/2003 DASFAA Tutorial, Kyoto 99 3/27/2003 DASFAA Tutorial, Kyoto 100

Dimensionality Curse

3/27/2003 DASFAA Tutorial, Kyoto 101 3/27/2003 DASFAA Tutorial, Kyoto 102

So?

Is nearest neighbor estimate cursed in high-D spaces?

Yes!When D is large and N is relatively small, the estimate is off!!

2003-4-5



Are We Doomed?

How does the curse affect classification?Similar objects tend to clustertogetherClassification makes binary prediction


Distribution of Distances


Some Solutions to High-D

Restricted Estimators Specifying the nature of local neighborhood

Adaptive Feature Reduction PCA, LDA

Dynamic Partial Function


Three Major Paradigms

Preserve data description in a lower dimensional space

PCAMaximize discriminability in a lower dimensional space

LDAActivate only similar channels

DPF


Minkowski Distance

Objects P and QD = (ΣM (pi - qi)n)1/n

Similar images are similar in all M features


1.0E-06

1.0E-05

1.0E-04

1.0E-03

1.0E-02

1.0E-01

00.

060.

130.

190.

250.

320.

380.

440.

510.

570.

630.

690.

760.

820.

880.

95

Feature Distance

Freq

uenc

y

1.0E-06

1.0E-05

1.0E-04

1.0E-03

1.0E-02

1.0E-01

00.

060.

130.

190.

250.

320.

380.

440.

510.

570.

630.

690.

760.

820.

880.

95

Feature Distance

Freq

uenc

y

2003-4-5



Weighted Minkowski Distance

D = (ΣM wi(pi - qi)n)1/n

Similar images are similar in the same subset of the M features


0 0 0 0 0 0 00 0 0 0 0 0 00 0 0 0 0 0 00 0 0 0 0 0 00 0 0 0 0 0 00 0 0 0 0 0 00 0 0 0 0 0 00 0 0 0 0 0 00 0 0 0 0 0 0

0.007545 0.01307 0.004637 0.002413 0.002635 0.002954 0.0020070.014669 0.02717 0.010578 0.006734 0.007725 0.006379 0.0057660.012615 0.023055 0.009333 0.006764 0.007363 0.006593 0.0054430.082128 0.212612 0.068016 0.037835 0.032241 0.018068 0.0132030.061564 0.176548 0.045542 0.026445 0.026374 0.018583 0.0220370.019243 0.037016 0.015684 0.010834 0.012792 0.013536 0.0093460.09418 0.153677 0.066896 0.040249 0.036368 0.030341 0.0211380.1284 0.335405 0.13774 0.072613 0.054947 0.039216 0.043319

0.041414 0.101403 0.035881 0.022633 0.018991 0.017131 0.019450.014024 0.049782 0.01457 0.0053 0.004439 0.003041 0.0052260.049319 0.120274 0.045804 0.020165 0.019499 0.013805 0.018513

GIF

00.020.040.060.080.1

0.120.14

1 11 21 31 41 51 61 71 81 91 101

111

121

131

141

Feature Number

Aver

age

Dis

tanc

e

0 0 0 0 0 0 0 00 0 0 0 0 0 0 00 0 0 0 0 0 0 00 0 0 0 0 0 0 00 0 0 0 0 0 0 00 0 0 0 0 0 0 00 0 0 0 0 0 0 00 0 0 0 0 0 0 00 0 0 0 0 0 0 0

0.002923 0.004377 0.029086 0.017063 0.007649 0.002019 0.001984 0.011560.006648 0.010143 0.070708 0.046142 0.023502 0.005178 0.005169 0.030140.006298 0.009264 0.075118 0.042225 0.020053 0.006285 0.006533 0.0300430.010198 0.056025 0.052869 0.033199 0.018294 0.00688 0.006858 0.023620.017066 0.047514 0.104013 0.073459 0.037468 0.013849 0.01293 0.0483440.008148 0.015337 0.074134 0.044238 0.021222 0.005197 0.005099 0.0299780.013529 0.051743 0.063263 0.038084 0.020885 0.010481 0.009844 0.0285110.045746 0.104141 0.145924 0.11276 0.065015 0.026333 0.02593 0.0751920.026167 0.034522 0.085067 0.054154 0.02918 0.015887 0.014371 0.0397320.002676 0.012148 0.008913 0.004682 0.002452 0.000913 0.000905 0.0035730.014527 0.036084 0.046779 0.024712 0.017418 0.004182 0.004991 0.0196160.012121 0.030269 0.045198 0.022268 0.012468 0.004706 0.004955 0.017919

Scale up/down

00.050.1

0.150.2

0.250.3

0.350.4

1 11 21 31 41 51 61 71 81 91 101

111

121

131

141

Feature Number

Aver

age

Dis

tanc

e

0.024788 0.069615 0.0226 0.009364 0.01 0.00678 0.0097120.094781 0.227558 0.099002 0.046466 0.047815 0.036883 0.0246990.093399 0.233519 0.188091 0.043026 0.037991 0.022151 0.0240640.040228 0.102763 0.034949 0.014184 0.01465 0.010237 0.0155170.001163 0.000896 0.000722 0.000627 0.000349 0.000452 0.0027580.006947 0.006769 0.003541 0.006377 0.002048 0.005515 0.0130060.006365 0.005313 0.002064 0.004006 0.002055 0.003338 0.01010.011705 0.010935 0.006615 0.007506 0.003319 0.005911 0.0152110.009434 0.010169 0.004484 0.006306 0.002582 0.004798 0.0136570.006305 0.005997 0.003392 0.005719 0.002382 0.004853 0.0128020.005835 0.00945 0.004323 0.00564 0.002688 0.004535 0.0063320.008149 0.009636 0.0047 0.006213 0.002564 0.003375 0.0064210.006776 0.010315 0.005393 0.008004 0.003845 0.005659 0.0132030.001526 0.002551 0.000576 0.000371 0.000331 0.000286 0.000380.016302 0.022657 0.007055 0.00353 0.002171 0.004162 0.003980.012414 0.020159 0.007076 0.003102 0.00188 0.004606 0.003490.007231 0.013591 0.004979 0.001092 0.000582 0.002766 0.0007410.011588 0.015102 0.005764 0.003855 0.00262 0.004584 0.0037920.01212 0.016013 0.006441 0.004048 0.002728 0.004856 0.004241

0.012235 0.01671 0.00483 0.002616 0.00197 0.00268 0.001672

Cropping

00.05

0.10.15

0.20.25

0.30.35

1 11 21 31 41 51 61 71 81 91 101

111

121

131

141

Feature Number

Aver

age

Dis

tanc

e

0.006109 0.019169 0.032795 0.015229 0.008667 0.002357 0.00292 0.0123940.01223 0.070665 0.046472 0.02549 0.017445 0.008694 0.00841 0.021302

0.019067 0.08113 0.04592 0.024327 0.014169 0.004995 0.005275 0.0189370.011323 0.029089 0.063856 0.037716 0.01988 0.00522 0.005556 0.0264460.000995 0.000971 0.00241 0.001415 0.000736 0.000275 0.000272 0.0010220.007103 0.006337 0.015615 0.008709 0.003433 0.001572 0.002071 0.006280.004321 0.004457 0.012494 0.007507 0.003403 0.001351 0.001976 0.0053460.007451 0.008135 0.017145 0.008711 0.003192 0.001154 0.00223 0.0064860.00576 0.006822 0.015235 0.00869 0.003676 0.001193 0.002159 0.006191

0.006491 0.005948 0.013473 0.007436 0.003165 0.001777 0.002377 0.0056460.003832 0.005257 0.011884 0.008077 0.002654 0.001227 0.001213 0.0050110.004812 0.005389 0.011737 0.00729 0.003216 0.001534 0.002039 0.0051630.008795 0.007888 0.016303 0.008801 0.004048 0.002367 0.0027 0.0068440.000451 0.000707 0.002277 0.001346 0.000797 0.000253 0.000239 0.0009820.004914 0.006924 0.01499 0.009123 0.006657 0.003364 0.003391 0.0075050.004473 0.006398 0.017247 0.008858 0.005219 0.002338 0.002392 0.0072110.001723 0.003639 0.010426 0.005216 0.003024 0.00043 0.000423 0.0039040.00427 0.005712 0.011221 0.00856 0.006923 0.004464 0.004462 0.007126

0.004978 0.006186 0.009864 0.007161 0.005881 0.003835 0.003847 0.0061180.001722 0.0046 0.015611 0.007291 0.00338 0.000508 0.00049 0.005456

Rotation

0

0.02

0.04

0.06

0.08

0.1

0.12

1 10 19 28 37 46 55 64 73 82 91 100

109

118

127

136

Feature Number

Ave

rage

Dis

tanc

e


Similarity Theories

Objects are similar in all respects (Richardson 1928)Objects are similar in some respects (Tversky 1977)Similarity is a process of determining respects, rather than using predefined respects (Goldstone 94)


DPF

Which Place is Similar to Kyoto?PartialDynamicDynamic Partial Function


Precision/Recall


Summary

Statistical LearningEmerging Applications Data CharacteristicsClassical Models (Classification)Kernel Methods



2003-4-5



Emerging DB Applications

N < DN+ << N-

ExamplesInformation Retrieval with relevance feedbackGene ProfilingBioinformatics


Useful Links

Related Publicationshttp://www-db.stanford.edu/~echang/

Software Free Trialhttp://www.imagebeagle.comLocate objectionable images on your hard drivesBefore your boss finds it!!!


References1. The Elements of Statistical Learning, T. Hastie, R. Tibshirani, and J.

Friedman, Springer, N.Y., 20012. Machine Learning, T. Mitchell, 19973. High-dimensional Data Analysis, D. Donoho, American Math. Society Lecture,

20004. Support Vector Machine Active Learning for Image Retrieval, S. Tong and E.

Chang, ACM MM, 20015. Dynamic Partial Function, B. Li and E. Chang, ACM Journal, 20036. Pattern Discovery in Sequences under a Markov Assumption, D. Chudova and

P. Smyth, ACM KDD 20027. Bayes Point Machines, R. Herbrich, T. Graepel and C. Campbell, Journal of

Machine Learning Research, 20018. The Nature of Statistical Learning Theory, V. Vapnik, Springer, N.Y., 19959. Probabilistic Kernel Regression Models, T. Jaakkola and D. Haussler,

Conference of AI and Statistics, 199910. Support Vector Machines, Lecture Notes, A. Moore, CMU11. On the Surprising Behavior of Distance Metrics in High-dimensional Space, C.

Aggarwal, A. Hinneburg, and D. Keim, ICDT 2001

statistical learning methods for useful links emerging...

Documents