support vector machines, kernels, and development of representations tim oates cognition, robotics,...

Support Vector Support Vector Machines, Kernels, Machines, Kernels, and Development of and Development of

RepresentationsRepresentations

Tim OatesTim Oates

Cognition, Robotics, and Learning Cognition, Robotics, and Learning (CORAL) Lab(CORAL) Lab

University of Maryland Baltimore CountyUniversity of Maryland Baltimore County

OutlineOutline

PredictionPrediction Support vector machinesSupport vector machines KernelsKernels Development of representationsDevelopment of representations

OutlineOutline

PredictionPrediction Why might predictions be wrong?Why might predictions be wrong?

Support vector machinesSupport vector machines Doing really well with linear modelsDoing really well with linear models

KernelsKernels Making the non-linear linearMaking the non-linear linear

Development of representationsDevelopment of representations Learning representations by learning kernelsLearning representations by learning kernels Beyond the veil of perceptionBeyond the veil of perception

PredictionPrediction

Supervised ML = Supervised ML = PredictionPrediction

Given training instances (x,y)Given training instances (x,y) Learn a model fLearn a model f Such that f(x) = ySuch that f(x) = y Use f to predict y for new xUse f to predict y for new x Many variations on this basic themeMany variations on this basic theme

Why might predictions be Why might predictions be wrong?wrong?

True non-determinismTrue non-determinism Partial observabilityPartial observability

hard, softhard, soft Representational biasRepresentational bias Algorithmic biasAlgorithmic bias Bounded resourcesBounded resources

True Non-DeterminismTrue Non-Determinism

Flip a biased coinFlip a biased coin p(p(headsheads) = ) = Estimate Estimate If If > 0.5 predict > 0.5 predict headsheads, else , else tailstails Lots of ML research on problems Lots of ML research on problems

like thislike this Learn a modelLearn a model Do the best you can in expectationDo the best you can in expectation

Partial ObservabilityPartial Observability

Something needed to predict y is Something needed to predict y is missing from observation xmissing from observation x

N-bit parity problemN-bit parity problem x contains N-1 bits (hard PO)x contains N-1 bits (hard PO) x contains N bits but learner ignores x contains N bits but learner ignores

some of them (soft PO)some of them (soft PO)

Representational BiasRepresentational Bias

Having the right features (x) is crucialHaving the right features (x) is crucial TD-GammonTD-Gammon

XOO O O XXX

X

OO O O

X

X

X

Development of Development of RepresentationsRepresentations

The agent (e.g. a robot)The agent (e.g. a robot) Fixed sensor suiteFixed sensor suite Multiple tasksMultiple tasks Changing environmentChanging environment Interaction with humans Interaction with humans

Sensors, concepts, beliefs, desires, intentionsSensors, concepts, beliefs, desires, intentions RepresentationsRepresentations

Expressed solely in terms of observablesExpressed solely in terms of observables Hand-coded, thus fixedHand-coded, thus fixed

Other Reasons for Wrong Other Reasons for Wrong PredictionsPredictions

Algorithmic biasAlgorithmic bias Bounded resourcesBounded resources

Support Vector Support Vector MachinesMachines

Doing Doing ReallyReally Well with Well with Linear Decision SurfacesLinear Decision Surfaces

Strengths of SVMsStrengths of SVMs

Good generalization in theoryGood generalization in theory Good generalization in practiceGood generalization in practice Work well with few training Work well with few training

instancesinstances Find globally best modelFind globally best model Efficient algorithmsEfficient algorithms Amenable to the kernel trickAmenable to the kernel trick

Linear SeparatorsLinear Separators

Training instancesTraining instances x x nn

y y {-1, 1} {-1, 1} w w nn

b b HyperplaneHyperplane

<w, x> + b = 0<w, x> + b = 0 ww11xx11 + w + w22xx22 … + w … + wnnxxnn + b = 0 + b = 0

Decision functionDecision function f(x) = sign(<w, x> + b)f(x) = sign(<w, x> + b)

IntuitionsIntuitions

X

X

O

OO

O

O

OX

X

X

X

X

XO

O

A “Good” SeparatorA “Good” Separator

X

X

O

OO

O

O

OX

X

X

X

X

XO

O

Noise in the Noise in the ObservationsObservations

X

X

O

OO

O

O

OX

X

X

X

X

XO

O

Ruling Out Some Ruling Out Some SeparatorsSeparators

X

X

O

OO

O

O

OX

X

X

X

X

XO

O

Lots of NoiseLots of Noise

X

X

O

OO

O

O

OX

X

X

X

X

XO

O

Maximizing the MarginMaximizing the Margin

X

X

O

OO

O

O

OX

X

X

X

X

XO

O

““Fat” SeparatorsFat” Separators

X

X

O

OO

O

O

OX

X

X

X

X

XO

O

Why Maximize Margin?Why Maximize Margin?

Increasing margin reduces Increasing margin reduces capacitycapacity Must restrict capacity to generalize Must restrict capacity to generalize

m training instancesm training instances 22mm ways to label them ways to label them What if function class that can separate What if function class that can separate

them all?them all? ShattersShatters the training instances the training instances

VC Dimension is largest m such that VC Dimension is largest m such that function class can shatter some set function class can shatter some set of m pointsof m points

VC Dimension ExampleVC Dimension Example

X

X X

O

X X

X

O X

X

X O

O

O X

O

X O

X

O O

O

O O

Bounding Generalization Bounding Generalization ErrorError

R[f] = risk, test errorR[f] = risk, test error RRempemp[f] = empirical risk, train error[f] = empirical risk, train error h = VC dimensionh = VC dimension m = number of training instancesm = number of training instances = probability that bound does not = probability that bound does not

holdhold

1m

2mh

ln + 14

+ lnhR[f] Remp[f] +

Support VectorsSupport Vectors

X

X

O

OO

O

O

O

O

O

X

X

X

X

X

X

The MathThe Math Training instancesTraining instances

x x nn

y y {-1, 1} {-1, 1} Decision functionDecision function

f(x) = sign(<w,x> + b)f(x) = sign(<w,x> + b) w w nn

b b Find w and b that Find w and b that

Perfectly classify training instancesPerfectly classify training instances Assuming linear separabilityAssuming linear separability

Maximize marginMaximize margin

The MathThe Math

For perfect classification, we wantFor perfect classification, we want yyii (<w,x (<w,xii> + b) ≥ 0 for all i> + b) ≥ 0 for all i Why?Why?

To maximize the margin, we wantTo maximize the margin, we want w that minimizes |w|w that minimizes |w|22

Dual Optimization Dual Optimization ProblemProblem

Maximize over Maximize over W(W() = ) = ii ii - 1/2 - 1/2 i,ji,j ii jj y yii y yj j <x<xii, x, xjj>>

Subject toSubject to i i 0 0

ii ii y yii = 0 = 0

Decision functionDecision function f(x) = sign(f(x) = sign(ii ii y yii <x, x <x, xii> + b)> + b)

What if Data Are Not What if Data Are Not Perfectly Linearly Perfectly Linearly

Separable?Separable? Cannot find w and b that satisfyCannot find w and b that satisfy

yyii (<w,x (<w,xii> + b) ≥ 1 for all i> + b) ≥ 1 for all i

Introduce slack variables Introduce slack variables ii

yyii (<w,x (<w,xii> + b) ≥ 1 - > + b) ≥ 1 - ii for all i for all i

MinimizeMinimize |w||w|22 + C + C ii

Strengths of SVMsStrengths of SVMs

Good generalization in theoryGood generalization in theory Good generalization in practiceGood generalization in practice Work well with few training Work well with few training

instancesinstances Find globally best modelFind globally best model Efficient algorithmsEfficient algorithms Amenable to the kernel trick …Amenable to the kernel trick …

What if Surface is Non-What if Surface is Non-Linear?Linear?

X

XXX

XX

O O

OO

OO O

OOO

OO

O

OOOO

O OO

Kernel MethodsKernel Methods

Making the Non-Linear Making the Non-Linear LinearLinear

When Linear Separators When Linear Separators FailFail

XOO O O XXX x1

x2

X

OO O O

X

X

X

x1

x12

Mapping into a New Feature Mapping into a New Feature SpaceSpace

Rather than run SVM on xRather than run SVM on xii, run it on , run it on (x(xii)) Find non-linear separator in input spaceFind non-linear separator in input space What if What if (x(xii) is really big?) is really big? Use kernels to compute it implicitly!Use kernels to compute it implicitly!

: x : x X = X = (x)(x)

(x(x11,x,x22) = (x) = (x11,x,x22,x,x1122,x,x22

22,x,x11xx22))

KernelsKernels

Find kernel K such thatFind kernel K such that K(xK(x11,x,x22) = < ) = < (x(x11), ), (x(x22)>)>

Computing K(xComputing K(x11,x,x22) should be ) should be efficient, much more so than efficient, much more so than computing computing (x(x11) and ) and (x(x22))

Use K(xUse K(x11,x,x22) in SVM algorithm rather ) in SVM algorithm rather than <xthan <x11,x,x22>>

Remarkably, this is possibleRemarkably, this is possible

The Polynomial KernelThe Polynomial Kernel

K(xK(x11,x,x22) = < x) = < x11, x, x2 2 > > 22

xx11 = (x = (x1111, x, x1212))

xx22 = (x = (x2121, x, x2222))

< x< x11, x, x2 2 > = (x> = (x1111xx2121 + x + x1212xx2222))

< x< x11, x, x2 2 > > 22 = (x = (x11112 2 xx2121

22 + x + x121222xx2222

22 + 2x + 2x1111 xx12 12 xx2121

xx2222))

(x(x11) = (x) = (x111122, x, x1212

22, √2x, √2x1111 xx1212))

(x(x22) = (x) = (x212122, x, x2222

22, √2x, √2x2121 xx2222))

K(xK(x11,x,x22) = < ) = < (x(x11), ), (x(x22)) >>

The Polynomial KernelThe Polynomial Kernel

(x) contains all monomials of (x) contains all monomials of degree ddegree d

Useful in visual pattern recognitionUseful in visual pattern recognition Number of monomialsNumber of monomials

16x16 pixel image16x16 pixel image 10101010 monomials of degree 5 monomials of degree 5

Never explicitly compute Never explicitly compute (x)!(x)! Variation - K(xVariation - K(x11,x,x22) = (< x) = (< x11, x, x2 2 > + 1) > + 1) 22

KernelsKernels

What does it What does it meanmean to be a kernel? to be a kernel? K(xK(x11,x,x22) = < ) = < (x(x11), ), (x(x22) > for some ) > for some

What does it What does it taketake to be a kernel? to be a kernel? The Gram matrix GThe Gram matrix Gijij = K(x = K(xii, x, xjj)) Positive definite matrixPositive definite matrix

ijij c cii c cjj G Gijij 0 for c 0 for ci,i, c cjj Positive definite kernelPositive definite kernel

For all samples of size m, induces a positive For all samples of size m, induces a positive definite Gram matrixdefinite Gram matrix

A Few KernelsA Few Kernels Dot product kernelDot product kernel

K(xK(x11,x,x22) = < x) = < x11,x,x22 > > Polynomial kernelPolynomial kernel

K(xK(x11,x,x22) = < x) = < x11,x,x22 > >dd

Monomials of degree dMonomials of degree d Gaussian kernelGaussian kernel

K(xK(x11,x,x22) = exp(-| x) = exp(-| x11-x-x22 | |22/2/222)) Radial basis functionsRadial basis functions

Sigmoid kernelSigmoid kernel K(xK(x11,x,x22) = tanh(< x) = tanh(< x11,x,x22 > + > + )) Neural networksNeural networks

Establishing “kernel-hood” from first principles is Establishing “kernel-hood” from first principles is non-trivialnon-trivial

The Kernel TrickThe Kernel Trick

“Given an algorithm which is formulated in terms of a positive definite kernel K1, one can construct an alternative algorithm by replacing K1 with another positive definite kernel K2”

SVMs can use the kernel trick

Exotic KernelsExotic Kernels

StringsStrings TreesTrees GraphsGraphs The hard part is establishing kernel-The hard part is establishing kernel-

hoodhood

Development of Development of RepresentationsRepresentations

Motivation, AgainMotivation, Again

The agent (e.g. a robot)The agent (e.g. a robot) Fixed sensor suiteFixed sensor suite Multiple tasksMultiple tasks Changing environmentChanging environment Interaction with humans Interaction with humans

Sensors, concepts, beliefs, desires, intentionsSensors, concepts, beliefs, desires, intentions RepresentationsRepresentations

Expressed solely in terms of observablesExpressed solely in terms of observables Hand-coded, thus fixedHand-coded, thus fixed

An Old IdeaAn Old Idea

Constructive inductionConstructive induction Search over feature spaceSearch over feature space Generate new features from existing Generate new features from existing

onesones ffnewnew = f = f11 * f * f22

Test new features by running learning Test new features by running learning algorithmalgorithm

Retain those that improve performanceRetain those that improve performance

Making an Old Idea NewMaking an Old Idea New

Co-chaired Workshop on Development of Co-chaired Workshop on Development of Representations at ICML 2002Representations at ICML 2002

ConclusionsConclusions Work in this area is vitally importantWork in this area is vitally important Hand-coded representationsHand-coded representations

Keep humans in the loopKeep humans in the loop Limited time and ingenuityLimited time and ingenuity TD-GammonTD-Gammon

Need big success to draw more attention from Need big success to draw more attention from communitycommunity

Change of Feature SpaceChange of Feature Space

K(x,y) = K(x,y) = KK11(x,y) + (1 - (x,y) + (1 - )K)K22(x,y)(x,y)

K(x,y) = K(x,y) = KK11(x,y)(x,y)

K(x,y) = KK(x,y) = K11(x,y) K(x,y) K22(x,y)(x,y)

K(x,y) = f(x) f(y)K(x,y) = f(x) f(y)

K(x,y) = KK(x,y) = K33(((x),(x),(y))(y))

… … and so onand so on

Searching Through Feature Searching Through Feature Spaces (not Feature Space)Spaces (not Feature Space)

Each kernel corresponds to a distinct Each kernel corresponds to a distinct feature space feature space

Search over kernel space (feature Search over kernel space (feature spaces)spaces) Start with known kernelsStart with known kernels Operators generate new (composite) kernelsOperators generate new (composite) kernels Evaluate according to performanceEvaluate according to performance

Keep learning algorithm constant, vary Keep learning algorithm constant, vary representationrepresentation

Current Work in LabCurrent Work in Lab

Identify/construct datasets where existing Identify/construct datasets where existing kernels are insufficientkernels are insufficient

Determine conditions under which this Determine conditions under which this occursoccurs

Search kernel space by handSearch kernel space by hand Implement automated search methodsImplement automated search methods Developing naturally occurring datasetsDeveloping naturally occurring datasets

Robotic applicationsRobotic applications Methods for triggering search over kernel Methods for triggering search over kernel

spacespace

Beyond the Veil of Beyond the Veil of PerceptionPerception

Human perception is limitedHuman perception is limited Sight, hearing, taste, smell, touchSight, hearing, taste, smell, touch

We overcome this limitationWe overcome this limitation Genes, atoms, gravity, tectonic plates, germs, Genes, atoms, gravity, tectonic plates, germs,

dark matter, electricity, black holesdark matter, electricity, black holes We can’t see, hear, taste, smell, or touch a black We can’t see, hear, taste, smell, or touch a black

holehole Most physicists believe they existMost physicists believe they exist

Theoretical entitiesTheoretical entities Causally efficacious entities of the world that Causally efficacious entities of the world that

cannot be sensed directlycannot be sensed directly

Our ClaimOur Claim

Theoretical entities are of fundamental Theoretical entities are of fundamental importance to the development of knowledge importance to the development of knowledge in individualsin individuals

Core concepts in which all others ground outCore concepts in which all others ground out Weight, distance, velocity, orientation, animacyWeight, distance, velocity, orientation, animacy Beliefs, desires, intentions - theory of mindBeliefs, desires, intentions - theory of mind

Robots need algorithms for discovering, Robots need algorithms for discovering, validating, and using theoretical entities if validating, and using theoretical entities if they are to move beyond the veil of perceptionthey are to move beyond the veil of perception

Discovering Theoretical Discovering Theoretical EntitiesEntities

Actively explore environmentActively explore environment Child/robot as scientistChild/robot as scientist

Posit TEs to explain non-determinism in Posit TEs to explain non-determinism in action outcomesaction outcomes

Keep those that are mutually predictiveKeep those that are mutually predictive Flipping a coinFlipping a coin

Posit a bit somewhere in the universe whose Posit a bit somewhere in the universe whose value determines outcomevalue determines outcome

Flip coin to determine current value of that bitFlip coin to determine current value of that bit Use the bit’s value to make other predictionsUse the bit’s value to make other predictions

The Concept “Weight”The Concept “Weight”

Do you have a weight sensor?Do you have a weight sensor? Are children born with knowledge Are children born with knowledge

that objects have a property called that objects have a property called weight?weight?

How does this concept arise?How does this concept arise? We claim that “weight” is a We claim that “weight” is a

theoretical entity, just like dark theoretical entity, just like dark matter and black holesmatter and black holes

The Crate-Stacking The Crate-Stacking RobotRobot

slidefast

slowforce

high

lowstack

topple

stable

weightheavy

light

What Did the Robot What Did the Robot Accomplish?Accomplish?

PredictionPrediction Push crate and predict outcome of stackingPush crate and predict outcome of stacking

ControlControl Build taller, more stable stacksBuild taller, more stable stacks

UnderstandingUnderstanding Acquired fundamental concept about the Acquired fundamental concept about the

world, grounded in actionworld, grounded in action Extended innate sensory endowmentExtended innate sensory endowment

““Weight” cannot be represented as a simple Weight” cannot be represented as a simple combination of sensor datacombination of sensor data

Triggering Search for Triggering Search for KernelsKernels

Why might predictions about action Why might predictions about action outcomes be wrong?outcomes be wrong?

Recall first few slidesRecall first few slides TEs directly address partial observabilityTEs directly address partial observability If representation is wrong, action If representation is wrong, action

outcomes that are deterministic may outcomes that are deterministic may appear to be randomappear to be random

Use TEs to trigger search for new feature Use TEs to trigger search for new feature space that renders these outcomes space that renders these outcomes (nearly) deterministic(nearly) deterministic

ConclusionConclusion

SVMs find optimal linear separatorSVMs find optimal linear separator The kernel trick makes SVMs non-The kernel trick makes SVMs non-

linear learning algorithmslinear learning algorithms Development of representationsDevelopment of representations

Searching through feature spacesSearching through feature spaces Triggering search with theoretical Triggering search with theoretical

entitiesentities

support vector machines, kernels, and development of representations tim oates cognition, robotics,...

Documents

prediction slide

fixed slide

soft po x

observation x n

training instances x

expectation slide

bits hard po x

soft po slide