artificial intelligence ml u4

8/12/2019 artificial intelligence ML u4

1/100

Machine Learning


2/100

Introduction

What is learning ? Learning is any process by which a system

improves performance from experience.

!erbert "imon

Learning is constructing or modifyingrepresentations of what is being experienced.

#ys$ard Michals%i


3/100

&

Why learn?

'uild software agents that can adapt to their users or to othersoftware agents or to changing environments

(ersonali$ed news or mail filter

(ersonali$ed tutoring

Mars robot

)iscover new thingsor structure that were previously un%nown to

humans

*xamples+ data mining, scientific discovery

ML as a subfield of -I is concerned with design and development

of algorithms and techniues that allow computer to learn.

"imulation of Intelligence reuires features such as %nowledge

acuisition, inference, updating or refinement of %nowledge base

etc. /hus we can sum up by saying that learning is an important

aspect of intelligence.


4/100

/ypes of Learning Methodologies

Inductive learning+

#euired rules and patterns are extracted from

massive data sets.

)eductive learning+ )educing new %nowledge from already existing

%nowledge.


5/100

0

-pplications

-ssign ob1ect2event to one of a given finiteset of categories. Medical diagnosis

3redit card applications or transactions 4raud detection in e5commerce

"pam filtering in email

#ecommended boo%s, movies, music

4inancial investments

6ame playing

!andwritten letters


6/100

Machine5Learning "ystems

3omponents of a Learning "ystem

1. Learning component+ /o ma%e changes or

improvements to the system depending on

its performance.

2. Performance element+ It performs the tas%

ofchoosing the actions that need to be ta%en.

&. Critic+ /he 1ob of the critic is to inform the

learning component regarding itserformance


7/100

7. Problem generator+ It suggests

problems or

actions that would lead to generation of

new

examples or experiences.

0. Sensors and effectors+ 'oth thesecomponents are external to the system.


8/100

8

- general model of learning

agents


9/100

9

Ma1or paradigms of machine

learning Rote learning Learning by memori$ation.

eg5 3aching. "teps+ :rgani$ation, 6enerali$ation, "tability of ;nowledge.

Learning by taking advice /a%ing high level and abstract advice and thenconverting it into rules. eg. *xpert "ystems "teps+ #euest, Interpret, :perationali$e, Integrate, *valuation.

Learning by Parameter Adjustment "teps+

Initially start with some estimate of the correct weight settings. /hen modify the weight in the program on the basis of accumulated experiences. Increase or decrease the weights of features that appear to be good or bad

predictors respectively.

Learning by acro!"perators "imilar to rote learning, instead we avoidexpensive re5computation by using macro5operators that are learnt forsubseuent use.

Learning by Analogy )etermine correspondence between two differentrepresentations . Identified as 3-"* '-"*) #*":


10/100

"upervised and unsupervised

Learning Supervised learning =se specific examples

to reach general conclusions or extract generalrules

3lassification >3oncept learning

#egression

#nsupervised learning $Clustering%=nsupervised identification of natural groups indata


11/100

1) Neural Network Based Learning

It is a system loosely modeled based on the

human brain.

The basic computational element (model

neuron) is often called a node or unit. It

receives input from some other units, or

perhaps from an external source. Each input

has an associated weight w, which can bemodified by the learning methods.


12/100

@A

2% Supervised concept

learning 6iven a training set of positive and

negative examples of a concept

3onstruct a description that will accurately

classify whether future examples are

positive or negative

/hat is, learn some good estimate of

function f given a training set B>x@, y@, >xA,

yA, ..., >xn, ynC where each yiis either D

>positive or 5 >negative, or a probability

distribution over D25


13/100

&% Probability appro'imating

Correct Learning

In the (-3 model, we specify two small

parameters, E and F, and reuire that with

probability at least >@ F a system learns

a concept with error at most E.


14/100

7 #einforcement Learning

)ecision ma%ing >robot, chess machine

'asic ;inds+ =tility 4unction

-ction Galue 4unction


15/100

@0

/he inductive learning

problem *xtrapolate from a given set of

examples to ma%e accurate predictionsabout future examples

"upervised versus unsupervisedlearning Learn an un%nown function f>H J, where

H is an input example and J is the desiredoutput.

Supervised learningimplies we are givena training setof >H, J pairs by a

teacher.


16/100

@K

/he inductive learning

problem *xtrapolate from a given set of examples to ma%e accurate

predictions about future examples

"upervised versus unsupervised learning

Learn an un%nown function f>H J, where H is an inputexample and J is the desired output.

Supervised learningimplies we are given a training setof>H, J pairs by a teacher

#nsupervised learningmeans we are only given the Hsand some >ultimate feedbac% function on our performance.

3oncept learning or classification

6iven a set of examples of some concept2class2category,

determine if a given example is an instance of the concept ornot

If it is an instance, we call it a positive example

If it is not, it is called a negative example

:r we can ma%e a probabilistic prediction >e.g., using a

'ayes net


17/100

@

Inductive learning framewor%

#aw input data from sensors are typically preprocessedto obtain a feature vector, H, that adeuately describesall of the relevant features for classifying examples

*ach x is a list of >attribute, value pairs. 4or example,H (erson+"ue, *ye3olor+'rown, -ge+Joung,

"ex+4emaleN

/he number of attributes is fixed

*ach attribute has a fixed, finite number of possiblevalues >or could be continuous

*ach example can be interpreted as a point in an n5dimensional feature space, where n is the number ofattributes.


18/100

@8

Learning decision trees6oal+ 'uild a decision treeto classifyexamples as positive or negative

instances of a concept using supervised

learning from a training set

- decision treeis a tree where

each non5leaf node has associatedwith it an attribute >feature

each leaf node has associated with it a

classification >D or 5

each arc has associated with it one of

the possible values of the attribute atthe node from which the arc is directed

6enerali$ation+ allow for OA classes

e.g., Bsell, hold, buyC

Color

S(apeSi)e *

*! Si)e

*!

*

big

big small

small

roundsuare

redgreen blue


19/100

@9

)ecision tree5induced partition

example

Color

S(apeSi)e *

*! Si)e

*!

*

big

big small

small

roundsuare

redgreen brown

I


20/100

AP

Inductive learning and bias

"uppose that we want to learn a function f>x y and we are

given some sample >x,y pairs, as in figure >a

/here are several hypotheses we could ma%e about this

function, e.g.+ >b, >c and >d

- preference for one over the others reveals the biasof ourlearning techniue, e.g.+

prefer piece5wise functions

prefer a smooth function

prefer a simple function and treat outliers as noise


21/100

A@

3hoosing the best attribute

/he %ey problem is choosing which attributeto split a given set of examples

"ome possibilities are+ Random+"elect any attribute at random

Least!,alues+3hoose the attribute with thesmallest number of possible values

ost!,alues+3hoose the attribute with thelargest number of possible values

a'!-ain+3hoose the attribute that has thelargest expected information gaini.e., theattribute that will result in the smallest expectedsi$e of the subtrees rooted at its children

/he I)& algorithm uses the Max56ain method

of selecting the best attribute


22/100

)eductive learning

Wor%ing on already existing facts and

%nowledge and simply deducing new

%nowledge from the existing one.

If - >assertion then '>conclusion.

1%Probability based learning $ayesian

Learning%

2%Adaptive dynamic learning


23/100

3lustering -lgorithms

*xclusive 3lustering

;5means

:verlapping 3lustering

4u$$y 35means >43M

!ierarchical 3lustering


24/100

"upport Gector Machines

/he classifier is a separating hyperplane.

Most important training points are support vectorsQ they define

the hyperplane.

Ruadratic optimi$ation algorithms can identify which training points

'iare support vectors with non5$ero Lagrangian multipliers i.

'oth in the dual formulation of the problem and in the solution

training points appear only inside inner products+

4ind 1Nsuch that

/>0Si 5 SSijyiyj'i3'jis maximi$ed

and

>@ Siyi P

>A P 4i4 Cfor all i

f>' Siyi'i3' * b


25/100

Linear 3lassifiers

Copyright 2001, 2003,Andrew W. Moore

fx

yest

denotes +1

denotes -1

f(x,w,b) = sign(w. x- b)

How woud you

!"ssi#y this d"t"$


26/100

Linear 3lassifiers


fx

yest

denotes +1

denotes -1


How woud you

!"ssi#y this d"t"$


27/100

Linear 3lassifiers


fx

yest

denotes +1

denotes -1


How woud you

!"ssi#y this d"t"$


28/100

Linear 3lassifiers


fx

yest

denotes +1

denotes -1


How woud you

!"ssi#y this d"t"$


29/100

Linear 3lassifiers


fx

yest

denotes +1

denotes -1


Any o# these

woud %e #ine..

..%ut whi!h is

%est$


30/100

3lassifier Margin


fx

yest

denotes +1

denotes -1


&e#ine the '"rgin

o# " ine"r

!"ssi#ier "s thewidth th"t the

%ound"ry !oud %e

in!re"sed %y

%e#ore hitting "d"t"point.


31/100

Maximum Margin


fx

yest

denotes +1

denotes -1


(he '")i'u'

'"rgin ine"r

!"ssi#ieris theine"r !"ssi#ier

with the, u',

'")i'u' '"rgin.

(his is thesi'pest *ind o#

M C"ed "n

M/

Linear "GM


32/100

Maximum Margin


fx

yest

denotes +1

denotes -1


(he '")i'u'

'"rgin ine"r

!"ssi#ieris theine"r !"ssi#ier

with the, u',

'")i'u' '"rgin.

(his is thesi'pest *ind o#

M C"ed "n

M/

upport e!tors

"re those

d"t"points th"t

the '"rginpushes up

"g"inst

Linear "GM


33/100

*stimate the Margin

What is the distance expression for a point '

to a line5'Db P?


denotes +1

denotes -1x wx +% 0

2 2

2

( )d

ii

b bd

w=

+ += =

x w x w

x

w


34/100

*stimate the Margin

What is the expression for margin?


denotes +1

denotes -1 wx +% 0

2

margin min ( ) mindD D

ii

bd

w

=

+ =

x xx w

x

Margin


35/100

Maximi$e Margin


denotes +1

denotes -1 wx +% 0

,

,

2,

argmax margin( , , )

! argmax min ( )

argmax min

i

i

b

iDb

i

dDbii

b D

d

b

w

=

+ =

w

xw

xw

w

x

x w

Margin


36/100

Maximi$e Margin


denotes +1

denotes -1 wx +% 0

( )

2,

argmax min

sub"ect to # $

i

i

dDbi

i

i i i

b

w

D y b

=

+

+ >

xw

x w

x x w

Margin


37/100

Maximi$e Margin

"trategy+


denotes +1

denotes -1

wx +% 0

( )

2,

argmax min

sub"ect to # $

i

i

dDb

ii

i i i

b

wD y b

=

+

+ xw

x w

x x w

Margin

# i iD b + x x w( )

2

,

argmin

sub"ect to #

d

iib

i i i

w

D y b

=

+ w

x x w


38/100

Maximum Margin Linear

3lassifier

!ow to solve such a convex optimi$ation

problem ?Copyright 2001, 2003,

Andrew W. Moore

( )

( )

( )

% % 2

,

2 2

& , '! argmin

sub"ect to

....

d

kkw b

N N

w b w

y w x b

y w x b

y w x b

=

+

+

+

r

r

r r

r r

r r


39/100

Lagrange Multiplier Method

/he new ob1ective function is called the

Lagrangian for the optimi$ation problem+

Lp TUUWUUV5 Xi >yi>w.xiD b @5555>@

Xi555 Lagrange multiplier

(artially )ifferentiating Lp w.r.t YwZ and YbZ weget5

555>A

'ecause the La ran e multi liers areCopyright 2001, 2003,

Andrew W. Moore

w =iyixi and iyi= 0


40/100

It can be handled only when

[i\ P,

[i yi>w.xi D b @N P

/hese are %nown as the ;arush5;uhn5

/uc%er >;;/ conditions.

4rom the above euation YbZ can be

calculated.

"ubtituting the values from en. >A in en.

>@, we get5

Copyright 2001, 2003,

Andrew W. Moore

Linear "GM+


41/100

Linear "GM+


42/100

"upport Gector Machine >"GM for


43/100


44/100

"upport Gector Machine for


45/100

"GM ;ernel 4unctions

K(a,b)=(a. b1)!is an example of an

"GM ;ernel 4unction

'eyond polynomials there are other very

high dimensional basis functions that can

be made practical by finding the right

;ernel 4unction

#adial5'asis5style ;ernel 4unction+


46/100

;ernel /ric%s

#eplacing dot product with a %ernel

function

a >b

3ould ;>a,b >a5b&be a %ernel function ?

3ould ;>a,b >a5b7 >aDbAbe a %ernel

function?



47/100


48/100


49/100

3ontd..

-n -rtificial


50/100


51/100

/he


52/100

'ias of a


53/100


54/100

c

b

a

+tep unction


55/100

c d

b

a

1amp unction


56/100

+igmoid function

The Gaussian function is the probability function of the
http://en.wikipedia.org/wiki/Image:Logistic-curve.png


57/100

The Gaussian function is the probability function of the

normal distribution. +ometimes also called the fre*uency

curve.


58/100

-rtificial


59/100


60/100

Perceptron+ 7euron odel


61/100

Perceptron+ 7euron odel>"pecial form of single layer feed forward

/he perceptron was first proposed by #osenblatt >@908 is asimple neuron that is used to classify its input into one of twocategories.

- perceptron uses a step functionthat returns D@ ifweighted sum of its input P and 5@ otherwise

x1

x2

xn

w2

w1

wn

b (bias)

v y (v)


62/100


63/100

Learning (rocess for (erceptron

Initially assign random weights to inputs between 5P.0and DP.0

/raining data is presented to perceptron and its output isobserved.

If output is incorrect, the weights are ad1ustedaccordingly using following formula.wi wi D >a^ xi ^e, where YeZ is error produced

and YaZ >5@


64/100

*xample+ (erceptron to learn :#

function

Initially consider w@ 5P.A and wA P.7 /raining data say, x@ P and xA P, output is P. 3ompute y "tep>w@^x@ D wA^xA P. :utput is correct

so weights are not changed. 4or training data x@P and xA @, output is @

3ompute y "tep>w@^x@ D wA^xA P.7 @. :utput iscorrect so weights are not changed.


65/100

(erceptron+ Limitations

/he perceptron can only model linearly separablefunctions, those functions which can be drawn in A5dim graph and

single straight line separates values in two part.

'oolean functions given below are linearly

separable+ -


66/100

H:#


67/100

These two classes (true and false) cannot be separated using a

line. 9ence 6/1 is non linearly separable.

6 62 6 6/1 62

$ $ $

$

$

$

6

true false

false true$ 62


68/100

Multi layer feed5forward 44


69/100

44@,5@ and >5@,@.

/he output node is used to combine the outputs of the two hidden

nodes.

Input nodes 9idden layer /utput layer /utput

H 7$.4X

7 Y

7 92X2


70/100


71/100

44


72/100

/raining -lgorithm+

'ac%propagation

/he 'ac%propagation algorithm learns in the same wayas single perceptron.

It searches for weight values that minimi$e the totalerror of the networ% over the set of training examples>training set.

'ac%propagation consists of the repeated application ofthe following two passes+ ;or5ard pass+ In this step, the networ% is activated on one

example and the error of each neuron at the output layer iscomputed.

ack5ard pass+ In this step the networ% error is used forupdating the weights. /he error is propagated bac%wards fromthe output layer through the networ% layer by layer. /his isdone by recursively computing the local gradient of eachneuron.

< i


73/100

ac


74/100

3ontd..

3onsider a networ% of three layers. Let us use i to represent nodes in input layer, 1 to

represent nodes in hidden layer and % represent nodesin output layer.

wi1 refers to weight of connection between a node ininput layer and node in hidden layer.

/he following euation is used to derive the outputvalue J1 of node 1

where, H1 xi . wi15 1 , @ i nQ n is the number of inputs to

node 1, and 1is threshold for node 1

jXe+=

="


75/100


76/100

Weight =pdate #ule

/he 'ac%prop weight update rule is based on thegradient descent method+

It ta%es a step in the direction yielding the maximum

decrease of the networ% error *.

/his direction is the opposite of the gradient of *. Iteration of the 'ac%prop algorithm is usually

terminated when the sum of suares of errors of the

output values for all training data in an epoch is less

than some threshold such as P.P@

ijijij www +=i"

i"w

:w

= #

"t i it i


77/100

"topping criterions

/otal mean suared error change+ 'ac%5prop is considered to have converged when the

absolute rate of change in the average suared error per

epoch is sufficiently small >in the range P.@, P.P@N.

6enerali$ation based criterion+ -fter each epoch, the


78/100


79/100

#adial 'asis 4unction #'4 ifits output depends on the distance of the input from agiven stored vector. /he #'4 neural networ% has an input layer, a hidden layer and

an output layer.

In such #'4 networ%s, the hidden layer uses neurons with#'4s as activation functions. /he outputs of all these hidden neurons are combined linearly

at the output node.

/hese networ%s have a wide variety of applicationssuch as function approximation, time series prediction, control and regression, pattern classification tas%s for performing complex >non5linear.

#'4 -rchitecture


80/100

#'4 -rchitecture

"ne (idden layer 5it( R; activation functions

"utput layer 5it( linear activation function.

x!

x"

x1

y

w"1

w1

m

... m

@@)(@@...@@)(@@ mmm txwtxwy ++=

txxxtx m centerfrom),...,(ofdistance@@@@ =

3ont


81/100

3ont...

!ere we reuire weights, ifrom the hidden layer to theoutput layer only.

/he weights ican be determined with the help of anyof the standard iterative methods described earlier forneural networ%s.

!owever, since the approximating function given belowis linear w. r. t. i, it can be directly calculated using thematrix methods of linear least suares without having toexplicitly determine iiteratively.

It should be noted that the approximate function f(*) isdifferentiable with respect to i.

)()(= ==

N

iiii tXwXfY

Aomparison


82/100

RBF NN FF NN

Non-linear layered feed-forwardnetwor*s.

Non-linear layered feed-forwardnetwor*s

Hidden "yer o# 456 is non-linear,the output "yer o# 456 is linear.

Hidden "nd output "yers o#6677 "re usu"y non-linear.

8ne singlehidden "yer M"y h"e morehidden "yers.

7euron 'ode o# the hidden neuronsis different#ro' the one o# theoutput nodes.

Hidden "nd output neuronssh"re " common neuron model.

A!ti"tion #un!tion o# e"!h hiddenneuron in " 456 77 !o'putes theEuclidean distance %etween inpute!tor "nd the !enter o# th"t unit.

A!ti"tion #un!tion o# e"!hhidden neuron in " 6677!o'putes the inner product o#input e!tor "nd the syn"pti!weight e!tor o# th"t neuron

Aomparison

;; BE+IC; I++DE+


83/100

)ata representation


84/100

)ata representation depends on the problem. In general -


85/100

/he number of layers and neurons depend on thespecific tas%.

In practice this issue is solved by trial and error.

/wo types of adaptive algorithms can be used+

start from a large networ% and successively remove some

neurons and lin%s until networ% performance degrades.

begin with a small networ% and introduce new neurons until

performance is satisfactory.

;etwor< Topology


86/100

Initiali$ation of weights


87/100

Initiali$ation of weights

In general, initial weights are randomly chosen, withtypical values between [email protected] and @.P or 5P.0 and P.0.

If some inputs are much larger than others, random

initiali$ation may bias the networ% to give much more

importance to larger inputs. In such a case, weights can be initiali$ed as follows+

=

=Ni

N,...,

@x@

2

i"i

wor weights from the input to the first layer

or weights from the first to the second layer=

=Ni

Ni,...,)xw(

2

"


88/100

/he right value of depends on the

application.

Galues between P.@ and P.9 have beenused in many applications.

:ther heuristics is that adapt during the

training as described in previous slides.

Ahoice of learning rate

/ i i


89/100

/raining

#ule of thumb+ the number of training examples should be at least five to

ten times the number of weights of the networ%.

:ther rule+

@@! number of weights

a!expected accuracy on test seta):(

@E@; >

# t < t %


90/100

#ecurrent


91/100

#ecurrent


92/100

!opfield


93/100

!opfield


94/100

-ctivation -lgorithm

-ctive unit represented by @ and inactive by P.

+epeat 3hoose any unit randomly. /he chosen unit may be

active or inactive.

4or the chosen unit, compute the sum of the weightson the connection to the active neighbours only, if any. If sum O P >threshold is assumed to be P, then the

chosen unit becomes active, otherwise it becomesinactive.

If chosen unit has no active neighbours then ignore it,and status remains same.

ntilthe networ% reaches to a stable state


95/100

Aurrent +tate +elected Dnit fromcurrent state

Aorresponding ;ew +tate

:2

:2

:2

:2

:2

:2

+um ! 7 2 ! F $Gactivated

:2

:2

:2

9ere, the sum of weights of

active neighbours of aselected unit is calculated. :2

:2

:2

:2

:2

:2

+um ! 72 H $G deactivated


96/100

2

72

)

6 ! I$ J

2

72

)

6 !I $J

2

72

)

6 !I$ $ $J

+table ;etwor


97/100

*xample


98/100

*xample

Let us now consider a !opfield networ% with four unitsand three training input vectors that are to be learned by

the networ%.

3onsider three input examples, namely, H@, HA, and H&

defined as follows+

7

X! 7 X2! X!

7 7

7

E ! 6. (6)T3 62. (62)

T3 6. (6)

T7 .I

X! I 7 7 J : 2


99/100

7 7 $ $ $ $ 7 7 &! 7 7 . 7 $ $ $ ! 7 $ 7

7 7 $ $ $ 7 $ 7

7 7 $ $ $ 7 7 $

: :

: 5

X! I: :J

: 2

: :

: 5

+table positions of the networwhich is athamming distance @.

4inally, with the obtained weights and

stable states >H@ and H&, we canstabili$e any new >partial pattern to one

of those

artificial intelligence ml u4

Documents