cs2351 ai unit5 learning

8/12/2019 CS2351 AI UNIT5 Learning

1/65

UNIT V: LEARNING


2/65

LEARNING

Learning from Observation

Inductive Learning

Decision Trees Explanation based Learning

Statistical Learning methods

Reinforcement Learning


3/65

2000-2012 Franz Kurfess Learning

Learning

An agent tries to improve its behavior throughobservation, reasoning, or reflection

learning from experience

memorization of past percepts, states, and actions generalizations, identification of similar experiences

forecasting

prediction of changes in the environment

theories generation of complex models based on observations

and reasoning


4/65


Learning from Observation

Learning Agents

Inductive Learning

Learning Decision Trees


5/65


Learning Agents

based on previous agent designs, such asreflexive, model-based, goal-based agents

those aspects of agents are encapsulated into the

performance elementof a learning agent a learning agent has an additional learning

element usually used in combination with a critic and a

problem generator for better learning most agents learn from examples

inductive learning


6/65


Learning Agent Model

Sensors

Effectors

Performance Element

Critic

Learning Element

Problem Generator

Agent

Environment

PerformanceStandard

Feedback

LearningGoals

Changes

Knowledge


7/65 2000-2012 Franz Kurfess Learning

Forms of Learning

supervised learning

an agent tries to find a function that matches examples from a sampleset

each example provides an input together with the correct output

a teacher provides feedback on the outcome

the teacher can be an outside entity, or part of the environment

unsupervised learning

the agent tries to learn from patterns without corresponding outputvalues

reinforcement learning the agent does not know the exact output for an input, but it receives

feedback on the desirability of its behavior

the feedback can come from an outside entity, the environment, or theagent itself

the feedback may be delayed, and not follow the respective actionimmediately



Feedback

provides information about the actual outcome ofactions

supervised learning

both the input and the output of a component can be

perceived by the agent directly

the output may be provided by a teacher

reinforcement learning

feedback concerning the desirability of the agentsbehavior is available

not in the form of the correct output

may not be directly attributable to a particular action

feedback may occur only after a sequence of actions



Prior Knowledge

background knowledge available before a task

is tackled

can increase performance or decrease

learning time considerably

many learning schemes assume that no prior

knowledge is available

in reality, some prior knowledge is almost

always available

but often in a form that is not immediately usable

by the agent



Inductive Learning

tries to find a function h(the hypothesis)

that approximates a set of samples defining a

function f

the samples are usually provided as

input-output pairs (x, f(x))

supervised learning method

relies on inductive inference, or induction

conclusions are drawn from specific instances to

more general statements



Hypotheses finding a suitable hypothesis can be difficult

since the function fis unknown, it is hard to tell if thehypothesis his a good approximation

the hypothesis spacedescribes the set of

hypotheses under consideration e.g. polynomials, sinusoidal functions, propositional

logic, predicate logic, ...

the choice of the hypothesis space can strongly

influence the task of finding a suitable function

while a very general hypothesis space (e.g. Turingmachines) may be guaranteed to contain a suitablefunction, it can be difficult to find it

Ockhams razor: if multi le h otheses are



Example Inductive Learning 1

input-output pairsdisplayed as points

in a plane

the task is to find a

hypothesis

(functions) thatconnects the points

either all of them,

or most of them

various

performancemeasures

number of points

connected

minimal surface

lowest tension

x

f(x)




hypothesis is afunction consisting

of linear segments

fully incorporates all

sample pairs

goes through allpoints

very easy to

calculate

has discontinuities

at the joints of thesegments

moderate predictive

performancex

f(x)




hypothesisexpressed as apolynomialfunction

incorporates allsamples

morecomplicated tocalculate than

linear segments no

discontinuities

better predictivepower

x

f(x)




hypothesis is alinear functions

does not

incorporate all

samples

extremely easy tocompute

low predictive

power

x

f(x)


16/65


Learning and Decision Trees

based on a set of attributes as input, predictedoutput value, the decisionis learned

it is called classificationlearning for discrete values

regressionfor continuous values

Boolean or binary classification

output values are true or false

conceptually the simplest case, but still quite powerful

making decisions

a sequence of test is performed, testing the value of

one of the attributes in each step

when a leaf node is reached its value is returned


17/65


Boolean Decision Trees

compute yes/no decisions based on sets of

desirable or undesirable properties of an

object or a situation

each node in the tree reflects one yes/no decision

based on a test of the value of one property of the

object

the root node is the starting point leaf nodes represent the possible final decisions

branches are labeled with possible values

the learning aspect is to predict the value of a

oal redicate also called oal conce t


18/65


Terminology

example or sample describes the values of the attributes and the goal

a positive sample has the value true for the goal predicate, a

negative sample has false

sample set

collection of samples used for training and validation

training

the training set consists of samples used for

constructing the decision tree

validation

the test set is used to determine if the decision tree


19/65


Restaurant Sample Set

Example Attributes Goal ExaAlt Bar Fri Hun Pat Price Rain Res Type Est WillWait

X1 Yes No No Yes Some $$$ No Yes French 0-10 Yes X1

X2 Yes No No Yes Full $ No No Thai 30-60 No X2

X3 No Yes No No Some $ No No Burger 0-10 Yes X3

X4 Yes No Yes Yes Full $ No No Thai 10-30 Yes X4

X5 Yes No Yes No Full $$$ No Yes French >60 No X5

X6 No Yes No Yes Some $$ Yes Yes Italian 0-10 Yes X6

X7 No Yes No No None $ Yes No Burger 0-10 No X7

X8 No No No Yes Some $$ Yes Yes Thai 0-10 Yes X8X9 No Yes Yes No Full $ Yes No Burger >60 No X9

X10 Yes Yes Yes Yes Full $$$ No Yes Italian 10-30 No X10

X11 No No No No None $ No No Thai 0-10 No X11

X12 Yes Yes Yes Yes Full $ No No Burger 30-60 Yes X12


20/65


Decision Tree Example

No

No

Yes

Patrons?

Bar?

EstWait?

Hungry?Yes

Yes Alternative? No Alternative?

Walkable?Yes

Yes No

Driveable?Yes

Yes No

To wait, or not to wait?


21/65


Learning Decision Trees

Problem: find a decision tree that agrees with

the training set

trivial solution: construct a tree with one

branch for each sample of the training set

works perfectly for the samples in the training set

may not work well for new samples

(generalization)

results in relatively large trees

better solution: find a concise tree that still

agrees with all samples


22/65


Ockhams Razor

The most likely hypothesis is the simplest

one that is consistent with all

observations.

general principle for inductive learning

a simple hypothesis that is consistent with all

observations is more likely to be correct than a

complex one


23/65


Constructing Decision Trees

in general, constructing the smallest possible

decision treeis an intractable problem

algorithms exist for constructing reasonably

small trees

basic idea: test the most important attribute

first

attribute that makes the most difference for the

classification of an example

can be determined through information theory

hopefully will yield the correct classification with


24/65


Decision Tree Algorithm

recursive formulation

select the best attribute to splitpositive and

negative examples

if only positive or only negative examples are left,

we are done

if no examples are left, no such examples were

observed return a default value calculated from the majority

classification at the nodes parent

if we have positive and negative examples left, but

no attributes to split them, we are in trouble


25/65


26/65


Noise and Over-fitting

the presence of irrelevant attributes (noise) may lead to moredegrees of freedom in the decision tree

the hypothesis space is unnecessarily large

overfittingmakes use of irrelevant attributes to distinguish

between samples that have no meaningful differences e.g. using the day of the week when rolling dice

over-fitting is a general problem for all learning algorithms

decision tree pruningidentifies attributes that are likely to be

irrelevant very low information gain

cross-validationsplits the sample data in different training and

test sets

results are averaged


27/65


Ensemble Learning

Multiple hypotheses (an ensem le)aregenerated, and their predictions combined

by using multiple hypotheses, the likelihood for

misclassification is hopefully lower also enlarges the hypothesis space

oosting is a frequently used ensemble method

each example in the training set has a weightassociated

the weights of incorrectly classified examples are

increased, and a new hypothesis is generated from

this new weighted training set


28/65

Explanation-Based Learning

Learning complex concepts using Inductionprocedures typically requires a substantialnumber of training instances.

But people seem to be able to learn quite a bitfrom single examples.

An EBL system attempts to learn from a singleexample x by explaining why x is an example

of the target concept. The explanation is then generalized, and then

systems performance is improved through theavailability of this knowledge.

EBL


29/65

EBL EBL programs as accepting the following as input:

A training example

A goal concept: A high level description of what the program is supposed to

learn An operational criterion- A description of which concepts are usable.

A domain theory: A set of rules that describe relationships between objectsand actions in a domain.

From this EBL computes a generalizationof the training example that is

sufficient to describe the goal concept, and also satisfies theoperationality criterion.

Explanation-based generalization (EBG) is an algorithm for EBL and hastwo steps: (1) explain, (2) generalize

During the explanationstep- prune away all the unimportant aspects ofthe training example with respect to the goal conceptgives explanation

The next step is to generalize the explanationas far as possible while stilldescribing the goal concept.


30/65

Statistical Learning Methods


31/65

Statistical Learning Data instantiations of some or all of

the random variables describing thedomain; they are evidence

Hypothesesprobabilistic theories of

how the domain works The Surprise candy example: two

flavors in very large bags of 5 kinds,

indistinguishable from outside h1: 100% cherry P(c|h1) = 1, P(l|h1) = 0

h2: 75% cherry + 25% lime

h3: 50% cherry + 50% lime


32/65

Problem formulation Given a new bag, random variable H denotes the

bag type (h1h5); Diis a random variable (cherry orlime); after seeing D1, D2, , DN,predictthe flavor

(value) of DN-1.

Bayesian learning Calculates the probability of each hypothesis, given

the data and makes predictions on that basis P(hi|d) = P(d|hi)P(hi), where dare observed values of D

Predictions use a likelihood-weighted average overhypotheses

hiare intermediariesbetween raw data and predictions

No need to pick one best-guess hypothesis


33/65

Learning with Complete Data

Parameter learning - to find the numericalparameters for a probability model whosestructure is fixed

Data are complete when each data pointcontains values for every variable in themodel

Maximum-likelihoodparameter learning:discrete model

With complete data, ML parameter learning


34/65

Nave Bayes models

The most common Bayesian network model used in machinelearning

It assumes that the attributes are conditionally independentof each other, given class

A deterministic prediction can be obtained by choosing themost likely class

P(C|x1,x2,,xn) = (C)i(xi|C)

NBChasnodifficultywithnoisydata


35/65

Learning with Hidden Variables

Many real-world problems have hidden variables which arenot observable in the data available for learning.

Question: If a variable (disease) is not observed, why not

construct a model without it?

Answer: Hidden variables can dramatically reduce the numberof parameters required to specify a Bayesian network. Thisresults in the reduction of needed amount of data forlearning.


36/65

EM: Learning mixtures of Gaussians

The unsupervised clustering problem

If we knew which component generated each

xj, we can get , If we knew the parameters of each component, we

know which cishould xj belong to. However, we donot know either,

EMexpectation and maximization

Pretend we know the parametersof the model

and then to infer the probabilitythat each xj


37/65

E-step computes the expected value pij of thehiddenindicator variablesZij, where Zijis 1 if xjwas generated by i-thcomponent, 0 otherwise

M-stepfinds the new valuesof the parameters that maximizethe log likelihood of the data, given the expected values of Zij


38/65

Instance-based Learning

Parametric vs. nonparametric learning

Learning focuses on fitting the parametersof a

restricted family of probability models to an

unrestricted data set Parametric learning methods are often simple and

effective, but can oversimplify whats really

happening

Nonparametriclearning allows the hypothesis

complexity to grow with the data

IBL is nonparametric as it constructs hypotheses

directly from the training data.


39/65

Nearest-neighbor models

The key idea: Neighbors are similar

Density estimation example: estimate xs probability density by thedensity of its neighbors

Connecting with table lookup, NBC, decision trees,

How define neighborhood N If too small, no any data points

If too big, density is the same everywhere

A solution is to define N to contain k points, where k is large enough toensure a meaningful estimate

For a fixed k, the size of N varies

The effect of size of k

For most low-dimensional data, k is usually between 5-10


40/65

K-NN for a given query x

Which data point is nearest to x?

We need a distance metric, D(x1, x2)

Euclidean distance DE is a popular one

When each dimension measures something different, it is inappropriate to useDE (Why?)

Important to standardize the scale for each dimension

Mahalanobis distance is one solution

Discrete features should be dealt with differently

Hamming distance

Use k-NN to predict

High dimensionality poses another problem


41/65

Summary

Bayesian learning formulates learning as a form of probabilistic inference,using the observations to update a prior distribution over hypotheses.

Maximum a posteriori (MAP) selects a single most likely hypothesis giventhe data.

Maximum likelihood simply selects the hypothesis that maximizes the

likelihood of the data (= MAP with a uniform prior). EM can find local maximum likelihood solutions for hidden variables.

Instance-based models use the collection of data to represent adistribution.

Nearest-neighbor method


42/65

Reinforcement Learning

In which we examine how an agent

can learn from success and failure,reward and punishment.


43/65

Introduction

Learning to r ide a bicycle:The goal given to the Reinforcement Learning

system is simply to ride the bicycle without fallingoverBegins riding the bicycle and performs a series of

actions that result in the bicycle being tilted 45degrees to the right

Photo:http://www.roanoke.com/outdoors/bikepages/bikerattler.html


44/65

Introduction

Learning to r ide a bicycle:

RL system turns the handle bars to the LEFT

Result: CRASH!!!Receives negative reinforcement

RL system turns the handle bars to the RIGHTResult: CRASH!!!

Receives negative reinforcement


45/65

Introduction

Learning to r ide a bicycle:RL system has learned that the state of being

titled 45 degrees to the right is bad

Repeat trial using 40 degree to the rightBy performing enough of these trial-and-errorinteractions with the environment, the RL systemwill ultimately learn how to prevent the bicyclefrom ever falling over

i i i


46/65

Passive Learning in a Known

Environment

Passive Learner: A passive learner simplywatchesthe world going by, and tries to

learn the utilityof being in various states. Another way to think of a passive learneris as an agent with a fixed policytrying todetermine its benefits.


47/65

Passive Learning in a Known Environment

In passive learning, the environment generates statetransitions and the agent perceives them. Consideran agent trying to learn the utilities of the statesshown below:


48/65


Agent can move {North, East, South, West}

Terminate on reading [4,2] or [4,3]


49/65


the object is to use this information about rewards tolearn the expected utility U(i) associated with eachnonterminal state i

Utilities can be learned using 3 approaches1) LMS (least mean squares)

2) ADP (adaptive dynamic programming)

3) TD (temporal difference learning)


50/65

Active Learning in an Unknown Environment

An active agent must consider :

what actions to take

what their outcomes may be

how they will affect the rewards received


51/65

Active Learning in an Unknown Environment

Minor changes to passive learning agent :

environment model now incorporates the

probabilities of transitions to other states

given a particular action

maximize its expected utility

agent needs a performance element to

choose an action at each step


52/65


Learning in Neural Networks

Neurons and the Brain

Neural Networks

Perceptrons

Multi-layer Networks

Applications


53/65


Neural Networks

complex networks of simple computing

elements

capable of learning from examples

with appropriate learning methods

collection of simple elements performs high-

level operations

thought

reasoning

consciousness


54/65


Neural Networks and the Brain

brain set of interconnected modules

performs information processing

operations at various levels

sensory input analysis

memory storage and retrieval

reasoning

feelings

consciousness

neurons basic computational elements

heavily interconnected with

other neurons

[Russell & Norvig, 1995]


55/65


Artificial Neuron Diagram

weighted inputs are summed up by the input function

the (nonlinear) activation functioncalculates the activation value,

which determines the output



56/65


Common Activation Functions

Stept(x) = 1 if x >= t, else 0

Sign(x) = +1 if x >= 0, else 1

Sigmoid(x) = 1/(1+e-x)



57/65


Network Structures

in principle, networks can be arbitrarily

connected

occasionally done to represent specific structures

semantic networks

logical sentences

makes learning rather difficult

layered structures networks are arranged into layers

interconnections mostly between two layers

some networks have feedback connections


58/65


Perceptrons single layer, feed-forward

network

historically one of the first

types of neural networks

late 1950s

the output is calculated as

a step function applied tothe weighted sum of

inputs

capable of learning simple

functions

linearly separable



59/65


Perceptrons and Learning

perceptrons can learn from examples through a

simple learning rule

calculate the error of a unit Errias the difference

between the correct output Tiand the calculated

output OiErri= Ti- Oi

adjust the weight Wjof the input Ijsuch that the error

decreasesWij:= Wij+ *Iij* Errij is the learning rate

this is a gradient descent search through the weight

space


60/65


Multi-Layer Networks

research in the more complex networks with

more than one layer was very limited until the

1980s

learning in such networks is much morecomplicated

the problem is to assign the blame for an error to

the respective units and their weights in aconstructive way

the back-propagation learning algorithm can

be used to facilitate learning in multi-layer


61/65


Diagram Multi-Layer Network

two-layer network input units Ik

usually not counted as a

separate layer

hidden units aj

output units Oi

usually all nodes of one layer

have weighted connections to

all nodes of the next layer

Ik

aj

Oi

Wji

Wkj


62/65


Back-Propagation Algorithm

assigns blame to individual units in the

respective layers

essentially based on the connection strength

proceeds from the output layer to the hiddenlayer(s)

updates the weights of the units leading to the

layer essentially performs gradient-descent search

on the error surface

relatively simple since it relies only on local


63/65


Capabilities of Multi-Layer Neural Networks

expressiveness weaker than predicate logic

good for continuous inputs and outputs

computational efficiency training time can be exponential in the number of

inputs

depends critically on parameters like the learning rate

local minima are problematic

can be overcome by simulated annealing, at additional cost

generalization

works reasonabl well for some functions classes of

Capabilities of Multi Layer Neural


64/65


Capabilities of Multi-Layer Neural

Networks (cont.)

sensitivity to noise

very tolerant

they perform nonlinear regression

transparency

neural networks are essentially black boxes

there is no explanation or trace for a particular

answer

tools for the analysis of networks are very limited

some limited methods to extract rules from

networks


65/65

Applications

domains and tasks where neural networks are

successfully used

handwriting recognition

control problems

juggling, truck backup problem

series prediction

weather, financial forecasting categorization

sorting of items (fruit, characters, phonemes, )

cs2351 ai unit5 learning

Documents