Download - CS2351 AI UNIT5 Learning
-
8/12/2019 CS2351 AI UNIT5 Learning
1/65
UNIT V: LEARNING
-
8/12/2019 CS2351 AI UNIT5 Learning
2/65
LEARNING
Learning from Observation
Inductive Learning
Decision Trees Explanation based Learning
Statistical Learning methods
Reinforcement Learning
-
8/12/2019 CS2351 AI UNIT5 Learning
3/65
2000-2012 Franz Kurfess Learning
Learning
An agent tries to improve its behavior throughobservation, reasoning, or reflection
learning from experience
memorization of past percepts, states, and actions generalizations, identification of similar experiences
forecasting
prediction of changes in the environment
theories generation of complex models based on observations
and reasoning
-
8/12/2019 CS2351 AI UNIT5 Learning
4/65
2000-2012 Franz Kurfess Learning
Learning from Observation
Learning Agents
Inductive Learning
Learning Decision Trees
-
8/12/2019 CS2351 AI UNIT5 Learning
5/65
2000-2012 Franz Kurfess Learning
Learning Agents
based on previous agent designs, such asreflexive, model-based, goal-based agents
those aspects of agents are encapsulated into the
performance elementof a learning agent a learning agent has an additional learning
element usually used in combination with a critic and a
problem generator for better learning most agents learn from examples
inductive learning
-
8/12/2019 CS2351 AI UNIT5 Learning
6/65
2000-2012 Franz Kurfess Learning
Learning Agent Model
Sensors
Effectors
Performance Element
Critic
Learning Element
Problem Generator
Agent
Environment
PerformanceStandard
Feedback
LearningGoals
Changes
Knowledge
-
8/12/2019 CS2351 AI UNIT5 Learning
7/65 2000-2012 Franz Kurfess Learning
Forms of Learning
supervised learning
an agent tries to find a function that matches examples from a sampleset
each example provides an input together with the correct output
a teacher provides feedback on the outcome
the teacher can be an outside entity, or part of the environment
unsupervised learning
the agent tries to learn from patterns without corresponding outputvalues
reinforcement learning the agent does not know the exact output for an input, but it receives
feedback on the desirability of its behavior
the feedback can come from an outside entity, the environment, or theagent itself
the feedback may be delayed, and not follow the respective actionimmediately
-
8/12/2019 CS2351 AI UNIT5 Learning
8/65 2000-2012 Franz Kurfess Learning
Feedback
provides information about the actual outcome ofactions
supervised learning
both the input and the output of a component can be
perceived by the agent directly
the output may be provided by a teacher
reinforcement learning
feedback concerning the desirability of the agentsbehavior is available
not in the form of the correct output
may not be directly attributable to a particular action
feedback may occur only after a sequence of actions
-
8/12/2019 CS2351 AI UNIT5 Learning
9/65 2000-2012 Franz Kurfess Learning
Prior Knowledge
background knowledge available before a task
is tackled
can increase performance or decrease
learning time considerably
many learning schemes assume that no prior
knowledge is available
in reality, some prior knowledge is almost
always available
but often in a form that is not immediately usable
by the agent
-
8/12/2019 CS2351 AI UNIT5 Learning
10/65 2000-2012 Franz Kurfess Learning
Inductive Learning
tries to find a function h(the hypothesis)
that approximates a set of samples defining a
function f
the samples are usually provided as
input-output pairs (x, f(x))
supervised learning method
relies on inductive inference, or induction
conclusions are drawn from specific instances to
more general statements
-
8/12/2019 CS2351 AI UNIT5 Learning
11/65 2000-2012 Franz Kurfess Learning
Hypotheses finding a suitable hypothesis can be difficult
since the function fis unknown, it is hard to tell if thehypothesis his a good approximation
the hypothesis spacedescribes the set of
hypotheses under consideration e.g. polynomials, sinusoidal functions, propositional
logic, predicate logic, ...
the choice of the hypothesis space can strongly
influence the task of finding a suitable function
while a very general hypothesis space (e.g. Turingmachines) may be guaranteed to contain a suitablefunction, it can be difficult to find it
Ockhams razor: if multi le h otheses are
-
8/12/2019 CS2351 AI UNIT5 Learning
12/65 2000-2012 Franz Kurfess Learning
Example Inductive Learning 1
input-output pairsdisplayed as points
in a plane
the task is to find a
hypothesis
(functions) thatconnects the points
either all of them,
or most of them
various
performancemeasures
number of points
connected
minimal surface
lowest tension
x
f(x)
-
8/12/2019 CS2351 AI UNIT5 Learning
13/65 2000-2012 Franz Kurfess Learning
Example Inductive Learning 2
hypothesis is afunction consisting
of linear segments
fully incorporates all
sample pairs
goes through allpoints
very easy to
calculate
has discontinuities
at the joints of thesegments
moderate predictive
performancex
f(x)
-
8/12/2019 CS2351 AI UNIT5 Learning
14/65 2000-2012 Franz Kurfess Learning
Example Inductive Learning 3
hypothesisexpressed as apolynomialfunction
incorporates allsamples
morecomplicated tocalculate than
linear segments no
discontinuities
better predictivepower
x
f(x)
-
8/12/2019 CS2351 AI UNIT5 Learning
15/65 2000-2012 Franz Kurfess Learning
Example Inductive Learning 4
hypothesis is alinear functions
does not
incorporate all
samples
extremely easy tocompute
low predictive
power
x
f(x)
-
8/12/2019 CS2351 AI UNIT5 Learning
16/65
2000-2012 Franz Kurfess Learning
Learning and Decision Trees
based on a set of attributes as input, predictedoutput value, the decisionis learned
it is called classificationlearning for discrete values
regressionfor continuous values
Boolean or binary classification
output values are true or false
conceptually the simplest case, but still quite powerful
making decisions
a sequence of test is performed, testing the value of
one of the attributes in each step
when a leaf node is reached its value is returned
-
8/12/2019 CS2351 AI UNIT5 Learning
17/65
2000-2012 Franz Kurfess Learning
Boolean Decision Trees
compute yes/no decisions based on sets of
desirable or undesirable properties of an
object or a situation
each node in the tree reflects one yes/no decision
based on a test of the value of one property of the
object
the root node is the starting point leaf nodes represent the possible final decisions
branches are labeled with possible values
the learning aspect is to predict the value of a
oal redicate also called oal conce t
-
8/12/2019 CS2351 AI UNIT5 Learning
18/65
2000-2012 Franz Kurfess Learning
Terminology
example or sample describes the values of the attributes and the goal
a positive sample has the value true for the goal predicate, a
negative sample has false
sample set
collection of samples used for training and validation
training
the training set consists of samples used for
constructing the decision tree
validation
the test set is used to determine if the decision tree
-
8/12/2019 CS2351 AI UNIT5 Learning
19/65
2000-2012 Franz Kurfess Learning
Restaurant Sample Set
Example Attributes Goal ExaAlt Bar Fri Hun Pat Price Rain Res Type Est WillWait
X1 Yes No No Yes Some $$$ No Yes French 0-10 Yes X1
X2 Yes No No Yes Full $ No No Thai 30-60 No X2
X3 No Yes No No Some $ No No Burger 0-10 Yes X3
X4 Yes No Yes Yes Full $ No No Thai 10-30 Yes X4
X5 Yes No Yes No Full $$$ No Yes French >60 No X5
X6 No Yes No Yes Some $$ Yes Yes Italian 0-10 Yes X6
X7 No Yes No No None $ Yes No Burger 0-10 No X7
X8 No No No Yes Some $$ Yes Yes Thai 0-10 Yes X8X9 No Yes Yes No Full $ Yes No Burger >60 No X9
X10 Yes Yes Yes Yes Full $$$ No Yes Italian 10-30 No X10
X11 No No No No None $ No No Thai 0-10 No X11
X12 Yes Yes Yes Yes Full $ No No Burger 30-60 Yes X12
-
8/12/2019 CS2351 AI UNIT5 Learning
20/65
2000-2012 Franz Kurfess Learning
Decision Tree Example
No
No
Yes
Patrons?
Bar?
EstWait?
Hungry?Yes
Yes Alternative? No Alternative?
Walkable?Yes
Yes No
Driveable?Yes
Yes No
To wait, or not to wait?
-
8/12/2019 CS2351 AI UNIT5 Learning
21/65
2000-2012 Franz Kurfess Learning
Learning Decision Trees
Problem: find a decision tree that agrees with
the training set
trivial solution: construct a tree with one
branch for each sample of the training set
works perfectly for the samples in the training set
may not work well for new samples
(generalization)
results in relatively large trees
better solution: find a concise tree that still
agrees with all samples
-
8/12/2019 CS2351 AI UNIT5 Learning
22/65
2000-2012 Franz Kurfess Learning
Ockhams Razor
The most likely hypothesis is the simplest
one that is consistent with all
observations.
general principle for inductive learning
a simple hypothesis that is consistent with all
observations is more likely to be correct than a
complex one
-
8/12/2019 CS2351 AI UNIT5 Learning
23/65
2000-2012 Franz Kurfess Learning
Constructing Decision Trees
in general, constructing the smallest possible
decision treeis an intractable problem
algorithms exist for constructing reasonably
small trees
basic idea: test the most important attribute
first
attribute that makes the most difference for the
classification of an example
can be determined through information theory
hopefully will yield the correct classification with
-
8/12/2019 CS2351 AI UNIT5 Learning
24/65
2000-2012 Franz Kurfess Learning
Decision Tree Algorithm
recursive formulation
select the best attribute to splitpositive and
negative examples
if only positive or only negative examples are left,
we are done
if no examples are left, no such examples were
observed return a default value calculated from the majority
classification at the nodes parent
if we have positive and negative examples left, but
no attributes to split them, we are in trouble
-
8/12/2019 CS2351 AI UNIT5 Learning
25/65
-
8/12/2019 CS2351 AI UNIT5 Learning
26/65
2000-2012 Franz Kurfess Learning
Noise and Over-fitting
the presence of irrelevant attributes (noise) may lead to moredegrees of freedom in the decision tree
the hypothesis space is unnecessarily large
overfittingmakes use of irrelevant attributes to distinguish
between samples that have no meaningful differences e.g. using the day of the week when rolling dice
over-fitting is a general problem for all learning algorithms
decision tree pruningidentifies attributes that are likely to be
irrelevant very low information gain
cross-validationsplits the sample data in different training and
test sets
results are averaged
-
8/12/2019 CS2351 AI UNIT5 Learning
27/65
2000-2012 Franz Kurfess Learning
Ensemble Learning
Multiple hypotheses (an ensem le)aregenerated, and their predictions combined
by using multiple hypotheses, the likelihood for
misclassification is hopefully lower also enlarges the hypothesis space
oosting is a frequently used ensemble method
each example in the training set has a weightassociated
the weights of incorrectly classified examples are
increased, and a new hypothesis is generated from
this new weighted training set
-
8/12/2019 CS2351 AI UNIT5 Learning
28/65
Explanation-Based Learning
Learning complex concepts using Inductionprocedures typically requires a substantialnumber of training instances.
But people seem to be able to learn quite a bitfrom single examples.
An EBL system attempts to learn from a singleexample x by explaining why x is an example
of the target concept. The explanation is then generalized, and then
systems performance is improved through theavailability of this knowledge.
EBL
-
8/12/2019 CS2351 AI UNIT5 Learning
29/65
EBL EBL programs as accepting the following as input:
A training example
A goal concept: A high level description of what the program is supposed to
learn An operational criterion- A description of which concepts are usable.
A domain theory: A set of rules that describe relationships between objectsand actions in a domain.
From this EBL computes a generalizationof the training example that is
sufficient to describe the goal concept, and also satisfies theoperationality criterion.
Explanation-based generalization (EBG) is an algorithm for EBL and hastwo steps: (1) explain, (2) generalize
During the explanationstep- prune away all the unimportant aspects ofthe training example with respect to the goal conceptgives explanation
The next step is to generalize the explanationas far as possible while stilldescribing the goal concept.
-
8/12/2019 CS2351 AI UNIT5 Learning
30/65
Statistical Learning Methods
-
8/12/2019 CS2351 AI UNIT5 Learning
31/65
Statistical Learning Data instantiations of some or all of
the random variables describing thedomain; they are evidence
Hypothesesprobabilistic theories of
how the domain works The Surprise candy example: two
flavors in very large bags of 5 kinds,
indistinguishable from outside h1: 100% cherry P(c|h1) = 1, P(l|h1) = 0
h2: 75% cherry + 25% lime
h3: 50% cherry + 50% lime
-
8/12/2019 CS2351 AI UNIT5 Learning
32/65
Problem formulation Given a new bag, random variable H denotes the
bag type (h1h5); Diis a random variable (cherry orlime); after seeing D1, D2, , DN,predictthe flavor
(value) of DN-1.
Bayesian learning Calculates the probability of each hypothesis, given
the data and makes predictions on that basis P(hi|d) = P(d|hi)P(hi), where dare observed values of D
Predictions use a likelihood-weighted average overhypotheses
hiare intermediariesbetween raw data and predictions
No need to pick one best-guess hypothesis
-
8/12/2019 CS2351 AI UNIT5 Learning
33/65
Learning with Complete Data
Parameter learning - to find the numericalparameters for a probability model whosestructure is fixed
Data are complete when each data pointcontains values for every variable in themodel
Maximum-likelihoodparameter learning:discrete model
With complete data, ML parameter learning
-
8/12/2019 CS2351 AI UNIT5 Learning
34/65
Nave Bayes models
The most common Bayesian network model used in machinelearning
It assumes that the attributes are conditionally independentof each other, given class
A deterministic prediction can be obtained by choosing themost likely class
P(C|x1,x2,,xn) = (C)i(xi|C)
NBChasnodifficultywithnoisydata
-
8/12/2019 CS2351 AI UNIT5 Learning
35/65
Learning with Hidden Variables
Many real-world problems have hidden variables which arenot observable in the data available for learning.
Question: If a variable (disease) is not observed, why not
construct a model without it?
Answer: Hidden variables can dramatically reduce the numberof parameters required to specify a Bayesian network. Thisresults in the reduction of needed amount of data forlearning.
-
8/12/2019 CS2351 AI UNIT5 Learning
36/65
EM: Learning mixtures of Gaussians
The unsupervised clustering problem
If we knew which component generated each
xj, we can get , If we knew the parameters of each component, we
know which cishould xj belong to. However, we donot know either,
EMexpectation and maximization
Pretend we know the parametersof the model
and then to infer the probabilitythat each xj
-
8/12/2019 CS2351 AI UNIT5 Learning
37/65
E-step computes the expected value pij of thehiddenindicator variablesZij, where Zijis 1 if xjwas generated by i-thcomponent, 0 otherwise
M-stepfinds the new valuesof the parameters that maximizethe log likelihood of the data, given the expected values of Zij
-
8/12/2019 CS2351 AI UNIT5 Learning
38/65
Instance-based Learning
Parametric vs. nonparametric learning
Learning focuses on fitting the parametersof a
restricted family of probability models to an
unrestricted data set Parametric learning methods are often simple and
effective, but can oversimplify whats really
happening
Nonparametriclearning allows the hypothesis
complexity to grow with the data
IBL is nonparametric as it constructs hypotheses
directly from the training data.
-
8/12/2019 CS2351 AI UNIT5 Learning
39/65
Nearest-neighbor models
The key idea: Neighbors are similar
Density estimation example: estimate xs probability density by thedensity of its neighbors
Connecting with table lookup, NBC, decision trees,
How define neighborhood N If too small, no any data points
If too big, density is the same everywhere
A solution is to define N to contain k points, where k is large enough toensure a meaningful estimate
For a fixed k, the size of N varies
The effect of size of k
For most low-dimensional data, k is usually between 5-10
-
8/12/2019 CS2351 AI UNIT5 Learning
40/65
K-NN for a given query x
Which data point is nearest to x?
We need a distance metric, D(x1, x2)
Euclidean distance DE is a popular one
When each dimension measures something different, it is inappropriate to useDE (Why?)
Important to standardize the scale for each dimension
Mahalanobis distance is one solution
Discrete features should be dealt with differently
Hamming distance
Use k-NN to predict
High dimensionality poses another problem
-
8/12/2019 CS2351 AI UNIT5 Learning
41/65
Summary
Bayesian learning formulates learning as a form of probabilistic inference,using the observations to update a prior distribution over hypotheses.
Maximum a posteriori (MAP) selects a single most likely hypothesis giventhe data.
Maximum likelihood simply selects the hypothesis that maximizes the
likelihood of the data (= MAP with a uniform prior). EM can find local maximum likelihood solutions for hidden variables.
Instance-based models use the collection of data to represent adistribution.
Nearest-neighbor method
-
8/12/2019 CS2351 AI UNIT5 Learning
42/65
Reinforcement Learning
In which we examine how an agent
can learn from success and failure,reward and punishment.
-
8/12/2019 CS2351 AI UNIT5 Learning
43/65
Introduction
Learning to r ide a bicycle:The goal given to the Reinforcement Learning
system is simply to ride the bicycle without fallingoverBegins riding the bicycle and performs a series of
actions that result in the bicycle being tilted 45degrees to the right
Photo:http://www.roanoke.com/outdoors/bikepages/bikerattler.html
-
8/12/2019 CS2351 AI UNIT5 Learning
44/65
Introduction
Learning to r ide a bicycle:
RL system turns the handle bars to the LEFT
Result: CRASH!!!Receives negative reinforcement
RL system turns the handle bars to the RIGHTResult: CRASH!!!
Receives negative reinforcement
-
8/12/2019 CS2351 AI UNIT5 Learning
45/65
Introduction
Learning to r ide a bicycle:RL system has learned that the state of being
titled 45 degrees to the right is bad
Repeat trial using 40 degree to the rightBy performing enough of these trial-and-errorinteractions with the environment, the RL systemwill ultimately learn how to prevent the bicyclefrom ever falling over
i i i
-
8/12/2019 CS2351 AI UNIT5 Learning
46/65
Passive Learning in a Known
Environment
Passive Learner: A passive learner simplywatchesthe world going by, and tries to
learn the utilityof being in various states. Another way to think of a passive learneris as an agent with a fixed policytrying todetermine its benefits.
-
8/12/2019 CS2351 AI UNIT5 Learning
47/65
Passive Learning in a Known Environment
In passive learning, the environment generates statetransitions and the agent perceives them. Consideran agent trying to learn the utilities of the statesshown below:
-
8/12/2019 CS2351 AI UNIT5 Learning
48/65
Passive Learning in a Known Environment
Agent can move {North, East, South, West}
Terminate on reading [4,2] or [4,3]
-
8/12/2019 CS2351 AI UNIT5 Learning
49/65
Passive Learning in a Known Environment
the object is to use this information about rewards tolearn the expected utility U(i) associated with eachnonterminal state i
Utilities can be learned using 3 approaches1) LMS (least mean squares)
2) ADP (adaptive dynamic programming)
3) TD (temporal difference learning)
-
8/12/2019 CS2351 AI UNIT5 Learning
50/65
Active Learning in an Unknown Environment
An active agent must consider :
what actions to take
what their outcomes may be
how they will affect the rewards received
-
8/12/2019 CS2351 AI UNIT5 Learning
51/65
Active Learning in an Unknown Environment
Minor changes to passive learning agent :
environment model now incorporates the
probabilities of transitions to other states
given a particular action
maximize its expected utility
agent needs a performance element to
choose an action at each step
-
8/12/2019 CS2351 AI UNIT5 Learning
52/65
2000-2012 Franz Kurfess Learning
Learning in Neural Networks
Neurons and the Brain
Neural Networks
Perceptrons
Multi-layer Networks
Applications
-
8/12/2019 CS2351 AI UNIT5 Learning
53/65
2000-2012 Franz Kurfess Learning
Neural Networks
complex networks of simple computing
elements
capable of learning from examples
with appropriate learning methods
collection of simple elements performs high-
level operations
thought
reasoning
consciousness
-
8/12/2019 CS2351 AI UNIT5 Learning
54/65
2000-2012 Franz Kurfess Learning
Neural Networks and the Brain
brain set of interconnected modules
performs information processing
operations at various levels
sensory input analysis
memory storage and retrieval
reasoning
feelings
consciousness
neurons basic computational elements
heavily interconnected with
other neurons
[Russell & Norvig, 1995]
-
8/12/2019 CS2351 AI UNIT5 Learning
55/65
2000-2012 Franz Kurfess Learning
Artificial Neuron Diagram
weighted inputs are summed up by the input function
the (nonlinear) activation functioncalculates the activation value,
which determines the output
[Russell & Norvig, 1995]
-
8/12/2019 CS2351 AI UNIT5 Learning
56/65
2000-2012 Franz Kurfess Learning
Common Activation Functions
Stept(x) = 1 if x >= t, else 0
Sign(x) = +1 if x >= 0, else 1
Sigmoid(x) = 1/(1+e-x)
[Russell & Norvig, 1995]
-
8/12/2019 CS2351 AI UNIT5 Learning
57/65
2000-2012 Franz Kurfess Learning
Network Structures
in principle, networks can be arbitrarily
connected
occasionally done to represent specific structures
semantic networks
logical sentences
makes learning rather difficult
layered structures networks are arranged into layers
interconnections mostly between two layers
some networks have feedback connections
-
8/12/2019 CS2351 AI UNIT5 Learning
58/65
2000-2012 Franz Kurfess Learning
Perceptrons single layer, feed-forward
network
historically one of the first
types of neural networks
late 1950s
the output is calculated as
a step function applied tothe weighted sum of
inputs
capable of learning simple
functions
linearly separable
[Russell & Norvig, 1995]
-
8/12/2019 CS2351 AI UNIT5 Learning
59/65
2000-2012 Franz Kurfess Learning
Perceptrons and Learning
perceptrons can learn from examples through a
simple learning rule
calculate the error of a unit Errias the difference
between the correct output Tiand the calculated
output OiErri= Ti- Oi
adjust the weight Wjof the input Ijsuch that the error
decreasesWij:= Wij+ *Iij* Errij is the learning rate
this is a gradient descent search through the weight
space
-
8/12/2019 CS2351 AI UNIT5 Learning
60/65
2000-2012 Franz Kurfess Learning
Multi-Layer Networks
research in the more complex networks with
more than one layer was very limited until the
1980s
learning in such networks is much morecomplicated
the problem is to assign the blame for an error to
the respective units and their weights in aconstructive way
the back-propagation learning algorithm can
be used to facilitate learning in multi-layer
-
8/12/2019 CS2351 AI UNIT5 Learning
61/65
2000-2012 Franz Kurfess Learning
Diagram Multi-Layer Network
two-layer network input units Ik
usually not counted as a
separate layer
hidden units aj
output units Oi
usually all nodes of one layer
have weighted connections to
all nodes of the next layer
Ik
aj
Oi
Wji
Wkj
-
8/12/2019 CS2351 AI UNIT5 Learning
62/65
2000-2012 Franz Kurfess Learning
Back-Propagation Algorithm
assigns blame to individual units in the
respective layers
essentially based on the connection strength
proceeds from the output layer to the hiddenlayer(s)
updates the weights of the units leading to the
layer essentially performs gradient-descent search
on the error surface
relatively simple since it relies only on local
-
8/12/2019 CS2351 AI UNIT5 Learning
63/65
2000-2012 Franz Kurfess Learning
Capabilities of Multi-Layer Neural Networks
expressiveness weaker than predicate logic
good for continuous inputs and outputs
computational efficiency training time can be exponential in the number of
inputs
depends critically on parameters like the learning rate
local minima are problematic
can be overcome by simulated annealing, at additional cost
generalization
works reasonabl well for some functions classes of
Capabilities of Multi Layer Neural
-
8/12/2019 CS2351 AI UNIT5 Learning
64/65
2000-2012 Franz Kurfess Learning
Capabilities of Multi-Layer Neural
Networks (cont.)
sensitivity to noise
very tolerant
they perform nonlinear regression
transparency
neural networks are essentially black boxes
there is no explanation or trace for a particular
answer
tools for the analysis of networks are very limited
some limited methods to extract rules from
networks
-
8/12/2019 CS2351 AI UNIT5 Learning
65/65
Applications
domains and tasks where neural networks are
successfully used
handwriting recognition
control problems
juggling, truck backup problem
series prediction
weather, financial forecasting categorization
sorting of items (fruit, characters, phonemes, )