1 machine learning: symbol-based 9 9.0introduction 9.1a framework for symbol-based learning...

1

Machine Learning: Symbol-based 9 9.0 Introduction

9.1 A Framework forSymbol-based Learning

9.2 Version Space Search

9.3 The ID3 Decision TreeInduction Algorithm

9.4 Inductive Bias and Learnability

9.5 Knowledge and Learning

9.6 Unsupervised Learning

9.7 Reinforcement Learning

9.8 Epilogue and References

9.9 Exercises

Additional sources used in preparing the slides:Jeffrey Ullman’s data mining lecture notes (clustering)Ernest Davis’ lecture notes (clustering)Dean, Allen, and Aloimonos’ AI textbook (reinforcement learning)

2

Unsupervised learning

3

Conceptual Clustering

The clustering problem

Given

• a collection of unclassified objects, and

• a means for measuring the similarity of objects (distance metric),

find

• classes (clusters) of objects such that some standard of quality is met (e.g., maximize the similarity of objects in the same class.)

Essentially, it is an approach to discover a useful summary of the data.

4

Conceptual Clustering (cont’d)

Essentially, it is an approach to discover a useful summary of the data.

Ideally, we would like to represent clusters and their semantic explanations. In other words, we would like to define clusters extensionally (i.e., by general rules) rather than intensionally (i.e., by enumeration).

For instance, compare

{ X | X teaches AI at MTU CS}, and

{ John Lowther, Nilufer Onder}

5

Example: a cholera outbreak in London

Many years ago, during a cholera outbreak in London, a physician plotted the location of cases on a map. Properly visualized, the data indicated that cases clustered around certain intersections, where there were polluted wells, not only exposing the cause of cholera, but indicating what to do about the problem.

X X

XXX

XXX

XX

X

XX X

X

X

XXXX

X

6

Higher dimensional examples

• Observation that customers who buy diapers are more likely to buy beer than average allowed supermarkets to place beer and diapers nearby, knowing many customers would walk between them. Placing potato chips between increased the sales of all three items.

7

Higher dimensional examples (cont’d)

• Skycat clustered 2 x 109 sky objects into stars, galaxies, quasars, etc. Each object was a point in a space of 7 dimensions, with each dimension representing radiation in one band of the spectrum. The Sloan Sky Survey is a more ambitious attempt to catalog and cluster the entire visible universe. Clustering sky objects by their radiation levels in different bands allowed astronomers to distinguish between galaxies, nearby stars, and many other kinds of celestial objects.

8

Skycat software

9

Higher dimensional examples (cont’d)

• Documents may be thought of as points in a high-dimensional space, where each dimension corresponds to one possible word. The position of a document in a dimension is the number of times the word occurs in a document (or just 1 if it occurs, 0 if not). Clusters of documents in this space often correspond to groups of documents on the same topic.

Query “salsa” submitted to MetaCrawler returns 246 documents in 15 clusters, of which the top are:

Puerto Rico; Latin Music (8 docs)Follow Up Post; York Salsa Dancers (20 docs)music; entertainment; latin; artists (40 docs)hot; food; chiles; sauces; condiments; companies (79 docs)pepper; onion; tomatoes (41 docs)

10

Measuring distance

• To discuss, whether a set of points is close enough to be considered a cluster, we need a distance measure D(x,y) that tells how far points x and y are.

• The usual axioms for a distance measure D are:

1. D(x,x) = 0. A point is distance 0from itself.

2. D(x,y) = D(y,x). Distance is symmetric.

3. D(x,y) D(x,z) + D(z,y). The triangle inequality.

11

K-dimensional Euclidean space

The distance between any two points, saya = [a1, a2, … , ak] and b = [b1, b2, … , bk]is given in one of the usual manners:

1. Common distance (“L2 norm”) :

i =1 (ai - bi)2

2. Manhattan distance (“L1 norm”):

i =1 |ai - bi|

3. Max of dimensions (“L norm”):

maxi =1 |ai - bi|

k

k

k

a

b

a

b

a

b

12

Non-Euclidean spaces

Here are some examples where a distance measure without a Euclidean space makes sense.

• Web pages: Roughly 108-dimensional space where each dimension corresponds to one word. Rather use vectors to deal with only the words actually present in documents a and b.

• Character strings, such as DNA sequences: Rather use a metric based on the LCS---Lowest Common Subsequence.

• Objects represented as sets of symbolic, rather than numeric, features: Rather base similarity on the proportion of features that they have in common.

13

Non-Euclidean spaces (cont’d)

object1 = {small, red, rubber, ball}

object2 = {small, blue, rubber, ball}

object3 = {large, black, wooden, ball}

similarity(object1, object2) = 3 / 4

similarity(object1, object3) = similarity(object2, object3) = 1/4

Note that it is possible to assign different weights to features.

14

Approaches to Clustering

Broadly specified, there are two classes of clustering algorithms:

1. Centroid approaches: We guess the centroids or central point in each cluster, and assign points to the cluster of their nearest centroid.

2. Hierarchical approaches: We begin assuming that each point is a cluster by itself. We repeatedly merge nearby clusters, using some measure of how close two clusters are (e.g., distance between their centroids), or how good a cluster the resulting group would be (e.g., the average distance of points in the cluster from the resulting centroid.)

15

The k-means algorithm

•Pick k cluster centroids.

•Assign points to clusters by picking the closest centroid to the point in question. As points are assigned to clusters, the centroid of the cluster may migrate.

Example: Suppose that k = 2 and we assign points 1, 2, 3, 4, 5, in that order. Outline circles represent points, filled circles represent centroids. 1 5

3

2

4

16

The k-means algorithm example (cont’d)

1 5

32

4

1 5

32

4

1 5

32

4

1 5

32

4

17

Issues

• How to initialize the k centroids? Pick points sufficiently far away from any other centroid, until there is k.

• As computation progresses, one can decide to split one cluster and merge two, to keep the total at k. A test for whether to do so might be to ask whether doing so reduces the average distance from points to their centroids.

• Having located the centroids of k clusters, we can reassign all points, since some points that were assigned early may actually wind up closer to another centroid, as the centroids move about.

18

Issues (cont’d)

• How to determine k? One can try different values for k until the smallest k such that increasing k does not much decrease the average points of points to their centroids.

XX

XX

XX

X XX X

X XX

XXX X

XXX

19

Issues (cont’d)

XX

XX

XX

X XX X

X XX

XXX X

XXX

XX

XX

XX

X XX X

X XX

XXX X

XXX

When k = 1, all the points are in one cluster, and the average distance to the centroid will be high.

When k = 2, one of the clusters will be by itself and the other two will be forced into one cluster. The average distance of points to the centroid will shrink considerably.

20

Issues (cont’d)

XX

XX

XX

X XX X

X XX

XXX X

XXX

When k = 3, each of the apparent clusters should be a cluster by itself, and the average distance from the points to their centroids shrinks again.

When k = 4, then one of the true clusters will be artificially partitioned into two nearby clusters. The average distance to centroid will drop a bit, but not much.

XX

XX

XX

X XX X

X XX

XXX X

XXX

21

Issues (cont’d)

This failure to drop further suggests that k = 3 is right. This conclusion can be made even if the data is in so many dimensions that we cannot visualize the clusters.

Averageradius

k

1 2 3 4

22

The CLUSTER/2 algorithm

1. Select k seeds from the set of observed objects. This may be done randomly or according to some selection function.

2. For each seed, using that seed as a positive instance and all other seeds as negative instances, produce a maximally general definition that covers all of the positive and none of the negative instances (multiple classifications of non-seed objects are possible.)

23

The CLUSTER/2 algorithm (cont’d)

3. Classify all objects in the sample according to these descriptions. Replace each maximally specific description that covers all objects in the category (to decrease the likelihood that classes overlap on unseen objects.)

4. Adjust remaining overlapping definitions.

5. Using a distance metric, select an element closest to the center of each class.

6. Repeat steps 1-5 using the new central elements as seeds. Stop when clusters are satisfactory.

24

The CLUSTER/2 algorithm (cont’d)

7. If clusters are unsatisfactory and no improvement occurs over several iterations, select the new seeds closest to the edge of the cluster.

25

The steps of a CLUSTER/2 run

A COBWEB clustering for four one-celled organisms (Gennari et al.,1989)

Note: we will skip the COBWEB algorithm

27

Related communities

• data mining (in databases, over the web)

• statistics

• clustering algorithms

• visualization

• databases

28

Reinforcement Learning

• A form of learning where the agent can explore and learn through interaction with the environment.

• The agent learns a policy which is a mapping from states to actions. The policy tells what the best move is in a particular state.

• It is a general methodology: planning, decision making, search can all be viewed in the context of reinforcement learning.

29

Tic-tac-toe: a different approach

• Recall the minimax approach: The agent knows its current state. Generates a two layer search tree taking into account all the possible moves for itself and the opponent. Backs up values from the leaf nodes and takes the best move assuming that the opponent will also do so.

• An alternative is to directly start playing with an opponent (does not have to be perfect,but could as well be). Assume no prior knowledge or lookahead. Assign “values” to states: 1 is win

0 is loss or draw0.5 is anything else

Notice that 0.5 is arbitrary, it cannot differentiate between good moves and bad moves. So, the learner has no guidance initially.

It engages in playing. When the game ends, if it is a win, the value 1 will be propagated backwards. If it is a draw or a loss, the value 0 is propagated backwards. Eventually, earlier states will be labeled to reflect their “true” value.

After several plays, the learner will learn the best move given a state (a policy.)

31

Issues in generalizing this approach

• How will the state values be initialized or propagated backwards?

• What if there is no end to the game (infinite horizon)?

• This is an optimization problem which suggests that it is hard. How can an optimal policy be learned?

32

A simple robot domain

0 1

3 2

The robot is in one of the states: 0, 1, 2, 3. Each one represents an office, the offices are connected in a ring.

Three actions are available: + moves to the “next” state - moves to the “previous” state @ remains at the same state

+

+

++

@@

@@

-

-

- -

33

The robot domain (cont’d)

• The robot can observe the label of the state it is in and perform any action corresponding to an arc leading out of its current state.

• We assume that there is a clock governing the passage of time, and that at each tick of the clock the robot has to perform an action.

• The environment is deterministic, there is a unique state resulting from any initial state and action. (Yes, the diagram in the previous page is a state-transition diagram.)

• Each state has a reward, 10 for state 3, 0 for the others.

34

Compare three policies

a. Every state is mapped to @

The value of this policy is 0, because the robot will never get to office 3.

b. Every state is mapped to + policy 0

The value of this policy is , because the robot will end up in office 3 infinitely often.

c. Every state is except 3 is mapped to +, 3 is mapped to @ policy 1

The valus of this policy is also , because the robot will end up (stay) in office 3 infinitely often.

35

Compare three policies

So, it is easy to rule case a out, but how can we show that policy 1 is better than policy 0?

POLICY 1

The average reward per tick for state 0 is 10.

The discounted cumulative reward for state 0 is 2.5.

POLICY 0

The average reward per tick for state 0 is 10/4.

The discounted cumulative reward for state 0 is 1.33.

36

Discounted cumulative reward

Assume that the robot associates a higher value with more immediate rewards and therefore discounts future rewards.

The discount rate () is a number between 0 and 1 used to discount future rewards.

The discounted cumulative reward for a particular state with respect to a given policy is the sum for n from 0 to infinity of n times the reward associated with the state reached after the n-th tick of the clock.

37

Discounted cumulative reward (cont’d)

Take = 0.5

For state 0 with respect to policy 0:0.50 x 0 + 0.51 x 0 + 0.52 x 0 + 0.53 x 10 +0.54 x 0 + 0.55 x 0 + 0.56 x 0 + 0.57 x 10 + …= 1.25 + 0.078 + … = 1.33 in the limit

For state 0 with respect to policy 0:0.50 x 0 + 0.51 x 0 + 0.52 x 0 + 0.53 x 10 +0.54 x 10 + 0.55 x 10 + 0.56 x 10 + 0.57 x 10 + …= 2.5 in the limit

38

Discounted cumulative reward (cont’d)

Let j be a state,R(j) be the reward for ending up in state j, be a fixed policy,(j) be the action dictated by in state j,f(j,a) be the next state given the robot starts in state j and performs action a,Vi(j) be the estimated value of state j with respect to the policy after the i-th iteration of the algorithm

Using a dynamic programming algorithm, one can obtain a good estimate of V, the value function for policy as i .

39

A dynamic programming algorithm to compute values for states

1. For each j, set V0(j) to 0.

2. Set i to 0.

3. For each j, set VI+1 (j) to R(j) + Vi( f(j,) ) ).

4. Set i to i + 1.

5. If i is equal to the maximum number of iterations, then return VI ;

otherwise, return to step 3.

40

Temporal credit assignment problem

• The problem of assigning credit or blame to the actions in a sequence of actions where feedback is available only at the end of the sequence.

• When you lose a game of chess or checkers, the blame for your loss cannot necessarily be attributesd to the last move you made, or even the next-to-the-last move.

• Dynamic programming solves the temporal credit assignment problem by propogating rewards backwards to earlier states and hence to actions earlier in the sequence of actions determined by a policy.

41

Computing an optimal policy

Given a method for estimating the value of states with respect to a fixed policy, it is possible to find an optimal policy. We would like to maximize the discounted cumulative reward.

Policy iteration [Howard, 1960] is an algorithm that uses the algorithm for computing the value of a state as a subroutine.

42

Policy iteration algorithm

1. Let 0 be an arbitrary policy.

2. Set i to 0.

3. Compute V0 (j) for each j.

4. Compute a new policy i+1 so that i+1 (j) is the action a maximizing R(j) + Vi( f(j,) ) .

5. If i+1 = i , then return I; otherwise, set i to i + 1, and go to step 3.

43

Policy iteration algorithm (cont’d)

A policy is said to be the optimal policy if there is no other policy ’ and state j such that V’ (j) > V (j) and for all k j V’ (j) > V (j) .

The policy iteration algorithm is guaranteed to terminate in a finite number of steps with an optimal policy.

44

Comments on reinforcement learning

• A general model where an agent can learn to function in dynamic environments

• The agent can learn while interacting with the environment

• No prior knowledge except the (probabilistic) transitions is assumed

• Can be generalized to stochastic domains (an action might have several different probabilistic consequences, i.e., the state-transition function is not deterministic)

• Can also be generalized to domains where the reward function is not known

45

Famous example: TD-Gammon (Tosauro, 1995)

• Learns to play Backgammon

• Immediate reward:+100 if win-100 if lose0 for all other states

• Trained by playing 1.5 million games against itself (several weeks)

• Now approximately equal to best human player (won World Cup of Backgammon in 1992; among top 3 since 1995)

• Predecessor: NeuroGammon [Tesauro and Sejnowski, 1989] learned from examples of labelled moves (very tedious for human expert)

46

Other examples

• Robot learning to dock on battery charger

• Pole balancing

• Elevator dispatching [Crites and Barto, 1995]: better than industry standard

• Inventory management [Van Roy et. Al]: 10-15% improvement over industry standards

• Job-shop scheduling for NASA space missions [Zhang and Dietterich, 1997]

• Dynamic channel assignment in cellular phones [Singh and Bertsekas, 1994]

• Robotic soccer

47

Common characteristics

• delayed reward

• opportunity for active exploration

• possibility that state only partially observable

• possible need to learn multiple tasks with same sensors/effectors

• there may not be an adequate teacher

1 machine learning: symbol-based 9 9.0introduction 9.1a framework for symbol-based learning...

Documents

x slide

unsupervised learning

clustering sky objects

clusters of documents

classes clusters of

machine learning

skycat software slide

clustering problem