module 3 - decisiontrees-introtoann

8/6/2019 Module 3 - DecisionTrees-IntroToANN

1/44

Ing. Leonel D Rozo C, M.Sc, PhD(c)[email protected]

2010
mailto:[email protected]:[email protected]


2/44


3/44


4/44

2. Appropriate problems for decision tree learning

Instances are represented by attribute-value pairs - Instances aredescribed by a fixed set of attributes (e.g., Temperature) and

their values (e.g., Hot).

The target function has discrete output values - The decision treeassigns a boolean classification (e.g., yes or no) to eachexample.

Disjunctive descriptions may be required.

The training data may contain errors.

The training data may contain missing attribute values.


5/44

3. The basic decision tree learning algorithm

1. Which attribute should be tested at the root of the tree?

2. The best attribute is selected and used as the test at the rootnode of the tree.

3. A descendant of the root node is then created for each possiblevalue of this attribute, and the training examples are sorted tothe appropriate descendant node.

4. The entire process is then repeated using the training examplesassociated with each descendant node to select the best attributeto test at that point in the tree.


6/44

3. The basic decision tree learning algorithm

3.1. Which attribute is the best classifier ?

The central choice in the algorithm is selecting which attribute to testat each node in the tree. We would like to select the attribute that ismost useful for classifying examples.


7/44


Entropy measures homogeneity of examplesDefining a measure commonly used in information theory, calledentropy, that characterizes the (im)purity of an arbitrary collection ofexamples

.

Given a collection S, containing positive and negative examples ofsome target concept, the entropy of S relative to this booleanclassification is:


8/44


Information gain measures the expected reduction in entropy

Information gain is simply the expected reduction in entropy caused

by partitioning the examples according to this attribute. Moreprecisely, the information gain, Gain(S, A) of an attribute A, relativeto a collection of examples S, is defined as:

Set of all possiblevalues for attribute A

Subset of S forwhich attribute Ahas value v


9/44


An illustrative example


10/44

4. Issues in decision tree learning

4.1. Avoiding overfitting the data

The algorithm described before growseach branch of the tree just deeply enoughto perfectly classify the training examples.While this is sometimes a reasonablestrategy, in fact it can lead to difficultieswhen:

There is noise in the data.

The number of training examples is toosmall to produce a representativesample of the true target function.


11/44



A hypothesis overfits the training examples if some other hypothesisthat fits the training examples less well actually performs better over theentire distribution of instances (i.e., including instances beyond thetraining set).


12/44



There are several approaches to avoiding overfitting in decision treelearning. These can be grouped into two classes:

Approaches that stop growing the tree earlier, before it reaches thepoint where it perfectly classifies the training data.

Approaches that allow the tree to overfit the data, and then post-prune

the tree.


13/44



Reduced error pruningConsider each of the decision nodes in the tree to be candidates forpruning. Pruning a decision node consists of removing the subtreerooted at that node, making it a leaf node, and assigning it the mostcommon classification of the training examples affiliated with thatnode.

Nodes are removed only if the resulting pruned tree performs noworse than the original over the validation set.

Nodes are pruned iteratively, always choosing the node whoseremoval most increases the decision tree accuracy over the validationset.


14/44



Reduced error pruning


15/44



Rule post-pruning

i. Infer the decision tree from the training set.

ii. Convert the learned tree into an equivalent set of rules.

iii. Prune (generalize) each rule by removing any preconditions that

result in improving its estimated accuracy.

iv. Sort the pruned rules by their estimated accuracy, and consider themin this sequence when classifying subsequent instances.


16/44


4.2. Incorporating continuous-valued attributes

This can be accomplished by dynamically defining new discrete valuedattributes that partition the continuous attribute value into a discrete setof intervals.

In particular, for an attribute A that is continuous-valued, thealgorithm can dynamically create a new boolean attribute Ac, thatis true if A < c and false otherwise.


17/44

Many tasks involving intelligence or

pattern recognition are extremelydifficult to automate, but appear to beperformed very easily by animals.

For instance, animalsrecognize various objects and

make sense out of the largeamount of visual information intheir surroundings, apparentlyrequiring very little effort.


18/44


19/44

2. History of neural networks

The amount of activity at any given point in the brain cortex is the sumof the tendencies of all other points to discharge into it, such tendencies

being proportionate

(William James)

1. To the number of times the excitement of other points may haveaccompanied that of the point in question.

2. To the intensities of such excitements.

3. To the absence of any rival point functionally disconnected withthe first point, into which the discharges may be diverted.


20/44


21/44

2. History of neural networks

1954 Gabor invented the learning filter" that uses gradientdescent to obtain optimal weights that minimize the MSE

between the observed output signal and a signal generatedbased upon the past information.

1958 Rosenblatt invented the perceptron, introducing alearning method for the McCulloch and Pitts neuron model.

1960 Widrow and Hoffintroduced the Adaline.

1961 Rosenblatt proposed the backpropagation scheme fortraining multilayer networks.

1969 The limits of simple perceptrons were demonstrated.


22/44

3. Structure and function of a single neuron

3.1. Biological neurons

A typical biological neuron is composed of a cell body, a tubular axon,and a multitude of hair-like dendrites.


23/44



The small gap between an end bulb and a dendrite is called a synapse,across which information is propagated. The axon of a single neuronforms synaptic connections with many other neurons.


24/44



Inhibitory or excitatory signals from other neurons are transmitted toa neuron at its dendrites synapses . The magnitude of the signalreceived by a neuron (from another) depends on the efficiency of thesynaptic transmission.

The cell membrane becomes electrically active when sufficientlyexcited by the neurons making synapses onto this neuron.

A neuron will fire if sufficient signals from other neurons fall upon itsdendrites in a short period of time, called the period of latent

summation.


25/44


3.2. Artificial neuron models

The position of the neuron (node) ofthe incoming synapse (connection) isirrelevant.

Each node has a single output value,distributed to other nodes viaoutgoing links, irrespective theirpositions.

All inputs come in the same time orremain activated at the same levellong enough for computation of f tooccur.


26/44



The next level of specialization is to assume that different weightedinputs are summed.


27/44



Now, it is necessary to stablish which function fthe neuron has

Ramp functions

Step functions


28/44



Sigmoid functions

Piecewise linear and Gaussian functions


29/44

4. Neural net architectures

A single node is insufficient for many practical problems, andnetworks with a large number of nodes are frequently used. The

way nodes are connected determines how computations proceedand constitutes an important early design decision by a neuralnetwork developer.

Fully connected networks


30/44


Layered networks

Acyclic networks


31/44


Feedforward networks

Modular networks


32/44

5. Neural learning

Correlation learning

When an axon of cell A is nearenough to excite a cell B andrepeatedly or persistently takesplace in firing it, some growthprocess or metabolic change takesplace in one or both cells such thatAs efficiency, as one of the cells

firing B, is increased.


33/44

5. Neural learning

Competitive learning

Another principle for neural computation is that when aninput pattern is presented to a network, different nodescompete to be " winners" with high levels of activity. Thecompetitive process involves self-excitation and mutualinhibition among nodes, until a single winner emerges.

The connections between input nodes and the winner node are

then modified , increasing the likelihood that the same winnercontinues to win in future competitions.

The converse of competition is cooperation, found in someneural network models.


34/44

5. Neural learning

Feedback-based weight adaptation

If increasing a particular weight leadsto diminished performance orlarger error, then that weight isdecreased as the network is trainedto perform better.

The amount of change made at every

step is very small in most networks toensure that a network does not stray toofar from its partially evolved state, and sothat the network withstands somemistakes made by the teacher, feedback,

or performance evaluation mechanism.


35/44

6. What can neural networks be used for ?

Classification


36/44


Clustering

Clustering requires grouping together objects that are similar toeach other


37/44


Pattern association

In pattern association, another important task that can beperformed by neural networks, the presentation of an inputsample should trigger the generation of a specific output pattern


38/44


Function approximation

Many computational models can be described as functionsmapping some numerical input vectors to numericaloutputs. The outputs corresponding to some input vectorsmay be known from training data, but we may not know themathematical function describing the actual process thatgenerates the outputs from the input vectors.


39/44


Forescasting

There are many real-life problems in which future eventsmust be predicted on the basis of past history. An exampletask is that of predicting the behavior of stock marketindices.


40/44


Control applications

Control addresses the task of determining the values for inputvariables in order to achieve desired values for output variables.


41/44

7. Evaluation of networks

Quality of results

The performance of a neural network is frequently gauged interms of an error measure.

Euclidean distance

Manhattan or Hamming distance

In classification problems, another possible error measure is the

fraction of misclassified samples.


42/44


43/44

8. Real applications of neural networks


44/44

module 3 - decisiontrees-introtoann

Documents