k236: basis of data analytics - japan advanced institute ...bao/k236/k236-l10-print.pdf ·...

K236: Basis of Data AnalyticsLecture 10: Classification and prediction

Neural Networks

Lecturer: Tu Bao Ho and Hieu Chi DamTA: Moharasan Gandhimathi

and Nuttapong Sanglerdsinlapachai

2

Schedule of K236

1. Introduction to data science (1) ��*"�3 6/9

2. Introduction to data science (2) ��*"�3 6/13

3. Data and databases �� 6/16

4. Review of univariate statistics !2+. 6/20

5. Review of linear algebra ,#�$ 6/23

6. Data mining software �� 6/27

7. Data preprocessing ��' 6/30

8. Classification and prediction (1) �5��& (1) 7/4

9. Knowledge evaluation )1/� 7/7

10. Classification and prediction (2) �5��& (2) 7/11

11. Classification and prediction (3) �5��& (3) 7/1412. Mining association rules (1) (4��-% 7/18

13. Mining association rules (2) (4��-% 7/21

14. Cluster analysis ��-% 7/25

15. Review and Examination ��06 (the data is not fixed) 7/27

3

1. About neural networks

2. Backpropagation algorithms

3. More about classification and prediction methods

Outline Connectionists

4

Yann%LeCun Geoff%Hinton Yoshua Bengio

How the human brain learns

• A neuron collects signals from others through a structures called dendrites.

• The neuron sends out spikes of electrical activity through a long, thin stand known as an axon, which splits into thousands of branches.

• At the end of each branch, a structure called a synapse converts the activity from the axon into electrical effects that inhibit or excite activity in the connected neurons.

5

A neuron model

The diagram, showing the biological inspiration. The activation function is a non-linear function applied to the weighted input sum to produce the output of the artificial neuron (in the case of Rosenblatt's Perceptron, the function just a thresholding operation).

6http://www.andreykurenkov.com/writing/a@brief@history@of@neural@nets@and@deep@learning/

7

30×32 pixelsas inputs

30 outputsfor steering

30×32 weightsinto one out offour hiddenunit

4 hidden units

Drives 70 mphon a public highway

ALVINN drives 70 mph on highways

Tom%Mitchell,%Machine%Learning,%1997

Perceptron

% &', … , &* = , 1......if..12 + 1'&' + ⋯+ 1*&* > 0..−1....otherwise..........................................

..

In vector notation % &⃗ = ,..1......if.1. &⃗ > 0.−1...otherwise

The perceptron is the neural network algorithm for supervised learning of binary classifiers (+1.and − 1) [Rosenblatt, 1957]

“Mark%1%perceptron”%machine

How to learn vector 1 from training data? Two rules:

• Perceptron training rule

• Delta rule (gradient descent)

8

Perceptron training rule

• Begin with a random weights, then iteratively apply the perceptron to each training example, modifying the weights whenever it misclassifies an example.

• Revise the weight 1D.associated with &D by1D ← 1D + ∆1D

where∆1D.= G(H − %)&D

! H = J &⃗ is the target output (label of the training example)! % is perceptron output! G is a small constant (e.g., 0.1) called learning rate (moderating the changing

degree of weight)

• Can prove it converges if training data is linearly separable and G is sufficiently small (gradient descent overcomes non-linearly separable).

9

Gradient descent

Gradient descent is an optimizationalgorithm to find a local minimum of a function K(&)..

If a multi-variable function K(L) is defined and differentiable in a neighborhood of a point a, then K Ldecreazes fastest if it goes from a inthe direction of the negative gradient of K at a, −MK N .

It follows that, if O = N − GMK(N) for Gsmall enough, then K(O) ≥ K(N).

https://en.wikipedia.org/wiki/Gradient_descent 10)

Gradient descent

Consider simpler linear unit, where

% = .12 + 1'&' +⋯+1*&*

Let’s learn 1D’s that minimize the squared error

Q 1 =12R HS − %S T�

S∈W

where X is the set of training examples

Gradient%%

MQ[1]= YZY[\

, YZY[]

, … , YZY[^

Training%rule

∆1 = −GMQ 1

i.e.,%%...

∆1D = −G_Q_1D

M1D = −GR HS − %S −&DS

�

S

11

Neural networks and non-linearly separable problems

12

Structure Types+ofDecision+Regions

Exclusive7ORProblem

Classes+withMeshed+regions

Most+GeneralRegion+Shapes

Single7Layer

Two7Layer

Three7Layer

Half+PlaneBounded+ByHyperplane

Convex+OpenOr

Closed+Regions

Arbitrary(ComplexityLimited3by3No.of3Nodes)

A

AB

B

A

AB

B

A

AB

B

BA

BA

BA

13

Neural networks • Advantages

! Prediction accuracy is generally high! Robust, works when training objects contain errors! Output may be discrete, real-valued, or a vector of several

discrete or real-valued attributes! Fast evaluation of the learned target function

• Criticism! Long training time! Difficult to understand the learned function (weights)! Not easy to incorporate domain knowledge

• Reborn with advanced computers and deep learning.

14

• Backpropagation algorithm performs learning on a multilayer feed-forward neural network.

• The inputs corresponds to the number of attributes measured for training objects.

• The second layer are hidden units.

• Output layer emits the network’s prediction for given sample.

• Weights associated with arcs.

A multilayer feed-forward neural network

......

...

Inputlayer

hiddenlayer

outputlayer

&1

&2

&`1Da...........................1ba.

ca.........................cb

A"training"sample,"X"=%{x1,…,%xn},%is"fed"to"the"input"layer."Weighted"connections"exist"between"each"layer,"where"wij denotes"the"weight"from"a"unit"j"in"one"layer"to"unit"i in"the"previous"layer.

15

A hidden or output layer unit

The input to unit j are outputs from the previous layer. These are multiplied by their corresponding weights in order to form a weighted sum, which is added to the defg.ha associated with unit j. A nonlinear activation function is applied to the net input.

x0

x1

xn

..

.

w0j

w1j

wnj

S f Output

Inputs (output from previous

layer)

Weighted sum

Activation function

(non linear)R&D

*

Di2

1Da

defg.ha

16

• Before training, the user must decide on the network topology: number of layers, number of units in each layer.

Defining a network topology

• Typically, input values are normalized so as to fall between 0.0 and 1.0. Discrete valued attributes may be encoded such that there is one input unit per domain value.

• One output unit may be used to represent two classes. If more than two classes, then one output unit per class is used.

• No clear rule as to the “best” number of hidden layer units. Network design is a trial-and-error process.

17

Network training

• The ultimate objective of training ! obtain a set of weights that makes almost all the

objects in the training dataset classified correctly

• Steps! Initialize weights with random values ! Feed the input objects into the network one by one! For each unit• Compute the net input to the unit as a linear combination

of all the inputs to the unit• Compute the output value using the activation function• Compute the error

18




Outline

19

• Learns by iterativelyprocessing a set of training objects, comparing the network prediction for each object with its known class label (error)

Backpropagation

• Weights are modified to minimize the mean squared error. The modification are made in the “backward” direction: from output layer, through each hidden layer down to the first layer

• Input: training object (objects), the learning rate j, a multi feed-forward network (network)

• Output: A neural network trained to classify the objects

Basic&idea

20

Backpropagation• Initialize the weights

! As small random numbers (from −1.0.H%.1.0..or. − 0.5.H%.0.5)! Each unit l associated with a bias ha

• Propagate the inputs forward! The input and output of each hidden unit and output unit is computed! The input is computed by ma = ∑ 1Daca + ha�

D! Example of the output computed by the sigmoid function ca = '

'opqrst

• Backpropagate the error! The error is computed and propagated backwards by updating the weights u1Da =

1Da + ∆1Dav and biases ha = ha + ∆ha to reflect the error of the network’s prediction.

• Terminating condition! All ∆1Da in previous epoch were so small

! The percentage of objects misclassified in the previous epoch is below some threshold

! A specified number of epochs may has expired

21

Back-propagation

(4) Output nodes

(2) Input nodes

(3) Hidden nodes

(5) Output vector

(1) Input vector: Le

(3), (4)

(3)

(4)

(3), (4)1Da = 1Da + (w)QxxacD

ha = ha + (w)Qxxa

Qxxa = ca 1 − ca ya − ca

Qxxa = ca 1 − ca RQxxb1ab

�

b

ca =1

1 + z{|s

ma =R1DacD + ha

�

D

ca = ma (1)

wij

wjk

(3), (4)

(3), (4)

22

Back-propagation algorithm (Han’s book)1. Initialize%all%weights%and%biases%in%network2. while stop%condition%is%not%satisfied%{

for each%training%object%}.{%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%/*%Propagate%inputs%forward%*/for each%input%layer%unit%j...ca= ma /*output%of%an%input%unit%is%its%actual%input%value%*/%

1. for each%hidden%or%output%layer%unit%l {2. ma = ∑ 1DacD + ha�

D /*%compute%net%input%of%unit%j wrt previous%layer%*/3. ..ca= '

'opqrst ..}....................................................../*%compute%the%output%of%each%unit%j"*/%%%%%%%%%%%%%%%%%%

4. for each%unit%j%in%the%output%layer% /*Backpropagate the%errors%*/5. .....Qxxa= ca 1 − ca ya − ca /*%compute%the%error%*/6. for each%unit%j%in%hidden%layers,%from%the%last%to%the%first%hidden%layer7. .....Qxxa= ca 1 − ca ∑ Qxxb1ab�

b /*%compute%error%wrt next%higher%layer%k%*/8. for each%weigh%1el in%network%{9. ..∆1Da= w QxxacD /*%weigh%increment%*/10. 1Da = 1Da + ∆1Da}%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%/*%weight%update%*/11. for each%bias%ha in%network%{12. ha = l Qxxa /*%bias%increment%*/%13. .....ha= ha + ∆ha }%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%/*%bias%update%*/%14. }}

23

• Effectively address the weakness of the symbolic AI approach in knowledge discovery (grow of the hypothesis space).

• Extracting or making sense of numeric weightsassociated with the interconnections of neurons to come up with a higher level of knowledge has been and will continue to be a challenge problem.

Remark on mining with neural networks Deep learning

24

@ Autoencoder@ Recurrent%neural%networks%(RNN)

@ Deep%generative%models@ TensorFlow@ etc.%

Drives%success%in%deep%learning• Flexible%models• Data,%data%and%data• Distributed%and%parallel%computing

25




Outline

" Map the data from X into a (high-dimensional) vector space, the feature space F, by applying the feature map f on the data points x.

" Find a linear (or other easy) pattern in F using a well-known algorithm (that works on the Gram matrix).

" By applying the inverse map, the linear pattern in F can be found to correspond to a complex pattern in X.

" This implicitly done by only making use of inner products in F (kernel trick) determined by a kernel function

Kernel methods: the scheme

x1 x2

…xn-1 xn

!(x)!(x1)

!(x2)

!(xn-1)!(xn)

...

inverse map !-1

k(xi,xj) = !(xi).!(xj)

Input3space3X Feature3space3F

kernel function k: XxX # R kernel-based algorithm on KGram matrix Knxn= {k(xi,xj)} -

27

Examples: kernel function and kernel PCA

32: RHRX =!="

),,(),( 22

212121 xxxxxx +!

28

denotes +1

denotes -1

f(x,w,b))=)sign(w x +)b)

The maximum margin linear classifier is the linear classifier with the maximum margin.

This is the simplest kind of SVM (called an LSVM)

Support vectorsare those datapoints that the margin pushes up against

1. Maximizing the margin is good according to intuition and PAC theory

2. Implies that only support vectors are important; other training examples are ignorable.

3. Empirically it works very very well.

Maximum margin

Probabilistic graphical modelsAn overview

" A probabilistic graphical model (graphical model) is a way of representing probabilistic relationships between random variables (bring graph theory and probability theory in a powerful formalism for multivariate statistical modeling)

" Provides a powerful tool for modeling and solving problems related to

Uncertainty and Complexity

Probability))Theory)))+ Graph)Theory

Graphical ModelsAn overview

• Probability theory (about random phenomena: variables, stochastic processes, and events) ensures consistency, provides interface models to data.

• Graph theory (model of pairwise relations between objects) intuitively appealing interface for humans. “The graphical language allows us to encode in practice: the property that variables tend to interact directly only with very few others”. (Koller’s book).

• Modularity: a complex system is built by combining simpler parts.

PCWP CO

HRBP

HREKG HRSAT

ERRCAUTERHRHISTORY

CATECHOL

SAO2 EXPCO2

ARTCO2

VENTALV

VENTLUNG VENITUBE

DISCONNECT

MINVOLSET

VENTMACHKINKEDTUBEINTUBATIONPULMEMBOLUS

PAP SHUNT

ANAPHYLAXIS

MINOVL

PVSAT

FIO2PRESS

INSUFFANESTHTPR

LVFAILURE

ERRBLOWOUTPUTSTROEVOLUMELVEDVOLUME

HYPOVOLEMIA

CVP

BP

Example from domain of Monitoring Intensive-Care Patients: A ICU alarm network, 37 nodes, 509 parameters.ICU: Incident Command Units.

$ Issues:%! Representation! Learning! Inference! Applications

Graphical modelsLearning

• Form the input of fully or partially observable data cases?

• The learning steps:

! Structure learning: Edge between any two nodes?

! Parameter learning: Quantitative dependencies between variables (parameters).

31

B E A C Ne a c

b n

Call

Burglary

Alarm

Earthquake

Newscast

Z. Ghahramani, “Graphical model: Parameter learning”

X1

Y

X2

Spurious)edge

X1

Y

X2

Missing)edge

X1

Y

X2

True%structure

Graphical modelsInference

" Computational inference problems1. Computing the likelihood of observed data.2. Computing the marginal distribution �(&Ä) over a particular

subset A � V of nodes.3. Computing the posterior distribution of latent variables.4. Computing a mode of the density (i.e., an element &Å in the

set arg.maxÖ∈Üá

�(&)).

" Example: What is the most probable disease?

symptoms

diseases

Graphical modelsInstances of graphical models

33

Probabilistic%models

Graphical%models

Directed Undirected

Bayes%nets MRFs

DBNs

Hidden%Markov%Model%(HMM)

Naïve%Bayes%classifier

Mixture%models

Kalmanfiltermodel

Conditionalrandom%fields

MaxEnt

LDA

Murphy, ML for life sciences

Homework

• Use%‘Neural%Network’%classifier%of%WEKA%(or%other%tools%as%you%can)%to%analyze%the%‘labor’%dataset:! Compare%the%results%on%the%original%‘labor’%data%and%your%

‘labor’%data%after%filling%the%missing%values! Compare%the%results%with%some%different%network%topologies.

• Hints:! Choose%`Functions’%then%`MultilayerPerceptron’! You%can%design%the%network%topology%by:• Adjusts%parameter%‘hiddenLayers’• Use%GUI

k236: basis of data analytics - japan advanced institute ...bao/k236/k236-l10-print.pdf ·...

Documents