k236: basis of data analytics - japan advanced institute ...bao/k236/k236-l10-print.pdf ·...
Post on 16-Mar-2018
214 Views
Preview:
TRANSCRIPT
K236: Basis of Data AnalyticsLecture 10: Classification and prediction
Neural Networks
Lecturer: Tu Bao Ho and Hieu Chi DamTA: Moharasan Gandhimathi
and Nuttapong Sanglerdsinlapachai
2
Schedule of K236
1. Introduction to data science (1) ���*"�3 6/9
2. Introduction to data science (2) ���*"�3 6/13
3. Data and databases ��������� 6/16
4. Review of univariate statistics !2+. 6/20
5. Review of linear algebra ,#�$ 6/23
6. Data mining software ��������� ��� 6/27
7. Data preprocessing �����' 6/30
8. Classification and prediction (1) �5��& (1) 7/4
9. Knowledge evaluation )1/� 7/7
10. Classification and prediction (2) �5��& (2) 7/11
11. Classification and prediction (3) �5��& (3) 7/1412. Mining association rules (1) (4����-% 7/18
13. Mining association rules (2) (4����-% 7/21
14. Cluster analysis ����-% 7/25
15. Review and Examination �����06 (the data is not fixed) 7/27
3
1. About neural networks
2. Backpropagation algorithms
3. More about classification and prediction methods
Outline Connectionists
4
Yann%LeCun Geoff%Hinton Yoshua Bengio
How the human brain learns
• A neuron collects signals from others through a structures called dendrites.
• The neuron sends out spikes of electrical activity through a long, thin stand known as an axon, which splits into thousands of branches.
• At the end of each branch, a structure called a synapse converts the activity from the axon into electrical effects that inhibit or excite activity in the connected neurons.
5
A neuron model
The diagram, showing the biological inspiration. The activation function is a non-linear function applied to the weighted input sum to produce the output of the artificial neuron (in the case of Rosenblatt's Perceptron, the function just a thresholding operation).
6http://www.andreykurenkov.com/writing/a@brief@history@of@neural@nets@and@deep@learning/
7
30×32 pixelsas inputs
30 outputsfor steering
30×32 weightsinto one out offour hiddenunit
4 hidden units
Drives 70 mphon a public highway
ALVINN drives 70 mph on highways
Tom%Mitchell,%Machine%Learning,%1997
Perceptron
% &', … , &* = , 1......if..12 + 1'&' + ⋯+ 1*&* > 0..−1....otherwise..........................................
..
In vector notation % &⃗ = ,..1......if.1. &⃗ > 0.−1...otherwise
The perceptron is the neural network algorithm for supervised learning of binary classifiers (+1.and − 1) [Rosenblatt, 1957]
“Mark%1%perceptron”%machine
How to learn vector 1 from training data? Two rules:
• Perceptron training rule
• Delta rule (gradient descent)
8
Perceptron training rule
• Begin with a random weights, then iteratively apply the perceptron to each training example, modifying the weights whenever it misclassifies an example.
• Revise the weight 1D.associated with &D by1D ← 1D + ∆1D
where∆1D.= G(H − %)&D
! H = J &⃗ is the target output (label of the training example)! % is perceptron output! G is a small constant (e.g., 0.1) called learning rate (moderating the changing
degree of weight)
• Can prove it converges if training data is linearly separable and G is sufficiently small (gradient descent overcomes non-linearly separable).
9
Gradient descent
Gradient descent is an optimizationalgorithm to find a local minimum of a function K(&)..
If a multi-variable function K(L) is defined and differentiable in a neighborhood of a point a, then K Ldecreazes fastest if it goes from a inthe direction of the negative gradient of K at a, −MK N .
It follows that, if O = N − GMK(N) for Gsmall enough, then K(O) ≥ K(N).
https://en.wikipedia.org/wiki/Gradient_descent 10)
Gradient descent
Consider simpler linear unit, where
% = .12 + 1'&' +⋯+1*&*
Let’s learn 1D’s that minimize the squared error
Q 1 =12R HS − %S T�
S∈W
where X is the set of training examples
Gradient%%
MQ[1]= YZY[\
, YZY[]
, … , YZY[^
Training%rule
∆1 = −GMQ 1
i.e.,%%...
∆1D = −G_Q_1D
M1D = −GR HS − %S −&DS
�
S
11
Neural networks and non-linearly separable problems
12
Structure Types+ofDecision+Regions
Exclusive7ORProblem
Classes+withMeshed+regions
Most+GeneralRegion+Shapes
Single7Layer
Two7Layer
Three7Layer
Half+PlaneBounded+ByHyperplane
Convex+OpenOr
Closed+Regions
Arbitrary(ComplexityLimited3by3No.of3Nodes)
A
AB
B
A
AB
B
A
AB
B
BA
BA
BA
13
Neural networks • Advantages
! Prediction accuracy is generally high! Robust, works when training objects contain errors! Output may be discrete, real-valued, or a vector of several
discrete or real-valued attributes! Fast evaluation of the learned target function
• Criticism! Long training time! Difficult to understand the learned function (weights)! Not easy to incorporate domain knowledge
• Reborn with advanced computers and deep learning.
14
• Backpropagation algorithm performs learning on a multilayer feed-forward neural network.
• The inputs corresponds to the number of attributes measured for training objects.
• The second layer are hidden units.
• Output layer emits the network’s prediction for given sample.
• Weights associated with arcs.
A multilayer feed-forward neural network
......
...
Inputlayer
hiddenlayer
outputlayer
&1
&2
&`1Da...........................1ba.
ca.........................cb
A"training"sample,"X"=%{x1,…,%xn},%is"fed"to"the"input"layer."Weighted"connections"exist"between"each"layer,"where"wij denotes"the"weight"from"a"unit"j"in"one"layer"to"unit"i in"the"previous"layer.
15
A hidden or output layer unit
The input to unit j are outputs from the previous layer. These are multiplied by their corresponding weights in order to form a weighted sum, which is added to the defg.ha associated with unit j. A nonlinear activation function is applied to the net input.
x0
x1
xn
..
.
w0j
w1j
wnj
S f Output
Inputs (output from previous
layer)
Weighted sum
Activation function
(non linear)R&D
*
Di2
1Da
defg.ha
16
• Before training, the user must decide on the network topology: number of layers, number of units in each layer.
Defining a network topology
• Typically, input values are normalized so as to fall between 0.0 and 1.0. Discrete valued attributes may be encoded such that there is one input unit per domain value.
• One output unit may be used to represent two classes. If more than two classes, then one output unit per class is used.
• No clear rule as to the “best” number of hidden layer units. Network design is a trial-and-error process.
17
Network training
• The ultimate objective of training ! obtain a set of weights that makes almost all the
objects in the training dataset classified correctly
• Steps! Initialize weights with random values ! Feed the input objects into the network one by one! For each unit• Compute the net input to the unit as a linear combination
of all the inputs to the unit• Compute the output value using the activation function• Compute the error
18
1. About neural networks
2. Backpropagation algorithms
3. More about classification and prediction methods
Outline
19
• Learns by iterativelyprocessing a set of training objects, comparing the network prediction for each object with its known class label (error)
Backpropagation
• Weights are modified to minimize the mean squared error. The modification are made in the “backward” direction: from output layer, through each hidden layer down to the first layer
• Input: training object (objects), the learning rate j, a multi feed-forward network (network)
• Output: A neural network trained to classify the objects
Basic&idea
20
Backpropagation• Initialize the weights
! As small random numbers (from −1.0.H%.1.0..or. − 0.5.H%.0.5)! Each unit l associated with a bias ha
• Propagate the inputs forward! The input and output of each hidden unit and output unit is computed! The input is computed by ma = ∑ 1Daca + ha�
D! Example of the output computed by the sigmoid function ca = '
'opqrst
• Backpropagate the error! The error is computed and propagated backwards by updating the weights u1Da =
1Da + ∆1Dav and biases ha = ha + ∆ha to reflect the error of the network’s prediction.
• Terminating condition! All ∆1Da in previous epoch were so small
! The percentage of objects misclassified in the previous epoch is below some threshold
! A specified number of epochs may has expired
21
Back-propagation
(4) Output nodes
(2) Input nodes
(3) Hidden nodes
(5) Output vector
(1) Input vector: Le
(3), (4)
(3)
(4)
(3), (4)1Da = 1Da + (w)QxxacD
ha = ha + (w)Qxxa
Qxxa = ca 1 − ca ya − ca
Qxxa = ca 1 − ca RQxxb1ab
�
b
ca =1
1 + z{|s
ma =R1DacD + ha
�
D
ca = ma (1)
wij
wjk
(3), (4)
(3), (4)
22
Back-propagation algorithm (Han’s book)1. Initialize%all%weights%and%biases%in%network2. while stop%condition%is%not%satisfied%{
for each%training%object%}.{%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%/*%Propagate%inputs%forward%*/for each%input%layer%unit%j...ca= ma /*output%of%an%input%unit%is%its%actual%input%value%*/%
1. for each%hidden%or%output%layer%unit%l {2. ma = ∑ 1DacD + ha�
D /*%compute%net%input%of%unit%j wrt previous%layer%*/3. ..ca= '
'opqrst ..}....................................................../*%compute%the%output%of%each%unit%j"*/%%%%%%%%%%%%%%%%%%
4. for each%unit%j%in%the%output%layer% /*Backpropagate the%errors%*/5. .....Qxxa= ca 1 − ca ya − ca /*%compute%the%error%*/6. for each%unit%j%in%hidden%layers,%from%the%last%to%the%first%hidden%layer7. .....Qxxa= ca 1 − ca ∑ Qxxb1ab�
b /*%compute%error%wrt next%higher%layer%k%*/8. for each%weigh%1el in%network%{9. ..∆1Da= w QxxacD /*%weigh%increment%*/10. 1Da = 1Da + ∆1Da}%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%/*%weight%update%*/11. for each%bias%ha in%network%{12. ha = l Qxxa /*%bias%increment%*/%13. .....ha= ha + ∆ha }%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%/*%bias%update%*/%14. }}
23
• Effectively address the weakness of the symbolic AI approach in knowledge discovery (grow of the hypothesis space).
• Extracting or making sense of numeric weightsassociated with the interconnections of neurons to come up with a higher level of knowledge has been and will continue to be a challenge problem.
Remark on mining with neural networks Deep learning
24
@ Autoencoder@ Recurrent%neural%networks%(RNN)
@ Deep%generative%models@ TensorFlow@ etc.%
Drives%success%in%deep%learning• Flexible%models• Data,%data%and%data• Distributed%and%parallel%computing
25
1. About neural networks
2. Backpropagation algorithms
3. More about classification and prediction methods
Outline
" Map the data from X into a (high-dimensional) vector space, the feature space F, by applying the feature map f on the data points x.
" Find a linear (or other easy) pattern in F using a well-known algorithm (that works on the Gram matrix).
" By applying the inverse map, the linear pattern in F can be found to correspond to a complex pattern in X.
" This implicitly done by only making use of inner products in F (kernel trick) determined by a kernel function
Kernel methods: the scheme
x1 x2
…xn-1 xn
!(x)!(x1)
!(x2)
!(xn-1)!(xn)
...
inverse map !-1
k(xi,xj) = !(xi).!(xj)
Input3space3X Feature3space3F
kernel function k: XxX # R kernel-based algorithm on KGram matrix Knxn= {k(xi,xj)} -
27
Examples: kernel function and kernel PCA
32: RHRX =!="
),,(),( 22
212121 xxxxxx +!
28
denotes +1
denotes -1
f(x,w,b))=)sign(w x +)b)
The maximum margin linear classifier is the linear classifier with the maximum margin.
This is the simplest kind of SVM (called an LSVM)
Support vectorsare those datapoints that the margin pushes up against
1. Maximizing the margin is good according to intuition and PAC theory
2. Implies that only support vectors are important; other training examples are ignorable.
3. Empirically it works very very well.
Maximum margin
Probabilistic graphical modelsAn overview
" A probabilistic graphical model (graphical model) is a way of representing probabilistic relationships between random variables (bring graph theory and probability theory in a powerful formalism for multivariate statistical modeling)
" Provides a powerful tool for modeling and solving problems related to
Uncertainty and Complexity
Probability))Theory)))+ Graph)Theory
Graphical ModelsAn overview
• Probability theory (about random phenomena: variables, stochastic processes, and events) ensures consistency, provides interface models to data.
• Graph theory (model of pairwise relations between objects) intuitively appealing interface for humans. “The graphical language allows us to encode in practice: the property that variables tend to interact directly only with very few others”. (Koller’s book).
• Modularity: a complex system is built by combining simpler parts.
PCWP CO
HRBP
HREKG HRSAT
ERRCAUTERHRHISTORY
CATECHOL
SAO2 EXPCO2
ARTCO2
VENTALV
VENTLUNG VENITUBE
DISCONNECT
MINVOLSET
VENTMACHKINKEDTUBEINTUBATIONPULMEMBOLUS
PAP SHUNT
ANAPHYLAXIS
MINOVL
PVSAT
FIO2PRESS
INSUFFANESTHTPR
LVFAILURE
ERRBLOWOUTPUTSTROEVOLUMELVEDVOLUME
HYPOVOLEMIA
CVP
BP
Example from domain of Monitoring Intensive-Care Patients: A ICU alarm network, 37 nodes, 509 parameters.ICU: Incident Command Units.
$ Issues:%! Representation! Learning! Inference! Applications
Graphical modelsLearning
• Form the input of fully or partially observable data cases?
• The learning steps:
! Structure learning: Edge between any two nodes?
! Parameter learning: Quantitative dependencies between variables (parameters).
31
B E A C Ne a c
b n
Call
Burglary
Alarm
Earthquake
Newscast
Z. Ghahramani, “Graphical model: Parameter learning”
X1
Y
X2
Spurious)edge
X1
Y
X2
Missing)edge
X1
Y
X2
True%structure
Graphical modelsInference
" Computational inference problems1. Computing the likelihood of observed data.2. Computing the marginal distribution �(&Ä) over a particular
subset A � V of nodes.3. Computing the posterior distribution of latent variables.4. Computing a mode of the density (i.e., an element &Å in the
set arg.maxÖ∈Üá
�(&)).
" Example: What is the most probable disease?
symptoms
diseases
Graphical modelsInstances of graphical models
33
Probabilistic%models
Graphical%models
Directed Undirected
Bayes%nets MRFs
DBNs
Hidden%Markov%Model%(HMM)
Naïve%Bayes%classifier
Mixture%models
Kalmanfiltermodel
Conditionalrandom%fields
MaxEnt
LDA
Murphy, ML for life sciences
Homework
• Use%‘Neural%Network’%classifier%of%WEKA%(or%other%tools%as%you%can)%to%analyze%the%‘labor’%dataset:! Compare%the%results%on%the%original%‘labor’%data%and%your%
‘labor’%data%after%filling%the%missing%values! Compare%the%results%with%some%different%network%topologies.
• Hints:! Choose%`Functions’%then%`MultilayerPerceptron’! You%can%design%the%network%topology%by:• Adjusts%parameter%‘hiddenLayers’• Use%GUI
top related