neural network and deep learning - bios.unc.edudzeng/bios740/dplearning.pdf · nothing to do with...

Neural Network and DeepLearning

Donglin Zeng, Department of Biostatistics, University of North Carolina

Early history of deep learning

I Deep learning dates back to 1940s: known as cybernetics inthe 1940s-60s, connectionism in the 1980s-90s, and underthe current name starting in 2006.

I Deep learning has gone by name Artificial NeuralNetworks (ANNs), one of machine learning methodsoriginally aiming to understand brain function (althoughnothing to do with actual biological functions; Hinton andShallice, 1991) The first wave was simple linear models torelate input to output (McCulloch and Pitts, 1943) from aneuroscientific perspective.


Early history: continue

I Deep learning gradually deviates from neuroscience dueto lack of knowledge about brain function, but retainscommon belief that much of the mammalian brain mightsolve different tasks using a single algorithm (True orFalse?), including language processing, vision, motionplanning and speech recognition.

I The second wave of deep learning (neural network)started in 1980s and lasted till the mid-1990s due toemerging movement called connectionism (paralleldistributed processing). However, kernel machines (SVM)and graphical models became dominating later.


Recent development in deep learning

I The explosive revival of deep learning started from Hintonet al. (2006) with its outperforming other machine learningmethods.

I Why deep learning becomes more exciting than ever:I huge amount of training data, especially data with

repetitive structures (image, speech), have lessened theconcern on statistical generalizability, the point mostcriticized by statistical learning communities;

I more powerful computers and better softwareinfrastructures enable computation with a highly complexmodel containing parallelizable components.


Deep learning applications

I Deep learning has been successful in recent applications:– In the ImageNet Large Scale Visual RecognitionChallenge of 2012, it improved the top-5 error rate from26.1% to 15.3% and further down to 3.6% in 2015.– It also had a dramatic impact on speech recognition,resulting a sudden drop in error rates with some cut inhalf.– It showed superhuman performance in traffic signclassification and image segmentation.– It is incorporated with reinforcement learning to reachhuman-level performance (e.g., DeepMind).

I Deep learning has been successfully used in otherapplications, eg., predicting how molecules interact,searching for subatomic particles and automaticallyparsing microscope images to construct a 3-D brain map.


Neural Networks

I It has other names: deep feedforward networks,feedforward neural networks, or multilayer perceptrons(MLPs).

I They are feedforward since information flows throughfunction from input X to output Y and there are nofeedback.

I A typical network is

X→ f (1)(X)→ f (2)(f (1)(X))→ · · · → Y.

I f (1): first layer; f (2): second layerthe overall length of the chain is the depth of the modelall the layers between X and Y are hidden layers.

I The architecture of ANN is closely related to direct acylicgraph or structural equation model.


ANN example


Single-layer neural network

I Let Z1, ...,Zm be the variables in the hidden layer (hiddenunits) and Zk = hk(X).

I Furthermore, we use g(Z1, ...,Zm) to predict Y.I Questions: how to specify the link functions (also called

activation functions) h1, ..., hm and g?I The key motivation is: we want these functions to be

simple and conveniently computed, since by increasing thenumber of Z’s, simple functions are sufficient approximateany possibly nonlinear relationship between X and Y.

I Universal approximation theorem A feedforwardnetwork with a linear output layer and at least one hiddenlayer with any “squashing” activation functions canapproximate any Borel measurable function, provided thatthe network is given enough hidden units.


Activation functions

I Commonly used function for h’s– linear function: h(x) = β0 + βTx– sigmoid function: h(x) = max{0,min{1, β0 + βTx}}– logit sigmoid function:h(x) = exp{β0 + βTx}/[1 + exp{β0 + βTx}]– rectified linear function: h(x) = max(0, β0 + βTx)– hyperbolic tangent function: h(x) = tanh(β0 + βTx)

I The choice of g can also be one of these link functionsdepending on the type of output Y.


Network architecture

I One main feature of the architecture is characterized bydepth (the number of hidden layers) and width (thenumber of hidden units at each layer).

I Empirically, greater depth results in better performance ascompared to a shallow network with large width.

I Another feature of the architecture is how to connect twoneighboring layers– input layer can be connected to a subset of the units inthe output layer;– connection coefficients can be shared across some unitsin the input layer (convolutional networks);– the advantage over a full connected network is thereduced number of parameters and less computation.

I Remark: the architecture of network is task-specific!


Computation algorithm for ANN

I Despite of the complexity in the network architecture,estimation of the parameters can be effectively obtaineddue to simplicity of the activation function and theso-called forward- and backward-propagation algorithm.

I Suppose Zk = σk(XTαk), k = 1, ...,m, and

E[Y|X] = g(β1Z1 + ...+ βmZm + β0) ≡ f (X).

I Based on n observations, we wish to minimizen∑

i=1

{Yi − g(β1σ1(XTi α1) + ...+ βmσm(XT

i αm))}2

if Y is continuous, or

−n∑

i=1

Yi log g(β1σ1(XTi α1) + ...+ βmσm(XT

i αm))

if Y is binary.Donglin Zeng, Department of Biostatistics, University of North Carolina

Gradient-descent algorithm

I The key is to compute the gradient with respect to allparameters.

I At (r + 1)st iteration, from the chain-rule,

β(r+1)k = β

(r)k − γr

n∑i=1

δiZki,

α(r+1)kl = α

(r)kl − γr

n∑i=1

sikXil,

where γr is the step size in the decent algorithm (calledlearning rate) and sik = σ′k(X

Ti αk)βkδi, and

δi = −2(Yi − f (Xi))f ′(β1Zi1 + ...+ βmZim + β0)

for the continuous Y and

δi = −Yi/f (Xi)f ′(β1Zi1 + ...+ βmZim + β0)

for the binary Y.Donglin Zeng, Department of Biostatistics, University of North Carolina

Remarks on computation

I The update for the parameters can be carried in two-passalgorithm. In the forward pass, we use the currentparameters to estimate f (·); in the backward pass, wecompute δi then sik.

I Each hidden unit passes and receives information only toand from units that share a connection, this algorithm canbe implemented efficiently on a parallel architecturecomputer.

I Using the chain rule, we can also develop recursivecomputation for the hessian matrix.


Improve estimation in ANN

I Parameter regularization: L2 or L1 penalizationI Data augmentation: create fake data and augment it to

training sample (boostrap sample, add noises?) Oneparticularly effective technique for structured data is tointroduce perturbation in data augmentation: for example,for object recognition, we translate images in a few pixelsin each direction to improve generalization. Rotation orscaling are also effective.

I Multitask learning: it assumes that some factors are sharedacross two or more tasks.

I Early stopping, bagging or other ensemble methods


Alternative optimization algorithms

I stochastic gradient descentI adaptive learning ratesI conjugate gradient methodI Boryden-Fletcher-Goldfarb-Shanno (BFGS) algorithmI coordinate descent


Convolution Networks

I It is also known as convolutional neural networks (CNNs).I It involves convolution and pooling operations in neural

network.I It has been quite successful for processing data with

grid-like topology, such as time-series data and image data.


Convolution

I Recall convolution operation

s(t) =∫

x(a)w(t− a)da,

where w is call the convolution kernel.I In a discrete version, it is

∑a x(a)w(t− a) for a 2-d image,

s(i, j) =∑m,n

x(m,n)w(i−m, j− n).

I Discrete convolution can be viewed as multiplication by akernel matrix with constrained entries (Toeplitz matrix,doubly block circulant matrix)


Features in CNN

I Sparse connectivity: convolution can be treated as anotherlayer network from input data x to the transformed data s.Since kernel matrix is much smaller than input matrix, theconnection is sparse.

I Parameter sharing: the parameters for the same input unitare the same for different units in the layer of s.

I Equivariance: if the input changes, the output shouldchanges in the same way. For example, every image pixelis shifted by one unit to the right, the output of ANNshould shift by one pixel too (use invariance property toimprove prediction).


Pooling operation

I It adds a pooling layer so replaces the input layer datawith some summary statistics of the nearby outputs (forexample, the maximum value in the neighborhood, orsome weighted average).

I Pooling increases robustness to small translation of theinput.

I Pooling results in data reduction so improves computationefficiency.


Convolution diagram


Pooling diagram


Additional deep learning

I Recurrent neural network: input data are in a sequenceX(1),Y(1),X(2), · · · and learning uses sharing parametersacross different parts of a model.

I The application of recurrent neural networks includescomputer vision, speech recognition, natural languageprocessing, contextual bandits in reinforcement learning

I Autoencoders: it is a neural network that is trained to copyits input to its output (X→ X) so is useful for dimensionreduction or feature learning. Some variants includesparse autoencoders, denoising autoencoders.

I Representation learningI Graphical models


neural network and deep learning - bios.unc.edudzeng/bios740/dplearning.pdf · nothing to do with...

Documents