neural networks. overview
TRANSCRIPT
![Page 1: Neural Networks. Overview](https://reader034.vdocuments.mx/reader034/viewer/2022042716/55a686041a28abe7088b463a/html5/thumbnails/1.jpg)
Neural networks. Overview
Oleksandr Baiev, PhD
Senior Engineer
Samsung R&D Institute Ukraine
![Page 2: Neural Networks. Overview](https://reader034.vdocuments.mx/reader034/viewer/2022042716/55a686041a28abe7088b463a/html5/thumbnails/2.jpg)
Neural networks. Overview
• Common principles
– Structure
– Learning
• Shallow and Deep NN
• Additional methods
– Conventional
– Voodoo
![Page 3: Neural Networks. Overview](https://reader034.vdocuments.mx/reader034/viewer/2022042716/55a686041a28abe7088b463a/html5/thumbnails/3.jpg)
Neural networks. Overview
• Common principles
– Structure
– Learning
• Shallow and Deep NN
• Additional methods
– Conventional
– Voodoo
![Page 4: Neural Networks. Overview](https://reader034.vdocuments.mx/reader034/viewer/2022042716/55a686041a28abe7088b463a/html5/thumbnails/4.jpg)
Canonical/Typical tasks
![Page 5: Neural Networks. Overview](https://reader034.vdocuments.mx/reader034/viewer/2022042716/55a686041a28abe7088b463a/html5/thumbnails/5.jpg)
Solutions in general
𝑥𝑗 = 𝑥1, 𝑥2, 𝑥3, 𝑥4, … , 𝑥𝑖 , … 𝑗 ∈ 𝑋
𝑦𝑗 = 𝑦1, 𝑦2, … , 𝑦𝑘 , … 𝑗 ∈ 𝑌
𝐹: 𝑋 → 𝑌Classification
𝑦1 = 1,0,0𝑦2 = 0,0,1𝑦3 = 0,1,0𝑦4 = 0,1,0
Index of sample in dataset
sample of class “0”
sample of class “2”
sample of class “2”
sample of class “1”
Regression
𝑦1 = 0.3𝑦2 = 0.2𝑦3 = 1.0𝑦4 = 0.65
![Page 6: Neural Networks. Overview](https://reader034.vdocuments.mx/reader034/viewer/2022042716/55a686041a28abe7088b463a/html5/thumbnails/6.jpg)
What is artificial Neural Networks?Is it biology?
Simulation of biological neural networks (synapses, axons, chains, layers, etc.) is a good abstraction for understanding topology.
Bio NN is only inspiration and illustration. Nothing more!
![Page 7: Neural Networks. Overview](https://reader034.vdocuments.mx/reader034/viewer/2022042716/55a686041a28abe7088b463a/html5/thumbnails/7.jpg)
What is artificial Neural Networks?Let’s imagine black box!
Finputs
params
outputs
General form:𝑜𝑢𝑡𝑝𝑢𝑡𝑠 = 𝐹 𝑖𝑛𝑝𝑢𝑡𝑠, 𝑝𝑎𝑟𝑎𝑚𝑠
Steps:1) choose “form” of F 2) find params
![Page 8: Neural Networks. Overview](https://reader034.vdocuments.mx/reader034/viewer/2022042716/55a686041a28abe7088b463a/html5/thumbnails/8.jpg)
What is artificial Neural Networks?It’s a simple math!
free parameters
activation function
𝑠𝑖 =
𝑗=1
𝑛
𝑤𝑖𝑗𝑥𝑗 + 𝑏𝑖
𝑦𝑖 = 𝑓 𝑠𝑖
Output of i-th neuron:
![Page 9: Neural Networks. Overview](https://reader034.vdocuments.mx/reader034/viewer/2022042716/55a686041a28abe7088b463a/html5/thumbnails/9.jpg)
What is artificial Neural Networks?It’s a simple math!
activation: 𝑦 = 𝑓 𝑤𝑥 + 𝑏 = 𝑠𝑖𝑔𝑚𝑜𝑖𝑑(𝑤𝑥 + 𝑏)
![Page 10: Neural Networks. Overview](https://reader034.vdocuments.mx/reader034/viewer/2022042716/55a686041a28abe7088b463a/html5/thumbnails/10.jpg)
What is artificial Neural Networks?It’s a simple math!
activation: 𝑦 = 𝑓 𝑤𝑥 + 𝑏 = 𝑠𝑖𝑔𝑚𝑜𝑖𝑑(𝑤𝑥 + 𝑏)
![Page 11: Neural Networks. Overview](https://reader034.vdocuments.mx/reader034/viewer/2022042716/55a686041a28abe7088b463a/html5/thumbnails/11.jpg)
What is artificial Neural Networks?It’s a simple math!
activation: 𝑦 = 𝑓 𝑤𝑥 + 𝑏 = 𝑠𝑖𝑔𝑚𝑜𝑖𝑑(𝑤𝑥 + 𝑏)
![Page 12: Neural Networks. Overview](https://reader034.vdocuments.mx/reader034/viewer/2022042716/55a686041a28abe7088b463a/html5/thumbnails/12.jpg)
What is artificial Neural Networks?It’s a simple math!
activation: 𝑦 = 𝑓 𝑤𝑥 + 𝑏 = 𝑠𝑖𝑔𝑚𝑜𝑖𝑑(𝑤𝑥 + 𝑏)
![Page 13: Neural Networks. Overview](https://reader034.vdocuments.mx/reader034/viewer/2022042716/55a686041a28abe7088b463a/html5/thumbnails/13.jpg)
What is artificial Neural Networks?It’s a simple math!
activation: 𝑦 = 𝑓 𝑤𝑥 + 𝑏 = 𝑠𝑖𝑔𝑚𝑜𝑖𝑑(𝑤𝑥 + 𝑏)
![Page 14: Neural Networks. Overview](https://reader034.vdocuments.mx/reader034/viewer/2022042716/55a686041a28abe7088b463a/html5/thumbnails/14.jpg)
What is artificial Neural Networks?It’s a simple math!
n inputsm neurons in hidden layer
𝑠𝑖 =
𝑗=1
𝑛
𝑤𝑖𝑗𝑥𝑗 + 𝑏𝑖
𝑦𝑖 = 𝑓 𝑠𝑖
Output of i-th neuron:
Output of k-th layer:
1) 𝑆𝑘 = 𝑊𝑘𝑋𝑘 + 𝐵𝑘 =
=
𝑤11 𝑤12 ⋯ 𝑤1𝑛𝑤21 𝑤21 ⋯ 𝑤21
⋯ ⋯ ⋯ ⋯𝑤𝑚1 𝑤𝑚2 ⋯ 𝑤𝑚𝑛 𝑘
𝑥1𝑥2𝑥3⋮𝑥𝑛 𝑘
+
𝑏1𝑏2𝑏3⋮𝑏𝑛 𝑘
2) 𝑌𝑘 = 𝑓𝑘 𝑆𝑘
apply element-wise
Kolmagorov & Arnold function superposition
Form of F:
![Page 15: Neural Networks. Overview](https://reader034.vdocuments.mx/reader034/viewer/2022042716/55a686041a28abe7088b463a/html5/thumbnails/15.jpg)
Neural networks. Overview
• Common principles
– Structure
– Learning
• Shallow and Deep NN
• Additional methods
– Conventional
– Voodoo
![Page 16: Neural Networks. Overview](https://reader034.vdocuments.mx/reader034/viewer/2022042716/55a686041a28abe7088b463a/html5/thumbnails/16.jpg)
How to find parametersW and B?
Supervised learning:
Training set (pairs of variables and responses):𝑋; 𝑌 𝑖 , 𝑖 = 1. . 𝑁
Find: 𝑊∗, 𝐵∗ = 𝑎𝑟𝑔𝑚𝑖𝑛𝑊,𝐵
𝐿 𝐹 𝑋 , 𝑌
Cost function (loss, error):
logloss: L 𝐹 𝑋 , 𝑌 =1
𝑁 𝑖=1𝑁 𝑗=1
𝑀 𝑦𝑖.𝑗 log 𝑓𝑖,𝑗
rmse: L 𝐹 𝑋 , 𝑌 =1
𝑁 𝑖=1𝑁 𝐹 𝑋𝑖 − 𝑌𝑖 2
“1” if in i-th sample is class j else “0”
previously scaled:
𝑓𝑖,𝑗 = 𝑓𝑖,𝑗 𝑗 𝑓𝑖,𝑗
Just an examples. Cost function depend on problem (classification, regression) and domain knowledge
![Page 17: Neural Networks. Overview](https://reader034.vdocuments.mx/reader034/viewer/2022042716/55a686041a28abe7088b463a/html5/thumbnails/17.jpg)
Training or optimization algorithm
So, we have model cost 𝐿 (or error of prediction)
And we want to update weights in order to minimize 𝑳:
𝑤∗ = 𝑤 + 𝛼Δ𝑤
In accordance to gradient descent: Δ𝑤 = −𝛻𝐿
It’s clear for network with only one layer (we have predicted outputs and targets, so can evaluate 𝐿).
But how to find 𝜟𝒘 for hidden layers?
![Page 18: Neural Networks. Overview](https://reader034.vdocuments.mx/reader034/viewer/2022042716/55a686041a28abe7088b463a/html5/thumbnails/18.jpg)
Meet “Error Back Propagation”
Find Δ𝑤 for each layer from the last to the first as influence of weights to cost:
∆𝑤𝑖,𝑗 =𝜕𝐿
𝜕𝑤𝑖,𝑗
and:
𝜕𝐿
𝜕𝑤𝑖,𝑗
=𝜕𝐿
𝜕𝑓𝑗
𝜕𝑓𝑗
𝜕𝑠𝑗
𝜕𝑠𝑗
𝜕𝑤𝑖,𝑗
![Page 19: Neural Networks. Overview](https://reader034.vdocuments.mx/reader034/viewer/2022042716/55a686041a28abe7088b463a/html5/thumbnails/19.jpg)
Error Back PropagationDetails
𝜕𝐿
𝜕𝑤𝑖,𝑗
=𝜕𝐿
𝜕𝑓𝑗
𝜕𝑓𝑗
𝜕𝑠𝑗
𝜕𝑠𝑗
𝜕𝑤𝑖,𝑗
𝛿𝑗 =𝜕𝐿
𝜕𝑓𝑗
𝜕𝑓𝑗
𝜕𝑠𝑗
𝛿𝑗 = 𝐿′ 𝐹 𝑋 , 𝑌 𝑓′ 𝑠𝑗 , 𝑜𝑢𝑡𝑝𝑢𝑡 𝑙𝑎𝑦𝑒𝑟
𝑙 ∈ 𝑛𝑒𝑥𝑡 𝑙𝑎𝑦𝑒𝑟 𝛿𝑙𝑤 𝑗,𝑙 𝑓′ 𝑠𝑗 , ℎ𝑖𝑑𝑑𝑒𝑛 𝑙𝑎𝑦𝑒𝑟𝑠
∆𝑤𝑖,𝑗 = −𝛼𝛿𝑗 𝑥𝑖
![Page 20: Neural Networks. Overview](https://reader034.vdocuments.mx/reader034/viewer/2022042716/55a686041a28abe7088b463a/html5/thumbnails/20.jpg)
Gradient Descentin real life
Recall gradient descent:𝑤∗ = 𝑤 + 𝛼Δ𝑤
𝛼 is a “step” coefficient. In term of ML – learning rate.
Recall cost function:
𝐿 =1
𝑁
𝑁
…
GD modification: update 𝑤 for each sample.
Sum along all samples,And what if 𝑁 = 106 or more?
Typical: 𝛼 = 0.01. . 0.1
![Page 21: Neural Networks. Overview](https://reader034.vdocuments.mx/reader034/viewer/2022042716/55a686041a28abe7088b463a/html5/thumbnails/21.jpg)
Gradient DescentStochastic & Minibatch
“Batch” GD (L for full set)
need a lot of memory
Stochastic GD (L for each sample)fast, but fluctuation
Minibatch GD(L for subsets)
less memory & less fluctuationsSize of minibatch depends on HW Typical: minibatch=32…256
![Page 22: Neural Networks. Overview](https://reader034.vdocuments.mx/reader034/viewer/2022042716/55a686041a28abe7088b463a/html5/thumbnails/22.jpg)
Termination criteria
By epochs count max number of iterations along all data set
By value of gradientwhen gradient is equal to 0 than minimum, but small gradient => very slow learning
When cost didn’t change during several epochsif error is not change than training procedure is not converges
Early stoppingStop when “validation” score starts increase even when “train” score continue decreasing
Typical: epochs=50…200
![Page 23: Neural Networks. Overview](https://reader034.vdocuments.mx/reader034/viewer/2022042716/55a686041a28abe7088b463a/html5/thumbnails/23.jpg)
Neural networks. Overview
• Common principles
– Structure
– Learning
• Shallow and Deep NN
• Additional methods
– Conventional
– Voodoo
![Page 24: Neural Networks. Overview](https://reader034.vdocuments.mx/reader034/viewer/2022042716/55a686041a28abe7088b463a/html5/thumbnails/24.jpg)
What about “form” of F?Network topology
“Shallow” networks 1, 2 hidden layers => not enough parameters => pure separation abilities
“Deep” networks is a NN with 2..10 layers
“Very deep” networks is a NN with >10 layers
![Page 25: Neural Networks. Overview](https://reader034.vdocuments.mx/reader034/viewer/2022042716/55a686041a28abe7088b463a/html5/thumbnails/25.jpg)
Deep learning. Problems
• Big networks => Too huge separating ability => Overfitting
• Vanishing gradient problem during training
• Complex error’s surface => Local minimum
• Curse of dimensionality => memory & computations
𝑚(𝑖−1) 𝑚(𝑖)
dim 𝑊(𝑖) = 𝑚 𝑖−1 ∗ 𝑚(𝑖)
![Page 26: Neural Networks. Overview](https://reader034.vdocuments.mx/reader034/viewer/2022042716/55a686041a28abe7088b463a/html5/thumbnails/26.jpg)
Neural networks. Overview
• Common principles
– Structure
– Learning
• Shallow and Deep NN
• Additional methods
– Conventional
– Voodoo
![Page 27: Neural Networks. Overview](https://reader034.vdocuments.mx/reader034/viewer/2022042716/55a686041a28abe7088b463a/html5/thumbnails/27.jpg)
Additional methodsConventional
• Momentum (prevent the variations on error surface)
∆𝑤(𝑡) = −𝛼𝛻𝐿 𝑤 𝑡 + 𝛽∆𝑤(𝑡−1)
𝑚𝑜𝑚𝑒𝑛𝑡𝑢𝑚
• LR decay (make smaller steps near optimum)
𝛼(𝑡) = 𝑘𝛼(𝑡−1), 0 < 𝑘 < 1
• Weight Decay (prevent weight growing, and smooth F)
𝐿∗ = 𝐿 + 𝜆 𝑤(𝑡)
L1 or L2 regularization often used
Typical: 𝛽 = 0.9
Typical: apply LR decay (𝑘 = 0.1) each 10..100 epochs
Typical: 𝐿2 with 𝜆 = 0.0005
![Page 28: Neural Networks. Overview](https://reader034.vdocuments.mx/reader034/viewer/2022042716/55a686041a28abe7088b463a/html5/thumbnails/28.jpg)
Neural networks. Overview
• Common principles
– Structure
– Learning
• Shallow and Deep NN
• Additional methods
– Conventional
– Voodoo
![Page 29: Neural Networks. Overview](https://reader034.vdocuments.mx/reader034/viewer/2022042716/55a686041a28abe7088b463a/html5/thumbnails/29.jpg)
Additional methodsContemporary
Dropout/DropConnect
– ensembles of networks
– 2𝑁 networks in one: for each example hide neurons output randomly (𝑃 = 0.5)
![Page 30: Neural Networks. Overview](https://reader034.vdocuments.mx/reader034/viewer/2022042716/55a686041a28abe7088b463a/html5/thumbnails/30.jpg)
Additional methodsContemporary
Data augmentation - more data with all available cases:
– affine transformations, flips, crop, contrast, noise, scale
– pseudo-labeling
![Page 31: Neural Networks. Overview](https://reader034.vdocuments.mx/reader034/viewer/2022042716/55a686041a28abe7088b463a/html5/thumbnails/31.jpg)
Additional methodsContemporary
New activation function:
– Linear: 𝑦𝑖 = 𝑓 𝑠𝑖 = 𝑎𝑠𝑖– ReLU: 𝑦𝑖 = 𝑚𝑎𝑥 𝑠𝑖 , 0
– Leaky ReLU: 𝑦𝑖 = 𝑠𝑖 𝑠𝑖 > 0𝑎𝑠𝑖 𝑜𝑡ℎ𝑒𝑟𝑤𝑖𝑠𝑒
– Maxout: 𝑦𝑖 = 𝑚𝑎𝑥 𝑠1,𝑖 , 𝑠2,𝑖 , … , 𝑠𝑘,𝑖
Typical: 𝑎 = 0.01
Typical: 𝑘 = 2. . 3
![Page 32: Neural Networks. Overview](https://reader034.vdocuments.mx/reader034/viewer/2022042716/55a686041a28abe7088b463a/html5/thumbnails/32.jpg)
Additional methodsContemporary
Pre-training
– train layer-by-layer,
– re-train “other” network
![Page 33: Neural Networks. Overview](https://reader034.vdocuments.mx/reader034/viewer/2022042716/55a686041a28abe7088b463a/html5/thumbnails/33.jpg)
Sources
• Jeffry Hinton Course “Neural Networks for Machine Learning” [http://www.coursera.org/course/neuralnets]
• Ian Goodfellow, Yoshua Bengio and Aaron Courville “Deep Learning” [http://www.deeplearningbook.org/]
• http://neuralnetworksanddeeplearning.com
• CS231n: Convolutional Neural Networks for Visual Recognition [http://cs231n.stanford.edu/]
• CS224d: Deep Learning for Natural Language Processing [http://cs224d.stanford.edu/]
• Schmidhuber “Deep Learning in Neural Networks: An Overview”
• kaggle.com competitions and forums