practical deep learning · jupyter notebook: keras-mnist-cnn.ipynb 14:30-14:45 break 14:45-15:30...
TRANSCRIPT
![Page 1: Practical deep learning · Jupyter notebook: keras-mnist-cnn.ipynb 14:30-14:45 Break 14:45-15:30 Lecture 4: Text data, embeddings, 1D CNN, recurrent neural networks, attention 15:30-16:00](https://reader033.vdocuments.mx/reader033/viewer/2022052009/601dec9775f4cb6a1f52109c/html5/thumbnails/1.jpg)
Practical deep learning
Markus Koskela
Mats Sjöberg
CSC – IT Center for Science Ltd, Espoo
February 13–14, 2019
![Page 2: Practical deep learning · Jupyter notebook: keras-mnist-cnn.ipynb 14:30-14:45 Break 14:45-15:30 Lecture 4: Text data, embeddings, 1D CNN, recurrent neural networks, attention 15:30-16:00](https://reader033.vdocuments.mx/reader033/viewer/2022052009/601dec9775f4cb6a1f52109c/html5/thumbnails/2.jpg)
All original material (C) 2019 by CSC – IT Center for Science Ltd.
Licensed under a Creative Commons Attribution-ShareAlike 4.0 Unported License,
http://creativecommons.org/licenses/by-sa/4.0
All other material copyrighted by their respective authors.
![Page 3: Practical deep learning · Jupyter notebook: keras-mnist-cnn.ipynb 14:30-14:45 Break 14:45-15:30 Lecture 4: Text data, embeddings, 1D CNN, recurrent neural networks, attention 15:30-16:00](https://reader033.vdocuments.mx/reader033/viewer/2022052009/601dec9775f4cb6a1f52109c/html5/thumbnails/3.jpg)
Agenda
Up-to-date agenda and lecture slides can be found at https://tinyurl.com/y83ctvug
Exercise materials are at GitHub: https://github.com/csc-training/intro-to-dl/
Wireless accounts for CSC-guest network behind the badges. Alternatively, use the eduroam
network with your university accounts or the LAN cables on the tables.
Accounts to Taito-GPU cluster delivered separately.
Day 1: Notebooks
9:00-10:30 Lecture 1: Introduction to deep learning
10:30-10:45 Break
10:45-11:00 Exercise 1: Introduction to Notebooks, Keras fundamentals
Jupyter notebook: keras-test-setup.ipynb
11:00-11:30 Lecture 2: Multi-layer perceptron networks
11:30-12:00 Exercise 2: Classification with MLPs
Jupyter notebook: keras-mnist-mlp.ipynb
12:00-13:00 Lunch
13:00-14:00 Lecture 3: Image data, convolutional neural networks
14:00-14:30 Exercise 3: Image classification with CNNs
Jupyter notebook: keras-mnist-cnn.ipynb
14:30-14:45 Break
14:45-15:30 Lecture 4: Text data, embeddings, 1D CNN, recurrent neural networks, attention
15:30-16:00 Exercise 4: Text sentiment classification with CNNs, RNNs
Jupyter notebooks: keras-imdb-cnn.ipynb, keras-imdb-rnn.ipynb
Day 2: Taito-GPU
9:00-9:45 Lecture 5: Introduction to PyTorch
9:45-10:15 Lecture 6: GPUs, batch jobs, using Taito-GPU
10:15-10:30 Break
10:30-12:00 Exercise 5: Image classification: dogs vs. cats; traffic signs
12:00-13:00 Lunch
13:00-14:00 Exercise 6: Text categorization and labeling: 20 newsgroups; Ted talks
14:00-14:45 Lecture 7: Cloud, GPU utilization, multi-GPU
14:45-15:00 Break
15:00-16:00 Exercise 7: Using multiple GPUs
![Page 4: Practical deep learning · Jupyter notebook: keras-mnist-cnn.ipynb 14:30-14:45 Break 14:45-15:30 Lecture 4: Text data, embeddings, 1D CNN, recurrent neural networks, attention 15:30-16:00](https://reader033.vdocuments.mx/reader033/viewer/2022052009/601dec9775f4cb6a1f52109c/html5/thumbnails/4.jpg)
![Page 5: Practical deep learning · Jupyter notebook: keras-mnist-cnn.ipynb 14:30-14:45 Break 14:45-15:30 Lecture 4: Text data, embeddings, 1D CNN, recurrent neural networks, attention 15:30-16:00](https://reader033.vdocuments.mx/reader033/viewer/2022052009/601dec9775f4cb6a1f52109c/html5/thumbnails/5.jpg)
Practical deep learning
Lecture 1: Introduction to deep learning
About this course
• Introduction to deep learning– basics of ML assumed– mostly high-school math– much of theory, many details skipped
• 1st day: lectures + small-scale exercises using notebooks.csc.fi • 2nd day: mid-scale experiments using GPUs at Taito-GPU• Slides at: https://tinyurl.com/y83ctvug • Other materials (and link to Gitter) at GitHub:
https://github.com/csc-training/intro-to-dl/ • Focus on text and image classification, no fancy stuff• Using Python, Keras, and PyTorch
Further resources
• This course is largely “inspired by”:“Deep Learning with Python” by François Chollet
• Recommended textbook:“Deep learning” by Goodfellow, Bengio, Courville
• Lots of further material available online, e.g.:http://cs231n.stanford.edu/ http://course.fast.ai/ https://developers.google.com/machine-learning/crash-course/ www.nvidia.com/dlilabs http://introtodeeplearning.com/ https://github.com/oxford-cs-deepnlp-2017/lectures
• Academic courses
What is artificial intelligence?
Artificial intelligence is the ability of a computer to perform tasks
commonly associated with intelligent beings.
What is machine learning?
Machine learning is the study of algorithms that learn from
examples and experience instead of relying on hard-coded rules
and make predictions on new data.
What is deep learning?
Deep learning is a subfield of machine learning focusing on
learning data representations as successive layers of increasingly
meaningful representations.
![Page 6: Practical deep learning · Jupyter notebook: keras-mnist-cnn.ipynb 14:30-14:45 Break 14:45-15:30 Lecture 4: Text data, embeddings, 1D CNN, recurrent neural networks, attention 15:30-16:00](https://reader033.vdocuments.mx/reader033/viewer/2022052009/601dec9775f4cb6a1f52109c/html5/thumbnails/6.jpg)
Image from https://blogs.nvidia.com/blog/2016/07/29/whats-difference-artificial-intelligence-machine-learning-deep-learning-ai/
cat
“Traditional” machine learning:
handcrafted features
learned classifier
cat
Deep, “end-to-end” learning:
learned high-level features
learned mid-level features
learned low-level features
learned classifier
From: Wang & Raj: On the Origin of Deep Learning (2017)
Demotivational slide
“All of these AI systems we see, none of them is ‘real’ AI”– Josh Tennenbaum
“Neural networks are … neither neural nor even networks.”– François Chollet, author of Keras
Main types of machine learning
Main types of machine learning
• Supervised learning
• Unsupervised learning• Self-supervised learning• Reinforcement learning
cat
dog
![Page 7: Practical deep learning · Jupyter notebook: keras-mnist-cnn.ipynb 14:30-14:45 Break 14:45-15:30 Lecture 4: Text data, embeddings, 1D CNN, recurrent neural networks, attention 15:30-16:00](https://reader033.vdocuments.mx/reader033/viewer/2022052009/601dec9775f4cb6a1f52109c/html5/thumbnails/7.jpg)
Main types of machine learning
• Supervised learning
• Unsupervised learning• Self-supervised learning• Reinforcement learning
By Chire [CC BY-SA 3.0], from Wikimedia Commons
Main types of machine learning
• Supervised learning
• Unsupervised learning• Self-supervised learning• Reinforcement learning
Image from https://arxiv.org/abs/1710.10196
Main types of machine learning
• Supervised learning
• Unsupervised learning• Self-supervised learning• Reinforcement learning
Animation from https://yanpanlau.github.io/2016/07/10/FlappyBird-Keras.html
Fundamentals of machine learning
Data
• Humans learn by observation and unsupervised learning
– model of the world /common sense reasoning
• Machine learning needs lots of (labeled) data to compensate
• Tensors: generalization of matricesto n dimensions (or rank, order, degree)– 1D tensor: vector– 2D tensor: matrix
– 3D, 4D, 5D tensors– numpy.ndarray(shape, dtype)
• Training – validation – test split (+ adversarial test)
• Minibatches
– small sets of input data used at a time
Data
![Page 8: Practical deep learning · Jupyter notebook: keras-mnist-cnn.ipynb 14:30-14:45 Break 14:45-15:30 Lecture 4: Text data, embeddings, 1D CNN, recurrent neural networks, attention 15:30-16:00](https://reader033.vdocuments.mx/reader033/viewer/2022052009/601dec9775f4cb6a1f52109c/html5/thumbnails/8.jpg)
Model – learning/training – inference
•• parameters 𝜃 and hyperparameters http://playground.tensorflow.org/
Optimization
• Mathematical optimization:“the selection of a best element (withregard to some criterion) from someset of available alternatives” (Wikipedia)
• Main types:finite-step, iterative, heuristic
• Learning as an optimization problem
– cost function: loss regularization
By Rebecca Wilson (originally posted to Flickr as Vicariously) [CC BY 2.0], via Wikimedia Commons
Optimization
Image from: Li et al. “Visualizing the Loss Landscape of Neural Nets”, arXiv:1712.09913
Gradient descent
• Derivative and minima/maxima of functions
• Gradient: the derivative of a multivariable function
• Gradient descent:
• (Mini-batch) stochastic gradient
descent (and its variants)
Image from: https://towardsdatascience.com/gradient-descent-algorithm-and-its-variants-10f652806a3
Over- and underfitting, generalization, regularization
• Models with lots of parameters can
easily overfit to training data
• Generalization: the quality of ML model is measured on new, unseen samples
• Regularization: any method* to prevent overfitting– simplicity, sparsity, dropout, early stopping
– *) other than adding more data
By Chabacano [GFDL or CC BY-SA 4.0], from Wikimedia Commons
Deep learning
![Page 9: Practical deep learning · Jupyter notebook: keras-mnist-cnn.ipynb 14:30-14:45 Break 14:45-15:30 Lecture 4: Text data, embeddings, 1D CNN, recurrent neural networks, attention 15:30-16:00](https://reader033.vdocuments.mx/reader033/viewer/2022052009/601dec9775f4cb6a1f52109c/html5/thumbnails/9.jpg)
Anatomy of a deep neural network
• Layers• Input data and targets• Loss function• Optimizer
Layers
• Data processing modules• Many different kinds exist
– densely connected– convolutional– recurrent– pooling, flattening, merging, normalization, etc.
• Input: one or more tensorsoutput: one or more tensors
• Usually have a state, encoded as weights– learned, initially random
• When combined, form a network ora model
Input data and targets
• The network maps the input data X to predictions Y′
• During training, the predictions Y′ are compared to true targets Y using the loss function
cat
dog
Loss function
• The quantity to be minimized (optimized) during training
– the only thing the network cares about
– there might also be other metrics you care about
• Common tasks have “standard” loss functions:– mean squared error for regression– binary cross-entropy for two-class classification– categorical cross-entropy for multi-class classification
– etc.
• https://lossfunctions.tumblr.com/
Optimizer
• How to update the weights based on the loss function
• Learning rate
• Stochastic gradient descent, momentum, and their variants– RMSProp is usually a good
first choice– more info:
http://ruder.io/optimizing-gradient-descent/
Animation from: https://imgur.com/s25RsOr
Anatomy of a deep neural network
![Page 10: Practical deep learning · Jupyter notebook: keras-mnist-cnn.ipynb 14:30-14:45 Break 14:45-15:30 Lecture 4: Text data, embeddings, 1D CNN, recurrent neural networks, attention 15:30-16:00](https://reader033.vdocuments.mx/reader033/viewer/2022052009/601dec9775f4cb6a1f52109c/html5/thumbnails/10.jpg)
Deep learning frameworks
• Actually tools for defining static or dynamic general-purpose computational graphs
• Automatic differentation
• Seamless CPU / GPU usage– multi-GPU, distributed
• Python/numpy or R interfaces– instead of C, C++, or CUDA
• Open source
✕
x y 5
✕
+
+
Deep learning frameworks Deep learning frameworks
• Keras is a high-levelneural networks API– we will use TensorFlow
as the compute backend– https://keras.io/
• PyTorch is:– a GPU-based tensor library– an efficient library for dynamic neural networks– https://pytorch.org/
Keras
TensorFlowTheano CNTK PyTorch MXNet Caffe
CUDA, cuDNN MKL, MKL-DNN
GPUs CPUs
TF Estimator torch.nn GluonLasagne
![Page 11: Practical deep learning · Jupyter notebook: keras-mnist-cnn.ipynb 14:30-14:45 Break 14:45-15:30 Lecture 4: Text data, embeddings, 1D CNN, recurrent neural networks, attention 15:30-16:00](https://reader033.vdocuments.mx/reader033/viewer/2022052009/601dec9775f4cb6a1f52109c/html5/thumbnails/11.jpg)
Practical deep learning
Lecture 2: Multi-layer perceptron networks
Neuron as a linear classifier
By User:ZackWeinberg, based on PNG version by User:Cyc [CC BY-SA 3.0], via Wikimedia Commons
A non-linear classifier? Activation function
• A smooth (differentiable) nonlinear function that is applied after the inner product with the weights
• Common functions:
Neural network
• (Artificial) neural network is a collection of neurons
• Usually organized in layers– input layer– one or more hidden layers
(sizes, activation functions are hyperparameters)
– output layer(size typically equals number of classes in classification; activation function should be compatible with training labels)
By Glosser.ca [CC BY-SA 3.0], via Wikimedia Commons
![Page 12: Practical deep learning · Jupyter notebook: keras-mnist-cnn.ipynb 14:30-14:45 Break 14:45-15:30 Lecture 4: Text data, embeddings, 1D CNN, recurrent neural networks, attention 15:30-16:00](https://reader033.vdocuments.mx/reader033/viewer/2022052009/601dec9775f4cb6a1f52109c/html5/thumbnails/12.jpg)
Backpropagation
• Based on the chain rule of calculus:
cat
dog
• Neural networks are trained with gradient descent,starting from a random weight initialization
• Algorithm for computing the gradients for a neural network:
Multilayer perceptron (MLP) / Dense network
• Classic feedforward neural network
• Densely connected: all inputsfrom the previous layer connected
• In Keras: keras.layers.Dense(units, activation=None)
or:keras.layers.Dense(units)keras.layers.Activation(activation)
cat
dog
fish
Dropout
• randomly setting a fraction rate of input units to 0 at each update during training
• helps to prevent overfitting (regularization)
• In Keras: keras.layers.Dropout(rate)
Image from Srivastava et al (2014), JMLR 15: 1929-1958
![Page 13: Practical deep learning · Jupyter notebook: keras-mnist-cnn.ipynb 14:30-14:45 Break 14:45-15:30 Lecture 4: Text data, embeddings, 1D CNN, recurrent neural networks, attention 15:30-16:00](https://reader033.vdocuments.mx/reader033/viewer/2022052009/601dec9775f4cb6a1f52109c/html5/thumbnails/13.jpg)
Practical deep learning
Lecture 3: Images and convolutional neural networks
Computer vision
Computer vision = giving computers the ability to understand visual information
Examples:○ A robot that can move around obstacles by analysing the input of
its camera(s)○ A computer system finding images of cats among millions of
images on the Internet
From picture to pixels
Each a set of numbers quantifying the color of that element
0.49411765 0.49411765 0.4745098 0.49019608 0.4745098
0.49411765 0.49411765 0.5058824 0.49411765 0.49803922
0.49803922 0.49411765 0.4862745 0.47058824 0.49411765
0.5019608 0.49803922 0.49803922 0.49019608 0.50980395
0.50980395 0.5058824 0.52156866 0.50980395 0.5058824
Picture source: https://pixabay.com/en/kitty-cat-kid-cat-domestic-cat-2948404/
An image has to be digitized for computer processing
It is turned into millions of “pixel” elements
From pixels to … understanding?
0.49411765 0.49411765 0.4745098 0.49019608 0.4745098
0.49411765 0.49411765 0.5058824 0.49411765 0.49803922
0.49803922 0.49411765 0.4862745 0.47058824 0.49411765
0.5019608 0.49803922 0.49803922 0.49019608 0.50980395
0.50980395 0.5058824 0.52156866 0.50980395 0.5058824
There’s a cat among some flowers in the grass
● This is easy for humans
● But for AI it’s actually one of the harder problems!
● How do you transform that grid of numbers into understanding… or even something useful?
Image understanding• Humans are so good in vision that it’s not even considered intelligence
Convolutional neural networks
Deep learning for Computer vision
![Page 14: Practical deep learning · Jupyter notebook: keras-mnist-cnn.ipynb 14:30-14:45 Break 14:45-15:30 Lecture 4: Text data, embeddings, 1D CNN, recurrent neural networks, attention 15:30-16:00](https://reader033.vdocuments.mx/reader033/viewer/2022052009/601dec9775f4cb6a1f52109c/html5/thumbnails/14.jpg)
● Dense or fully-connected: each neuron connected to all neurons in previous layer
● CNN: only connected to a small “local” set of neurons
● Radically reduces numberof network connections
Dense layer Convolutional layer
Convolutional neural network (CNN, ConvNet) Convolution for image data
● Image represented as 2D grid of values
● Each output neuron connected to small
2D area in the image
● Output value = weighted sum of inputs
● Idea: nearby pixels are related ⇒
we can learn local relationships of pixels
Image source: https://mlnotebook.github.io/post/CNN1/
3✕3 image area3✕3 weights (conv. kernel)
output neuron
Convolution for image data
● We repeat for each output neuron
● Weights stay the same (shared
weights)
● Border effect: without padding output
area is smaller
● Outputs form a “feature map”
feature map
image input 3✕3 weights (conv. kernel)
Image source: https://mlnotebook.github.io/post/CNN1/
A real example
Image from: http://cs.nyu.edu/~fergus/tutorials/deep_learning_cvpr12/fergus_dl_tutorial_final.pptx
Side note: color images
● Example: 256 ✕ 256 color image with 3 color channels (red, green, and blue)⇒ single image is a 3D tensor: 256 ✕ 256 ✕ 3
● Example: 5 ✕ 5 convolution is actually also a 3D tensor: 5 ✕ 5 ✕ 3 ● Slides over width and height, but covers the full color depth
Convolution for image data
● We can repeat for different sets of weights (kernels)
● Each learns a different “feature”
● Typically: edges, corners, etc
● Each outputs a feature map
...
...
image 256✕256✕3
K kernels each 5✕5(✕3)
K feature maps each 252✕252✕1
![Page 15: Practical deep learning · Jupyter notebook: keras-mnist-cnn.ipynb 14:30-14:45 Break 14:45-15:30 Lecture 4: Text data, embeddings, 1D CNN, recurrent neural networks, attention 15:30-16:00](https://reader033.vdocuments.mx/reader033/viewer/2022052009/601dec9775f4cb6a1f52109c/html5/thumbnails/15.jpg)
Convolution for image data
● We stack the feature maps into a single tensor
● Depth out output tensor = number of kernels K
● Tensor is the output of the entire convolutional layer
...
image 256✕256✕3
K kernels each 5✕5(✕3)
output tensor 252✕252✕K
“cat”
Convolution in layers: intuition
● We can then add another convolutional layer
● This operates on the previous layer’s output tensor (feature maps)
● Features layered from simple to more complex
catlearned
high-level features
learned mid-level features
learned low-level features
learned classifier
Image from lecture by Yann Le Cun, original from Zeiler & Fergus (2013)
Image datasets• Color image mini-batches are 4D tensors:
width ✕ height ✕ color channels ✕ samples
• Plenty of big datasets for training exist, e.g., ImageNet with 1,2 million images in 1000 classes
• Data augmentation for small datasets: generate more training data by transforming existing data
• E.g., shifting, rotation, cropping,Scaling, adding noise, etc …
Convolutional layers
• Input: tensor of size N × Wi × H
i × C
i
• Hyperparameters:– K: number of filters– w, h: kernel size– padding: how to handle image borders– activation function
• Output: tensor of size N × Wo × H
o × K
• In Keras: keras.layers.Conv2D(filters, kernel_size, padding, activation)
(there is also Conv1D and Conv3D)
Pooling layers
• Used to reduce the spatial resolution– independently on each channel– reduce complexity and number
of parameters
• MAX operator most common– sometimes also AVERAGE
• In Keras: keras.layers.MaxPooling2D(pool_size)keras.layers.AveragePooling2D(pool_size)
Image from http://cs231n.github.io/convolutional-networks/
![Page 16: Practical deep learning · Jupyter notebook: keras-mnist-cnn.ipynb 14:30-14:45 Break 14:45-15:30 Lecture 4: Text data, embeddings, 1D CNN, recurrent neural networks, attention 15:30-16:00](https://reader033.vdocuments.mx/reader033/viewer/2022052009/601dec9775f4cb6a1f52109c/html5/thumbnails/16.jpg)
Other layers
• Flatten
– flattens the input into a vector (typically before dense layers)
• Dropout
– similar as with dense layers
• In Keras:keras.layers.Flatten()keras.layers.Dropout(rate)
1. Input layer = image pixels2. Convolution3. ReLU4. Pooling5. One or more fully connected layers (+ReLU)6. Final fully connected layer to get to the number of classes we
want7. Softmax to get probability distribution over classes
Typical architecture
Repeat one or more times
AlexNet
VGG
Inception / GoogLeNet
ResNet
DenseNet
Large-scale CNNs with pre-trained weights
• For many applications, an existing CNN can be re-used instead of training a new model from scratch:extract features from suitable layer or fine-tune the top layers with new data
• Keras contains several models trained with ImageNet:– Xception, VGG16, VGG19, ResNet50, InceptionV3, InceptionResNetV2,
MobileNet, DenseNet, NASNet
extracted features
re-initialize and train
Some selected applications
• Object detection: https://pjreddie.com/darknet/yolo/
• Semantic segmentation: https://www.youtube.com/watch?v=qWl9idsCuLQ
• Self-driving cars: https://www.youtube.com/watch?v=mCj_C1NOVxw
• Human pose estimation: https://www.youtube.com/watch?v=pW6nZXeWlGM
• Video recognition: https://valossa.com/
• Digital pathology: https://www.aiforia.com/
![Page 17: Practical deep learning · Jupyter notebook: keras-mnist-cnn.ipynb 14:30-14:45 Break 14:45-15:30 Lecture 4: Text data, embeddings, 1D CNN, recurrent neural networks, attention 15:30-16:00](https://reader033.vdocuments.mx/reader033/viewer/2022052009/601dec9775f4cb6a1f52109c/html5/thumbnails/17.jpg)
Practical deep learning
Lecture 4: Text, embeddings, 1D CNN,recurrent neural networks, attention
Representations for text
Sequence data
By Mogrifier5 [CC BY-SA 3.0], from Wikimedia Commons
By Der Lange 11/6/2005, http://commons.wikimedia.org/w/index.php?title=File:Spike-waves.png&action=edit§ion=2
Text data
• sequence of words (or characters)• main representations:
– one-hot encoding– word embedding
raw text
cleaned text
tokens
one-hot encoding
preprocessing tokenization vectorizationword
embedding
One-hot encoding and bag-of-words
• dimensionality equals the number of distinct tokens in dictionary– 1000’s or 10000’s
• tokens are independent of each other• bag-of-words loses the ordering of tokens
– lots of important applications: IR etc.– n-grams
cleaned text
tokens
one-hot encoding
dictionary
The cat is in the moon.
[“the”, “cat”, “is”, “in”, “the”, “moon”]
{“a”: 1, “aardvark”: 2, “aardwolf”: 3,
…}
bag of words
Word embeddings
• dense vector representations
– dimensionality typically much lower than in one-hot⇒ bag-of-words not needed
– learned based on context of words
• semantics
– similar words have similar vectors
– directions in the vector space map to semantic relationships
• context-free and contextual embeddings
• either learn from data or use a pre-trained embedding
![Page 18: Practical deep learning · Jupyter notebook: keras-mnist-cnn.ipynb 14:30-14:45 Break 14:45-15:30 Lecture 4: Text data, embeddings, 1D CNN, recurrent neural networks, attention 15:30-16:00](https://reader033.vdocuments.mx/reader033/viewer/2022052009/601dec9775f4cb6a1f52109c/html5/thumbnails/18.jpg)
(Context-free) word embeddings
Images from: https://www.tensorflow.org/tutorials/word2vec Image from: http://wiki.fast.ai/index.php/Lesson_5_Notes
Standalone word embedding algorithms
• unsupervised learning, no annotation needed
• popular context-free algorithms include:– word2vec (CBoW and skip-gram)– GLoVe– fastText
• recently proposed contextual algorithms include:– ELMo– BERT
• to learn a task-based embedding or to use a pre-trained one?– pre-trained embeddings encode general semantic relationships– need to handle OOV (out-of-vocabulary) words– task-based embeddings may sometimes be better if enough data
Word sequence embedding
• usually a fixed-size matrix or sequence
• in Keras:keras.layers.Embedding(input_dim, output_dim, input_length, trainable, weights)
cleaned text
tokenspadding / truncate
learned embedding
The cat is in the moon.
[“the”, “cat”, “is”, “in”, “the”, “moon”]
sequence embedding
[“the”, “cat”, “is”, “in”, “the”, “moon”, ∅, ∅, ∅, ∅]
10 × N matrixor
sequence of length 10
Deep learning for sequences
Deep learning for sequences
• first layer is usually an embedding
• then there are three main approaches(that can also be combined):
– 1D convolutional layers
– recurrent layers
– attention
• last layers are often dense
(1) CNNs for sequences
• a fixed-length embedded sequence is a matrix
– can be considered as an image ⇒ CNNs can be applied
• 1D convolution
– as we want to process the full embedding each time
• simple and cheap approach for simple tasks
• in Keras:keras.layers.Conv1D(filters, kernel_size, padding, activation)keras.layers.MaxPooling1D(pool_size)keras.layers.GlobalMaxPooling1D()
![Page 19: Practical deep learning · Jupyter notebook: keras-mnist-cnn.ipynb 14:30-14:45 Break 14:45-15:30 Lecture 4: Text data, embeddings, 1D CNN, recurrent neural networks, attention 15:30-16:00](https://reader033.vdocuments.mx/reader033/viewer/2022052009/601dec9775f4cb6a1f52109c/html5/thumbnails/19.jpg)
1D convolution
tokens
embedding
(2) Recurrent neural networks
• MLPs and CNNs expect fixed-sized input, not sequences• RNNs have memory and recurrent connections, i.e. loops• last output contains a representation of the whole sequence• learning by backpropagation through time
– vanishing or exploding gradients!
By François Deloche [CC BY-SA 4.0], from Wikimedia Commons
Recurrent neural networks
Image from: http://karpathy.github.io/2015/05/21/rnn-effectiveness/
Long-short term memory (LSTM) network• specialized architecture to solve the vanishing gradient problem
– additional “conveyor belt” dataflow to carry information across timesteps– “forget”, “input”, and “output” gates
By François Deloche [CC BY-SA 4.0], from Wikimedia Commons
• simple RNNs do not usually work in practice⇒ use LSTM or its variants (e.g. GRU)
• can also be used bidirectionally
• cuDNN kernels may be >20 times faster on GPUs
• in Keras:keras.layers.LSTM(units, return_sequences)keras.layers.CuDNNLSTM(units, return_sequences)keras.layers.GRU(units, return_sequences)keras.layers.CuDNNGRU(units, return_sequences)keras.layers.Bidirectional(layer, merge_mode)
LSTM layer Language models and text generation
• RNNs can be trained to predict the next word and then used to generate novel text (or music, etc.)
Image from: https://github.com/oxford-cs-deepnlp-2017/lectures/blob/master/Lecture%204%20-%20Language%20Modelling%20and%20RNNs%20Part%202.pdf
![Page 20: Practical deep learning · Jupyter notebook: keras-mnist-cnn.ipynb 14:30-14:45 Break 14:45-15:30 Lecture 4: Text data, embeddings, 1D CNN, recurrent neural networks, attention 15:30-16:00](https://reader033.vdocuments.mx/reader033/viewer/2022052009/601dec9775f4cb6a1f52109c/html5/thumbnails/20.jpg)
Encoder–decoder (seq2seq) networks
Image from: https://devblogs.nvidia.com/introduction-neural-machine-translation-gpus-part-2/
“You can’t cram the meaning of a whole %&!$ing sentence into a single $&!*ing vector!”
-- Ray Mooney
(3) Attention
Problem: the final encoder
output vector is a bottleneck
Solution: attention
• allows the model to focus on
the relevant part of the input
sequence
• all encoder output vectors are
passed to the decoder and are
weighted using a learned
alignment
Image from: https://arxiv.org/abs/1409.0473
Attention is all you need
• Self-attention: relating different positions of
a sequence in forming its representation
TransformerImage from: https://arxiv.org/abs/1706.03762
BERTImage from: https://arxiv.org/abs/1810.04805
Some applications
• text classification and annotation• author identification• chatbots• reading comprehension / QA• image & video captioning• speech recognition• handwritten text recognition
![Page 21: Practical deep learning · Jupyter notebook: keras-mnist-cnn.ipynb 14:30-14:45 Break 14:45-15:30 Lecture 4: Text data, embeddings, 1D CNN, recurrent neural networks, attention 15:30-16:00](https://reader033.vdocuments.mx/reader033/viewer/2022052009/601dec9775f4cb6a1f52109c/html5/thumbnails/21.jpg)
Practical deep learning
Lecture 5: Introduction to PyTorch
Software frameworks for deep learning
2
Deep learning frameworks: arXiv mentions
3
Software frameworks for deep learning
• TensorFlow most popular, but not so easy to use and debug
– Keras is an easy-to-use neural network “front end” for TensorFlow
• PyTorch is a Python version of Torch (Lua-based)
– Getting a lot of traction recently, especially in research
4
Keras versus PyTorch
We’ll discuss two main differences between Keras and PyTorch:
• Static versus dynamic computational graphs
• Sequential versus functional style
… although these days both frameworks support all modes and styles
5
Computational graphs
• Any mathematical computation can be expressed as a computational graph
• Neural networks are just a (huge) number of simple computations
• With the graph it is easy to automatically calculate the gradients backwards for each node (backpropagation!)
• Both Keras and PyTorch work in this way
https://en.wikipedia.org/wiki/Automatic_differentiation6
✕
x y 5
✕
+
+
![Page 22: Practical deep learning · Jupyter notebook: keras-mnist-cnn.ipynb 14:30-14:45 Break 14:45-15:30 Lecture 4: Text data, embeddings, 1D CNN, recurrent neural networks, attention 15:30-16:00](https://reader033.vdocuments.mx/reader033/viewer/2022052009/601dec9775f4cb6a1f52109c/html5/thumbnails/22.jpg)
• In Keras (and TF) the graph is static
– First define a fixed graph with inputs being undefined or abstract (variables)
– Then “execute” the graph with specific inputs
• This can be cumbersome and hard to debug
• In theory fast as graph can be optimized during compilation
✕
x y 5
✕
+
+
Keras: static computational graph
7
• In PyTorch the graph is defined dynamically
– You define concrete tensors, e.g., x = torch.tensor(42.)
– Then just write the calculations, e.g.,z = x*y + 5*x + 5
– The computational graph is generated “on the fly” in the background
• Easy to debug, feels more like normal Python coding
✕
x y 5
✕
+
+
PyTorch: dynamic computational graph
8
Keras: sequential style
• Keras models typically defined in a sequential style
• Each layer is added in sequence to a list
Example: 2-layer MLP:
model = Sequential()
model.add(Dense(units=64, activation='relu', input_dim=100))
model.add(Dense(units=10, activation='softmax'))
model.compile(optimizer='rmsprop', loss='categorical_crossentropy')
model.fit(x_train, y_train)
9
Keras: functional style
• Keras also supports a functional style
• Each step is written as a function of some input (or output from a previous step)
Example: 2-layer MLP:
inputs = Input(shape=(100,))
x = Dense(64, activation='relu')(inputs)
predictions = Dense(10, activation='softmax')(x)
model = Model(inputs=inputs, outputs=predictions)
model.compile(optimizer='rmsprop', loss='categorical_crossentropy')
model.fit(x_train, y_train)
No value, just an “open slot” for a future value
10
class Net(nn.Module): def __init__(self): super(Net, self).__init__() self.fc1 = nn.Linear(64) self.fc2 = nn.Linear(10)
def forward(self, inputs): x = nn.ReLU(self.fc1(inputs)) predictions = self.fc2(x) return predictions
net = Net()optimizer = optim.RMSprop(net.parameters())criterion = nn.CrossEntropyLoss()
for i in range(num_epochs): for (x_train, y_train) in enumerate(batch_loader): optimizer.zero_grad() outputs = net(x_train) loss = criterion(outputs, y_train) loss.backward() optimizer.step()
PyTorch: functional with subclassing
Network defined as a Python class
We have to handle training loop manually
Backpropagation, andweight updates 11
Sequential versus functional
These days Keras and PyTorch support all styles!
… but some are more supported than others
Sequential Functional Func. with classes
Keras Yes, canonical Supported Supported, but limited
PyTorch Supported In theory, but does not make sense...
Yes, canonical
12
![Page 23: Practical deep learning · Jupyter notebook: keras-mnist-cnn.ipynb 14:30-14:45 Break 14:45-15:30 Lecture 4: Text data, embeddings, 1D CNN, recurrent neural networks, attention 15:30-16:00](https://reader033.vdocuments.mx/reader033/viewer/2022052009/601dec9775f4cb6a1f52109c/html5/thumbnails/23.jpg)
Common neural network modules in PyTorch
torch.nn.Linear(in_features, out_features, bias=True)
torch.nn.Dropout(p=0.5, ...)
torch.nn.Conv2d(in_channels, out_channels, kernel_size, ...)
torch.nn.Embedding(num_embeddings, embedding_dim, ...)
torch.nn.LSTM(input_size, hidden_size, num_layers=1, dropout=0,
bidirectional=False, ...)
torch.nn.GRU(input_size, hidden_size, num_layers=1, dropout=0,
bidirectional=False, ...)
13
Keras or PyTorch?
We’ll provide examples on how to do things with PyTorch, it’s up to you if you wish to learn PyTorch or stick with Keras
• PyTorch allows more control and customization, easier experimentation with new architectures
• Keras is easier if you just want to apply deep learning, and not do research
Useful PyTorch links:https://pytorch.org/tutorials/
https://pytorch.org/docs/stable/index.html
14
![Page 24: Practical deep learning · Jupyter notebook: keras-mnist-cnn.ipynb 14:30-14:45 Break 14:45-15:30 Lecture 4: Text data, embeddings, 1D CNN, recurrent neural networks, attention 15:30-16:00](https://reader033.vdocuments.mx/reader033/viewer/2022052009/601dec9775f4cb6a1f52109c/html5/thumbnails/24.jpg)
Practical deep learning
Lecture 6: GPUs, batch jobs, using Taito-GPU
GPU computing
• CPUs are optimized for latency whereas GPUs are optimized for throughput
• CSC’s GPU nodes with P100’s:
#cores max clock speed
memory
2 x Xeon CPUs 2 x 14 3.30 GHz 512 GB
4 x P100 GPUs 4 x 3584 1.48 GHz 4 x 16 GB
Research administration
ICT platforms, Funet network and data center functions are the base for our solutions
Computing and software
Data management and analytics for research
Support and training for research
Solutions for managing and organizing education
Solutions for learners and teachers
Solutions for educational and teaching cooperation
Hosting services tailored to customers’ needs
Identity and authorisation
Management and use of data
CSC’s solutions
![Page 25: Practical deep learning · Jupyter notebook: keras-mnist-cnn.ipynb 14:30-14:45 Break 14:45-15:30 Lecture 4: Text data, embeddings, 1D CNN, recurrent neural networks, attention 15:30-16:00](https://reader033.vdocuments.mx/reader033/viewer/2022052009/601dec9775f4cb6a1f52109c/html5/thumbnails/25.jpg)
CSC’s computing resources
• Supercomputer (Sisu)
• Supercluster (Taito)
• Cloud services (cPouta, ePouta)
• Accelerated computing (GPUs, Pouta and Taito-GPU)
• Grid (FGCI)
• International resources
– Extremely large computing (PRACE)
– Nordic resources (NEIC)
Taito GPU
The P100 nodes consists of 20 Dell PowerEdge C4130 servers with:
• 2x Xeon E5-2680 v4 CPUs with 14 cores each running at 2.4GHz• 512 GB of DDR4 memory• 4x P100 GPUs connected in pairs to each CPU• 2x800GB of Sata SSD scratch space
The K80 nodes consists of 12 Dell PowerEdge C4130 servers with:
• 2x Xeon E5-2680 v3 CPUs with 12 cores each running at 2.5GHz• 256 GB of DDR4 memory• 2x K80 GPU cards each with 2 GPUs for a total of 4 GPUs per node, these are all
connected to the first CPU• 850GB of HDD scratch space
DL2021 – new data management and computing infrastructure
Phase 1 computing cluster (700+ nodes) - summer 2019:
● New Intel Cascade lake CPU architecture supporting VNNI instructions for AI inference workloads
● Includes 80 “AI specific nodes” with 320 GPU’s- 4 NVIDIA V100 (32 GB) GPUs / node, NVLink- 3.2 TB local NVMe disk- Extremely fast network (InfiniBand 200 Gbps)
Taito compute nodes are used via a queuing system
Do not use the login node for heavy computation!
Batch jobs
Steps for running a batch job:
1. Write a batch job script
2. Make sure you have all the input files where the program can find them
3. Submit your job (sbatch batch_job_file.sh)
4. Wait (or check progress: tail slurm-jobid.out)
5. Look at the results, e.g., standard output in slurm-jobid.out
You have to specify the necessary resources:
– resources need to be sufficient for the job– requested resources consume BUs and affect time spent in queue⇒ realistic resource requests give best results
![Page 26: Practical deep learning · Jupyter notebook: keras-mnist-cnn.ipynb 14:30-14:45 Break 14:45-15:30 Lecture 4: Text data, embeddings, 1D CNN, recurrent neural networks, attention 15:30-16:00](https://reader033.vdocuments.mx/reader033/viewer/2022052009/601dec9775f4cb6a1f52109c/html5/thumbnails/26.jpg)
Example batch job script on Taito-GPU
#!/bin/bash
#SBATCH -p gpu --gres=gpu:p100:1
#SBATCH -t 1:00:00 --mem=8G -c 4
srun python my_python_program.py
Relevant sbatch options
-J, --job-name name of job
-c, --cpus-per-task number of processors per task
-p partition specify partition (gpu, gputest, gpulong)
--gres=gpu:type:number request number of GPUs of type (k80, p100)
-t, --time time limit in DD-HH:MM:SS
--mem the real (host) memory required per node
-o, --output file for script’s standard output
-e, --error file for script’s standard error
Managing batch jobs
sbatch batch_job_file.sh submit a job
sbatch --options batch_job_file.sh
scancel jobid delete a job
squeue -l show all jobs in all queues (partitions)
squeue -l -p partition show all jobs in partition
squeue -l -u username show all jobs for a single user
squeue -l -j jobid show status of a single job
sinfo check all available queues
seff jobid show CPU, mem and GPU utilization
Directory or storage area
Intended use Default quota/user
Storage time Backup
$HOME * Initialization scripts, source codes, small data files.Not for running programs or research data.
50 GB Data will be deleted 90 days after closing the account
Yes
$USERAPPL Users' own application software. 50 GB Data will be deleted 90days after closing the account
Yes
$WRKDIR * Temporary data storage. 5 TB 90 days No
$TMPDIR Temporary users' files. 2 days No
project Common storage for project members. A project can consist of one or more user accounts.
On request. Data will be deleted 90 days after closing the project
No
HPC Archive * Long term storage. 2,5 TB Permanent Two copies maintained
IDA Long term storage. On request Permanent Part of the Open Science and Research services.
Pouta Object Storage
Storage and sharing 1 TB Permanent
See https://research.csc.fi/csc-guide-directories-and-data-storage-at-csc for more information.
Module system
• Different software packages have different, possibly conflicting, requirements
• Most commonly used module commands:– module help Show available options– module load modulename Load the given environment module
module load modulename/version– module list List the loaded modules– module avail List modules that are available to load– module spider List all existing modules– module spider name Search the list of existing modules– module swap module1 module2 Replaces a module, including compatible
versions of other loaded modules– module unload modulename Unload the given environment module– module purge Unload all modules
Mlpython: collections of GPU-optimized ML frameworks
• E.g., python-env/2.7.10-ml or python-env/3.6.3-ml
• GPU-optimized versions of ML frameworks, including:- TensorFlow- Keras- PyTorch
• Usage, e.g.:
module purgemodule load python-env/3.6.3-ml
• See https://research.csc.fi/-/mlpython for more information
![Page 27: Practical deep learning · Jupyter notebook: keras-mnist-cnn.ipynb 14:30-14:45 Break 14:45-15:30 Lecture 4: Text data, embeddings, 1D CNN, recurrent neural networks, attention 15:30-16:00](https://reader033.vdocuments.mx/reader033/viewer/2022052009/601dec9775f4cb6a1f52109c/html5/thumbnails/27.jpg)
TensorBoard
• Tool to visualize TF graphs, plot quantitative metrics, and show additional data like images
• Operates by reading TF event files, which contain summary data that can be generated while running TF (or Keras, PyTorch, etc.)
• Instructions in the exercises if you want to try
TensorBoard at Taito-GPU
![Page 28: Practical deep learning · Jupyter notebook: keras-mnist-cnn.ipynb 14:30-14:45 Break 14:45-15:30 Lecture 4: Text data, embeddings, 1D CNN, recurrent neural networks, attention 15:30-16:00](https://reader033.vdocuments.mx/reader033/viewer/2022052009/601dec9775f4cb6a1f52109c/html5/thumbnails/28.jpg)
Practical deep learning
Lecture 7: Cloud environment, GPU utilization, using multiple GPUs
• Cloud environment allows flexible data analytics– Pouta (Openstack) allows to run and manage own VMs
▪ GPU nodes and IO intensive nodes
▪ ePouta for sensitive data
– Rahti (Openshift) allows to run and manage own containers
(under development, in limited beta)
• Pouta Object Storage for shared data storage
• Good for: web applications, big data frameworks, installing
custom software, building computing infrastructure
Data analytics in the cloud
• a GPU cannot be shared among users
– running multiple parallel processes possible (in theory)
but cumbersome
⇒ GPU jobs should be optimized to utilize the GPU
as efficiently as possible
• standard solution: increase mini-batch size
• monitor your GPU usage:
seff jobid
ssh gxxx nvidia-smi [dmon]
GPU utilization
• Model and data parallelism
• Single-node multi-GPU and
distributed training
• All the main frameworks
offer some level of support– TensorFlow and PyTorch
good choices for distributed
– high-level APIs such as Keras
may not be optimal
– external tools: Horovod, Gloo
Using multiple GPUs for model training Model parallelism Data parallelism
![Page 29: Practical deep learning · Jupyter notebook: keras-mnist-cnn.ipynb 14:30-14:45 Break 14:45-15:30 Lecture 4: Text data, embeddings, 1D CNN, recurrent neural networks, attention 15:30-16:00](https://reader033.vdocuments.mx/reader033/viewer/2022052009/601dec9775f4cb6a1f52109c/html5/thumbnails/29.jpg)
Model parallelism in Alexnet
Image from https://arxiv.org/pdf/1609.08144.pdf
Model parallelism in Google’s NMT
• In data parallelism, we need to gather all gradients and
to send the mean of the gradients back to all GPUs
MPI allreduce
Images from: https://cwiki.apache.org/confluence/display/MXNET/Single+machine+All+Reduce+Topology-aware+Communication
• Horovod is a Python framework for distributed deep
learning– supports TensorFlow, Keras, and PyTorch
• uses Nvidia’s NCCL 2 which provides a highly optimized
version of ring-allreduce
• uses MPI which launches all tasks and transparently sets
up the distributed infrastructure for communication
between tasks– readily compatible with Slurm!
Horovod and ring-allreduce
Horovod and ring-allreduce
Image from: https://eng.uber.com/horovod/
GPU topology
CSC’s P100 nodes:
PCIe switch
GPU0 GPU1
CPU
PCIe switch
GPU2 GPU3
CPU
NVIDIA DGX-1:
![Page 30: Practical deep learning · Jupyter notebook: keras-mnist-cnn.ipynb 14:30-14:45 Break 14:45-15:30 Lecture 4: Text data, embeddings, 1D CNN, recurrent neural networks, attention 15:30-16:00](https://reader033.vdocuments.mx/reader033/viewer/2022052009/601dec9775f4cb6a1f52109c/html5/thumbnails/30.jpg)
GPU topology
NVIDIA DGX-2:
1. request multiple GPUs (and CPUs) with sbatch:
2. modify your code to utilize multiple GPUs– if you use some existing code, there might already be
an option for this
– a single process may not be able to feed GPUs fast
enough => use multiple CPU cores for data processing
– IO easily becomes the bottleneck (especially with
spinning disks, network filesystem, lots of small files)
Using multiple GPUs
--gres=gpu:type:number request number of GPUs of type (k80, p100)
-c, --cpus-per-task number of processors per task
• In Keras: multiprocessing, workers
Using multiple CPUs for ETL
hist = fit_generator(generator, ..., workers=N,
use_multiprocessing=True/False)
EXTRACT TRANSFORM LOAD
• In PyTorch: workers (multiple processes)
train_loader = torch.utils.data.DataLoader(...,
num_workers=N)
• Keras/TF supports single node multi-GPU data parallelism with keras.utils.multi_gpu_model(model, gpus):
Using multiple GPUs in Keras
with tf.device('/cpu:0'):
_model = Sequential(...)
_model.add(...)
_model.add(...)
model = multi_gpu_model(_model, gpus=2)
model.compile(...)
• Notes:– batch_size is split among GPUs (each gets batch_size/gpus of data)– to save a multi-gpu model, use .save() with the template model
• PyTorch supports single node multi-GPU data parallelism by
wrapping your model with torch.nn.DataParallel()
Using multiple GPUs in PyTorch
model = MyModel(...)
if torch.cuda.device_count() > 1:
model = nn.DataParallel(model)
model.to(device)
• Notes:– batch_size is split among GPUs (each gets batch_size/gpus of data)