machine learning and data mining i

61
DS 4400 Alina Oprea Associate Professor, CCIS Northeastern University April 4 2019 Machine Learning and Data Mining I

Upload: others

Post on 16-Oct-2021

4 views

Category:

Documents


0 download

TRANSCRIPT

Page 1: Machine Learning and Data Mining I

DS 4400

Alina Oprea

Associate Professor, CCIS

Northeastern University

April 4 2019

Machine Learning and Data Mining I

Page 2: Machine Learning and Data Mining I

Logistics

• Final project presentations

– Thursday, April 11

– Tuesday, April 16 in ISEC 655

– 8 minute slot – 5 min presentation and 3 min questions

• Final report due on Tuesday, April 23

– Template in Piazza

– Schedule on Piazza

2

Page 3: Machine Learning and Data Mining I

Training

• Training data 𝑥(1) , y(1), … 𝑥(𝑁) , y(N)

• One training example 𝑥(𝑖) = 𝑥1(𝑖), … 𝑥𝑑

(𝑖), label 𝑦(𝑖)

• One forward pass through the network– Compute prediction ො𝑦(𝑖)

• Loss function for one example

– 𝐿 ො𝑦, 𝑦 = −[ 1 − 𝑦 log 1 − ො𝑦 + 𝑦 log ො𝑦]

• Loss function for training data

– 𝐽 𝑊,𝑏 =1

𝑁σ𝑖 𝐿 ( ො𝑦

(𝑖), 𝑦(𝑖)) + 𝜆𝑅(𝑊, 𝑏)

3

Cross-entropy loss

Page 4: Machine Learning and Data Mining I

Mini-batch Gradient Descent

• Initialization– For all layers ℓ

• Set 𝑊 [ℓ], 𝑏[ℓ] at random

• Backpropagation– Fix learning rate 𝛼

– For all layers ℓ (starting backwards)• For all batches b of size B with training examples 𝑥(𝑖𝑏), 𝑦(𝑖𝑏)

–𝑊[ℓ] = 𝑊[ℓ] − 𝛼σ𝑖=1𝐵 𝜕𝐿( ො𝑦(𝑖𝑏),𝑦(𝑖𝑏))

𝜕𝑊 ℓ

– 𝑏[ℓ] = 𝑏[ℓ] − 𝛼σ𝑖=1𝐵 𝜕𝐿( ො𝑦(𝑖𝑏),𝑦(𝑖𝑏))

𝜕𝑏 ℓ

4

Page 5: Machine Learning and Data Mining I

Given training set 𝑥1, 𝑦1 , … , 𝑥𝑁, 𝑦𝑁Initialize all parameters 𝑊 [ℓ], 𝑏[ℓ] randomly, for all layers ℓLoop

Training NN with Backpropagation

5

Update weights via gradient step

• 𝑊𝑖𝑗[ℓ]

= 𝑊𝑖𝑗[ℓ]− 𝛼

Δ𝑖𝑗[ℓ]

𝑁

• Similar for 𝑏𝑖𝑗[ℓ]

Until weights converge or maximum number of epochs is reached

EPOCH

Page 6: Machine Learning and Data Mining I

Gradient Descent Variants

6

Page 7: Machine Learning and Data Mining I

Training Neural Networks

• Randomly initialize weights

• Implement forward propagation to get prediction ෝ𝑦𝑖 for any training instance 𝑥𝑖

• Compute loss function 𝐿 ො𝑦𝑖 , 𝑦𝑖• Implement backpropagation to compute partial

derivatives 𝜕𝐿( ො𝑦(𝑖) ,𝑦(𝑖))

𝜕𝑊 ℓ and𝜕𝐿( ො𝑦(𝑖),𝑦(𝑖))

𝜕𝑏 ℓ

• Use gradient descent with backpropagation to compute parameter values that optimize loss

• Can be applied to both feed-forward and convolutional nets

7

Page 8: Machine Learning and Data Mining I

Neural Network Architectures

8

Feed-Forward Networks• Neurons from each layer

connect to neurons from next layer

Convolutional Networks• Includes convolution layer

for feature reduction• Learns hierarchical

representations

Recurrent Networks• Keep hidden state• Have cycles in

computational graph

Page 9: Machine Learning and Data Mining I

Outline

• Recurrent Neural Networks (RNNs)– One-to-one, one-to-many, many-to-one, many-to-

many

– Blog by Andrej Karpathy• http://karpathy.github.io/2015/05/21/rnn-

effectiveness/

• Unsupervised learning

• Dimensionality reduction– PCA

• Clustering

9

Page 10: Machine Learning and Data Mining I

RNN Architectures

10

Page 11: Machine Learning and Data Mining I

RNN Architectures

11

Page 12: Machine Learning and Data Mining I

RNN Architectures

12

Page 13: Machine Learning and Data Mining I

RNN Architectures

13

Page 14: Machine Learning and Data Mining I

Recurrent Neural Network

14

Page 15: Machine Learning and Data Mining I

RNN: Computational Graph

15

Page 16: Machine Learning and Data Mining I

RNN: Computational Graph

16

Page 17: Machine Learning and Data Mining I

One-to-Many

17

Page 18: Machine Learning and Data Mining I

Many-to-Many

18

Page 19: Machine Learning and Data Mining I

Many-to-One

19

Page 20: Machine Learning and Data Mining I

Example: Language Model

20

Page 21: Machine Learning and Data Mining I

Example: Language Model

21

Page 22: Machine Learning and Data Mining I

Example: Language Model

22

Page 23: Machine Learning and Data Mining I

Training RNNs

23

Page 24: Machine Learning and Data Mining I

Training RNNs

24

Page 25: Machine Learning and Data Mining I

Writing poetry

25

Page 26: Machine Learning and Data Mining I

Writing poetry

26

Page 27: Machine Learning and Data Mining I

Writing geometry proofs

27

Page 28: Machine Learning and Data Mining I

Writing geometry proofs

28

Page 29: Machine Learning and Data Mining I

Example RNN: LSTM

29

Capture long-term dependencies by using “memory cells”

Page 30: Machine Learning and Data Mining I

LSTM

30

Hidden State

Memory Cell

Page 31: Machine Learning and Data Mining I

LSTM vs Standard RNN

31

RNN

LSTM

Page 32: Machine Learning and Data Mining I

Summary RNNs

• RNNs maintain state and have flexible design

– One-to-many, many-to-one, many-to-many

• Applicable to sequential data

• LSTM maintains both short-term and long-term memory

• Better and simpler architectures are a topic of active research

32

Page 33: Machine Learning and Data Mining I

Unsupervised Learning

33

Page 34: Machine Learning and Data Mining I

Unsupervised Learning

• Different learning tasks

• Dimensionality reduction– Project the data to lower dimensional space

– Example: PCA (Principal Component Analysis)

• Feature learning– Find feature representations

– Example: Autoencoders

• Clustering– Group similar data points into clusters

– Example: k-means, hierarchical clustering

34

Page 35: Machine Learning and Data Mining I

Supervised vs Unsupervised Learning

35

Standard metrics for evaluation

Difficult to evaluate

Page 36: Machine Learning and Data Mining I

How Can we Visualize High-Dimensional Data?

36

Page 37: Machine Learning and Data Mining I

Data Visualization

37

Page 38: Machine Learning and Data Mining I

Principal Component Analysis

38

Page 39: Machine Learning and Data Mining I

The Principal Components

39

Page 40: Machine Learning and Data Mining I

2D Gaussian Data

40

Page 41: Machine Learning and Data Mining I

1st PCA Axis

41

Page 42: Machine Learning and Data Mining I

2nd PCA Axis

42

Page 43: Machine Learning and Data Mining I

PCA Algorithm

43

Σ𝑥 = 𝜆𝑥

Page 44: Machine Learning and Data Mining I

PCA

44

Page 45: Machine Learning and Data Mining I

PCA

45

Page 46: Machine Learning and Data Mining I

PCA

46

Page 47: Machine Learning and Data Mining I

Visualizing data

47

Page 48: Machine Learning and Data Mining I

PCA for image compression

d=1 d=2 d=4 d=8

d=16 d=32 d=64 d=100Original Image

48

Page 49: Machine Learning and Data Mining I

Summary: PCA

• PCA creates a lower-dimensional feature representation– Linear transformation

• Can be used for visualization

• Can be used with supervised on unsupervised learning– Very common to use classification after PCA

transformation

• Main drawback– No interpretability of resulting features

49

Page 50: Machine Learning and Data Mining I

Clustering

• Goal: Automatically segment data into groups of similar points

• Question: When and why would we want to do this?

• Useful for:– Automatically organizing data

– Understanding hidden structure in data and data distribution

– Detect similar points in data and generate representative samples

50

Page 51: Machine Learning and Data Mining I

Clustering Examples

• Social networks– Facebook user group according to their interests and

profiles

• Image search– Retrieve similar images to input image

• NLP– Topic discovery in articles

• Medicine– Patients with similar disease and symptoms

• Cyber security– Machine with same malware infection – New attack has no label

51

Page 52: Machine Learning and Data Mining I

Setup

d

< 𝑥𝑛,1,… , 𝑥𝑛,𝑑 >

Assignment from each point to cluster index52

Page 53: Machine Learning and Data Mining I

What is a good distance function?

Partition this data into k groups

53Euclidean distance: 𝑑 𝑥, 𝑦 = √

𝑗=1

𝑑

𝑥𝑗 − 𝑦𝑗2

Page 54: Machine Learning and Data Mining I

K means Algorithm

• Fix a number of desired clusters k

• Key insight: describe each cluster by its mean value (called cluster representative)

• Algorithm– Select k cluster means at random

– Assign points to “closest cluster”

– Re-compute cluster means based on new assignment

– Refine assignment iteratively until convergence

54

Page 55: Machine Learning and Data Mining I

Example: Start

55

Page 56: Machine Learning and Data Mining I

56

Example: Iteration 1

Page 57: Machine Learning and Data Mining I

57

Example: Iteration 2

Page 58: Machine Learning and Data Mining I

58

Example: Iteration 3

Page 59: Machine Learning and Data Mining I

59

Example: Iteration 4

Page 60: Machine Learning and Data Mining I

60

Example: Iteration 5

Page 61: Machine Learning and Data Mining I

Acknowledgements

• Slides made using resources from:

– Yann LeCun

– Andrew Ng

– Eric Eaton

– David Sontag

– Andrew Moore

• Thanks!