convolution and pooling - dl-nlp.github.io · convolution with one lter: tensor sizes most images...

19
Convolution and Pooling Benjamin Roth, Nina Poerner CIS LMU M¨ unchen November 26, 2019 Benjamin Roth, Nina Poerner (CIS LMU M¨ unchen) Convolution and Pooling November 26, 2019 1 / 27

Upload: others

Post on 11-Feb-2020

7 views

Category:

Documents


0 download

TRANSCRIPT

Page 1: Convolution and Pooling - dl-nlp.github.io · Convolution with one lter: Tensor sizes Most images are not 2D but 3D I 3rd dimension is # channels, e.g., RGB values I image height

Convolution and Pooling

Benjamin Roth, Nina Poerner

CIS LMU Munchen

November 26, 2019

Benjamin Roth, Nina Poerner (CIS LMU Munchen) Convolution and Pooling November 26, 2019 1 / 27

Page 2: Convolution and Pooling - dl-nlp.github.io · Convolution with one lter: Tensor sizes Most images are not 2D but 3D I 3rd dimension is # channels, e.g., RGB values I image height

Convolutional Neural Networks (CNNs)

Technique from Computer Vision (e.g., object recognition in images)

Alternative to RNNs for many (not all) NLP tasks

General idea: Filter bank with N learnable filters

Benjamin Roth, Nina Poerner (CIS LMU Munchen) Convolution and Pooling November 26, 2019 4 / 27

Page 3: Convolution and Pooling - dl-nlp.github.io · Convolution with one lter: Tensor sizes Most images are not 2D but 3D I 3rd dimension is # channels, e.g., RGB values I image height

Convolution with one filter

For now, assume we have only one filter

Move filter over input with step size (stride) s (here: 1)

At every position, multiply filter and input entries together(elementwise), and sum the results into a single value

Input

Filter

Output

=

-1

-2

1

0

-3

0

-1

0

1

0

0

-1

0

1

-1

21 -1

1 -2

-1 -1

Filter size: 2× 2

Input size: 3× 4

Benjamin Roth, Nina Poerner (CIS LMU Munchen) Convolution and Pooling November 26, 2019 6 / 27

Page 4: Convolution and Pooling - dl-nlp.github.io · Convolution with one lter: Tensor sizes Most images are not 2D but 3D I 3rd dimension is # channels, e.g., RGB values I image height

Building an edge detector filter

Assume that -1 means black and +1 means white

We want to build a filter that can detect diagonal edges where theupper left side is dark and the lower right side is bright

= a filter that calculates a high positive number on windows thatlook like this:

-3-3-3-3-3

-33

-3

3

-3

3

-3

3

-33

-3

3

-3

3

-33

-3

3

-33

How would the filter

react to this window?

In CNNs, the filters are not manually chosen, but learned withgradient descent

Benjamin Roth, Nina Poerner (CIS LMU Munchen) Convolution and Pooling November 26, 2019 7 / 27

Page 5: Convolution and Pooling - dl-nlp.github.io · Convolution with one lter: Tensor sizes Most images are not 2D but 3D I 3rd dimension is # channels, e.g., RGB values I image height

Convolution with one filter: Tensor sizes

Most images are not 2D but 3DI 3rd dimension is # channels, e.g., RGB valuesI image height × image width × # channels

As a consequence, each filter is also 3DI filter height × filter width × # channels

The operation stays the same, with an additional summation over thechannel dimension

Benjamin Roth, Nina Poerner (CIS LMU Munchen) Convolution and Pooling November 26, 2019 8 / 27

Page 6: Convolution and Pooling - dl-nlp.github.io · Convolution with one lter: Tensor sizes Most images are not 2D but 3D I 3rd dimension is # channels, e.g., RGB values I image height

Convolution with N filters

Apply N different filters of the same size → N matrices with thesame size

Stack the N matrices on top of each other → 3D tensor, where thelast dimension is N

Also known as a feature map

Feature map is slightly smaller than input (why?)

Because a filter of size k fits into an input of size h only h − k + 1times

... unless we pad the input with zeros

Benjamin Roth, Nina Poerner (CIS LMU Munchen) Convolution and Pooling November 26, 2019 10 / 27

Page 7: Convolution and Pooling - dl-nlp.github.io · Convolution with one lter: Tensor sizes Most images are not 2D but 3D I 3rd dimension is # channels, e.g., RGB values I image height

Convolution with N filters: Tensor sizes

Tensor sizes:I Input 3D: input height × input width × # channels (if this is the first

layer, otherwise # filters of previous layer)I Parameter tensor 4D: filter height × filter width × # channels ×

#filtersI Output 3D: input height* × input width* × #filtersI *height and width are slightly reduced by convolution unless we do

padding

Benjamin Roth, Nina Poerner (CIS LMU Munchen) Convolution and Pooling November 26, 2019 11 / 27

Page 8: Convolution and Pooling - dl-nlp.github.io · Convolution with one lter: Tensor sizes Most images are not 2D but 3D I 3rd dimension is # channels, e.g., RGB values I image height

What does convolution do?

Contextualization: Feature vector computed for position (i , j)contains info from (i −k, j −k) to (i +k, j +k) (where k is filter size).

Locality-preserving: In one convolution layer, info can travel nofurther than k positions

Computer Vision: Many convolutional layers applied one after another

Typical nonlinearity between convolution layers: ReLU

With every layer, feature maps become more complex

Pixels → edges → shapes → small objects → bigger, compositional

Benjamin Roth, Nina Poerner (CIS LMU Munchen) Convolution and Pooling November 26, 2019 13 / 27

Page 9: Convolution and Pooling - dl-nlp.github.io · Convolution with one lter: Tensor sizes Most images are not 2D but 3D I 3rd dimension is # channels, e.g., RGB values I image height

Convolution

Source: Computer science: The learning machines. Nature (2014).

Benjamin Roth, Nina Poerner (CIS LMU Munchen) Convolution and Pooling November 26, 2019 14 / 27

Page 10: Convolution and Pooling - dl-nlp.github.io · Convolution with one lter: Tensor sizes Most images are not 2D but 3D I 3rd dimension is # channels, e.g., RGB values I image height

Pooling

Often applied between convolution steps

Divide feature map into “grid”

Combine vectors inside the same grid cell with some operator

Most popular: Average pooling, Max pooling

Max pooling: only select maximum value for each dimension

“Feature detector”, “Cat neuron fires”

Max pooled

2 1

1 0

0 4

0 5

2 0

2 0

2 3

1 0

2 2

5 3

Benjamin Roth, Nina Poerner (CIS LMU Munchen) Convolution and Pooling November 26, 2019 16 / 27

Page 11: Convolution and Pooling - dl-nlp.github.io · Convolution with one lter: Tensor sizes Most images are not 2D but 3D I 3rd dimension is # channels, e.g., RGB values I image height

Convolution and Pooling: LeNet

LeCun et al. (1998). Gradient-based learning applied to document recognition.

Benjamin Roth, Nina Poerner (CIS LMU Munchen) Convolution and Pooling November 26, 2019 17 / 27

Page 12: Convolution and Pooling - dl-nlp.github.io · Convolution with one lter: Tensor sizes Most images are not 2D but 3D I 3rd dimension is # channels, e.g., RGB values I image height

Convolution for NLP

Images have width and height, but text only has “width” (length)

→ We can discard the “height” dimension from our filters

Tensor sizes (in NLP):I Input 2D: sentence length × # channels (word embedding size, or #

filters of previous convolution)I Parameter tensor 3D: filter length × # channels × #filtersI Output 2D: sentence length* × #filtersI *length slightly reduced unless we do padding

Computer vision: 2D convolution (over height and width)

NLP: 1D convolution (over length)

Typically fewer convolutional layers than Computer Vision

Benjamin Roth, Nina Poerner (CIS LMU Munchen) Convolution and Pooling November 26, 2019 19 / 27

Page 13: Convolution and Pooling - dl-nlp.github.io · Convolution with one lter: Tensor sizes Most images are not 2D but 3D I 3rd dimension is # channels, e.g., RGB values I image height

Pooling for NLP

Pooling between convolutional layers less frequently used than inComputer Vision

After last convolutional layer: “global” pooling step

Calculate max/average over the entire sequence (“pooling over time”)

Benjamin Roth, Nina Poerner (CIS LMU Munchen) Convolution and Pooling November 26, 2019 20 / 27

Page 14: Convolution and Pooling - dl-nlp.github.io · Convolution with one lter: Tensor sizes Most images are not 2D but 3D I 3rd dimension is # channels, e.g., RGB values I image height

Convolution and Pooling for NLP

I What is the unpadded input size (=length)? 6I What is the padded input size? 8I How many filters? 3I How many input channels (=word vector dimensions)? 4I What is the filter size (=filter width)? 3I What stride (=step size)? 1I What is the output size of the convolution operation? 6× 3I What is the output size of the pooling operation? 3I How many parameters have to be learned? 3× 3× 4 = 36

Benjamin Roth, Nina Poerner (CIS LMU Munchen) Convolution and Pooling November 26, 2019 21 / 27

Page 15: Convolution and Pooling - dl-nlp.github.io · Convolution with one lter: Tensor sizes Most images are not 2D but 3D I 3rd dimension is # channels, e.g., RGB values I image height

Convolution and Pooling for NLP

Source: Zhang, Y., & Wallace, B. (2015). A Sensitivity Analysis of ConvNets for Sentence Classification.

Benjamin Roth, Nina Poerner (CIS LMU Munchen) Convolution and Pooling November 26, 2019 22 / 27

Page 16: Convolution and Pooling - dl-nlp.github.io · Convolution with one lter: Tensor sizes Most images are not 2D but 3D I 3rd dimension is # channels, e.g., RGB values I image height

Convolution and Pooling for NLP

Source: Kim, Y. (2014). Convolutional Neural Networks for Sentence Classification.

Benjamin Roth, Nina Poerner (CIS LMU Munchen) Convolution and Pooling November 26, 2019 23 / 27

Page 17: Convolution and Pooling - dl-nlp.github.io · Convolution with one lter: Tensor sizes Most images are not 2D but 3D I 3rd dimension is # channels, e.g., RGB values I image height

# binary classifier, e.g., sentiment polarity

from keras.layers import Embedding, Conv1D, GlobalMaxPooling1D, Dense

from keras.models import Sequential

embedding = Embedding(input_dim = VOCAB_SIZE, output_dim = EMB_DIM)

conv_layer = Conv1D(filters = NUM_FILTERS, kernel_size = FILTER_WIDTH,

activation = "relu")

pool_layer = GlobalMaxPooling1D()

dense_layer = Dense(units = 1, activation = "sigmoid")

model = Sequential(layers = [emb_layer, conv_layer, pool_layer, dense_layer])

model.compile(loss = "binary_crossentropy", optimizer = "sgd")

X, Y = # load_data()

model.fit(X, Y)

Benjamin Roth, Nina Poerner (CIS LMU Munchen) Convolution and Pooling November 26, 2019 24 / 27

Page 18: Convolution and Pooling - dl-nlp.github.io · Convolution with one lter: Tensor sizes Most images are not 2D but 3D I 3rd dimension is # channels, e.g., RGB values I image height

RNN vs. CNN

RangeI CNN: Cannot capture dependencies with range above k × L (where k is

filter width and L is the number of layersI RNN: Can capture long-range dependencies

Information transportI RNN: Must learn to “transport” salient information across many time

steps.I CNN: No information transport across time, salient information

“fast-tracked” by global max pooling

EfficiencyI RNN: Sequential data processing → not parallelizable over timeI CNN: Input windows are independent from one another → highly

parallelizable over time

Benjamin Roth, Nina Poerner (CIS LMU Munchen) Convolution and Pooling November 26, 2019 26 / 27

Page 19: Convolution and Pooling - dl-nlp.github.io · Convolution with one lter: Tensor sizes Most images are not 2D but 3D I 3rd dimension is # channels, e.g., RGB values I image height

RNN vs. CNN: Quiz

Given a task description, choose appropriate architecture!I Task: predict the number of the main verb (sleep or sleeps)

F The cats, who were sitting on the map inside the house, [sleep/sleeps?]

I Which architecture should we use? RNNI Task: predict the polarity of the review:

F [... many useless sentences ...] best book ever [... many uselesssentences ...]

I Which architecture should we use? CNN

Task: Machine Translation

Which architecture should we use?I Intuitively RNN (because MT is all about long-range dependencies),

but ...I Attention gives CNNs the ability to capture long-range dependencies,

while maintaining parallel processing (Gehring et al.)I More about attention: Later in this course

Benjamin Roth, Nina Poerner (CIS LMU Munchen) Convolution and Pooling November 26, 2019 27 / 27