generating videos with scene dynamics · 2017-05-03 · training: dataset acquisition 2m unlabelled...

68
Generating Videos with Scene Dynamics Presented by Bryan Anenberg and Aojia Zhao By Carl Vondrick, Hamed Pirsiavash, Antonio Torralba

Upload: others

Post on 09-Aug-2020

4 views

Category:

Documents


0 download

TRANSCRIPT

Page 1: Generating Videos with Scene Dynamics · 2017-05-03 · Training: Dataset Acquisition 2M unlabelled videos from Flickr 5k hours of unfiltered video Rest filtered video with Places2

Generating Videos with Scene Dynamics

Presented by Bryan Anenberg and Aojia Zhao

By Carl Vondrick, Hamed Pirsiavash, Antonio Torralba

Page 2: Generating Videos with Scene Dynamics · 2017-05-03 · Training: Dataset Acquisition 2M unlabelled videos from Flickr 5k hours of unfiltered video Rest filtered video with Places2

MotivationGoal:

● Model scene dynamics● Make use of unlabeled video data

Why:

● Useful for video understanding and simulation

Approach:

● Two stream Generative Adversarial Network (GAN)

Page 3: Generating Videos with Scene Dynamics · 2017-05-03 · Training: Dataset Acquisition 2M unlabelled videos from Flickr 5k hours of unfiltered video Rest filtered video with Places2

Generated Videos of Golf Course

Page 4: Generating Videos with Scene Dynamics · 2017-05-03 · Training: Dataset Acquisition 2M unlabelled videos from Flickr 5k hours of unfiltered video Rest filtered video with Places2

Generated Videos of Beach

Page 5: Generating Videos with Scene Dynamics · 2017-05-03 · Training: Dataset Acquisition 2M unlabelled videos from Flickr 5k hours of unfiltered video Rest filtered video with Places2

Generated Videos of Train Station

Page 6: Generating Videos with Scene Dynamics · 2017-05-03 · Training: Dataset Acquisition 2M unlabelled videos from Flickr 5k hours of unfiltered video Rest filtered video with Places2

Generated Videos of (creepy) Baby

Page 7: Generating Videos with Scene Dynamics · 2017-05-03 · Training: Dataset Acquisition 2M unlabelled videos from Flickr 5k hours of unfiltered video Rest filtered video with Places2

Future GenerationInput Output Input OutputInput Output

Page 8: Generating Videos with Scene Dynamics · 2017-05-03 · Training: Dataset Acquisition 2M unlabelled videos from Flickr 5k hours of unfiltered video Rest filtered video with Places2

Background

● GANS● Fractional Strided Convolution● Spatio-temporal Convolution● Unsupervised Representation Learning with Deep

Convolutional Generative Adversarial Networks [Adford, Metz, Chintala 2016]

Page 9: Generating Videos with Scene Dynamics · 2017-05-03 · Training: Dataset Acquisition 2M unlabelled videos from Flickr 5k hours of unfiltered video Rest filtered video with Places2

Background

● GANS● Fractional Strided Convolution● Spatio-temporal Convolution● Unsupervised Representation Learning with Deep

Convolutional Generative Adversarial Networks [Adford, Metz, Chintala 2016]

Page 10: Generating Videos with Scene Dynamics · 2017-05-03 · Training: Dataset Acquisition 2M unlabelled videos from Flickr 5k hours of unfiltered video Rest filtered video with Places2

Generative Adversarial Networks

Page 11: Generating Videos with Scene Dynamics · 2017-05-03 · Training: Dataset Acquisition 2M unlabelled videos from Flickr 5k hours of unfiltered video Rest filtered video with Places2

Generative Adversarial Networks

Page 12: Generating Videos with Scene Dynamics · 2017-05-03 · Training: Dataset Acquisition 2M unlabelled videos from Flickr 5k hours of unfiltered video Rest filtered video with Places2

Generative Adversarial Networks

Two-player minimax game with value function

Page 13: Generating Videos with Scene Dynamics · 2017-05-03 · Training: Dataset Acquisition 2M unlabelled videos from Flickr 5k hours of unfiltered video Rest filtered video with Places2

Minibatch SGD training of GANs

Update discriminator by gradient ascent

Alternating optimization

Update generator by gradient descent

Page 14: Generating Videos with Scene Dynamics · 2017-05-03 · Training: Dataset Acquisition 2M unlabelled videos from Flickr 5k hours of unfiltered video Rest filtered video with Places2

Background

● GANS● Fractional Strided Convolution● Spatio-temporal Convolution● Unsupervised Representation Learning with Deep

Convolutional Generative Adversarial Networks [Adford, Metz, Chintala 2016]

Page 15: Generating Videos with Scene Dynamics · 2017-05-03 · Training: Dataset Acquisition 2M unlabelled videos from Flickr 5k hours of unfiltered video Rest filtered video with Places2

Fractional Strided Convolution

● To stride by 1/n steps, pad n-1 zero blocks around each block and do unit striding

● Used to upsample

Page 16: Generating Videos with Scene Dynamics · 2017-05-03 · Training: Dataset Acquisition 2M unlabelled videos from Flickr 5k hours of unfiltered video Rest filtered video with Places2

Background

● GANS● Fractional Strided Convolution● Spatio-temporal Convolution● Unsupervised Representation Learning with Deep

Convolutional Generative Adversarial Networks [Adford, Metz, Chintala 2016]

Page 17: Generating Videos with Scene Dynamics · 2017-05-03 · Training: Dataset Acquisition 2M unlabelled videos from Flickr 5k hours of unfiltered video Rest filtered video with Places2

Spatio Temporal Convolution● Analogous to normal 2D convolution over 2D or 3D space● Using a 3D spatial convolution filter to stride over the 3D spatial as well as 4th

temporal dimension

Page 18: Generating Videos with Scene Dynamics · 2017-05-03 · Training: Dataset Acquisition 2M unlabelled videos from Flickr 5k hours of unfiltered video Rest filtered video with Places2

Background

● GANS● Fractional Strided Convolution● Spatio-temporal Convolution● Unsupervised Representation Learning with Deep

Convolutional Generative Adversarial Networks [Adford, Metz, Chintala 2016]

Page 19: Generating Videos with Scene Dynamics · 2017-05-03 · Training: Dataset Acquisition 2M unlabelled videos from Flickr 5k hours of unfiltered video Rest filtered video with Places2

Combining GANs and CNNs

Unsupervised Representation Learning with Deep Convolutional Generative Adversarial Networks [Adford, Metz, Chintala 2016]

Page 20: Generating Videos with Scene Dynamics · 2017-05-03 · Training: Dataset Acquisition 2M unlabelled videos from Flickr 5k hours of unfiltered video Rest filtered video with Places2

Combining GANs and CNNs

Unsupervised Representation Learning with Deep Convolutional Generative Adversarial Networks

Lsun bedroom dataset

Page 21: Generating Videos with Scene Dynamics · 2017-05-03 · Training: Dataset Acquisition 2M unlabelled videos from Flickr 5k hours of unfiltered video Rest filtered video with Places2

Generating Videos with Scene Dynamics

Page 22: Generating Videos with Scene Dynamics · 2017-05-03 · Training: Dataset Acquisition 2M unlabelled videos from Flickr 5k hours of unfiltered video Rest filtered video with Places2

Generator-Discriminator Architecture

● One stream spatio-temporal generator● Two stream foreground background generator● Foreground-Background Mask● Discriminator

Page 23: Generating Videos with Scene Dynamics · 2017-05-03 · Training: Dataset Acquisition 2M unlabelled videos from Flickr 5k hours of unfiltered video Rest filtered video with Places2

Generator-Discriminator Architecture

● One stream spatio-temporal generator● Two stream foreground background generator● Foreground-Background Mask● Discriminator

Page 24: Generating Videos with Scene Dynamics · 2017-05-03 · Training: Dataset Acquisition 2M unlabelled videos from Flickr 5k hours of unfiltered video Rest filtered video with Places2

Generator-Discriminator Architecture: One Stream

Page 25: Generating Videos with Scene Dynamics · 2017-05-03 · Training: Dataset Acquisition 2M unlabelled videos from Flickr 5k hours of unfiltered video Rest filtered video with Places2

Generator-Discriminator Architecture: One Stream

● Spatio-temporal provides invariance

● Fractionally stride up samples efficiently

Page 26: Generating Videos with Scene Dynamics · 2017-05-03 · Training: Dataset Acquisition 2M unlabelled videos from Flickr 5k hours of unfiltered video Rest filtered video with Places2

Generator-Discriminator Architecture: One Stream

● Spatio-temporal provides invariance

● Fractionally stride up samples efficiently

Projected4X4X2 Filter

Page 27: Generating Videos with Scene Dynamics · 2017-05-03 · Training: Dataset Acquisition 2M unlabelled videos from Flickr 5k hours of unfiltered video Rest filtered video with Places2

Generator-Discriminator Architecture: One Stream

● Spatio-temporal provides invariance

● Fractionally stride up samples efficiently

Projected4X4X2 Filter

Fractional Striding4X4X4 Filter

Page 28: Generating Videos with Scene Dynamics · 2017-05-03 · Training: Dataset Acquisition 2M unlabelled videos from Flickr 5k hours of unfiltered video Rest filtered video with Places2

Generator-Discriminator Architecture

● One stream spatio-temporal generator● Two stream foreground background

generator● Foreground-Background Mask● Discriminator

Page 29: Generating Videos with Scene Dynamics · 2017-05-03 · Training: Dataset Acquisition 2M unlabelled videos from Flickr 5k hours of unfiltered video Rest filtered video with Places2

Generator-Discriminator Architecture: Two Stream

Same structure as one stream for foreground

Page 30: Generating Videos with Scene Dynamics · 2017-05-03 · Training: Dataset Acquisition 2M unlabelled videos from Flickr 5k hours of unfiltered video Rest filtered video with Places2

Generator-Discriminator Architecture: Two Stream

Same structure as one stream for foreground

Eliminates temporal dimension

Page 31: Generating Videos with Scene Dynamics · 2017-05-03 · Training: Dataset Acquisition 2M unlabelled videos from Flickr 5k hours of unfiltered video Rest filtered video with Places2

Generator-Discriminator Architecture

● One stream spatio-temporal generator● Two stream foreground background generator● Foreground-Background Mask● Discriminator

Page 32: Generating Videos with Scene Dynamics · 2017-05-03 · Training: Dataset Acquisition 2M unlabelled videos from Flickr 5k hours of unfiltered video Rest filtered video with Places2

Generator-Discriminator Architecture: MaskBounded in range [0, 1]Gives weight to foreground vs background influence at each pixel

Page 33: Generating Videos with Scene Dynamics · 2017-05-03 · Training: Dataset Acquisition 2M unlabelled videos from Flickr 5k hours of unfiltered video Rest filtered video with Places2

● Mask matrix “cuts” out the foreground objects from the back ground, allowing for motion to occur for just the foreground.

● Sigmoid activation

Generator-Discriminator Architecture: Mask

Page 34: Generating Videos with Scene Dynamics · 2017-05-03 · Training: Dataset Acquisition 2M unlabelled videos from Flickr 5k hours of unfiltered video Rest filtered video with Places2

Generator-Discriminator Architecture

● One stream spatio-temporal generator● Two stream foreground background generator● Foreground-Background Mask● Discriminator

Page 35: Generating Videos with Scene Dynamics · 2017-05-03 · Training: Dataset Acquisition 2M unlabelled videos from Flickr 5k hours of unfiltered video Rest filtered video with Places2

Generator-Discriminator Architecture: DiscriminatorRecognizes realistic motionDownsample instead of upsample

Page 36: Generating Videos with Scene Dynamics · 2017-05-03 · Training: Dataset Acquisition 2M unlabelled videos from Flickr 5k hours of unfiltered video Rest filtered video with Places2

Generator-Discriminator Architecture: DiscriminatorRecognizes realistic motionDownsample instead of upsample

4X4X4 FiltersNormal Striding

Page 37: Generating Videos with Scene Dynamics · 2017-05-03 · Training: Dataset Acquisition 2M unlabelled videos from Flickr 5k hours of unfiltered video Rest filtered video with Places2

Generator-Discriminator Architecture: DiscriminatorRecognizes realistic motionDownsample instead of upsample

4X4X4 FiltersNormal Striding

4X4X2 FilterBinary Classification

Page 38: Generating Videos with Scene Dynamics · 2017-05-03 · Training: Dataset Acquisition 2M unlabelled videos from Flickr 5k hours of unfiltered video Rest filtered video with Places2

Training

● Intuition● Dataset Acquisition● Dataset Processing

Page 39: Generating Videos with Scene Dynamics · 2017-05-03 · Training: Dataset Acquisition 2M unlabelled videos from Flickr 5k hours of unfiltered video Rest filtered video with Places2

Training

● Intuition● Dataset Acquisition● Dataset Processing

Page 40: Generating Videos with Scene Dynamics · 2017-05-03 · Training: Dataset Acquisition 2M unlabelled videos from Flickr 5k hours of unfiltered video Rest filtered video with Places2

Training: Intuition

● Train both discriminator and generator in alternation.

● Discriminator wants to learn what makes a real image/video

● Generator wants to fool discriminator by producing real looking image/video

Video

Page 41: Generating Videos with Scene Dynamics · 2017-05-03 · Training: Dataset Acquisition 2M unlabelled videos from Flickr 5k hours of unfiltered video Rest filtered video with Places2

Training

● Intuition● Dataset Acquisition● Dataset Processing

Page 42: Generating Videos with Scene Dynamics · 2017-05-03 · Training: Dataset Acquisition 2M unlabelled videos from Flickr 5k hours of unfiltered video Rest filtered video with Places2

Training: Dataset Acquisition

● 2M unlabelled videos from Flickr● 5k hours of unfiltered video● Rest filtered video with Places2● 4 filtered categories of videos:

golf course, babies, beaches, train station

● Normalizes to [-1, 1]

Page 43: Generating Videos with Scene Dynamics · 2017-05-03 · Training: Dataset Acquisition 2M unlabelled videos from Flickr 5k hours of unfiltered video Rest filtered video with Places2

Training

● Intuition● Dataset Acquisition● Dataset Processing

Page 44: Generating Videos with Scene Dynamics · 2017-05-03 · Training: Dataset Acquisition 2M unlabelled videos from Flickr 5k hours of unfiltered video Rest filtered video with Places2

Training: Dataset Processing

● Stabilizes data using SIFT and RANSAC to determine homography between frames

● Uses previous frame to fill in missing values, reject overly large changes

Page 45: Generating Videos with Scene Dynamics · 2017-05-03 · Training: Dataset Acquisition 2M unlabelled videos from Flickr 5k hours of unfiltered video Rest filtered video with Places2

Experiments

● Action Classification● Video Generation● Future Generation

Page 46: Generating Videos with Scene Dynamics · 2017-05-03 · Training: Dataset Acquisition 2M unlabelled videos from Flickr 5k hours of unfiltered video Rest filtered video with Places2

Experiments

● Action Classification● Video Generation● Future Generation

Page 47: Generating Videos with Scene Dynamics · 2017-05-03 · Training: Dataset Acquisition 2M unlabelled videos from Flickr 5k hours of unfiltered video Rest filtered video with Places2

Action ClassificationThe Video GAN model learns a useful representation that can be transferred to Action Classification

● Train on 5k hours of unlabeled Flicker videos

● Fine-tune the discriminator on UCF101 action classification dataset

● Replace last layer of discriminator to be a K-way softmax classifier

Page 48: Generating Videos with Scene Dynamics · 2017-05-03 · Training: Dataset Acquisition 2M unlabelled videos from Flickr 5k hours of unfiltered video Rest filtered video with Places2

Action ClassificationThe Video GAN model learns a useful representation that can be transferred to Action Classification

● Train on 5k hours of unlabeled Flicker videos

● Fine-tune the discriminator on UCF101 action classification dataset

● Replace last layer of discriminator to be a K-way softmax classifier

Page 49: Generating Videos with Scene Dynamics · 2017-05-03 · Training: Dataset Acquisition 2M unlabelled videos from Flickr 5k hours of unfiltered video Rest filtered video with Places2

Action ClassificationThe Video GAN model learns a useful representation that can be transferred to Action Classification

● Train on 5k hours of unlabeled Flicker videos

● Fine-tune the discriminator on UCF101 action classification dataset

● Replace last layer of discriminator to be a K-way softmax classifier

Page 50: Generating Videos with Scene Dynamics · 2017-05-03 · Training: Dataset Acquisition 2M unlabelled videos from Flickr 5k hours of unfiltered video Rest filtered video with Places2

Action ClassificationThe Video GAN model learns a useful representation that can be transferred to Action Classification

● Train on 5k hours of unlabeled Flicker videos

● Fine-tune the discriminator on UCF101 action classification dataset

● Replace last layer of discriminator to be a K-way softmax classifier

Page 51: Generating Videos with Scene Dynamics · 2017-05-03 · Training: Dataset Acquisition 2M unlabelled videos from Flickr 5k hours of unfiltered video Rest filtered video with Places2

Experiments

● Action Classification● Video Generation● Future Generation

Page 52: Generating Videos with Scene Dynamics · 2017-05-03 · Training: Dataset Acquisition 2M unlabelled videos from Flickr 5k hours of unfiltered video Rest filtered video with Places2

Generating Video

Autoencoder

100D

Latent z code

Page 53: Generating Videos with Scene Dynamics · 2017-05-03 · Training: Dataset Acquisition 2M unlabelled videos from Flickr 5k hours of unfiltered video Rest filtered video with Places2

Generating Video● Fit a 256 component Gaussian Mixture Model over the autoencoder 100-dim

outputs● To generate video, sample from GMM a 100-dim latent code ● Latent code passed to generator to generate 32-frame video

Random Noise Gaussian Mixture Model

Page 54: Generating Videos with Scene Dynamics · 2017-05-03 · Training: Dataset Acquisition 2M unlabelled videos from Flickr 5k hours of unfiltered video Rest filtered video with Places2

Generating Video

100D vector sampled from GMM

Page 55: Generating Videos with Scene Dynamics · 2017-05-03 · Training: Dataset Acquisition 2M unlabelled videos from Flickr 5k hours of unfiltered video Rest filtered video with Places2

Generated Videos of Beach

Page 56: Generating Videos with Scene Dynamics · 2017-05-03 · Training: Dataset Acquisition 2M unlabelled videos from Flickr 5k hours of unfiltered video Rest filtered video with Places2

Generated Videos of (creepy) Baby

Page 57: Generating Videos with Scene Dynamics · 2017-05-03 · Training: Dataset Acquisition 2M unlabelled videos from Flickr 5k hours of unfiltered video Rest filtered video with Places2

Mechanical Turk Evaluation

● 13,000 opinions over 4 categories by 150 workers● Mostly prefers VGAN over autoencoder● Unexpected number prefers VGAN over real

Page 58: Generating Videos with Scene Dynamics · 2017-05-03 · Training: Dataset Acquisition 2M unlabelled videos from Flickr 5k hours of unfiltered video Rest filtered video with Places2

Experiments

● Action Classification● Video Generation● Future Generation

Page 59: Generating Videos with Scene Dynamics · 2017-05-03 · Training: Dataset Acquisition 2M unlabelled videos from Flickr 5k hours of unfiltered video Rest filtered video with Places2

Future GenerationInput Output Input OutputInput Output

Page 60: Generating Videos with Scene Dynamics · 2017-05-03 · Training: Dataset Acquisition 2M unlabelled videos from Flickr 5k hours of unfiltered video Rest filtered video with Places2

Future Generation

Page 61: Generating Videos with Scene Dynamics · 2017-05-03 · Training: Dataset Acquisition 2M unlabelled videos from Flickr 5k hours of unfiltered video Rest filtered video with Places2

Summary: Motivation● Model scene dynamics● Make use of unlabeled video data

Page 62: Generating Videos with Scene Dynamics · 2017-05-03 · Training: Dataset Acquisition 2M unlabelled videos from Flickr 5k hours of unfiltered video Rest filtered video with Places2

Summary: Architecture

● One stream spatial temporal generator

● Two stream foreground background generator

● Discriminator recognizes real and generated videos

Page 63: Generating Videos with Scene Dynamics · 2017-05-03 · Training: Dataset Acquisition 2M unlabelled videos from Flickr 5k hours of unfiltered video Rest filtered video with Places2

Summary: Experimentation and Results● VGAN successful classification of action● Randomly generating video with z sample from GMM● Generating future video with single frame

Page 64: Generating Videos with Scene Dynamics · 2017-05-03 · Training: Dataset Acquisition 2M unlabelled videos from Flickr 5k hours of unfiltered video Rest filtered video with Places2

Questions?

Page 65: Generating Videos with Scene Dynamics · 2017-05-03 · Training: Dataset Acquisition 2M unlabelled videos from Flickr 5k hours of unfiltered video Rest filtered video with Places2

Questions?

Page 66: Generating Videos with Scene Dynamics · 2017-05-03 · Training: Dataset Acquisition 2M unlabelled videos from Flickr 5k hours of unfiltered video Rest filtered video with Places2

Questions?

Page 67: Generating Videos with Scene Dynamics · 2017-05-03 · Training: Dataset Acquisition 2M unlabelled videos from Flickr 5k hours of unfiltered video Rest filtered video with Places2

Questions?

Page 68: Generating Videos with Scene Dynamics · 2017-05-03 · Training: Dataset Acquisition 2M unlabelled videos from Flickr 5k hours of unfiltered video Rest filtered video with Places2

Input Output Input OutputInput OutputQuestions?