learning temporal transformations from time-lapse … · learning temporal transformations from...

16
Learning Temporal Transformations From Time-Lapse Videos Yipin Zhou Tamara L. Berg University of North Carolina at Chapel Hill {yipin,tlberg}@cs.unc.edu Abstract. Based on life-long observations of physical, chemical, and bi- ologic phenomena in the natural world, humans can often easily picture in their minds what an object will look like in the future. But, what about computers? In this paper, we learn computational models of ob- ject transformations from time-lapse videos. In particular, we explore the use of generative models to create depictions of objects at future times. These models explore several different prediction tasks: generating a fu- ture state given a single depiction of an object, generating a future state given two depictions of an object at different times, and generating future states recursively in a recurrent framework. We provide both qualitative and quantitative evaluations of the generated results, and also conduct a human evaluation to compare variations of our models. Keywords: generation, temporal prediction, time-lapse video 1 Introduction Before they can speak or understand language, babies have a grasp of some natural phenomena. They realize that if they drop the spoon it will fall to the ground (and their parent will pick it up). As they grow older, they develop understanding of more complex notions like object constancy and time. Children acquire much of this knowledge by observing and interacting with the world. In this paper we seek to learn computational models of how the world works through observation. Specifically, we learn models for natural transformations from video. To enable this learning, we collect time-lapse videos demonstrating four different natural state transformations: melting, blooming, baking, and rot- ting. Several of these transformations are applicable to a variety of objects. For example, butter, ice cream, and snow melt, bread and pizzas bake, and many different objects rot. We train models for each transformation – irrespective of the object undergoing the transformation – under the assumption that these transformations have shared underlying physical properties that can be learned. To model transformations, we train deep networks to generate depictions of the future state of objects. We explore several different generation tasks for modeling natural transformations. The first task is to generate the future state depiction of an object from a single image of the object (Sec 3.1). Here the input is a frame depicting an object at time t, and output is a generated depiction

Upload: buinhan

Post on 25-Jul-2018

222 views

Category:

Documents


0 download

TRANSCRIPT

Learning Temporal Transformations FromTime-Lapse Videos

Yipin Zhou Tamara L. Berg

University of North Carolina at Chapel Hill{yipin,tlberg}@cs.unc.edu

Abstract. Based on life-long observations of physical, chemical, and bi-ologic phenomena in the natural world, humans can often easily picturein their minds what an object will look like in the future. But, whatabout computers? In this paper, we learn computational models of ob-ject transformations from time-lapse videos. In particular, we explore theuse of generative models to create depictions of objects at future times.These models explore several different prediction tasks: generating a fu-ture state given a single depiction of an object, generating a future stategiven two depictions of an object at different times, and generating futurestates recursively in a recurrent framework. We provide both qualitativeand quantitative evaluations of the generated results, and also conducta human evaluation to compare variations of our models.

Keywords: generation, temporal prediction, time-lapse video

1 Introduction

Before they can speak or understand language, babies have a grasp of somenatural phenomena. They realize that if they drop the spoon it will fall to theground (and their parent will pick it up). As they grow older, they developunderstanding of more complex notions like object constancy and time. Childrenacquire much of this knowledge by observing and interacting with the world.

In this paper we seek to learn computational models of how the world worksthrough observation. Specifically, we learn models for natural transformationsfrom video. To enable this learning, we collect time-lapse videos demonstratingfour different natural state transformations: melting, blooming, baking, and rot-ting. Several of these transformations are applicable to a variety of objects. Forexample, butter, ice cream, and snow melt, bread and pizzas bake, and manydifferent objects rot. We train models for each transformation – irrespective ofthe object undergoing the transformation – under the assumption that thesetransformations have shared underlying physical properties that can be learned.

To model transformations, we train deep networks to generate depictionsof the future state of objects. We explore several different generation tasks formodeling natural transformations. The first task is to generate the future statedepiction of an object from a single image of the object (Sec 3.1). Here the inputis a frame depicting an object at time t, and output is a generated depiction

2 Yipin Zhou Tamara L. Berg

of the object at time t+k, where k is specified as a conditional label input tothe network. For this task we explore two auto-encoder based architectures. Thefirst architecture is a baseline algorithm built from a standard auto-encoderframework. The second architecture is a generative adversarial network wherein addition to the baseline auto-encoder, a discriminator network is added toencourage more realistic outputs.

For our second and third future prediction tasks, we introduce different waysof encoding time in the generation process. In our two-stack model (Sec 3.2) theinput is two images of an object at time t and t+m and the model learns togenerate a future image according to the implicit time gap between the inputimages (ie generate a prediction of the object at time t+2m). These models aretrained on images with varying time gaps. Finally, in our last prediction task,our goal is to recursively generate the future states of an object given a singleinput image of the object. For this task, we use a recurrent neural network torecursively generate future depictions (Sec 3.3). For each of the described futuregeneration tasks, we also explore the effectiveness of pre-training on a large setof images, followed by fine-tuning on time-lapse data for improving performance.

Future prediction has been explored in previous works for generating thenext frame or next few frames of a video [1–3]. Our focus, in comparison, is tomodel general natural object transformations and to model future prediction ata longer time scale (minutes to days) than previous approaches.

We evaluate the generated results of each model both quantitatively andqualitatively under a variety of different training scenarios (Sec 4). In addition,we perform human evaluations of model variations and image retrieval exper-iments. Finally, to help understand what these models have learned, we alsovisualize common motion patterns of the learned transformations. These resultsare discussed in Sec 5.

The innovations introduced by our paper include: 1) A new problem of model-ing natural object transformations with deep networks, 2) A new dataset of 1463time-lapse videos depicting 4 common object transformations, 3) Exploration ofdeep network architectures for modeling and generating future depictions, 4)Quantitative, qualitative, and human evaluations of the generated results, and5) Visualizations of the learned transformation patterns.

1.1 Related work

Object state recognition: Previous works [4–6] have looked at the problemof recognizing attributes, which has significant conceptual overlap with the ideaof object state recognition. For example “in full bloom” could be viewed as anattribute of flowers. Parikh and Grauman [5] train models to recognize the rel-ative strength of attributes such as face A is “smiling more” than face B fromordered sets of images. One way to view our work is as providing methods totrain relative state models in the temporal domain. Most relevant to our work,given a set of object transformation terms, such as ripe vs unripe, [7] learnsvisual classification and regression models for object transformations from a col-lection of photos. In contrast, our work takes a deep learning approach and learns

Learning Temporal Transformations From Time-Lapse Videos 3

transformations from video – perhaps a more natural input format for learningtemporal information.Timelapse data analysis: Timelapse data captures changes in time and hasbeen used for various applications. [8] hallucinates an input scene image at a dif-ferent time of day by making use of a timelapse video dataset exhibiting lightingchanges in an example-based color transfer technique. [9] presents an algorithmthat synthesizes timelapse videos of landmarks from large internet image col-lections. In their follow-up work, [10] imports additional camera motion whilecomposing videos to create transformations in time and space.Future prediction: Future prediction has been applied to various tasks suchas estimating the future trajectories of cars[11], pedestrians[12], or general ob-jects[13] in images or videos. In the ego-centric activity domain, [14] encodesthe prediction problem as a binary task of selecting which of two video clips isfirst in temporal ordering. Given large amounts of unlabeled video data from theinternet, [15] trains a deep network to predict visual representations of futureimages, enabling them to anticipate both actions and objects before they appear.Image/Frame generation: Generative models have attracted extensive atten-tion in machine learning [16–21]. Recently many works have focused on gener-ating novel natural or high-quality images. [22] applies deep structure networkstrained on synthetic data to generate 3D chairs. [23] combines variational auto-encoders with an attention mechanism to recurrently generate different partsof a single image. Generative adversarial networks (GANs) have shown greatpromise for improving image generation quality [24]. GANs are composed of twoparts, a generative model and a discriminative model, to be trained jointly. Someextensions have combined GAN structure with multi-scale laplacian pyramid toproduce high-resolution generation results [25]. Recently [26] incorporated deepconvolutional neural network structures into GANs to improve image quality.[27] proposed a network to generate the contents of an arbitrary image regionaccording to its surroundings. Some related approaches [1–3] have trained gen-eration models to reconstruct input video frames and/or generate the next fewconsecutive frames of a video. We also explore the use of DCGANs for futureprediction, focusing on modeling object transformations over relatively long timescales.

2 Object-centric timelapse dataset

Given the high-level goal of understanding temporal transformations of objects,we require a collection of videos showing temporal state changes. Therefore,we collect a large set of object-centric timelapse videos from the web. Time-lapse videos are ideal for our purposes since they are designed to show an entiretransformation (or a portion of a transformation) within a short period of time.

2.1 Data collection

We observe that time-lapse photography is quite popular due to the relativeease of data collection. A search on YouTube for “time lapse” results in over

4 Yipin Zhou Tamara L. Berg

Rotting: 185 Melting: 453 Baking: 242 Blooming: 583

Strawberry: 35 Ice cream: 128 Cookies: 55 Flower: 583Watermelon: 9 Chocolate: 18 Bread: 57Tomato: 25 Butter: 9 Pizza: 59Banana: 26 Snow: 54 Cake: 48Apple: 23 Wax: 60 Other: 23Peach: 8 Ice: 184Other: 59

Table 1. Statistics of our transformation categories. Some categories contain multipleobjects (e.g. ice cream, chocolate, etc melting) while others apply only to a specificobject (e.g. flowers blooming). Values indicate the total number of videos collected foreach category.

11 million results. Anyone with a personal camera, GoPro, or even a cell phonecan capture a simple time-lapse video and many people post and publicly sharethese videos on the web. We collect our object-based timelapse video datasetby directly querying keywords through the YouTube API. For this paper, wequery 4 state transformation categories: Blooming, Melting, Baking and Rotting,combined with various object categories. This results in a dataset of more than5000 videos. This dataset could be extended to a wider variety of transformationsor to more complex multi-object transformations, but as a first step we focus onthese 4 as our initial goal set of transformations for learning.

For ease of learning, ideally these videos should be object-centric with a staticcamera capturing an entire object transformation. However, many videos in theinitial data collection stage do not meet these requirements. Therefore, we cleanthe data using Amazon Mechanical Turk (AMT) as a crowdsourcing platform.Each video is examined by 3 Turkers who are asked questions related to videoquality. Videos that are not timelapse, contain severe camera motion or are notconsistent with the query labels are removed. We also manually adjust partsof the videos which are playing backwards (a technique used in some time-lapsevideos), contain more than one round of the specified transformation, and removeirrelevant prolog/epilog. Finally our resulting dataset contains 1463 high qualityobject based timelapse videos. Table. 1 shows the statistics of transformationcategories and their respective object counts. Fig. 1 shows example frames ofeach transformation category.

2.2 Transformation degree annotation

To learn natural transformation models of objects from videos, we need to firstlabel the degree of transformation throughout the videos. In language, peopleuse text labels to describe different states, for instance, fresh vs rotted apple.However, the transformation from one state to another is really a continuousevolution. Therefore, we represent the degree of transformation with a real num-ber, assigning the start state a value of 0 (not at all rotten) and the end state

Learning Temporal Transformations From Time-Lapse Videos 5

Fig. 1. Example frames from our dataset of each transformation category: Blooming,Melting, Baking, and Rotting. In each column, time increases as you move down thecolumn, showing how an object transforms.

(completely rotten) a value of 1. To annotate objects from different videos wecould naively assign the first frame a value of 0 and the last frame a value of 1,interpolating in between. However, we observe that some time-lapse videos maynot depict entire transformations, resulting in poor alignments.

Therefore, we design a labeling task for people to assign degrees of transfor-mation to videos. Directly estimating a real value for frames turns out to be im-practical as people may have different conceptions of transformation degree. In-stead our interface displays reference frames of an object category-transformationand asks Turkers to align frames from a target video to the reference framesaccording to degree of transformation. Specifically, for each object category-transformation pair we select 5 reference frames from a reference video showing:transformation degree values of 0, 0.25, 0.5, 0.75, and 1. Then, for the rest of thevideos in that object-transformation category, we ask Turkers to select framesvisually displaying the same degree of transformation as the reference frames. Ifthe displayed video does not depict an entire transformation, Turkers may alignless than 5 frames with the reference. Each target video is aligned by 3 Turkersand the median of their responses is used as annotation labels(linearly interpo-lating degrees between labeled target frames). This provides us with consistentdegree annotations between videos.

3 Future state generation Tasks & Approaches

Our goal is to generate depictions of the future state of objects. In this work, weexplore frameworks for 3 temporal prediction tasks. In the first task (Sec 3.1),called pairwise generation, we input an object-centric frame and the model gen-erates an image showing the future state of this object. Here the degree of thefuture state transformation – how far in the future we want the depiction toshow – is controlled by a conditional term in the model. In the second task(Sec 3.2) we have two inputs to the model: two frames from the same videoshowing a state transformation at two points in time. The goal of this modelis to generate a third image that continues the trend of the transformation de-picted by the first and second input. We call this “two stack” generation. In the

6 Yipin Zhou Tamara L. Berg

Fig. 2. Model architectures of three generation tasks: (a) Pairwise generator; (b) Twostack generator; (c) Recurrent generator.

third task (Sec 3.3), called “recurrent generation”, the input is a single frameand the goal is to recursively generate images that exhibit future degrees of thetransformation in a recurrent model framework.

3.1 Pairwise generation

In this task, we input a frame and generate an image showing the future state.We model the task using an autoencoder style convolutional neural network. Themodel architecture is shown in Fig. 2(a), where the input and output size are64x64 with 3 channels and encoding and decoding parts are symmetric. Bothencoding and decoding parts consists of 4 convolution/deconvolution layers withthe kernel size 5x5 and a stride of 2, meaning that at each layer the height andwidth of the output feature map decreases/increases by a factor of 2 with 64,128, 256, 512/512, 256, 128, 64 channels respectively. Each conv/deconv layer,except the last layer, is followed by a batch normalization operation[28] andReLU activation function[29]. For the last layer we use Tanh as the activationfunction. The size of the hidden variable z (center) is 512. We represent theconditional term encoding the degree of elapsed time between the input andoutput as a 4 dimensional one-hot vector, representing 4 possible degrees. Thisis connected with a linear layer (512) and concatenated with z to adjust thedegree of the future depiction. Below we describe experiments with different lossfunctions and training approaches for this network.p mse: As a baseline, we use pixel-wised mean square error(p MSE) betweenprediction output and ground truth as the loss function. Previous image gener-ation works[1, 2, 30, 3, 27] postulate that pixel-wised l2 criterion is one cause ofoutput generation blurriness. We also observe this effect in the outputs producedby this baseline model.p mse+adv: Following the success of recent image generation works[24, 26],which make use of generative adversarial networks (GANs) to generate betterquality images, we explore the use of this network architecture for temporal

Learning Temporal Transformations From Time-Lapse Videos 7

generation. For their original purpose, these networks generate images from ran-domly sampled noise. Here, we use a GAN to generate future images from animage of an object at the current time. Specifically, we use the baseline autoen-coder previously described and incorporate an adversarial loss in combinationwith the pixel-wise MSE. During training this means that in addition to theauto-encoder, called the generator (G), we also train a binary CNN classifier,called the discriminator (D). The discriminator takes as input, the output of thegenerator and is trained to classify images as real or fake (i.e. generated). Thesetwo networks are adversaries because G is trying to generate images that canfool D into thinking they are real, and D is trying to determine if the imagesgenerated by G are real. Adding D helps to train a better generator G thatproduces more realistic future state images. The architecture of D is the sameas the encoder of G, except that the last output is a scalar and connect with sig-moid function. D also incorporates the conditional term by connecting a one-hotvector with a linear layer and reshaping to the same size as the input image thenconcatenating in the third dimension. The output of D is a probability, whichwill be large if the input is a real image and small if the input is a generatedimage. For this framework, the loss is formulated as a combination of the MSEand adversarial loss:

LG = Lp mse + λadv ∗ Ladv (1)

where Lp mse is the mean square error loss, where x is the input image, c is theconditional term, G(.) is the output of the generation model, and y is the groundtruth future image at time = current time + c.

Lp mse = |y −G(x, c)|2 (2)

And, Ladv is a binary cross-entropy loss with α = 1 that penalizes if the generatedimage does not look like a real image. D[.] is the scalar probability output bythe discriminator.

Ladv = −α log(D[G(x, c), c])− (1− α) log(1−D[G(x, c), c]) (3)

During training, the binary cross-entropy loss is used to train D on both realand generated images. For details of jointly training adversarial structures, pleaserefer to [24].p g mse+adv: Inspired by [2], where they introduce gradient based l1 or l2loss to sharpen generated images. We also evaluate a loss function that is acombination of pixel-wise MSE, gradient based MSE, and adversarial loss:

LG = Lp mse + Lg mse + λadv ∗ Ladv (4)

where Lg mse represents mean square error in the gradient domain defined as:

Lg mse = (gx[y]− gx[G(x, c)])2 + (gy[y]− gy[G(x, c)])2 (5)

gx[.] and gy[.] are the gradient operations along the x and y axis of images. Wecould apply different weights to Lp mse and Lg mse, but in this work we simplyweight them equally.

8 Yipin Zhou Tamara L. Berg

p g mse+adv+ft: Since we have limited training data for each transformation,we investigate the use of pre-training for improving generation quality. In thismethod we use the same loss function as the last method, but instead of trainingthe adversarial network from scratch, training proceeds in two stages. The firststage is a reconstruction stage where we train the generation model using randomstatic images for the reconstruction task. Here the goal is for the output imageto match the input image as well as possible even though passing through abottleneck during generation. In the second stage, the fine-tuning stage, we fine-tune the network for the temporal generation task using our timelapse data. Byfirst pre-training for the reconstruction task, we expect the network to obtain agood initialization, capturing a representation that will be useful for kick-startingthe temporal generation fine-tuning.

3.2 Two stack generation

In this scenario, we want to generate an image that shows the future state of anobject given two input images showing the object in two stages of its transfor-mation. The output should continue the transformation pattern demonstratedin the input images, i.e. if the input images depict the object at time t andt+m, then the output should depict the object at time t+2m. We design thegeneration model using two stacks in the encoding part of the model as shownin Fig. 2(b). The structures of the two stacks are the same and are also identicalto the encoding part of the pairwise generation model. The hidden variables z1and z2 are both 512 dimensions, and are concatenated together and fed into thedecoding part, which is also the same as the previous pairwise generation model.The two stacks are sequential, trained independently without shared weights.

Given the blurry results of the baseline for pairwise generation, here we onlyuse three methods p mse+adv, p g mse+adv, and p g mse+adv+ft. Thestructure of the discriminator is the same. For the fine-tuning method, duringthe reconstruction training, we make the two inputs the same static image. Theoptimization procedures are the same as for the pairwise generation task, butwe do not have conditional term here (since time for the future generation isimplicitly specified by the input images).

3.3 Recurrent generation

In this scenario, we would like to recursively generate future images of an objectgiven only a single image of its current temporal state. In particular, we use arecurrent neural network framework for generation where each time step gener-ates an image of the object at a fixed degree interval in the future. This modelstructure is shown in Fig. 2(c). After hidden variable z, we add a LSTM[31]layer. For each time step, the LSTM layer takes both z and the output fromthe previous time step as inputs and sends a 512 dimension vector to the de-coder. The structure of the encoder and decoder are the same as in the previousscenarios, where the decoding portions for each time slot share the same weights.

Learning Temporal Transformations From Time-Lapse Videos 9

We evaluate three loss functions in this network: p mse+adv, p g mse+advand p g mse+adv+ft. The structure of the discriminator is again the samewithout the conditional term (as in the two-stack model). For fine-tuning, dur-ing reconstruction training, we train the model to recurrently output the samestatic image as the input at each time step.

4 Experiments

In this section, we discuss the training process and parameter settings for allexperiments (Sec 4.1). Then, we describe dataset pre-processing and augmenta-tion (Sec 4.2). Finally, we discuss quantitative and qualitative analysis of resultsfor: pairwise generation (Sec 4.3), two-stack generation (Sec 4.4), and recurrentgeneration (Sec 4.5).

4.1 Training & Parameter settings

Unless otherwise specified training and parameter setting details are applied toall models. During training, we apply Adam Stochastic Optimization[32] withlearning rate 0.0002 and minibatch of size 64. The models are implemented usingthe Tensorflow deep learning toolbox[33]. In the loss functions where we combinemean square error loss with adversarial loss (as in equation(1)), we set the weightof the adversarial loss to λadv = 0.2 for all experiments.

4.2 Dataset preprocessing & augmentation

Timelapse dataset: Some of the collected videos depict more than one objector the object is not located in the center of the frames. In order to help the modelconcentrate on learning the transformation itself rather object localization, foreach video in the dataset, we obtain a cropped version of the frames centered onthe main object. We randomly split the videos into training and testing sets ina ratio of 0.85 : 0.15. Then, we sample frame pairs (for pairwise generation) orgroups of frames (for two-stack and recurrent generation) from the training andtesting videos. Frames are resized to 64x64 for generation. To prevent overfitting,we also perform data augmentation on training pairs or groups by incorporatingframe crops and left-right flipping.Reconstruction dataset: This dataset contains static images used to pre-train our models on reconstruction tasks. Initially, we tried training only onobjects depicted in the timelapse videos and observed performance improvement.However, collecting images of specific object categories is a tedious task at large-scale. Therefore, we also tried pre-training on random images (scene images orimages of random objects) and found that the results were competitive. Thisimplies that the content of the images is not as important as encouraging thenetworks to learn how to reconstruct arbitrary input images well. We randomlydownload 50101 images from ImageNet[34] as our reconstruction dataset. Theadvantage of this strategy is that we are able to use the same group of imagesfor every transformation model and task.

10 Yipin Zhou Tamara L. Berg

Fig. 3. Pairwise generation results for: Blooming (rows 1-3), Melting (rows 4-5), Baking(rows 6-7) and Rotting (rows 8-9). Input(a), p mse(b), p mse+adv(c), p g mse+adv(d),p g mse+adv+ft(e), Ground truth frames(f). Black frames in the ground truth indicatevideo did not depict transformation that far in the future.

Pairwise Two stack RecurrentPSNR SSIM MSE PSNR SSIM MSE PSNR SSIM MSE

p mse+adv 17.0409 0.5576 0.0227 17.7425 0.5970 0.0185 17.2758 0.5747 0.0211p g mse+adv 17.0660 0.5720 0.0224 17.9157 0.6122 0.0177 17.2951 0.5749 0.0214p g mse+adv+ft 17.4784 0.6036 0.0207 18.6812 0.6566 0.0153 18.3357 0.6283 0.0166

Table 2. Quantitative Evaluation of: Pairwise generation, Two stack generation, andRecurrent tasks. For PSNR and SSIM larger is better, while for MSE lower is better.

4.3 Pairwise generation

In the pairwise generation task, we input an image of an object and the modeloutputs depiction of the future state of the object, where the degree of trans-formation is controlled by a conditional term. The conditional term is a 4 di-mensional one-hot vector which indicates whether the predicted output shouldbe 0, 0.25, 0.5 or 0.75 degrees in the future from the input frame. We sampleframe pairs from timelapse videos based on annotated degree value intervals. A0 degree interval means that the input and output training images are identical.We consider 0 degree pairs as auxiliary pairs, useful for two reasons: 1) Theyhelp augment the training data with video frames (which display different prop-erties from still images), and 2) The prediction quality of image reconstructionis highly correlated with the quality of future generation. Pairs from 0 degreetransformations can be easily be evaluated in terms of reconstruction quality

Learning Temporal Transformations From Time-Lapse Videos 11

Fig. 4. Two stack generation results for Blooming (a), Melting (b), Baking (c) and Rot-ting (d). For each we show the two input frames (col 1-2), and results for: p mse+adv(col 3), p g mse+adv (col 4), p g mse+adv+ft (col 5) and ground truth (col 6)

since the prediction should ideally exactly match the ground truth image. Pre-dictions for future degree transformations are somewhat more subjective. Forexample, from a bud the resulting generated bloom may look like a perfectlyvalid flower, but may not match the exact flower shape that this particular budgrew into (an example is shown in Fig. 3 row 1).

We train pairwise generation models separately for each of the 4 transforma-tion categories using p mse, p mse+adv and p g mse+adv methods, trainedfor 12500 iterations on timelapse data from scratch. For the p g mse+adv+ftmethod, the models are first trained on the reconstruction dataset for 5000iterations with the conditional term fixed as ‘0 degree’ and then fine-tuned ontimelapse data for another 5500 iterations. We observe that the fine-tuning train-ing converges faster than training from scratch (example results with 3 differentdegree conditional terms in Fig. 3). We observe that the baseline suffers from ahigh degree of blurriness. Incorporating other terms into the loss function im-proves results, as does pre-training with fine-tuning. Table. 2 (cols 2-4) showsevaluations of pairwise-generation reconstruction. For evaluation, we computethe Peak Signal to Noise Ratio(PSNR), Structural Similarity Index(SSIM) andMean Square Error (MSE) values between the output and ground truth. We cansee that incorporating gradient loss slightly improves results while pre-trainingfurther improves performance. This agrees with the qualitative visual results.

4.4 Two stack generation

For two stack generation, the model generates an image showing the future stateof an object given two input depictions at different stages of transformation. Astraining data, we sample frame triples from videos with neighboring frame de-gree intervals: 0, 0.1, 0.2, 0.3, 0.4, and 0.5. We train two stack generation models

12 Yipin Zhou Tamara L. Berg

Fig. 5. Two stack generation with varying time between input images. For each exam-ple we show: the two input images (col 1-2), p mse+adv (col 3), p g mse+adv (col 4),p g mse+adv+ft (col 5) and ground truth (col 6). The models are able to vary theiroutputs depending on elapsed time between inputs.

for each of the 4 transformation categories, trained for 12500 iterations for thep mse+adv and p g mse+adv methods. For the p g mse+adv+ft method,the models are first pre-trained on the reconstruction dataset for 4000 iterationsand then fine-tuned on timelapse data for 6500 iterations (Fig. 4 shows exam-ple prediction results). We observe that p g mse+adv+ft generates improvedresults in terms of both image quality and future state prediction accuracy. Wefurther evaluate the reconstruction accuracy off these models in Table. 2 (cols5-7). Furthermore, in this task we expect that the models can not only predictthe future state, but also learn to generate the correct time interval based on theinput images. Fig. 5 shows input images with different amounts of elapsed time.We can see that the models are able to vary how far in the future to generatebased on the input image interval.

4.5 Recurrent generation

For our recurrent generation task, we want to train a model to generate multiplefuture states of an object given a single input frame. Due to limited data werecursively generate 4 time steps. During training, we sample groups of framesfrom timelapse videos. Each group contains 5 frames, the first being the input,and the rests having 0, 0.1, 0.2, 0.3 degree intervals from the input. As in theprevious tasks, the reconstruction outputs are used for quantitative evaluation.

We train the models separately for the 4 transformation categories. The mod-els for the p mse+adv and p g mse+adv methods are trained for 8500 it-erations on timelapse data from scratch. For the p g mse+adv+ft method,models are pre-trained on the reconstruction dataset for 5000 iterations thenfine-tuned for another 5500 iterations. Fig. 6 shows prediction results (outputsof 2nd, 3rd and 4th time steps) of the three methods. Table. 2 (cols 8-10) showsthe reconstruction evaluation.

5 Additional experiments

Human Evaluations: As previously described, object future state prediction issometimes not well defined. Many different possible futures may be consideredreasonable to a human observer. Therefore, we design human experiments to

Learning Temporal Transformations From Time-Lapse Videos 13

Fig. 6. Recurrent generation results for: Blooming (rows 1-2), Melting (rows 3-4),Baking (rows 5-6) and Rotting (rows 7-8). Input(a), p mse+adv(b), p g mse+adv(c),p g mse+adv+ft(d), Ground truth frames(e).

judge the quality of our generated future states. For each transformation categoryand generation task, we randomly pick 500 test cases. Human subjects are shownone (or two for two stack generation) input image and future images generatedby each method (randomly sorted to avoid biases in human selection). Subjectsare asked to choose the image that most reasonably shows the future objectstate.

Results are shown in Table 3, where numbers indicate the fraction of caseswhere humans selected results from each method as the best future prediction.For pairwise generation (cols 2-5), the p g mse+adv+ft method achieves themost human preferences, while the baseline performs worst by a large margin.For both two stack generation (cols 6-8) and recurrent generation (cols 9-11),p mse+adv and p g mse+adv are competitive, but again making use of pre-training plus fine-tuning obtains largest number of human preferences.

Image retrieval: We also add a simple retrieval experiment on Pairwise gen-eration results using pixelwise similarity. We count retrievals within reasonabledistance (20% of video length) to the ground truth as correct, achieving averageaccuracies on top-1/5 of p mse+adv: 0.68/0.94; p g mse+adv: 0.72/0.95; andp g mse+adv+ft: 0.90/0.98.

Visualizations: Object state transformations often lead to physical changes inthe shape of an object. To further understand what our models have learned,we provide some simple visualizations of motion features computed on gener-ated images. Visualizations are computed on results of the p g mse+adv+ft

14 Yipin Zhou Tamara L. Berg

Pairwise Two stack RecurrentBL ADV Grad Ft ADV Grad Ft ADV Grad Ft

Blooming 0.1320 0.2300 0.2880 0.3500 0.3080 0.3240 0.3680 0.3320 0.2860 0.3820Melting 0.1680 0.2520 0.2760 0.3040 0.3400 0.3120 0.3480 0.3180 0.3220 0.3600Baking 0.1620 0.2600 0.2840 0.2940 0.3120 0.3200 0.3680 0.2780 0.3540 0.3680Rotting 0.1340 0.2020 0.2580 0.4060 0.3040 0.2640 0.4320 0.2940 0.2900 0.4160Average 0.1490 0.2360 0.2765 0.3385 0.3160 0.3050 0.3790 0.3055 0.3130 0.3815

Table 3. Human evaluation results: BL stands for p mse method, ADV (p mse+adv),Grad(p g mse+adv) and Ft(p g mse+adv+ft)

.

Fig. 7. Visualization results of learned transformations: x axis flows of blooming (a),y axis flows of melting (b), y axis flows of baking (c) and y axis flows of rotting (d)

recurrent model since we want to show the temporal trends of the learned trans-formations. For each testing case, we compute 3 optical flow maps in the x and ydirections between the input image and the second, third, and fourth generatedimages. We cluster each using kmeans (k=4). Then, for each cluster, we averagethe optical flow maps in the x and y directions.

Fig. 7 shows the flow visualization: (a) is the x axis flow for the bloomingtransformation. From the visualization we observe the trend of the object grow-ing spatially. (b) shows the y axis flows for the melting transformation, showingthe object shrinking in the y direction. (c) shows baking, consistent with theobject inflating up and down. For rotting (d), we observe that the upper part ofthe object inflates with mold or shrinks due to dehydration.

6 Conclusions

In this paper, we have collected a new dataset of timelapse videos depictingtemporal object transformations. Using this dataset, we have trained effectivemethods to generate one or multiple future object states. We evaluate each pre-diction task under a number of different loss functions and show improvementsusing adversarial networks and pre-training. Finally, we provide human evalu-ations and visualizations of the learned models. Future work includes applyingour methods to additional single-object transformations and to more complextransformations involving multiple objects.

Learning Temporal Transformations From Time-Lapse Videos 15

References

1. Srivastava, N., Mansimov, E., Salakhudinov, R.: Unsupervised learning of videorepresentations using lstms. In: ICML. (2015)

2. Mathieu, M., Couprie, C., LeCun, Y.: Deep multi-scale video prediction beyondmean square error. CoRR (2015)

3. Ranzato, M., Szlam, A., Bruna, J., Mathieu, M., Collobert, R., Chopra, S.: Video(language) modeling: a baseline for generative models of natural videos. CoRR(2014)

4. Farhadi, A., Endres, I., Hoiem, D., Forsyth, D.: Describing objects by their at-tributes. In: CVPR. (2009)

5. Parikh, D., Grauman, K.: Relative attributes. IJCV (2011)6. Patterson, G., Hays, J.: Sun attribute database: Discovering, annotating, and

recognizing scene attributes. In: CVPR. (2012)7. Isola, P., Lim, J.J., Adelson, E.H.: Discovering states and transformations in image

collections. In: CVPR. (2015)8. Shih, Y., Paris, S., Durand, F., Freeman, W.T.: Data-driven hallucination of dif-

ferent times of day from a single outdoor photo. ACM Trans. Graph. (2013)9. Martin-Brualla, R., Gallup, D., Seitz, S.M.: Time-lapse mining from internet pho-

tos. ACM Trans. Graph. (2015)10. Martin-Brualla, R., Gallup, D., Seitz, S.M.: 3d time-lapse reconstruction from

internet photos. In: ICCV. (2015)11. Walker, J., Gupta, A., Hebert, M.: Patch to the future: Unsupervised visual pre-

diction. In: CVPR. (2014)12. Kitani, K.M., Ziebart, B.D., Bagnell, J.A.D., Hebert , M.: Activity forecasting.

In: ECCV. (2012)13. Yuen, J., Torralba, A.: A data-driven approach for event prediction. In: ECCV.

(2010)14. Zhou, Y., Berg, T.L.: Temporal perception and prediction in ego-centric video. In:

ICCV. (2015)15. Vondrick, C., Pirsiavash, H., Torralba, A.: Anticipating the future by watching

unlabeled video. CoRR (2015)16. Hinton, G.E., Sejnowski, T.J.: Learning and relearning in boltzmann machines. In:

Parallel Distributed Processing: Explorations in the Microstructure of Cognition,Vol. 1. (1986)

17. Smolensky, P.: Information processing in dynamical systems: Foundations of har-mony theory. In: Parallel Distributed Processing: Explorations in the Microstruc-ture of Cognition, Vol. 1. (1986)

18. Lee, H., Grosse, R., Ranganath, R., Ng, A.Y.: Convolutional deep belief networksfor scalable unsupervised learning of hierarchical representations. In: ICML. (2009)

19. Hinton, G.E., Salakhutdinov, R.R.: Reducing the dimensionality of data withneural networks. Science (2006)

20. Tang, Y., Salakhutdinov, R.R.: Learning stochastic feedforward neural networks.In: NIPS. (2013)

21. Kingma, D.P., Welling, M.: Auto-Encoding Variational Bayes. ArXiv e-prints(2013)

22. A.Dosovitskiy, J.T.Springenberg, T.Brox: Learning to generate chairs with convo-lutional neural networks. In: CVPR. (2015)

23. Gregor, K., Danihelka, I., Graves, A., Rezende, D., Wierstra, D.: Draw: A recurrentneural network for image generation. In: ICML. (2015)

16 Yipin Zhou Tamara L. Berg

24. Goodfellow, I., Pouget-Abadie, J., Mirza, M., Xu, B., Warde-Farley, D., Ozair, S.,Courville, A., Bengio, Y.: Generative adversarial nets. In: NIPS. (2014)

25. Denton, E.L., Chintala, S., Szlam, A., Fergus, R.: Deep generative image modelsusing a laplacian pyramid of adversarial networks. (2015)

26. Radford, A., Metz, L., Chintala, S.: Unsupervised representation learning withdeep convolutional generative adversarial networks. CoRR (2015)

27. Pathak, D., Krahenbuhl, P., Donahue, J., Darrell, T., Efros, A.: Context encoders:Feature learning by inpainting. (2016)

28. Ioffe, S., Szegedy, C.: Batch normalization: Accelerating deep network training byreducing internal covariate shift. In: ICML. (2015)

29. Nair, V., Hinton, G.E.: Rectified linear units improve restricted boltzmann ma-chines. In: ICML

30. Ridgeway, K., Snell, J., Roads, B., Zemel, R.S., Mozer, M.C.: Learning to generateimages with perceptual similarity metrics. CoRR (2015)

31. Hochreiter, S., Schmidhuber, J.: Long short-term memory. Neural Comput. (1997)32. Kingma, D.P., Ba, J.: Adam: A method for stochastic optimization. CoRR (2014)33. http://tensorflow.org (2015)34. Deng, J., Dong, W., Socher, R., Li, L.J., Li, K., Fei-Fei, L.: ImageNet: A Large-

Scale Hierarchical Image Database. In: CVPR. (2009)