fusing music and video modalities using multi-timescale ... › ~xgwang › papers ›...

Fusing Music and Video Modalities Using Multi-timescaleShared Representations

Bing XuDepartment of Information

EngineeringThe Chinese University of

Hong [email protected]

Xiaogang WangDepartment of Electronic



Xiaoou TangDepartment of Information



ABSTRACTWe propose a deep learning architecture to solve the prob-lem of multimodal fusion of multi-timescale temporal data,using music and video parts extracted from Music Videos(MVs) in particular. We capture the correlations betweenmusic and video at multiple levels by learning shared fea-ture representations with Deep Belief Networks (DBN). Theshared representations combine information from multiplemodalities for decision making tasks, and are used to eval-uate matching degrees between modalities and to retrievematched modalities using single or multiple modalities asinput. Moreover, we propose a novel deep architecture tohandle temporal data at multiple timescales. When pro-cessing long sequences with varying length, we propose toextract hierarchical shared representations by concatenatingdeep representations at different levels, and to perform deci-sion fusion with a feed forward neural network, which takesinput from predictions of local and global classifiers trainedwith shared representations at each level. The effectivenessof our method is demonstrated through MV classificationand retrieval.

Categories and Subject DescriptorsH.3.3 [Information Storage and Retrieval]: InformationSearch and Retrieval—Retrieval models

Keywordsdeep learning; multimodal fusion; music video matching

1. INTRODUCTIONMusic and videos, as two popular types of media, cause d-

ifferent human perceptions with music in auditory and videoin vision. However, psychology and cognition studies haveshown that the information processing procedures of audioand visual signals by human brains are closely related and

Permission to make digital or hard copies of all or part of this work for personal orclassroom use is granted without fee provided that copies are not made or distributedfor profit or commercial advantage and that copies bear this notice and the full cita-tion on the first page. Copyrights for components of this work owned by others thanACM must be honored. Abstracting with credit is permitted. To copy otherwise, or re-publish, to post on servers or to redistribute to lists, requires prior specific permissionand/or a fee. Request permissions from [email protected]’14, November 3–7, 2014, Orlando, Florida, USA.Copyright 2014 ACM 978-1-4503-3063-3/14/11 ...$15.00.http://dx.doi.org/10.1145/2647868.2655069.

Figure 1: Deep learning multi-timescale shared rep-resentations. (a) Raw data of a MV segment. (b) 6sec long base-1 segments. Deep representations formusic and video are extracted separately. Sharedrepresentations are extracted by concatenating deeprepresentations and input them to another layer. (c)Concatenation of 2 base-1 segments gives 1 base-2segment. (d) Concatenation of 4 base-1 segmentsgives 1 base-4 segment. (e) To make a predictionon a 24 sec long time interval, we fuse predictionsfrom 4 adjacent base-1 predictors, 2 adjacent base-2predictors and 1 base-4 predictor.

integrated [2]. In music videos (MVs), music and videos ap-pear in parallel and complement each other. MV expressesthe song by using stories and rhythmic dances to comple-ment music. In this work, we try to explore the correlation-s between musical and visual parts of MVs, and use deeplearning to capture such correlations with extracted sharedrepresentation across two modalities.

One of the commonly recognized matches in MV is mu-sic semantics with video content. For example, sad music ispaired with sad faces or girls weeping in videos, and excitingparty music is paired with dancing scenes and flashing light-s. There is existing work on semantically matching musicand images with content [10]. The authors built a databasewith a large number of music-image pairs extracted fromMVs, and used Canonical Correlation Analysis (CCA) tocapture the relationship between music and image features.However they did not consider any temporal information ofmusic or videos. Since music and images have complex and

1073

distinct structures, they simplified the problem by groupingmusic-image pairs into five clusters and projected them tothe music-attribute space.In our work, deep learning is used to directly model rela-

tionships across modalities instead of simplifying the prob-lem through clustering data pairs as in [10]. And we treatmusic and videos as temporal data and characterize theirdynamic temporal and spacial features. In multimodal da-ta analysis, there have been advances using deep learningapproaches [9] [8] [11] [6]. Various kinds of modalities havealready been covered by these literatures, including text, au-dio, music, and image. However, none of the existing workfocus on fusing temporal multi-modality data. Temporaldata are data that represent states in time, such as musicand video. Multimodal problems on temporal data usual-ly concern alignment or synchronization among modalitiesalong time axis, and need to deal with multi-timescale prob-lem, i.e., data sequence may have varying length. Our workprovides novel solutions to these issues, demonstrating suc-cessful fusion of music and video on multiple level.Our work makes two key contributions. First, we propose

a deep learning framework taking well-crafted inter-frameand intra-frame music and video features as input and fusethem at both semantic and temporal level. This essentiallymakes sure the model captures semantic matches and takescare of data alignment problem. Second, we propose a novelmulti-timescale temporal data fusion approach to deal withvariable time length, such that our model can fuse modal-ities both locally and globally on the time axis. The effec-tiveness of the proposal multi-modality deep learning fusionapproach is demonstrated through two experiments, MV re-trieval and classification.

2. DATASET AND FEATURE DESIGN

2.1 Dataset preparationWe obtained 2,000 MVs from YouTube with top rated

view counts and acquired 40,000 music-video pairs from them.Most of these MVs were published in 2010 to 2014. TheMVs are of diversified styles and were performed by morethan 600 singers from different countries.To speed up feature extraction, all the MVs’ video parts

are down-sampled to a frame rate of 12fps and resolutionis scaled to 320 × 240 pixels. Audio is downgraded to onechannel with a sample rate of 44 kHz.

2.2 Video and music featuresWe cut music and videos in parallel into small segments

of 6 seconds without overlapping or skipping, and extractboth local and global features for each segment.We extract image features within each frame as intra-

frame features for videos. They include Histograms of Ori-ented Gradient (HOG), color histograms, and Local BinaryPatterns (LBP). We also design inter-frame features that cancapture temporal variation of frames, including histogram-s of Oriented Optical Flow, distances of HOG, color his-tograms, and LBP for adjacent frames. We calculate statis-tics of individual feature sets over all the frames in a segmen-t, namely mean and standard deviation, to capture globalcharacteristics for an entire segment. The final feature di-mension for a video segment is 7443.We use a rich set of mid-level musical features [5], in-

cluding tempo, attack time, attack slope, spectral charac-

teristics (such as brightness, spread, and roughness), timbrefeatures (such as zero crossing, spectral flux), and tonal fea-tures (such as chromagram, key clarity). A pooling step isperformed on the raw feature vector with a sliding windowalong the time axis to average the signal. It computes themean, median, maximum, and minimum within the windowto replace the raw features. Similar as in videos, we calcu-late statistical values for individual types of musical featuresover the entire segment, including the mean, standard devi-ation, slope, period frequency, period amplitude, and periodentropy. The final feature dimension for an audio segmentis 1052.

3. MULTIMODAL FUSION

3.1 Deep model on single modalityGiven a single modality data x with its initial feature rep-

resentation vx, we build a multilayer generative deep beliefnetwork [4]. A Gaussian-Bernoulli Restricted BoltzmannMachine (RBM) [3] with one layer of latent variables h1

x istrained using real-valued vx as input. Then we treat theactivations of h1

x as input to train another RBM on top ofit, obtaining latent variables h2

x. More layers of RBM canbe further trained in this manner to make the model deep.The whole network can be fine tuned after pre-training steps.Such configuration gives a Deep Belief Network model tak-ing vx as input and giving deep representation hn

x as output,where n is the number of hidden layers. Deep representa-tions are less modality-specific than raw data representation-s, and modeling correlations among deep representations ofdifferent modalities is much easier [9] [8]. In our experi-ments we use two layers of stacked RBMs. And input vx arenormalized to have 0 mean and standard deviation of 1.

3.2 Shared representation for multi-modalitySuppose there are m modalities for data fusion. After

obtaining deep representations hn1x1

, hn2x2

, ..., hnmxm

for eachmodality x1, x2, ..., xm respectively, where ni is the depthof DBN for modality xi, all the deep representations areconcatenated into one single vector. Using the merged rep-resentation as input, we build another RBM on top of it,which gives hidden layer h as our final shared representa-tion. This is equivalent to learning a joint distribution of allthe modalities, and representations of different modalitiesare fused to one. An graphical illustration of this architec-ture is shown in Figure 2.

3.2.1 Multi-modality decision makingThe process of extracting shared representation for multi-

ple modalities can be treated as preprocessing step. Sharedrepresentations capture information from all modalities. Us-ing shared representations in decision making problems suchas classification, regression gives better result than directlyusing single modality data or concatenated multimodalityinitial feature. Any existing classifiers or regressors can beapplied on top of them. We will show the application andexperimental result in Section 4.

3.2.2 Multi-modality retrievalAll the layers in our model are undirected. Therefore it

is possible to use one or more modalities to reconstruct oth-er modalities. Given only vp as query, we can search formatched vq from a database. The hidden representation

1074

ℎ21

ℎ12 ℎ2

2

RBM

RBM

RBM

features of video modality features of music modality

deep representation

of video modality

𝑣2

shared representation h (base_1)

ℎ12

ℎ11

𝑣1

deep representation

of music modality ℎ2

2

ℎ12 of segment t

ℎ12 of segment t + 1 ℎ2

2 of segment t + 1

ℎ22 of segment t

shared representation base_2

RBM

shared representation base_(n + 1)

…… ……

ℎ12 of segment t

ℎ12 of segment t + n

ℎ22 of segment t

ℎ22 of segment t + n

RBM

Figure 2: General architecture for extracting multi-timescale shared representation

hp is first computed from vp. In the shared RBM, we per-form Gibbs sampling by using the conditional distributionP (h|hp, hq) to obtain h, and then sample P (hq|h) to obtainhq iteratively. Once convergence is reached, hq can be esti-mated and easily be backpropagated through its DBN to getvq(recon). We evaluate the distances between vq(recon) andall the vq in the database, and return those with smallestdistances as retrieved results.

3.3 Multi-timescale fusionDecision level fusion is a common strategy in multimodal

fusion problems. To deal with long time-scale temporal da-ta, the traditional approach is to fuse modalities on localshort segments and average predictions over all short seg-ments in the time interval [1] [7]. However this approachignores valuable global information that may provide bettermatching.Our multi-timescale fusion strategy works in the follow-

ing way: merging adjacent deep representations within eachmodality to form hierarchical deep representations (Figure 2),and learning shared representations at different level of hi-erarchy. It aims at fusing modalities at varying time scalesand is much easier to construct than building scale-varyingmodels from bottom up. This approach has several advan-tages. First, the same set of initial features on the shortesttime interval can be reused at different time scales. There-fore, the deep representations for each modality only needto be computed once and the computation is efficient. Sec-ond, shared representations at different levels of the hierar-chy captures different ranges of cross-modality correlationsranging from local to global, which are complementary toone another. Some long-range characteristics can only becaptured by shared representations at higher hierarchy lev-els.In a long range matching scenario, one classifier is trained

for the shared representation at each hierarchy level . Clas-sifiers from different levels will give decisions on differentrange (local to global). A final step of decision fusion isperformed by using a simple Feed-forward Neural Network(FFNN) to combine the predictions of classifiers at multiplelevels.An exemplar illustration and the architecture of learning

multi-timescale shared representations are shown in Figure1 and 2.

Figure 3: Retrieval precisions using sampling sharedrep., CCA with deep rep., CCA with initial features,and random pairing

4. EXPERIMENTS

4.1 Multi-modal retrievalWe construct a video dataset with 2000 video segments,

and use 500 audio segments as queries to evaluate the per-formance of retrieval. Note that the MVs containing thesesegments are not used in training deep models. We alsoimplement a classic CCA method to evaluate similarity inthe same way as [10]. Both CCA and our deep models aretrained with 38,000 music-video segment pairs. Figure 3shows the performance of retrieval using our method and C-CA. We treat the retrieval as successful as long as the top Nretrieved examples contains the true match. The precisionsof top-N ranks are reported. CCA of deep rep. uses deeprepresentations of videos and audio, and outperforms CCAusing initial features as input, which proves deep representa-tions are easier to fuse than initial features. Retrieval withshared representation achieves big improvement over CCAon deep representations or initial features.

4.2 Korean MV classificationA number of YouTube hot clicks are Korean MVs of girl

groups and boy bands. Modern Korean MVs are quite dis-tinctive in style. Their music parts have fast tempo andrepeated patterns. Their video parts are filled with singers’singing and dancing scenes. In this experiment, we try tolearn a classifier to identity Korean MVs. We train an SVMclassifier with concatenated initial music and video features

1075

Data input AP

Video initial feature 0.3123Video deep rep. 0.5012Music initial feature 0.2656Music deep rep. 0.2975Video and music initial feature (SVM) 0.4542Shared rep. base-1 0.5585Shared rep. base-2 (12 sec) 0.5841Shared rep. base-5 (30 sec) 0.6479Shared rep. base-10 (60 sec) 0.6025

Table 1: Average precisions of classifiers trainedwith different data input

as input for comparison. We also demonstrate the effect ofmulti-timescale fusion of temporal data.Among 2,000 MVs in our dataset, around 10% are Korean

MVs. We cut each MV into segments of 60 seconds, leadingto 4,000 segments in total with around 400 Korean MV seg-ments. We randomly obtain 2,000 MV segments containing200 Korean MV segments as training data, leaving the restas testing data.In order to build a multi-timescale fusion model, we cut

data into segments of 6 seconds as our base level, denot-ed by base-1 (40,000 segments in total). Deep representa-tions are extracted for all the base-1 segments. To form amulti-timescale hierarchy, we concatenate the hidden rep-resentations of adjacent base-1 segments belonging to thesame original 60-second-long segment. We concatenate twoadjacent base-1 segments to form 1 base-2 segment (12 sec-onds long, 20,000 segments in total), concatenate 5 adjacentbase-1 segments to form 1 base-5 segment (30 seconds long,8,000 segments in total), and concatenate 10 adjacent base-1 segments to form 1 base-10 segment (60 seconds long, i.e.the original segments, 4,000 segments in total). One sharedRBM is trained on each hierarchy level using base-1, base-2,base-5, and base-10 segments respectively. Then one FFNNclassifier is trained with each level’s shared RBM individu-ally. We evaluate each classifier’s performance using testingdata belong to its concatenation level.With base-1 segments, we also train FFNN classifiers us-

ing unimodal input, namely video/music initial features, andvideo/music deep representations individually. In anothercomparison, an SVM classifier is trained using concatenat-ed initial music feature and video feature as input. Classi-fiers’ performances are listed in Table 1. Classifiers usingshared representations performs much better than those us-ing unimodal representations. Both shared rep. base-1 andSVM use multi-modal input, however, it outperforms SVMby 10%. It shows that the shared representations obtainedby deep learning are more effective than the initial features.Noted that AP of shared rep. base-2, base-5 and base-10are provided for reference only and we do not compare themwith others since they use smaller number of data segments(longer length) during training and testing.In order to deal with long sequences of 60 seconds, we ex-

tract predictions from all classifiers and concatenate theminto a vector. For each original segment of 60 seconds long,its prediction vector includes 10 local predictions by thebase-1 classifier, 5 predictions by the base-2 classifier, 2 pre-dictions by the base-5 classifier, 1 prediction by base-10 clas-sifier. We train a FFNN with prediction vectors and evaluate

Method AP

Multi-timescale 0.7900Averaging shared rep. base-1 0.6750Averaging SVM 0.6130

Table 2: Comparison of multi-timescale representa-tion and averaging on local predictions.

our multi-timescale fusion model performance. As shown inTable 2, our final classifier (Multi-timescale) gives 0.79 av-erage precision, outperforming the strategy of averaging 10local predictions using either shared representation or SVM.

5. CONCLUSIONSWe propose deep models to explore correlations of music

and video in MVs. Shared representations for music andvideos are learned with DBNs. A retrieval experiment isconducted to demonstrate the power of the shared represen-tations on matched music and videos. We also show thatshared representations combines information of music andvideo and perform better than unimodal representations inthe classification task. A novel method is proposed to han-dle temporal sequences with varying length, by constructingmulti-timescale shared representations. It significantly out-performs averaging predictions on local intervals.

6. REFERENCES[1] P. K. Atrey, M. A. Hossain, A. E. Saddik, and M. S.

Kankanhalli. Multimodal fusion for multimediaanalysis: a survey. Multimedia Systems, 2010.

[2] R. Cytowic. Synesthesia: A union of the senses. TheMIT Press, 2002.

[3] G. Hinton. A practical guide to training restrictedboltzmann machines. 2010.

[4] G. E. Hinton, S. Osindero, and Y. W. Teh. A fastlearning algorithm for deep belief nets. NeuralComputation, 2006.

[5] O. Lartillot and P. Toiviainen. Mir in matlab (ii): Atoolbox for musical feature extraction from audio.

[6] J. Ngiam, A. Khosla, M. Kim, J. Nam, H. Lee, andA. Y. Ng. Multimodal deep learning. InternationalConference on Machine Learning (ICML), 2011.

[7] A. Oord, S. Dieleman, and B. Schrauwen. Deepcontent-based music recommendation. Advances inNeural Information Processing Systems (NIPS), 2013.

[8] N. Srivastava and R. Salakhutdinov. Learningrepresentations for multimodal data with deep beliefnets. International Conference on Machine Learning(ICML), 2012.

[9] N. Srivastava and R. Salakhutdinov. Multimodallearning with deep boltzmann machines. Advances inNeural Information Processing Systems (NIPS), 2012.

[10] X. Wu, Y. Qiao, X. Wang, and X. Tang. Bridgingmusic and image: A preliminary study with multipleranking cca learning. Proceedings of ACM Multimedia,2012.

[11] Z. Yuan, J. Sang, Y. Liu2, and C. Xu. Latent featurelearning in social media network. Proceedings of ACMMultimedia, 2013.

1076

fusing music and video modalities using multi-timescale ... › ~xgwang › papers ›...

Documents