acm multimedia 2012 grand challenge: music video generation
DESCRIPTION
These slides present a novel content-based system that utilizes the perceived emotion of multimedia content as a bridge to connect music and video. Specifically, we propose a novel machine learning framework, called Acousticvisual Emotion Gaussians (AVEG), to jointly learn the tripartite relationship among music, video, and emotion from an emotion-annotated corpus of music videos. For a music piece (or a video sequence), the AVEG model is applied to predict its emotion distribution in a stochastic emotion space from the corresponding low-level acoustic (resp. visual) features. Finally, music and video are matched by measuring the similarity between the two corresponding emotion distributions, based on a distance measure such as KL divergence.TRANSCRIPT
![Page 1: ACM Multimedia 2012 Grand Challenge: Music Video Generation](https://reader038.vdocuments.mx/reader038/viewer/2022100603/559621d21a28ab68708b4604/html5/thumbnails/1.jpg)
1
The Audiovisual Emotion Gaussians Model for Automatic
Generation of Music Video
Ju-Chiang Wang, Yi-Hsuan Yang, I-Hong Jhuo, Yen-Yu Lin, Hsin-Min Wang
Academia Sinica, Taiwan
![Page 2: ACM Multimedia 2012 Grand Challenge: Music Video Generation](https://reader038.vdocuments.mx/reader038/viewer/2022100603/559621d21a28ab68708b4604/html5/thumbnails/2.jpg)
2
Introduction• Generate the music video based on the emotion
content recognized by machine
• The novel Audiovisual Emotion Gaussians(AVEG) framework, learns the tripartie relationship among music, video, and emotion
• Project music pieces and video sequences into the multi-dimensional emotion space (3DES), and perform the cross-modal matching via the predicted emotion distributions
![Page 3: ACM Multimedia 2012 Grand Challenge: Music Video Generation](https://reader038.vdocuments.mx/reader038/viewer/2022100603/559621d21a28ab68708b4604/html5/thumbnails/3.jpg)
3
System Diagram
• Utilize the DEAP dataset (valence, activation, and potency)
– 3D Emotion annotated music videos• Extend the AEG model to handle video (VEG)
– Wang et al. (2012), “The Acoustic Emotion Gaussians model for emotion-based music annotation and retrieval,” Proc. ACM MM (full paper)
![Page 4: ACM Multimedia 2012 Grand Challenge: Music Video Generation](https://reader038.vdocuments.mx/reader038/viewer/2022100603/559621d21a28ab68708b4604/html5/thumbnails/4.jpg)
4
Preliminary Result• Perform the cross-modal retrieval experiment on
the 120 music and video clips of DEAP• Evaluate the NDCG@P for the ranking
• Measure the average Top 1 Relevance Score
Scenario P=5 P=10 P=15 P=20
Audio to Video Ranking 0.8748 0.8316 0.8221 0.8172
Video to Audio Ranking 0.8737 0.8204 0.8105 0.8073
Random Permutation 0.8035 0.7604 0.7441 0.7370
Scenario A to V V to A Random
Average Relevance 0.4881 0.4826 0.3837