picasso – to sing you must close your eyes and...

30
PICASSO – To Sing you must Close Your Eyes and Draw Seminar Informatik in den Medien Eloy Rodríguez Rey

Upload: hacong

Post on 23-Jun-2018

215 views

Category:

Documents


0 download

TRANSCRIPT

PICASSO – To Sing you must Close Your Eyes and Draw

Seminar Informatik in den Medien

Eloy Rodríguez Rey

2

Outline

1) Introduction

2) Framework

3) Approaches

4) Evaluation

5) Conclussion

3

PICASSO● What?

PIcture CAtegorization for Suggesting SOundtracks● Why?

Problem of proposing a soundtrack to a picture or a group of pictures● How?

– Training data (40,000 image/soundtrack samples from 28 movies)

– Three-level algorithm

4

Previous Works

● Start with a soundtrack and then find the appropiate images

● Focus on impressionism paintings – emotions● Suggest music to a driving scenery● Align the video transitions with the transitions

in a given music piece

5

Technical Background

● Low-level features for both image-to-image and song-to-song similarity measures

– Image-to-image similarity● MPEG-7 color● Texture low level features

– Song-to-song similarity● Spectral shape● Temporal low level features

6

Outline

1) Introduction

2) Framework

3) Approaches

4) Evaluation

5) Conclussion

7

Training Database

Figure 1: PICASSO – To Sing you must Close your Eyes and Draw. Stupar, Aleksandar; Sebastian, Michel

8

Music/Speech Classification

● Naïve Bayes classifier– 64 speech samples

– 64 music samples

● Low-level features– Training classifier

– Classification task

● Marsyas tool– Features extraction

– Classification

9

Music/speech classification

● Output of the classifier for each second of the soundtrack:

– Label: “music” or “speech”

– Confidence value

● Only musical parts of the soundtrack– Musical parts with confidence value >95% and

with length longer than 5 seconds

10

Scene Detection● Splitting the sequence of screenshots on positions where the

image-to-image distance is larger than a given threshold

● The sequence of the screenshots from one split to the second one is considered a scene

● Considerations:

– Eliminated short musical parts

– Discard scenes whose length < 5 seconds

– > 10 seconds are split in multiple parts

Figure 2: PICASSO – To Sing you must Close your Eyes and Draw. Stupar, Aleksandar; Sebastian, Michel

11

Image Similarity Measure

● The following MPEG-7 features vectores are used:

– Scalable color

– Color structure

– Color layout

– Edge histogram

12

Image Similarity Measure

● Color structure: describes all the colors found in the image by aggregating them in a color histogram

Figure 3: https://documentation.apple.com/en/color/usermanual/Art/S02/S0208_RGBHistogram.png

13

Image Similarity Measure

● All distance calculations are combined in one distance measurement:

1) Calculating the standard score (z-score) for each of the descriptors

2) Summing up all of the standard scores into a single score

14

Music Similarity Measure

● The following low level musical descriptors are used:

– MFCC

– Chroma

– Spectral centroid

– Spectral rolloff

– Spectral flux

– Time domain zero crossing

15

Music Similarity Measure

● Spectral centroid: center of gravity of a musical signal's spectral representation

Figure 4: http://w3.impa.br/~cicconet/audiofeature/Feature_SpectralCentroid.png

16

Music Similarity Measure [3]

● To calculate the similarity between two songs:

1) Feature vectors of each descriptors are extracted

2) Pairwise similarity between these vectors is calculated and combined

● Pairwise is not enough, music also has a time dimension

– Dynamic Time Warping (DTW)

17

Music Similarity Measure

● Dynamic Time Warping (DTW): enables sequence matching with the variations in speed

Figure 5: http://america.pink/images/1/3/3/9/5/2/7/en/3-dynamic-time-warping.jpg

18

Music Similarity Measure

● The sum of distances between the soundtrack sample and these three positions in the song is used as the resulting distance

Figure 6: PICASSO – To Sing you must Close your Eyes and Draw. Stupar, Aleksandar; Sebastian, Michel

19

Outline

1) Introduction

2) Framework

3) Approaches

4) Evaluation

5) Conclussion

20

Approaches

● Two types of recommendations:– Single image recommendation

– Multiple images

21

Single Image Recommendation

● Two phases of K-nearest neighbor searches: first, in the image domain; and second, between musical pieces

– Phase 1. When the query is submitted, its distance to each of the images in training dataset is calculated

– Phase 2. After the top-K images are found, the list of the songs together with their score for each image is retrieved

22

Multiple Images

● Group these images using a clustering algorithm

● Recommend a soundtrack for each of the groups

– Average position

– Least misery

23

Outline

1) Introduction

2) Framework

3) Approaches

4) Evaluation

5) Conclussion

24

Evaluation

● Single image

1) Grade the 1st ranked recommended sountrack

2) Grade the 10th

3) Grade a random

● Multiple images

Evaluation of the average position

1) Least misery approaches

2) Random recommendation

● Dataset is obtained by downloading songs from the music2ten site

25

Evaluation

Figure 7: PICASSO – To Sing you must Close your Eyes and Draw. Stupar, Aleksandar; Sebastian, Michel

Figure 8: PICASSO – To Sing you must Close your Eyes and Draw. Stupar, Aleksandar; Sebastian, Michel

26

Evaluation

● The runtime of the query processing was measured too

Figure 9: PICASSO – To Sing you must Close your Eyes and Draw. Stupar, Aleksandar; Sebastian, Michel

27

Outline

1) Introduction

2) Framework

3) Approaches

4) Evaluation

5) Conclussion

28

Conclussion

● Automated approach to recommend a soundtrack for a picture or a series of pictures

● Extractions of knowledge from popular, publicly available, common movies and how this information can be used

● PICASSO is based on the usage of low-level features for similarity comparison between images and between songs

29

Media vs. Paper

Media Paper

Level of details

Lower Higher

Oriented All public People with knowledge

Level of language

Lower Higher

Size Shorter Longer

Citations Rarely Always

30

Thank you!