classi cation, implicit segmentation and a chronological …prodei/dsie12/papers/paper_31.pdf ·...

Classification, Implicit Segmentation and aChronological Prediction Model for Cinematic

Sound

Pedro Silva

[email protected]

Faculdade de Engenharia da Universidade do PortoDepartamento de Engenharia Informatica

Rua Dr. Roberto Frias, s/n4200–465 Porto

Portugal

Abstract. This report presents work done on classification and seg-mentation of cinematic sound employing support vector machines (svm)with sequential minimal optimization (smo). Speech, music, environmen-tal sound and silence, plus all pairwise combinations, are considered asclasses. A model considering simple adjacency rules and probabilistic out-put from logistic regression is used for segmenting fixed-length parts intoauditory scenes. Evaluation of the proposed methods on a 44-film datasetagainst k-nearest neighbor (knn), Naive Bayes and standard svm classi-fiers shows superior results on all performance metrics. A subsequentmeta-study is presented, which proposes optimizations to the build-ing of similar datasets with regard to sample sizes. Finally, it employssound classes resulting from classification as descriptors in a chronolog-ical model for predicting period of production of a given soundtrack.Decision table classification is able to estimate the year of production ofan unknown soundtrack with a mean absolute error of approximately 7years (actual→ predicted correlation coefficient = 0.6.)

1 Introduction

This report addresses the broad subject of content analysis of cinematic sound.That is, it details a set of techniques for inferring meaning out of the use ofsound in a cinematic setting. Cinematic sound is defined as sound used eitherdiegetically (produced by agents in the narrative) or non-diegetically (external tothe narrative) in a visual presentation. In the field of film theory, there has beenlittle effort to systematize this topic through the use of computational methodsso far. As a result, the so-called semantic gap, that is, the gap between low-levelmachine-computable features and high-level human-perceptible semantics [4], islarger in this than in other fields. For example, the questions of the importance ofthe introduction of sound in film in the late 1920s, and its subsequent evolutionin terms of scene length, music predominance over speech and other factors

across different periods, regions and-or producers, are long standing. Arguably,automating content analysis in cinematic sound would help address them.

Amongst other approaches [3, 2, 24], [27] has proposed to address this gapthrough a model of media encoding (i.e. synthesis, production) and decoding (i.e.analysis, criticism) based on a hierarchy of low to high level “aesthetic elements”.The method proposed here is also hierachical, with feature extraction, classifi-cation, audio segmentation, meta-feature extraction and meta-classification allbuilding upon one another toward the high level.

Audio is a low-bandwidth medium in the context of digital audio-visual me-dia, in which video comparatively uses more storage and communication band-width. Yet, semantically, audio is arguably as complex, and its information po-tential as high, as video. The analysis of cinematic sound through computationoffers efficiency to information retrieval and recommendation systems that pur-port to consider the semantics of audio, to name two examples. To that end,it would be useful to optimize the sample sizes taken when considering contentanalysis, since the manual creation of labeled datasets is expensive, and computerefficiency is essential. Although such research has been done for other kinds ofmedia (namely print media) [22, 17, 9, 18, 20, 19], audio has been neglected. Thestudy reported here proposes to address that need.

Although typically only simple, pure sound classes (e.g. speech and music) areconsidered in the context of high-accuracy classification (i.e. true positive rate> 90%) [11], this study attempts to expand those into a larger set, includinghybrid classes combining the simple ones. [12] have shown that a svm classifier,together with simple adjacency rules for building auditory scenes (see [23] also),is very effective in achieving high-accuracy of classification and segmentation.The use of timbral features is well established in the literature as providing thenecessary discriminatory power to differentiate not only simple music and speechclasses, but also for high-level classification tasks employing complex classes [25].

The tools resulting from this study are: (a) a method for segmentation andclassification of cinematic sound segments into auditory scenes (Sects. 4 and 5),(b) a new visualization tool for representing cinematic sound segmentation: thesound class map, which shows the distribution of sound classes in a single cine-matic presentation (Sect. 6), (c) a sampling strategy for minimizing dataset sizeand complexity and maximizing generalizability (Sect. 7), and (d) a chronologi-cal model for prediction of period of production (Sect. 8) .

This report is organized as follows: Sect. 2 characterizes the dataset. Sec-tion 3 details pre-processing, features, and rationale. Section 4 defines the la-beled dataset, sound classes and classifier. Section 6 proposes a model of auditoryscenes, and a novel way of visualizing audio segmentation. Section 5 presents theresults of cross-validation on the dataset, features and classifier. Section 7 pro-poses sample size optimization for similar datasets, and a chronological model forpredicting the period of production of a soundtrack. Section 9 discusses the ef-fectiveness of the chosen feature space, classifier, auditory scene model, samplingstrategy and chronological prediction model. It concludes by offering potentialimprovements to be applied to future research on this topic.

2 Dataset Characterization

2.1 Population

The population is comprised of the top-grossing American feature films from1970 to 2006 (population size = 104) as reported by the Internet Movie Databaseas of 19 November 2006. Top-grossing is defined as being in the top-250 list of all-time U.S. box office revenue, without adjustment for inflation. American meansthat the production was financed by at least one major American productioncompany, and thus excludes independent productions. Feature film is defined asa production at least 40 minutes long released for the theatrical market.

2.2 Sample

Table 1 lists the result of stratified weighted random sampling on the population(sample size = 44), where each stratum represents one decade. There are 9 filmsfor 1970–79, 10 for 1980–89, 12 for 1990–99, and 13 for 2000–06. The imbalanceof this distribution reflects the contribution of later decades to the population.

Table 1. Stratified weighted random sample filmography, ordered chronologically.

1. Patton (1970)

2. Harold and Maude(1971)

3. The Godfather (1972)

4. The Sting (1973)

5. The Conversation(1974)

6. One Flew Over theCuckoo’s Nest (1975)

7. Taxi Driver (1976)

8. Apocalypse Now(1979)

9. Manhattan (1979)

10. The Elephant Man(1980)

11. Raiders of the LostArk (1981)

12. Blade Runner (1982)

13. Scarface (1983)

14. Amadeus (1984)

15. Once Upon a Time inAmerica (1984)

16. Platoon (1986)

17. The Princess Bride(1987)

18. Glory (1989)19. Indiana Jones and the

Last Crusade (1989)20. Goodfellas (1990)21. Reservoir Dogs (1992)22. Schindler’s List (1993)23. Pulp Fiction (1994)24. The Shawshank

Redemption (1994)25. Se7en (1995)26. Fargo (1996)27. Sling Blade (1996)28. American Beauty

(1999)29. Magnolia (1999)30. The Straight Story

(1999)31. Toy Story 2 (1999)32. Memento (2000)33. Snatch. (2000)34. The Lord of the

Rings: The Fellowshipof the Ring (2001)

35. The Lord of theRings: The TwoTowers (2002)

36. Kill Bill: Vol. 1 (2003)37. The Lord of the

Rings: The Return ofthe King (2003)

38. Mystic River (2003)39. Pirates of the

Caribbean: The Curseof the Black Pearl(2003)

40. Eternal Sunshine ofthe Spotless Mind(2004)

41. The Incredibles (2004)42. Sin City (2005)43. Borat: Cultural

Learnings of Americafor Make BenefitGlorious Nation ofKazakhstan (2006)

44. Little Miss Sunshine(2006)

3 Feature Selection

The features used included 13 Mel-frequency cepstral coefficients (mfcc) [13],the zero crossing rate (zcr) [21], and spectral centroid [6], flux [5] and rolloff(see Fig. 1). The use of these timbral features is well established in the literatureas providing the necessary discriminatory power to differentiate not only simplemusic and speech classes, but also broad and general classification tasks involvingmusic genre, user preferences and broadcast categories, for example [25]. As such,these features were considered adequate to discriminate hybrid sound classes.

(a) Mel-frequency cepstrum (b) Zero-crossing rate (c) Spectral centroid

(d) Spectral rolloff (e) Spectral flux

Fig. 1. Spectral features for Federico Fellini’s 8 and ½. x-axes: soundtrack time-line.y-axis: 1(c) center of gravity of the magnitude spectrum of the short-time Fouriertransform (stft); 1(d) frequency below which 85% of the magnitude distribution isconcentrated; 1(e) rate of change of the power spectrum through a distance comparisonof adjacent normalized frames; 1(b) rate of change of the signal’s sign; 1(a) discretecosine transform of the logarithm of the magnitude spectrum of the Mel stft.

The means and variances of each of these features were calculated for each 3 ssegment during training. The basic analysis window was 512 samples long withno overlap, keeping a buffer of 40 windows for computing a moving average forthe features of 129 frames.1 The resulting feature vector was composed of thebuffer and frame means and variances of each of the actual 17 features. resultingin a final 68-feature vector for each sound segment.

1 Which, at a sampling rate of 22050 Hz, represent 512× 129÷ 22050 ≈ 3 s

4 Classification

4.1 Labeled Set

The labeled set was composed of 3560 distinct audio segments randomly sampledfrom the list in Tab. 1, each 3 s long. These segments were manually classifiedby three human coders into any of the classes in Sect. 4.2, with an inter-coderagreement coefficient κ = 0.832 [1]. Conflicts were resolved by majority voting.

4.2 Classes

The simplest (first-order) classes considered were speech, music, environmentalsound, and silence. Although used throughout the literature, first-order classesonly were not enough to build the meta-feature set used in the second part ofthis study. Therefore, these were subdivided into combinations of two-elementhybrid (second-order) classes, corresponding to all possible combinations of twofirst-order classes. Table 2 describes the working operational definitions for eachclass used in this study.

Table 2. Sound classes and corresponding working operational definitions.

Speech Dialog, perceptually unaccompanied by music or environmental sounds, noise,or other transient sounds.

Music Harmonic composition with a varying degree of subjective dissonance, percep-tually unaccompanied by speech, noise, or other environmental sounds.

Environmental sound Foley (i.e., clothes rustling, footsteps), true environmentalsounds (i.e., wind blowing, traffic), and noise in the strictest sense (i.e., whitenoise and other types of random or pseudo-random inharmonious sounds).

Silence Perceptual absence of any sound.Speech with music background Speech that is perceptually dominant over music.Speech with environmental sound background Speech that is perceptually

dominant over environmental sound.Music with environmental sound background Music that is perceptually domi-

nant over environmental sound.

4.3 Classifier

The classifier used in this study was a fast implementation of a support vec-tor machine (svm) [26, reprint from 1966 original] using sequential minimaloptimization (smo) [15]. A linear kernel was used, since the data was linearlyseparable in the binary classification problem. Logistic regression models werebuilt during training and prediction to obtain a posterior probability for eachinstance classification [16]. All classification tasks were done using the wekamachine learning framework [7], including its specific smo implementation.

2 “Almost perfect” agreement, according to [10]

5 Validation

Ten-fold cross-validation was used to obtain performance metrics of the smo clas-sifier against ZeroR, Naive Bayes, knn, and svm classifiers. Table 3 shows thatsmo performed better than its direct alternatives (∆tp > 10%), denoting “sub-stantial” agreement [10] with the baseline labels (κ = 0.7). Table 4(a) shows thatmost misclassifications shared a first-order constituent; the four circled numbersrepresent approximately 43 % of total misclassifications. Table 4(b) shows thatperformance was consistently high across classes, with the exception of music,due to a high actual music with environment sound class versus predicted musicclass error rate.

Table 3. Cross-validation summary for smo classifier performance versus a baselineclassifier that predicts the mode (ZeroR), a classifier that simply employs Bayes’ theo-rem while assuming independence between samples (Naive Bayes), a common cluster-ing classifier based on close samples in the feature space (k-nearest neighbor), and asupport vector machine classifier without the smo optimizations (standard svm).

ClassifierPerformance ZeroR Bayes knn svm smo

Correctly classified 32.32% 64.19% 64.34% 65.73% 76.01%Kappa statistic 0 0.56 0.55 0.57 0.70Mean absolute error 0.23 0.10 0.10 0.13 0.10Root mean squared error 0.34 0.30 0.32 0.26 0.22Relative absolute error 100% 45.23% 44.87% 58.99% 41.97%Root relative squared error 100% 89.52% 94.08% 76.19% 66.11%

Table 4. Performance of smo classifier. env: environmental sound, mus: music, sil:silence, sp: speech, m-e: music over environmental sound, s-e: speech over environmen-tal sound, and s-m: speech over music. tp: true positive rate, fp: false positive rate, pr:precision, re: recall, fm: F-measure, roc: receiver operating characteristic.

(a) Confusion matrix

Class env mus m-e sil sp s-e s-m

env 242 2 22 2 8 32 2mus 2 254 44 0 4 4 10m-e 28 58 180 0 2 16 26sil 10 0 0 36 0 0 0sp 6 0 0 6 470 78○ 26s-e 28 2 18 6 92○ 788 66○s-m 0 8 26 18 18 90○ 382

All 316 324 290 68 594 1008 512

(b) Cross-validation accuracy

tp fp pr re fm roc

0.78 0.03 0.77 0.78 0.77 0.980.78 0.03 0.78 0.80 0.79 0.970.58 0.04 0.62 0.58 0.60 0.940.78 0.01 0.72 0.78 0.80 0.990.80 0.05 0.80 0.80 0.80 0.960.79 0.11 0.78 0.79 0.79 0.930.73 0.05 0.75 0.73 0.74 0.95

0.76 0.06 0.76 0.76 0.76 0.95

6 Audio Segmentation

An auditory scene is a coherent collection of sound sources, only a few of whichare perceptually dominant. In this model, a scene change occurs when the major-ity of these dominant sources change. [23, p. 2441]. There is a natural equivalencybetween objects (sound sources) and characteristics (sound classes) [14, pp 26–27]. Each scene was therefore modeled as a collection of adjacent segments thatshare a common class, under the rules in [12, p. 6]. Further, each scene was an-notated with its cumulative posterior probability, based on the estimates of theclassifier. Finally, basic descriptive statistics were computed for each soundtrack,namely: average and standard deviation of scene length (in seconds), average andstandard deviation of scene frequency (scenes per minute) and, for each class,its coverage coefficient (the proportion of the duration of a soundtrack that thatclass represents).

Figure 2 exemplifies the result of this process. There are (a) seven time series,one for each sound class; (b) N discrete auditory scene points (the centereddifferently shaped marks); (c) individual scene lengths (the horizontal ticks—asingle center mark is the basic unit, a three-second long scene); and (d) thecumulative posterior probability, denoted by the vertical ticks, varying from asimple point (low probability of prediction being correct) to a line spanning thedistance to the adjacent time series (high probability of the prediction beingcorrect).

Environment

Music

Music/Environment

Silence

Speech

Speech/Environment

Speech/Music

00:00 00:30 01:00 01:30

Timeline (hours:minutes)

8 1-2

Fig. 2. Class segmentation for Federico Fellini’s 8 ½. Each mark represents an auditoryscene. Horizontal ticks show the scene’s length. Vertical ticks show the cumulativeposterior probability that the scene has been correctly classified.

7 Meta-analysis

Figure 3 shows cross-corrrelation in variation of changes in class coverage, scenefrequency and average scene length from 1970–2006, with progressively highersample sizes per decade. For each decade and for each class, random samplingwith substitution was done with sample sizes 1 through 9. Cross-correlation wascomputed for each sub-sample with the entire sample, and its results averagedinto three final series, one per meta-feature. All three series are highly cross-correlated (cc > 0.94). Assuming that the sample size is representative of thepopulation, this is evidence that the anchor weigths used for the stratified sam-pling in Sect. 2.2 can be reduced from 10 to 8 films per decade with no loss ofpower.

2 4 6 8

0.0

0.2

0.4

0.6

0.8

1.0

Number of random samples per decade

Mea

ncr

oss

-corr

elati

on

wit

hfu

llsa

mple Class coverage

Scene lengthScene frequency

Fig. 3. Meta-sampling summary of cross-correlation in class coverage, scene frequencyand scene length, 1970–2006, between different sample sizes and full sample. Error barsshow standard error at 0.95 confidence level.

8 Chronological prediction model

The last step was to train a meta classifier with the task of predicting the yearof production of unlabeled instance soundtracks. The classifier used was a rule-based decision table [8] with a greedy hillclimbing search space strategy. Thefeatures used in this step were the output of the previous stages of classifica-tion and segmentation. That is, the feature space extracted from each instancesoundtrack was the class-specific and global means and standard deviations oflength, scene frequency and coverage coefficient, for a total of 58 features. Thecorrelation coefficient between predicted and actual years was 0.6, with 7.037years mean absolute error and 9.114 years root mean squared error. Figure 8shows the output of this classifier.

1970 1975 1980 1985 1990 1995 2000 2005

1970

1975

1980

1985

1990

1995

2000

2005

Predicted (year)

Act

ual

(yea

r)

Instance soundtrackPerfect classifierRegression lineLowess smoothResidual spread

Fig. 4. Scatter plot of actual versus predicted years. Points show each instance sound-track, the dotted line shows a hypothetical perfect classifier, the solid line shows linearregression on the predictions (p < 0.001, r2 = 0.36), the dashed line shows loess non-parametric regression line on the predictions, and the dashed and dotted lines showsmoothing applied to the root-mean-square positive and negative residuals from theloess line to display conditional spread and assymetry.

9 Conclusion

The combination of spectral characteristics (mfcc, spectral centroid, flux androllof), and the zero-crossing rate as features for short segments provided suffi-cient discriminatory power for selecting between the seven sound classes definedin this study. The use of additional features, namely stereophonic ones (such asapparent source width [aws]) could, in principle, decrease the rate of misclas-sification, in particular of hybrid classes involving environmental sound, whichproved to be the source of most misclassifications. Environmental noise, asidefrom an overly broad definition, is especially heterogeneous. One commonal-ity might involve its fundamental binaural decorrelation (wherefore most of itsdefining characteristics would be lost during pre-processing of a soundtrack toone channel only).

Amongst various classifiers, smo provided the best predictive capabilities.At 74% accuracy, the model could certainly see improvements when comparedwith the state-of-the-art, which tends to hover around 90%. However, the use ofthree hybrid classes is rare in the literature; as argued in Sect. 5, the majorityof errors involved combinations of classes that would not exist in a simple four-first-order-classes model, where misclassifications confusing a hybrid class withone of its constituent classes would disappear.

The use of logistic regression to obtain posterior class call probabilities aidedin estimating the effectiveness of implicit audio segmentation based on adjacencyrules. As a result, the sound class maps exemplified by Fig. 2 provide a novelway of visualizing segmentation results and their correctness.

Under the limited scope of the dataset, this study presented evidence thata sampling frequency of eight instances per decade is strongly correlated withthe full sample. A tentative conclusion, therefore, is that for the purposes ofsegmenting and classifying cinematic soundtracks in the 1970–2006 period, it islikely that sampling eight instances per decade, for a total of 32 data points, isjustified. It remains to be investigated whether these figures apply also to the fullsound film period of 1930–present, or to other locales or agents of production.

The specific scope of this study disallowed further investigation of productionlocales and agents, but chronological prediction was scrutinized. For the threemeta-features specifically employed (class coverage, scene frequency, and averagescene length), it was difficult to build an accurate chronological model, althoughthe system was able to learn how to place an unknown instance soundtrack inthe 1970–2006 timeline, within an experimental error of approximately 7 years.

In the future, the feature space could be refined to include not only spectralcharacteristics, but also stereophonic ones. The smo classifier appears to beadequate for the subsequent task of training and predicting unknown instances.The meta-feature space should be expanded, as it is clear that, while a step in theright direction, judging from the experimental error and correlation coefficients inthe meta classification part of this study, these could be improved. Specifically,the meta-feature space should most likely include in-soundtrack variations ofscene frequency, length, and class coverage.

References

1. Cohen, J.: A coefficient of agreement for nominal scales. Educational and Psycho-logical Measurement 20(1), 37–46 (1960)

2. Dorai, C., Mauthe, A., Nack, F., Rutledge, L., Sikora, T., Zettl, H.: Media se-mantics: Who needs it and why? In: Proceedings of the Tenth acm InternationalConference on Multimedia. pp. 580–583. acm Press, New York (2002)

3. Dorai, C., Venkatesh, S.: Bridging the semantic gap in content management sys-tems: Computational media aesthetics. In: Proceedings of the First Conference onComputational Semiotics for Games and New Media. pp. 94–99 (2001)

4. Dorai, C., Venkatesh, S.: Bridging the semantic gap with computational mediaaesthetics. ieee Multimedia 10(2), 15–17 (2003)

5. Grey, J.M.: Multidimensional perceptual scaling of musical timbres. The Journalof the Acoustical Society of America 61(5), 1270–1277 (May 1977)

6. Grey, J.M., Gordon, J.W.: Perceptual effects of spectral modifications on musicaltimbres. The Journal of the Acoustical Society of America 63, 1493 (May 1978)

7. Hall, M., Frank, E., Holmes, G., Pfahringer, B., Reutemann, P., Witten, I.H.: Theweka data mining software: An update. sigkdd Explorations 11(1) (2009)

8. Kohavi, R.: The power of decision tables. In: Proceedings of the Eight EuropeanConference on Machine Learning. pp. 174–189. Springer-Verlag, London (1995)

9. Lacy, S.: Sample size in content analysis of weekly newspapers. Journalism andMass Communication Quarterly 72(2), 336–45 (1995)

10. Landis, J.R., Koch, G.G.: The measurement of observer agreement for categoricaldata. Biometrics 33(1), 159–174 (1977)

11. Lu, L., Zhang, H., Jiang, H.: Content analysis for audio classification and segmen-tation. ieee Transactions on Speech and Audio Processing 10(7), 504–516 (2002)

12. Lu, L., Zhang, H., Li, S.: Content-based audio classification and segmentation byusing support vector machines. Multimedia Systems 8(6), 482–492 (2003)

13. Mermelstein, P.: Distance measures for speech recognition, psychological and in-strumental. In: Chen, C.H. (ed.) Pattern recognition and artificial intelligence, pp.374–388. Academic, New York (1976)

14. Metz, C., Gurrieri, G.: Aural objects. Yale French Studies 60, 24–32 (1980)

15. Platt, J.: Sequential minimal optimization: A fast algorithm for training supportvector machines. In: Scholkopf, B., Burges, C.J.C., Smola, A.J. (eds.) Advances inkernel methods: Support vector learning, pp. 41–65. mit Press, Cambridge (1998)

16. Platt, J.C.: Probabilistic outputs for support vector machines and comparisonsto regularized likelihood methods. In: Smola, A.J., Bartlett, P., Schuurmans, D.,Scholkopf, B. (eds.) Advances in large margin classifiers, pp. 61–74. mit Press,Cambridge (2000)

17. Riffe, D.: The effectiveness of random, consecutive day and constructed week sam-pling in newspaper content analysis. Journalism Quarterly 70(1), 133–39 (1993)

18. Riffe, D.: The effectiveness of simple and stratified random sampling in broadcastnews content analysis. Journalism and Mass Communication Quarterly 73(1), 159–68 (1996)

19. Riffe, D., Freitag, A.: A content analysis of content analyses: Twenty-five years ofjournalism quarterly. Journalism and Mass Communication Quarterly 74, 515–524(1997)

20. Riffe, D., Lacy, S., Drager, M.W.: Sample size in content analysis of weekly newsmagazines. Journalism and Mass Communication Quarterly 73, 635–644 (1996)

21. Saunders, J.: Real-time discrimination of broadcast speech/music. In: Proceedingsof ieee International Conference on Acoustics, Speech, and Signal Processing.vol. 2, pp. 993–996. ieee (1996)

22. Stempel, G.H.: Sample size for classifying subject matter in dailies. JournalismQuarterly 29(2), 333–334 (1952)

23. Sundaram, H., Chang, S.F.: Audio scene segmentation using multiple features,models and timescales. In: Proceedings of the ieee International Conference onAcoustics, Speech, and Signal Processing. vol. 6, pp. 2441–2444 (Jun 2000)

24. Truong, B.T., Venkatesh, S., Dorai, C.: Application of computational media aes-thetics methodology to extracting color semantics in film. In: Proceedings of theTenth acm International Conference on Multimedia. pp. 339–342. acm Press, NewYork (2002)

25. Tzanetakis, G., Cook, P.: Musical genre classification of audio signals. ieee Trans-actions on Speech and Audio Processing 10(5), 293–302 (2002)

26. Vapnik, V.N., Kotz, S.: Estimation of dependences based on empirical data.Springer-Verlag, New York (2006)

27. Zettl, H.: Contextual media aesthetics as the basis for media literacy. Journal ofCommunication 48(1), 81–95 (1998)

classi cation, implicit segmentation and a chronological …prodei/dsie12/papers/paper_31.pdf ·...

Documents