3d graphics tools for sound collections george tzanetakis …gtzan/work/pubs/dafx00gtzan.pdf · 3d...

4
Proceedings of the COST G-6 Conference on Digital Audio Effects (DAFX-00), Verona, Italy, December 7-9,2000 3D GRAPHICS TOOLS FOR SOUND COLLECTIONS George Tzanetakis Computer Science Department Princeton Univerisy [email protected] Perry Cook Computer Science Department and Music Department Princeton University [email protected] ABSTRACT Most of the current tools for working with sound work on sin- gle soundfiles, use 2D graphics and offer limited interaction to the user. In this paper we describe a set of tools for working with col- lections of sounds that are based on interactive 3D graphics. These tools form two families: sound analysis visualization displays and model-based controllers for sound synthesis algorithms. We describe the general techniques we have used to develop these tools and give specific case studies from each family. Several collections of sounds were used for development and evaluation. These are: a set of musical instrument tones, a set of sound effects, a set of FM radio audio clips belonging to several music genres, and a set of mp3 rock song snippets. 1. INTRODUCTION There are many existing tools for sound analysis and synthesis. Typically these tools work on single soundfiles and use 2D graph- ics. Some attempts at using 3D graphics have been made (like Waterfall spectogram displays) but typically they offer very lim- ited interaction to the user. In this paper we describe some recently developed tools that are based on interactive 3D graphics. These tools can be grouped into two families. The first family consists of visualization dis- plays for sound analysis. The second family consists of model- based controllers for sound synthesis algorithms. The primary mo- tivation behind this work has been using 3D graphics for experi- menting with large collections of sounds rather than single audio files. Each tool is parametrized by its inputs and options. We have tried to decouple the tools from a specific analysis front-end and synthesis back-end. This is achieved through the use of differ- ent mappings of analysis or synthesis data to the tool inputs and a generic client-server architecture. That way other researchers can easily integrate these 3D graphics tools with their applications. The tool options control the visual appearance. Collections of mu- sic instrument tones, sound effects, FM radio audio clips belonging to several music genres, and rock mp3 song snippets were used for developing and evaluating these tools. An additional motivation for the development of these tools is that recent hardware is fast enough to be able to handle audio analysis or synthesis and 3D graphics rendering at real time. All the described tools run on commodity PCs without any special- purpose hardware for 3D Graphics or Signal Processing. 2. SOUND ANALYSIS VISUALIZATION DISPLAYS Visualization techniques have been used in many scientific do- mains. They take advantage of the strong pattern recognition abil- ities of the human visual system to reveal similarities, patterns and correlations both in space and time. Visualization is more suited for areas that are exploratory in nature and where there are large amounts of data to be analyzed like sound analysis research. Sound analysis is a relatively recent area of research. It refers in general to any technique that helps a computer “understand” sound in a similar way that a human does. Some examples of sound anal- ysis are classification, segmentation, retrieval, and clustering. The use of 3D graphics and animation allows for display of more in- formation than traditional static 2D displays. Sound analysis is typically based on the calculation of feature vectors that describe the spectral content of the sound. There are two approaces for representing a sound file. It can be represented as a single feature vector (i.e a point in the high dimensional fea- ture space) or a time series of feature vectors (i.e a trajectory). As our focus is the development of tools and not the optimum feature front-end we are not going to describe in detail the spe- cific features we have used. Moreover an important motivation for building these tools is exploring and evaluating the many possible features that are available. For more details about the supported features refer to [1]. The features used include means and vari- ances of Spectral Centroid (center of mass of spectrum, a correlate of brightness), Spectral Flux (a measure of the time variation of the signal), Spectrol Rolloff (a measure of the shape of the spec- trum) and RMS energy (a correlate of loudness). Other features are Linear Prediction Coefficients (LPC) and Mel Frequency Cep- stral Coefficients (MFCC) [2], which are perceptually motivated features used in speech processing and recognition. In addition, features directly computed from MPEG compressed audio can be used [3, 4]. Information about the dynamic range of the features is obtained by statistically analysing the dataset of interest. Principal components analysis (PCA) is a technique for reduc- ing a high dimensional feature space to a lower dimensional one re- taining as much information from the original set as possible. For more details refer to [5]. We use PCA to map a high dimensional set of features into a lower dimensional set of tool inputs. The extraction of a principal component amounts to a variance maxi- mizing rotation of the original variable space. In other words, the first principal component is the axis passing through the centroid of the feature vectors that has the maximum variance therefore ex- plains a large part of the underlying feature structure. The next principal component tries to maximize the variance not explained by the first. In this manner, consecutive orthogonal components are extracted. DAFX-1

Upload: phungbao

Post on 12-Mar-2019

214 views

Category:

Documents


0 download

TRANSCRIPT

Page 1: 3D GRAPHICS TOOLS FOR SOUND COLLECTIONS George Tzanetakis …gtzan/work/pubs/dafx00gtzan.pdf · 3D GRAPHICS TOOLS FOR SOUND COLLECTIONS George Tzanetakis Computer Science Department

Proceedingsof theCOSTG-6 Conferenceon Digital AudioEffects(DAFX-00),Verona,Italy, December7-9,2000

3D GRAPHICS TOOLS FOR SOUND COLLECTIONS

GeorgeTzanetakis

ComputerScienceDepartmentPrincetonUniverisy

[email protected]

Perry Cook

ComputerScienceDepartmentandMusic DepartmentPrincetonUniversity

[email protected]

ABSTRACT

Most of the current tools for working with soundwork on sin-glesoundfiles,use2D graphicsandoffer limited interactionto theuser. In this paperwe describea setof toolsfor working with col-lectionsof soundsthatarebasedoninteractive3D graphics.Thesetoolsform two families:soundanalysisvisualizationdisplaysandmodel-basedcontrollersfor soundsynthesisalgorithms.

We describethe generaltechniqueswe have usedto developthesetoolsandgivespecificcasestudiesfrom eachfamily. Severalcollectionsof soundswereusedfor developmentandevaluation.Theseare:asetof musicalinstrumenttones,asetof soundeffects,a setof FM radio audioclips belongingto several musicgenres,andasetof mp3rocksongsnippets.

1. INTRODUCTION

Thereare many existing tools for soundanalysisand synthesis.Typically thesetoolswork on singlesoundfilesanduse2D graph-ics. Someattemptsat using 3D graphicshave beenmade(likeWaterfall spectogramdisplays)but typically they offer very lim-ited interactionto theuser.

In this paperwe describesomerecentlydevelopedtools thatarebasedon interactive 3D graphics.Thesetoolscanbegroupedinto two families. The first family consistsof visualizationdis-plays for soundanalysis. The secondfamily consistsof model-basedcontrollersfor soundsynthesisalgorithms.Theprimarymo-tivation behindthis work hasbeenusing3D graphicsfor experi-mentingwith largecollectionsof soundsratherthansingleaudiofiles.

Eachtool is parametrizedby its inputsandoptions.We havetried to decouplethe tools from a specificanalysisfront-endandsynthesisback-end. This is achieved throughthe useof differ-ent mappingsof analysisor synthesisdatato the tool inputsanda genericclient-server architecture. That way other researcherscaneasilyintegratethese3D graphicstoolswith theirapplications.Thetool optionscontrolthevisualappearance.Collectionsof mu-sicinstrumenttones,soundeffects,FM radioaudioclipsbelongingto severalmusicgenres,androckmp3songsnippetswereusedfordevelopingandevaluatingthesetools.

An additionalmotivation for the developmentof thesetoolsis that recenthardware is fastenoughto be able to handleaudioanalysisor synthesisand3D graphicsrenderingat real time. Allthe describedtools run on commodityPCswithout any special-purposehardwarefor 3D Graphicsor SignalProcessing.

2. SOUND ANALYSIS VISUALIZATION DISPLAYS

Visualizationtechniqueshave beenusedin many scientific do-mains.They take advantageof thestrongpatternrecognitionabil-ities of the humanvisual systemto reveal similarities, patternsand correlationsboth in spaceand time. Visualizationis moresuitedfor areasthatareexploratoryin natureandwheretherearelargeamountsof datato beanalyzedlike soundanalysisresearch.Soundanalysisis a relatively recentareaof research.It refersingeneralto any techniquethathelpsacomputer“understand”soundin asimilarwaythatahumandoes.Someexamplesof soundanal-ysisareclassification,segmentation,retrieval, andclustering.Theuseof 3D graphicsandanimationallows for displayof morein-formationthantraditionalstatic2D displays.

Soundanalysisis typically basedon thecalculationof featurevectorsthatdescribethespectralcontentof thesound.Therearetwo approacesfor representinga soundfile. It canberepresentedasa singlefeaturevector(i.e a point in thehigh dimensionalfea-turespace)or a timeseriesof featurevectors(i.e a trajectory).

As our focusis thedevelopmentof toolsandnot theoptimumfeaturefront-endwe arenot going to describein detail the spe-cific featureswehaveused.Moreover animportantmotivationforbuilding thesetoolsis exploring andevaluatingthemany possiblefeaturesthat areavailable. For moredetailsaboutthe supportedfeaturesrefer to [1]. The featuresusedincludemeansandvari-ancesof SpectralCentroid(centerof massof spectrum,acorrelateof brightness),SpectralFlux (a measureof the time variationofthe signal),SpectrolRolloff (a measureof the shapeof thespec-trum) andRMS energy (a correlateof loudness).Other featuresareLinearPredictionCoefficients(LPC)andMel Frequency Cep-stral Coefficients (MFCC) [2], which areperceptuallymotivatedfeaturesusedin speechprocessingandrecognition. In addition,featuresdirectly computedfrom MPEGcompressedaudiocanbeused[3, 4]. Informationaboutthedynamicrangeof thefeaturesisobtainedby statisticallyanalysingthedatasetof interest.

Principalcomponentsanalysis(PCA) is atechniquefor reduc-ingahighdimensionalfeaturespacetoalowerdimensionalonere-tainingasmuchinformationfrom theoriginal setaspossible.Formoredetailsreferto [5]. We usePCA to mapa high dimensionalset of featuresinto a lower dimensionalset of tool inputs. Theextractionof a principal componentamountsto a variancemaxi-mizing rotationof theoriginal variablespace.In otherwords,thefirst principal componentis theaxispassingthroughthecentroidof thefeaturevectorsthathasthemaximumvariancethereforeex-plains a large part of the underlyingfeaturestructure. The nextprincipalcomponenttriesto maximizethevariancenot explainedby the first. In this manner, consecutive orthogonalcomponentsareextracted.

DAFX-1

Page 2: 3D GRAPHICS TOOLS FOR SOUND COLLECTIONS George Tzanetakis …gtzan/work/pubs/dafx00gtzan.pdf · 3D GRAPHICS TOOLS FOR SOUND COLLECTIONS George Tzanetakis Computer Science Department

Proceedingsof theCOSTG-6 Conferenceon Digital AudioEffects(DAFX-00),Verona,Italy, December7-9,2000

Figure1: Timbregramsfor speech andclassicalmusic(RGB)

Clusteringrefersto groupingvectorsto clustersbasedontheirsimilarity. For automatedclusteringwe have usedthe c-meansalgorithm[6]. In many casestheresultingclusterscanbepercep-tually justified.For exampleusing12clustersfor thesoundeffectsdatasetof 150soundsyieldsreasonableclusterslikeaclusterwithalmostall thewalkingsoundsandaclusterwith mostof thescrap-ing sounds.Usingvisualizationtechniqueswe canexplorediffer-entparametersandalternativeclusteringtechniquesin aneasyandinteractive way.

2.1. Case studies

� TimbreGram is a tool for visualizingsoundsor collectionsof soundswherethe featurevectorsform a trajectory. Itconsistsof aseriesof verticalcolorstripeswhereeachstripecorrespondsto a featurevector. Time is mappedfrom leftto right. Thereare threetool inputscorrespondingto thethreecolor componentsfor eachstripe. The TimbreGramrevealssoundsthataresimilar by color similarity andtimeperiodicity. For examplethe TimbreGramsof two walk-ing soundswith differentnumberof footstepswill have thesameperiodicity in color. PCA canbe usedto reducethefeaturespaceto thethreecolor inputs.ThecurrentoptionsareRGB andHSV spacecolormapping.

Fig. 1 shows the Timbregramsof six soundfiles (each30secondslong). Three of them contain speechand threecontainclassicalmusic.Althoughcolor informationis lostin the greyscaleof the paper, music and speechseparateclearly. Thebottomright soundfile (opera)is light purpleandthespeechsegmentsarelight green.Light andbrightcolorstypically correspondto speechor singing(Fig. 1 leftcolumn). Purpleand blue colors typically correspondtoclassicalmusic(Fig. 1 right column). Of course,themap-ping of featuresto colorsdependson the specifictrainingsetusedfor thePCAanalysis.

� TimbreSpace is a tool for visualizing soundcollectionswherethereis a singlefeaturevectorfor eachsound.Eachsound(featurevector) is representedasa singlepoint in a3D space.PCA canbe usedto reducethe higherdimen-sional featurespaceto the input x, y, z coordinates.TheTimbreSpacerevealssimilarity andclusteringof sounds.Inaddition coloring of the points basedon automaticclus-tering and boundingbox selectionis supported. Typicalgraphicoperatorslikezooming,rotatingandscalingcanbeusedto interactwith the data. A list of the nearestpoints(sounds)to themousecursorin 3D is providedandbyclick-ing at the appropriateentry the usercan hear the sound.

Figure2: GenreGramshowingmalespeaker andsports

Sometimesit is desirableto automaticallyhearthe soundpointswithout having to selectthemindividually. We areexploringtheuseof principalcurves[7] to to movesequen-tially throughthis space.

� TimbreBall is a tool for visualizingsoundswherethefea-turevectorsform a trajectory. This is a real-timeanimatedvisualizationwhereeachfeaturevectoris mappedto thex,y, z coordinateof asmallball insideacube.Theball movesin the spaceas the soundis playing following the corre-spondingfeaturevectors.Texturechangesareeasilyvisibleby abruptjumpsin the trajectoryof the ball. A shadow isprovidedfor betterdepthperception(seeFig. 3).

� GenreGram is a dynamicreal-timeaudiodisplaytargetedtowardsradio signals. The live radio audiosignal is anal-ysedin real-timeandis classifiedinto 12 categories:Male,Female, Sports,Classical,Country, Disco,HipHop,Fuzak,Jazz,Rock,SilenceandStatic. Theclassificationis doneus-ing statisticalpatternrecognitionclassifierstrainedon thecollectedFM-radiodataset.For eachof thesecategoriesaconfidencemeasure,rangingfrom 0.0 to 1.0, is calculatedandusedto move up or down cylinders correspondingtoeachcategory. Eachcylinder is texture-mappedwith a rep-resentative imageof eachcategory. Themovementis alsoweightedby a separateclassificationdecisionaboutif thesignal is Music or Speech. Male, Female, andSportsarethe Speech categoriesand the remainingare music. Theclassificationaccuracy is about

�����for theMusic/Speech

decision, � ��� for the Speech categories,and ��� � for theMusiccategories.Thefocusis thedevelopmentof thedis-play tool thereforethe classificationresultsare not a fullscaleevaluationandareprovidedonly asa reference.

In additionto beinga nicedemonstrationtool for real-timeautomaticaudioclassification,the GenreGram givesvalu-ablefeedbackboth to theuserandthe algorithmdesigner.Differentclassificationdecisionsandtheirrelativestrengthsarecombinedvisually revealingcorrelationsandclassifica-tion patterns.Sincethe boundariesbetweenmusicgenresarefuzzy, a displaylike this is moreinformative thansin-gle classificationdecision.For example,mostof the timesa rap songwill trigger Male Speech, SportsandHipHop.This exactcaseis shown in Fig. 2.

DAFX-2

Page 3: 3D GRAPHICS TOOLS FOR SOUND COLLECTIONS George Tzanetakis …gtzan/work/pubs/dafx00gtzan.pdf · 3D GRAPHICS TOOLS FOR SOUND COLLECTIONS George Tzanetakis Computer Science Department

Proceedingsof theCOSTG-6 Conferenceon Digital AudioEffects(DAFX-00),Verona,Italy, December7-9,2000

Figure3: TimbreBallandPuColaCan

3. MODEL-BASED CONTROLLERS FOR SOUNDSYNTHESIS

Thegoal of virtual reality systemsis to provide theuserthe illu-sionof arealworld. Althoughgraphicsrealismhasimprovedalot,soundis verydifferentfrom whatweexperiencein therealworld.Oneof themainreasonsis thatunlike therealworld, objectsin avirtual world do not make soundsaswe interactwith them. Theuseof pre-recordedPCM soundsdoesnot allow thetypeof para-metric interactionnecessaryfor realworld sounds.Therefore3Dmodelsthatnot only look similar to their real world counterpartsbut alsosoundsimilararevery important.

A numberof simple interfaceshave beenwritten in Java3D,with thecommonthemeof providing a user- controlledanimatedobject connecteddirectly to the parametersof the synthesisal-gorithm. This work is part of the PhysicallyOrientedLibraryof Interactive SoundEffects(PhOLISE)project[8]. This projectusesphysicalandphysically-motivatedanalysisandsynthesisal-gorithmssuchas modal synthesis,bandedwaveguides[9], andstochasticparticlemodels[10], to provide interactive parametricmodelsof real-world soundeffects.

Modal synthesisis usedfor soundgeneratingobjects/systemswhich can be characterizedby a relatively few resonantmodes,andwhich areexcited by impulsive sources(striking, slamming,tapping,etc.). Bandedwaveguidesarea hybridizationof modalsynthesisandwaveguidesynthesis[11], andaregoodfor model-ing modalsystemswhich canalsobe excited by rubbing,scrap-ping, bowing, etc. Stochasticparticlemodelsynthesis(PhISEM,PhysicallyInspiredStochasticEvent Modeling) is applicableforsoundscausedby systemswhich canbecharacterizedby interac-tions of many independentobjects,suchas ice cubesin a glass,leaves/gravel underwalking feet,loosechangein a pocket,etc.

Theseprogramsareintendedto bedemonstrationexamplesofhow sounddesignersin virtual reality, augmentedreality, games,andmovie productioncanuseparametricsynthesisin creatingbe-lievablereal-timeinteractive sonicworlds.

3.1. Case studies� PuCola Can shown in Fig. 3 is a 3D modelof a sodacan

thatis slid acrossvarioussurfacetextures.Theinputparam-etersof the tool arethe sliding speedandtexture materialandresult in the appropriatechangesto the soundin realtime.

� Gear is a 3D modelof a gearthat is rotatedusinga slider.Thecorrespondingsoundis a real-timeparametricsynthe-sisof thesoundof a turningwrench.

Figure4: GearandParticles

� Feet is a 3D model(still underdevelopment)of feet step-ping on a surface. The main input parametersare gait,heel/toetiming, heelandtoe material,weight andsurfacematerial.A parametricphysicalmodlingof humanwalkingthendrivesboththegraphicsandsynthesismodels.

� Particles is a 3D particle model of particles(rocks, icecubes,etc)falling ona surface.Thedensity, speedandma-terial input parametersaredriving both the 3D modelandthecorrespondingphysicalmodellingsynthesisalgorithms.

4. IMPLEMENTATION-DATASETS

Thesoundanalysisis performedusingMARSYAS [1] anobject-orientedframework for audioanalysis.Thesoundsynthesisis per-formedusingtheSynthesisToolkit [12]. Thedevelopedtoolswrit-ten in Java 3D actasclientsto theanalysisandsynthesisservers.ThesoftwarehasbeentestedonLinux, Solaris,Irix andWindows(95,98,NT)systems.

Four datasetswereusedasthe basisfor building andtestingour tools. Thefirst datasetconsistsof 913isolatedmusicalinstru-mentnotesof bowed, brassandwoodwind instrumentsrecordedfrom theMcGill MUMS CDs. Theseconddatasetconsistsof 150isolatedcontactsoundscollectedfrom varioussoundeffects li-braries.Thissetof 150soundeffectsis composedof everyday“in-teraction”sounds.By interactionwe meansoundsthatarecausedand changeddirectly by our gestures,suchas hitting, scraping,rubbing,walking, etc. They rangein sizefrom 0.4 to 5 seconds,andare the focus of two other projects,oneon the synthesisofinteractionsounds,andoneon theperceptionof suchsounds.Theremainingtwo datasetsare120 FM radio audioclips, represent-ing avarietyof musicgenres,and1000rocksongsnippets(30seclong) in mp3format.

Thesedatasetswereusedto build the soundanalysisvisual-izationtools.Usingrealisticdatasetsis theonly wayto debugandevaluatevisualizationtools. In additionto featureextractiontech-niques,clusteringandPCA requirea trainingdatasetin ordertowork. Evaluatingvisualizationtechniquesis difficult andwe havenot performedany directuserstudieson thevisualizationsystems(althoughwehaveperformedsegmentationandsimilarity retrievalstudiesonthedatasets).Wehavefoundthatthetoolsmakepercep-tual senseandin somecaseshave provideduswith insightsaboutthenatureof thedataandtheanalysistechniqueswe canuse.

DAFX-3

Page 4: 3D GRAPHICS TOOLS FOR SOUND COLLECTIONS George Tzanetakis …gtzan/work/pubs/dafx00gtzan.pdf · 3D GRAPHICS TOOLS FOR SOUND COLLECTIONS George Tzanetakis Computer Science Department

Proceedingsof theCOSTG-6 Conferenceon Digital AudioEffects(DAFX-00),Verona,Italy, December7-9,2000

5. FUTURE WORK

The main directionsfor future work are the developmentof al-ternative sound-analysisvisualizationtools and the developmentof new model-basedcontrollersfor soundsynthesisbasedon thesameprinciples.Of coursenew analysisandsynthesisalgorithmswill requireadjustmentof our existing tools anddevelopmentofnew ones. More formal evaluationsof the developedtools areplannedfor thefuture.Integratingthemodel-basedcontrollersintoa realvirtual environmentis anotherdirectionof futurework.

6. ACKNOWLEDGEMENTS

This work wasfundedunderNSF grant9984087andfrom giftsfrom Intel andArial Foundation.

7. REFERENCES

[1] G. TzanetakisandP. Cook, “Marsyas:A framework for au-dio analysis,” OrganisedSound, 2000,(to appear).

[2] M. Hunt, M. Lennig,andP. Mermelstein, “Experimentsinsyllable-basedrecognitionof continuousspeech,” in Proc.1996ICASSP, 1980,pp.880–883.

[3] D. Pye,“Content-basedmethodsfor themanagementof dig-ital music,” in Proc.Int.Confon Audio, Speech and SignalProcessing, ICASSP, 2000.

[4] G. Tzanetakisand P. Cook, “Sound analysisusing mpegcompressedaudio,” in Proc.1999IEEE Int.conf. on Audio,Speech andSignalProcessingICASSP00, Istanbul, 2000.

[5] L.T.Jolliffe, Principal ComponentAnalysis, Springer-Verlag,1986.

[6] R. DudaandP. Hart, PatternClassificationandSceneAnal-ysis, JohnWiley & Sons,1973.

[7] T. Hermann,P. Meinicke, and H. Ritter, “Principal curvesonification,” in Proc. Int. Confon Auditory Display, ICAD2000.

[8] P. Cook, “Towardphysically-informedparametricsynthesisof soundeffects,” in Proc.1999IEEE Workshopon Appli-cationsof SignalProcessingto Audio and Acoustics,WAS-PAA99, New Paltz,NY, 1999, InvitedKeynoteAddress.

[9] G. EsslandP. Cook, “Measurementsandefficient simula-tionsof bowedbars,” Journalof AcousticalSocietyof Amer-ica (JASA), vol. 108,no.1, pp.379–388,2000.

[10] P. Cook, “Physically inspiredsonicmodeling(phism):Syn-thesisof percussive sounds,” ComputerMusicJournal, vol.21,no.3, September1997.

[11] J.O.Smith, “Acousticmodelingusingdigital waveguides,”in MusicalSignalProcessing, C. Roads,S. T. Pope,A. Pic-cialli, andG.DePoli,Eds.,pp.221–263.Netherlands:SwetsandZietlinger, 1997.

[12] P. CookandG.Scavone,“The synthesistoolkit (stk),version2.1,” in Proc.1999Int.ComputerMusic ConferenceICMC,Beijing, China,October, 1999.

DAFX-4