3d graphics tools for sound collections george tzanetakis …gtzan/work/pubs/dafx00gtzan.pdf · 3d...

Proceedingsof theCOSTG-6 Conferenceon Digital AudioEffects(DAFX-00),Verona,Italy, December7-9,2000

3D GRAPHICS TOOLS FOR SOUND COLLECTIONS

GeorgeTzanetakis

ComputerScienceDepartmentPrincetonUniverisy

[email protected]

Perry Cook

ComputerScienceDepartmentandMusic DepartmentPrincetonUniversity

[email protected]

ABSTRACT

Most of the current tools for working with soundwork on sin-glesoundfiles,use2D graphicsandoffer limited interactionto theuser. In this paperwe describea setof toolsfor working with col-lectionsof soundsthatarebasedoninteractive3D graphics.Thesetoolsform two families:soundanalysisvisualizationdisplaysandmodel-basedcontrollersfor soundsynthesisalgorithms.

We describethe generaltechniqueswe have usedto developthesetoolsandgivespecificcasestudiesfrom eachfamily. Severalcollectionsof soundswereusedfor developmentandevaluation.Theseare:asetof musicalinstrumenttones,asetof soundeffects,a setof FM radio audioclips belongingto several musicgenres,andasetof mp3rocksongsnippets.

1. INTRODUCTION

Thereare many existing tools for soundanalysisand synthesis.Typically thesetoolswork on singlesoundfilesanduse2D graph-ics. Someattemptsat using 3D graphicshave beenmade(likeWaterfall spectogramdisplays)but typically they offer very lim-ited interactionto theuser.

In this paperwe describesomerecentlydevelopedtools thatarebasedon interactive 3D graphics.Thesetoolscanbegroupedinto two families. The first family consistsof visualizationdis-plays for soundanalysis. The secondfamily consistsof model-basedcontrollersfor soundsynthesisalgorithms.Theprimarymo-tivation behindthis work hasbeenusing3D graphicsfor experi-mentingwith largecollectionsof soundsratherthansingleaudiofiles.

Eachtool is parametrizedby its inputsandoptions.We havetried to decouplethe tools from a specificanalysisfront-endandsynthesisback-end. This is achieved throughthe useof differ-ent mappingsof analysisor synthesisdatato the tool inputsanda genericclient-server architecture. That way other researcherscaneasilyintegratethese3D graphicstoolswith theirapplications.Thetool optionscontrolthevisualappearance.Collectionsof mu-sicinstrumenttones,soundeffects,FM radioaudioclipsbelongingto severalmusicgenres,androckmp3songsnippetswereusedfordevelopingandevaluatingthesetools.

An additionalmotivation for the developmentof thesetoolsis that recenthardware is fastenoughto be able to handleaudioanalysisor synthesisand3D graphicsrenderingat real time. Allthe describedtools run on commodityPCswithout any special-purposehardwarefor 3D Graphicsor SignalProcessing.

2. SOUND ANALYSIS VISUALIZATION DISPLAYS

Visualizationtechniqueshave beenusedin many scientific do-mains.They take advantageof thestrongpatternrecognitionabil-ities of the humanvisual systemto reveal similarities, patternsand correlationsboth in spaceand time. Visualizationis moresuitedfor areasthatareexploratoryin natureandwheretherearelargeamountsof datato beanalyzedlike soundanalysisresearch.Soundanalysisis a relatively recentareaof research.It refersingeneralto any techniquethathelpsacomputer“understand”soundin asimilarwaythatahumandoes.Someexamplesof soundanal-ysisareclassification,segmentation,retrieval, andclustering.Theuseof 3D graphicsandanimationallows for displayof morein-formationthantraditionalstatic2D displays.

Soundanalysisis typically basedon thecalculationof featurevectorsthatdescribethespectralcontentof thesound.Therearetwo approacesfor representinga soundfile. It canberepresentedasa singlefeaturevector(i.e a point in thehigh dimensionalfea-turespace)or a timeseriesof featurevectors(i.e a trajectory).

As our focusis thedevelopmentof toolsandnot theoptimumfeaturefront-endwe arenot going to describein detail the spe-cific featureswehaveused.Moreover animportantmotivationforbuilding thesetoolsis exploring andevaluatingthemany possiblefeaturesthat areavailable. For moredetailsaboutthe supportedfeaturesrefer to [1]. The featuresusedincludemeansandvari-ancesof SpectralCentroid(centerof massof spectrum,acorrelateof brightness),SpectralFlux (a measureof the time variationofthe signal),SpectrolRolloff (a measureof the shapeof thespec-trum) andRMS energy (a correlateof loudness).Other featuresareLinearPredictionCoefficients(LPC)andMel Frequency Cep-stral Coefficients (MFCC) [2], which areperceptuallymotivatedfeaturesusedin speechprocessingandrecognition. In addition,featuresdirectly computedfrom MPEGcompressedaudiocanbeused[3, 4]. Informationaboutthedynamicrangeof thefeaturesisobtainedby statisticallyanalysingthedatasetof interest.

Principalcomponentsanalysis(PCA) is atechniquefor reduc-ingahighdimensionalfeaturespacetoalowerdimensionalonere-tainingasmuchinformationfrom theoriginal setaspossible.Formoredetailsreferto [5]. We usePCA to mapa high dimensionalset of featuresinto a lower dimensionalset of tool inputs. Theextractionof a principal componentamountsto a variancemaxi-mizing rotationof theoriginal variablespace.In otherwords,thefirst principal componentis theaxispassingthroughthecentroidof thefeaturevectorsthathasthemaximumvariancethereforeex-plains a large part of the underlyingfeaturestructure. The nextprincipalcomponenttriesto maximizethevariancenot explainedby the first. In this manner, consecutive orthogonalcomponentsareextracted.

DAFX-1


Figure1: Timbregramsfor speech andclassicalmusic(RGB)

Clusteringrefersto groupingvectorsto clustersbasedontheirsimilarity. For automatedclusteringwe have usedthe c-meansalgorithm[6]. In many casestheresultingclusterscanbepercep-tually justified.For exampleusing12clustersfor thesoundeffectsdatasetof 150soundsyieldsreasonableclusterslikeaclusterwithalmostall thewalkingsoundsandaclusterwith mostof thescrap-ing sounds.Usingvisualizationtechniqueswe canexplorediffer-entparametersandalternativeclusteringtechniquesin aneasyandinteractive way.

2.1. Case studies

� TimbreGram is a tool for visualizingsoundsor collectionsof soundswherethe featurevectorsform a trajectory. Itconsistsof aseriesof verticalcolorstripeswhereeachstripecorrespondsto a featurevector. Time is mappedfrom leftto right. Thereare threetool inputscorrespondingto thethreecolor componentsfor eachstripe. The TimbreGramrevealssoundsthataresimilar by color similarity andtimeperiodicity. For examplethe TimbreGramsof two walk-ing soundswith differentnumberof footstepswill have thesameperiodicity in color. PCA canbe usedto reducethefeaturespaceto thethreecolor inputs.ThecurrentoptionsareRGB andHSV spacecolormapping.

Fig. 1 shows the Timbregramsof six soundfiles (each30secondslong). Three of them contain speechand threecontainclassicalmusic.Althoughcolor informationis lostin the greyscaleof the paper, music and speechseparateclearly. Thebottomright soundfile (opera)is light purpleandthespeechsegmentsarelight green.Light andbrightcolorstypically correspondto speechor singing(Fig. 1 leftcolumn). Purpleand blue colors typically correspondtoclassicalmusic(Fig. 1 right column). Of course,themap-ping of featuresto colorsdependson the specifictrainingsetusedfor thePCAanalysis.

� TimbreSpace is a tool for visualizing soundcollectionswherethereis a singlefeaturevectorfor eachsound.Eachsound(featurevector) is representedasa singlepoint in a3D space.PCA canbe usedto reducethe higherdimen-sional featurespaceto the input x, y, z coordinates.TheTimbreSpacerevealssimilarity andclusteringof sounds.Inaddition coloring of the points basedon automaticclus-tering and boundingbox selectionis supported. Typicalgraphicoperatorslikezooming,rotatingandscalingcanbeusedto interactwith the data. A list of the nearestpoints(sounds)to themousecursorin 3D is providedandbyclick-ing at the appropriateentry the usercan hear the sound.

Figure2: GenreGramshowingmalespeaker andsports

Sometimesit is desirableto automaticallyhearthe soundpointswithout having to selectthemindividually. We areexploringtheuseof principalcurves[7] to to movesequen-tially throughthis space.

� TimbreBall is a tool for visualizingsoundswherethefea-turevectorsform a trajectory. This is a real-timeanimatedvisualizationwhereeachfeaturevectoris mappedto thex,y, z coordinateof asmallball insideacube.Theball movesin the spaceas the soundis playing following the corre-spondingfeaturevectors.Texturechangesareeasilyvisibleby abruptjumpsin the trajectoryof the ball. A shadow isprovidedfor betterdepthperception(seeFig. 3).

� GenreGram is a dynamicreal-timeaudiodisplaytargetedtowardsradio signals. The live radio audiosignal is anal-ysedin real-timeandis classifiedinto 12 categories:Male,Female, Sports,Classical,Country, Disco,HipHop,Fuzak,Jazz,Rock,SilenceandStatic. Theclassificationis doneus-ing statisticalpatternrecognitionclassifierstrainedon thecollectedFM-radiodataset.For eachof thesecategoriesaconfidencemeasure,rangingfrom 0.0 to 1.0, is calculatedandusedto move up or down cylinders correspondingtoeachcategory. Eachcylinder is texture-mappedwith a rep-resentative imageof eachcategory. Themovementis alsoweightedby a separateclassificationdecisionaboutif thesignal is Music or Speech. Male, Female, andSportsarethe Speech categoriesand the remainingare music. Theclassificationaccuracy is about

��for theMusic/Speech

decision, � �� for the Speech categories,and �� for theMusiccategories.Thefocusis thedevelopmentof thedis-play tool thereforethe classificationresultsare not a fullscaleevaluationandareprovidedonly asa reference.

In additionto beinga nicedemonstrationtool for real-timeautomaticaudioclassification,the GenreGram givesvalu-ablefeedbackboth to theuserandthe algorithmdesigner.Differentclassificationdecisionsandtheirrelativestrengthsarecombinedvisually revealingcorrelationsandclassifica-tion patterns.Sincethe boundariesbetweenmusicgenresarefuzzy, a displaylike this is moreinformative thansin-gle classificationdecision.For example,mostof the timesa rap songwill trigger Male Speech, SportsandHipHop.This exactcaseis shown in Fig. 2.

DAFX-2


Figure3: TimbreBallandPuColaCan

3. MODEL-BASED CONTROLLERS FOR SOUNDSYNTHESIS

Thegoal of virtual reality systemsis to provide theuserthe illu-sionof arealworld. Althoughgraphicsrealismhasimprovedalot,soundis verydifferentfrom whatweexperiencein therealworld.Oneof themainreasonsis thatunlike therealworld, objectsin avirtual world do not make soundsaswe interactwith them. Theuseof pre-recordedPCM soundsdoesnot allow thetypeof para-metric interactionnecessaryfor realworld sounds.Therefore3Dmodelsthatnot only look similar to their real world counterpartsbut alsosoundsimilararevery important.

A numberof simple interfaceshave beenwritten in Java3D,with thecommonthemeof providing a user- controlledanimatedobject connecteddirectly to the parametersof the synthesisal-gorithm. This work is part of the PhysicallyOrientedLibraryof Interactive SoundEffects(PhOLISE)project[8]. This projectusesphysicalandphysically-motivatedanalysisandsynthesisal-gorithmssuchas modal synthesis,bandedwaveguides[9], andstochasticparticlemodels[10], to provide interactive parametricmodelsof real-world soundeffects.

Modal synthesisis usedfor soundgeneratingobjects/systemswhich can be characterizedby a relatively few resonantmodes,andwhich areexcited by impulsive sources(striking, slamming,tapping,etc.). Bandedwaveguidesarea hybridizationof modalsynthesisandwaveguidesynthesis[11], andaregoodfor model-ing modalsystemswhich canalsobe excited by rubbing,scrap-ping, bowing, etc. Stochasticparticlemodelsynthesis(PhISEM,PhysicallyInspiredStochasticEvent Modeling) is applicableforsoundscausedby systemswhich canbecharacterizedby interac-tions of many independentobjects,suchas ice cubesin a glass,leaves/gravel underwalking feet,loosechangein a pocket,etc.

Theseprogramsareintendedto bedemonstrationexamplesofhow sounddesignersin virtual reality, augmentedreality, games,andmovie productioncanuseparametricsynthesisin creatingbe-lievablereal-timeinteractive sonicworlds.

3.1. Case studies� PuCola Can shown in Fig. 3 is a 3D modelof a sodacan

thatis slid acrossvarioussurfacetextures.Theinputparam-etersof the tool arethe sliding speedandtexture materialandresult in the appropriatechangesto the soundin realtime.

� Gear is a 3D modelof a gearthat is rotatedusinga slider.Thecorrespondingsoundis a real-timeparametricsynthe-sisof thesoundof a turningwrench.

Figure4: GearandParticles

� Feet is a 3D model(still underdevelopment)of feet step-ping on a surface. The main input parametersare gait,heel/toetiming, heelandtoe material,weight andsurfacematerial.A parametricphysicalmodlingof humanwalkingthendrivesboththegraphicsandsynthesismodels.

� Particles is a 3D particle model of particles(rocks, icecubes,etc)falling ona surface.Thedensity, speedandma-terial input parametersaredriving both the 3D modelandthecorrespondingphysicalmodellingsynthesisalgorithms.

4. IMPLEMENTATION-DATASETS

Thesoundanalysisis performedusingMARSYAS [1] anobject-orientedframework for audioanalysis.Thesoundsynthesisis per-formedusingtheSynthesisToolkit [12]. Thedevelopedtoolswrit-ten in Java 3D actasclientsto theanalysisandsynthesisservers.ThesoftwarehasbeentestedonLinux, Solaris,Irix andWindows(95,98,NT)systems.

Four datasetswereusedasthe basisfor building andtestingour tools. Thefirst datasetconsistsof 913isolatedmusicalinstru-mentnotesof bowed, brassandwoodwind instrumentsrecordedfrom theMcGill MUMS CDs. Theseconddatasetconsistsof 150isolatedcontactsoundscollectedfrom varioussoundeffects li-braries.Thissetof 150soundeffectsis composedof everyday“in-teraction”sounds.By interactionwe meansoundsthatarecausedand changeddirectly by our gestures,suchas hitting, scraping,rubbing,walking, etc. They rangein sizefrom 0.4 to 5 seconds,andare the focus of two other projects,oneon the synthesisofinteractionsounds,andoneon theperceptionof suchsounds.Theremainingtwo datasetsare120 FM radio audioclips, represent-ing avarietyof musicgenres,and1000rocksongsnippets(30seclong) in mp3format.

Thesedatasetswereusedto build the soundanalysisvisual-izationtools.Usingrealisticdatasetsis theonly wayto debugandevaluatevisualizationtools. In additionto featureextractiontech-niques,clusteringandPCA requirea trainingdatasetin ordertowork. Evaluatingvisualizationtechniquesis difficult andwe havenot performedany directuserstudieson thevisualizationsystems(althoughwehaveperformedsegmentationandsimilarity retrievalstudiesonthedatasets).Wehavefoundthatthetoolsmakepercep-tual senseandin somecaseshave provideduswith insightsaboutthenatureof thedataandtheanalysistechniqueswe canuse.

DAFX-3


5. FUTURE WORK

The main directionsfor future work are the developmentof al-ternative sound-analysisvisualizationtools and the developmentof new model-basedcontrollersfor soundsynthesisbasedon thesameprinciples.Of coursenew analysisandsynthesisalgorithmswill requireadjustmentof our existing tools anddevelopmentofnew ones. More formal evaluationsof the developedtools areplannedfor thefuture.Integratingthemodel-basedcontrollersintoa realvirtual environmentis anotherdirectionof futurework.

6. ACKNOWLEDGEMENTS

This work wasfundedunderNSF grant9984087andfrom giftsfrom Intel andArial Foundation.

7. REFERENCES

[1] G. TzanetakisandP. Cook, “Marsyas:A framework for au-dio analysis,” OrganisedSound, 2000,(to appear).

[2] M. Hunt, M. Lennig,andP. Mermelstein, “Experimentsinsyllable-basedrecognitionof continuousspeech,” in Proc.1996ICASSP, 1980,pp.880–883.

[3] D. Pye,“Content-basedmethodsfor themanagementof dig-ital music,” in Proc.Int.Confon Audio, Speech and SignalProcessing, ICASSP, 2000.

[4] G. Tzanetakisand P. Cook, “Sound analysisusing mpegcompressedaudio,” in Proc.1999IEEE Int.conf. on Audio,Speech andSignalProcessingICASSP00, Istanbul, 2000.

[5] L.T.Jolliffe, Principal ComponentAnalysis, Springer-Verlag,1986.

[6] R. DudaandP. Hart, PatternClassificationandSceneAnal-ysis, JohnWiley & Sons,1973.

[7] T. Hermann,P. Meinicke, and H. Ritter, “Principal curvesonification,” in Proc. Int. Confon Auditory Display, ICAD2000.

[8] P. Cook, “Towardphysically-informedparametricsynthesisof soundeffects,” in Proc.1999IEEE Workshopon Appli-cationsof SignalProcessingto Audio and Acoustics,WAS-PAA99, New Paltz,NY, 1999, InvitedKeynoteAddress.

[9] G. EsslandP. Cook, “Measurementsandefficient simula-tionsof bowedbars,” Journalof AcousticalSocietyof Amer-ica (JASA), vol. 108,no.1, pp.379–388,2000.

[10] P. Cook, “Physically inspiredsonicmodeling(phism):Syn-thesisof percussive sounds,” ComputerMusicJournal, vol.21,no.3, September1997.

[11] J.O.Smith, “Acousticmodelingusingdigital waveguides,”in MusicalSignalProcessing, C. Roads,S. T. Pope,A. Pic-cialli, andG.DePoli,Eds.,pp.221–263.Netherlands:SwetsandZietlinger, 1997.

[12] P. CookandG.Scavone,“The synthesistoolkit (stk),version2.1,” in Proc.1999Int.ComputerMusic ConferenceICMC,Beijing, China,October, 1999.

DAFX-4

3d graphics tools for sound collections george tzanetakis …gtzan/work/pubs/dafx00gtzan.pdf · 3d...

Documents