audio: generation & extractionfidler/teaching/2015/slides/csc2523/charu... · • designed to...

Audio:Generation&Extraction

CharuJaiswal

MusicComposition– whichapproach?

• FeedforwardNNcan’tstoreinformationaboutpast(orkeeptrackofpositioninsong)• RNNasasinglesteppredictorstrugglewithcomposition,too

• Vanishinggradientsmeanserrorflowvanishesorgrowsexponentially• Networkcan’tdealwithlong-termdependencies

• Butmusicisallaboutlong-termdependencies!

2

Music

• Long-termdependenciesdefinestyle:• Spanningbarsandnotescontributetometricalandphrasalstructure

• Howdoweintroducestructureatmultiplelevels?• EckandSchmidhuberàLSTM

3

WhyLSTM?

• Designedtoobtainconstanterrorflowthroughtime• Protecterrorfromperturbations

• Uses linearunitstoovercomedecayproblemswithRNN

• Inputgate:protectsflowfromperturbationbyirrelevantinputs• Outputgate:protectsotherunitsfromperturbationfromirrelevantmemory• Forgetgate:resetmemorycellwhencontentisobsolete

Hochreiter &Schmidhuber, 1997 4

DataRepresentation

Chords:

Notes:

EckandSchmidhuber,2002 5

Onlyquarternotes

Norests

TrainingmelodieswrittenbyEck

Datasetof4096segments

Experiment1- LearningChords

• Objective:showthatLSTMcanlearn/representchordstructureintheabsenceofmelody• Network:• 4cellblocksw/2cellseacharefullyconnectedtoeachother+input• Outputlayerisfullyconnectedtoallcellsandtoinputlayer

• Training&testing:predictprobabilityofanotebeingonoroff• Usenetworkpredictionsforensuingtimestepswithdecisionthreshold• CAVEAT:treatoutputsasstatisticallyindependent.Thisisuntrue!(Issue#1)

• Result:generatedchordsequences

6

Experiment2– LearningMelodyandChords

• CanLSTMlearnchord&melodystructure,andusethesestructuresforcomposition?• Network:• Differenceforex1.:chordcellblockshaverecurrentconnectionstothemselves+melody;melodycellblocksareonlyrecurrentlyconnectedtomelody

• Training:predictprobabilityforanotetobeonoroff

7

Samplecomposition

• Trainingset:http://people.idsia.ch/~juergen/blues/train.32.mp3

• Chord+melodysample:http://people.idsia.ch/~juergen/blues/lstm_0224_1510.32.mp3

8

Issues

• Noobjectivewaytojudgequalityofcompositions• Repetitionandsimilaritytotrainingset• Considerednotestobeindependent• Limitedtoquarternotes+norests• Usessymbolicrepresentations(modifiedsheetnotation)à howcouldithandlereal—timeperformancemusic(MIDIoraudio)• Wouldallowinteraction(liveimprovisation)

9

AudioExtraction(sourceseparation)

• Howdoweseparatesources?• Engineeringapproach:decomposemixedaudiosignalintospectrogram,assigntime-frequencyelementtosource• Idealbinarymask:eachelementisattributedtosourcewithlargestmagnitudeinthesourcespectrogram• Thisisthenusedtoest.referenceseparation

10

DNNApproach

• Dataset:63popsongs(50fortraining)• binarymaskcomputed:determinedbycomparingmagnitudesofvocal/non-vocalspectrogramsandassigningmaska‘1’whenvocalhadgreatermag

11

DNN

• Trainedafeed-forwardDNNtopredictbinarymasksforseparatingvocalandnon-vocalsignalsforasong• Spectrogramwindowwasunpackedintoavector• Probabilisticbinarymask:testingusedslidingwindow,andoutputofmodeldescribedpredictionsofbinarymaskinslidingwindowformat• Confidencethreshhold (alpha):Mv binarymask

12

SeparationofsourcesusingDNN

13

Separationqualityasafunctionofalpha

14

SIR(red)=signal-to-interferenceratio

SDR(green)=signal-to-distortion

SAR(blue) =signal-to-artefact

SARandSIRcanbeinterpretedasenergeticequivalentsofpositivehitrate(SIR)andfalsepositiverate(SAR)

Like-to-likeComparison

15

PlotsmeanSARasafunctionofmeanSIRforbothmodels

DNNprovides~3dBbetterSARperformance foragivenSIRindexmean,~5dBforvocalandandonlyasmalladvantagefornon-vocalsignals

DNNseemstohavebiaseditslearnings towardmakinggoodpredictionsviacorrectpositiveidentificationofvocalsounds

CritiqueofPaper+NextSteps

• DNNseemstohavebiaseditslearningstowardmakinggoodpredictionsviacorrectpositiveidentificationofvocalsounds• OnlyasmalladvantagetousingDNNvs.traditionalapproach• Expanddataset

16

audio: generation & extractionfidler/teaching/2015/slides/csc2523/charu... · • designed to...

Documents