audio: generation & extractionfidler/teaching/2015/slides/csc2523/charu... · • designed to...
TRANSCRIPT
MusicComposition– whichapproach?
• FeedforwardNNcan’tstoreinformationaboutpast(orkeeptrackofpositioninsong)• RNNasasinglesteppredictorstrugglewithcomposition,too
• Vanishinggradientsmeanserrorflowvanishesorgrowsexponentially• Networkcan’tdealwithlong-termdependencies
• Butmusicisallaboutlong-termdependencies!
2
Music
• Long-termdependenciesdefinestyle:• Spanningbarsandnotescontributetometricalandphrasalstructure
• Howdoweintroducestructureatmultiplelevels?• EckandSchmidhuberàLSTM
3
WhyLSTM?
• Designedtoobtainconstanterrorflowthroughtime• Protecterrorfromperturbations
• Uses linearunitstoovercomedecayproblemswithRNN
• Inputgate:protectsflowfromperturbationbyirrelevantinputs• Outputgate:protectsotherunitsfromperturbationfromirrelevantmemory• Forgetgate:resetmemorycellwhencontentisobsolete
Hochreiter &Schmidhuber, 1997 4
DataRepresentation
Chords:
Notes:
EckandSchmidhuber,2002 5
Onlyquarternotes
Norests
TrainingmelodieswrittenbyEck
Datasetof4096segments
Experiment1- LearningChords
• Objective:showthatLSTMcanlearn/representchordstructureintheabsenceofmelody• Network:• 4cellblocksw/2cellseacharefullyconnectedtoeachother+input• Outputlayerisfullyconnectedtoallcellsandtoinputlayer
• Training&testing:predictprobabilityofanotebeingonoroff• Usenetworkpredictionsforensuingtimestepswithdecisionthreshold• CAVEAT:treatoutputsasstatisticallyindependent.Thisisuntrue!(Issue#1)
• Result:generatedchordsequences
6
Experiment2– LearningMelodyandChords
• CanLSTMlearnchord&melodystructure,andusethesestructuresforcomposition?• Network:• Differenceforex1.:chordcellblockshaverecurrentconnectionstothemselves+melody;melodycellblocksareonlyrecurrentlyconnectedtomelody
• Training:predictprobabilityforanotetobeonoroff
7
Samplecomposition
• Trainingset:http://people.idsia.ch/~juergen/blues/train.32.mp3
• Chord+melodysample:http://people.idsia.ch/~juergen/blues/lstm_0224_1510.32.mp3
8
Issues
• Noobjectivewaytojudgequalityofcompositions• Repetitionandsimilaritytotrainingset• Considerednotestobeindependent• Limitedtoquarternotes+norests• Usessymbolicrepresentations(modifiedsheetnotation)à howcouldithandlereal—timeperformancemusic(MIDIoraudio)• Wouldallowinteraction(liveimprovisation)
9
AudioExtraction(sourceseparation)
• Howdoweseparatesources?• Engineeringapproach:decomposemixedaudiosignalintospectrogram,assigntime-frequencyelementtosource• Idealbinarymask:eachelementisattributedtosourcewithlargestmagnitudeinthesourcespectrogram• Thisisthenusedtoest.referenceseparation
10
DNNApproach
• Dataset:63popsongs(50fortraining)• binarymaskcomputed:determinedbycomparingmagnitudesofvocal/non-vocalspectrogramsandassigningmaska‘1’whenvocalhadgreatermag
11
DNN
• Trainedafeed-forwardDNNtopredictbinarymasksforseparatingvocalandnon-vocalsignalsforasong• Spectrogramwindowwasunpackedintoavector• Probabilisticbinarymask:testingusedslidingwindow,andoutputofmodeldescribedpredictionsofbinarymaskinslidingwindowformat• Confidencethreshhold (alpha):Mv binarymask
12
Separationqualityasafunctionofalpha
14
SIR(red)=signal-to-interferenceratio
SDR(green)=signal-to-distortion
SAR(blue) =signal-to-artefact
SARandSIRcanbeinterpretedasenergeticequivalentsofpositivehitrate(SIR)andfalsepositiverate(SAR)
Like-to-likeComparison
15
PlotsmeanSARasafunctionofmeanSIRforbothmodels
DNNprovides~3dBbetterSARperformance foragivenSIRindexmean,~5dBforvocalandandonlyasmalladvantagefornon-vocalsignals
DNNseemstohavebiaseditslearnings towardmakinggoodpredictionsviacorrectpositiveidentificationofvocalsounds