data fusion techniques and applicationyunshengb.com/wp-content/uploads/2017/12/... ·...
TRANSCRIPT
DataFusionTechniquesandApplication
GuangyuZhou
Referencepaper:ZhengYu:MethodologiesforCross-DomainDataFusion:AnOverview
Agenda§ Introduction§ Relatedwork§ Datafusiontechniques&applications
§ Stage-basedmethods§ Featurelevel-basedmethods§ Semanticmeaning-baseddatafusionmethods
§ Summary
Whatisdatafusion?§ Datafusion istheprocessofintegratingmultipledatasourcestoproducemoreconsistent,accurate,andusefulinformationthanthatprovidedbyanyindividualdatasource---- Wikipedia
Whydatafusion?§ Inthebigdataera,wefaceadiversityofdatasetsfromdifferentsourcesindifferentdomains,consistingofmultiplemodalities:§ Representation,distribution,scale,anddensity.
§ Howtounlockthepowerofknowledgefrommultipledisparate(butpotentiallyconnected)datasets?§ Treatingdifferentdatasetsequallyorsimplyconcatenatingthefeaturesfromdisparatedatasets?
Whydatafusion?§ Inthebigdataera,wefaceadiversityofdatasetsfromdifferentsourcesindifferentdomains,consistingofmultiplemodalities:§ Representation,distribution,scale,anddensity.
§ Howtounlockthepowerofknowledgefrommultipledisparate(butpotentiallyconnected)datasets?§ Treatingdifferentdatasetsequallyorsimplyconcatenatingthefeaturesfromdisparatedatasets
§ Useadvanceddatafusiontechniquesthatcanfuseknowledgefromvariousdatasetsorganicallyinamachinelearninganddataminingtask
RelatedWork§ RelationtoTraditionalDataIntegration
RelatedWork§ RelationtoHeterogeneousInformationNetwork
§ Itonlylinkstheobjectinasingledomain:§ Bibliographicnetwork,author,papers,andconferences.§ Flickrinformationnetwork:users,images,tags,andcomments.
§ Aimtofusedataacrossdifferentdomains:§ Trafficdata,socialmediaandairquality
§ Heterogeneousnetworkmaynotbeabletofindexplicitlinkswithsemanticmeaningsbetweenobjectsofdifferentdomains.
Datafusionmethodologies§ Stage-basedmethods§ Featurelevel-basedmethods§ Semanticmeaning-baseddatafusionmethods
§ multi-viewlearning-based§ similarity-based§ probabilisticdependency-based§ andtransferlearning-basedmethods.
Stage-baseddatafusionmethods§ Differentdatasetsatdifferentstagesofadataminingtask.§ Datasetsarelooselycoupled,withoutanyrequirementsontheconsistencyoftheirmodalities.
§ Canbeameta-approachusedtogetherwithotherdatafusionmethods
Mappartitionandgraphbuildingfortaxitrajectory
Friendrecommendation
§ Stages§ I.Detectstaypoints§ II.MaptoPOIvector§ III.Hierarchicalclustering§ IV.Partialtree§ V.Hierarchicalgraph
§ ->comparable(fromsametree)
Datafusionmethodologies§ Stage-basedmethods§ Featurelevel-basedmethods§ Semanticmeaning-baseddatafusionmethods
§ multi-viewlearning-based§ similarity-based§ probabilisticdependency-based§ andtransferlearning-basedmethods.
Feature-level-baseddatafusion§ DirectConcatenation
§ Treatfeaturesextractedfromdifferentdatasetsequally,concatenatingthemsequentiallyintoafeaturevector
§ Limitations:§ Over-fitting inthecaseofasmallsizetrainingsample,andthespecificstatisticalpropertyofeachviewisignored.
§ Difficulttodiscoverhighlynon-linearrelationshipsthatexistbetweenlow-levelfeaturesacrossdifferentmodalities.
§ Redundanciesanddependenciesbetweenfeaturesextractedfromdifferentdatasetswhichmaybecorrelated.
Feature-level-baseddatafusion§ DirectConcatenation+sparsityregularization:
§ handlethefeatureredundancyproblem
§ Dualregularization(i.e.,zero-meanGaussianplusinverse-gamma)§ RegularizemostfeatureweightstobezeroorclosetozeroviaaBayesiansparseprior
§ Allowforthepossibilityofamodellearninglargeweightsforsignificantfeatures
Feature-level-baseddatafusion§ DNN-BasedDataFusion§ Usingsupervised,unsupervisedandsemi-supervisedapproaches,DeepLearninglearnsmultiplelevelsofrepresentationandabstraction
§ Unifiedfeaturerepresentationfromdisparatedataset
DNN-BasedDataFusion§ DeepAutoencoderModelsoffeaturerepresentationbetween2modalities(audio+video)
MultimodalDeepBoltzmannMachine§ ThemultimodalDBMisagenerativeandundirectedgraphicmodel.
§ Enablesbi-directionalsearch.
§ Tolearn
LimitationsofDNN-basedfusionmodel§ Performanceheavilydependonparameters
§ Findingoptimalparametersisalaborintensiveandtime-consumingprocessgivenalargenumberofparametersandanon-convexoptimizationsetting.
§ Hardtoexplainwhatthemiddle-levelfeaturerepresentationstandsfor.§ WedonotreallyunderstandthewayaDNNmakesrawfeaturesabetterrepresentationeither.
Semanticmeaning-baseddatafusion§ Unlikefeature-basedfusion,semanticmeaning-basedmethodsunderstandtheinsight ofeachdatasetandrelations betweenfeaturesacrossdifferentdatasets.
§ 4groupsofsemanticmeaningmethods:§ multi-view-based,similarity-based,probabilisticdependency-based,andtransfer-learning-basedmethods.
Datafusionmethodologies§ Stage-basedmethods§ Featurelevel-basedmethods§ Semanticmeaning-baseddatafusionmethods
§ multi-viewlearning-based§ co-training,multiplekernellearning(MKL),subspacelearning
§ similarity-based§ probabilisticdependency-based§ andtransferlearning-basedmethods.
Multi-ViewBasedDataFusion§ Differentdatasetsordifferentfeaturesubsetsaboutanobjectcanberegardedasdifferentviewsontheobject.
§ Person:face,fingerprint,orsignature§ Image:colorortexturefeatures
§ Latentconsensus&complementaryknowledge§ 3subcategories:
§ 1)co-training§ 2)multiplekernellearning(MKL)§ 3)subspacelearning
Multi-ViewBasedDataFusion:Co-training§ Co-trainingconsidersasettinginwhicheachexamplecanbepartitionedintotwodistinctviews,makingthreemainassumptions:§ Sufficiency:eachviewissufficientforclassificationonitsown§ Compatibility:thetargetfunctionsinbothviewspredictthesamelabelsforco-occurringfeatureswithhighprobability
§ Conditionalindependence:theviewsareconditionallyindependentgiventheclasslabel.(Toostronginpractice)
Multi-ViewBasedDataFusion:Co-training§ OriginalCo-training
Co-training-basedairqualityinferencemodel
Multi-ViewBasedDataFusion:MKL§ 2.Multi-KernelLearning§ Akernelisahypothesisonthedata§ MKL referstoasetofmachinelearningmethodsthatusesapredefinedsetofkernelsandlearnsanoptimallinearornon-linearcombinationofkernelsaspartofthealgorithm.§ Eg:Ensembleandboostingmethods,suchasRandomForest,areinspiredbyMKL.
Multi-ViewBasedDataFusion:MKL§ MKL-basedframeworkforforecastingairquality.
Multi-ViewBasedDataFusion:MKL§ TheMKL-basedframeworkoutperformsasinglekernel-basedmodelintheairqualityforecastexample§ Featurespace:
§ Thefeaturesusedbythespatialandtemporalpredictorsdonothaveanyoverlaps,providingdifferentviewsonastation’sairquality.
§ Model:§ Thespatialandtemporalpredictorsmodelthelocalfactorsandglobalfactorsrespectively,whichhavesignificantlydifferentproperties.
§ Parameterlearning:§ Decomposingabigmodelinto3coupledsmallonesscalesdowntheparameterspacestremendously.
Multi-ViewBasedDataFusion:subspacelearning§ Obtainalatentsubspacesharedbymultipleviewsbyassumingthatinputviewsaregeneratedfromthislatentsubspace,
§ Subsequenttasks,suchasclassificationandclustering§ Lowerdimensionality
Multi-ViewBasedDataFusion:subspacelearning§ Eg:PCA->
§ Linearcase:Canonicalcorrelationanalysis(CCA)§ maximizingthecorrelationbetween2viewsinthesubspace
§ Non-linear:KernelvariantofCCA(KCCA)§ mapeach(non-linear)datapointtoahigherspaceinwhichlinearCCAoperates.
Multi-ViewBasedDataFusion§ SummaryofMulti-ViewBasedmethods
§ 1)co-training:maximizethemutualagreementontwodistinctviewsofthedata.
§ 2)multiplekernellearning(MKL):exploitkernelsthatnaturallycorrespondtodifferentviewsandcombinekernelseitherlinearlyornon-linearlytoimprovelearning.
§ 3)subspacelearning:obtainalatentsubspacesharedbymultipleviews,assumingthattheinputviewsaregeneratedfromthislatentsubspace
Datafusionmethodologies§ Stage-basedmethods§ Featurelevel-basedmethods§ Semanticmeaning-baseddatafusionmethods
§ multi-viewlearning-based§ similarity-based
§ CoupledMatrixFactorization§ ManifoldAlignment
§ probabilisticdependency-based§ andtransferlearning-basedmethods.
§ Recall:MatrixdecompositionbySVD
§ Problemsofsinglematrixdecompositionondifferentdatasets:§ Inaccuratecomplementationofmissingvaluesinthematrix.
Similarity-Based:CoupledMatrixFactorization§ Solutionbycoupled(context-aware)matrixfactorization:
§ Toaccommodatedifferentdatasetswithdifferentmatrices(distribution,meaning),whichshareacommondimensionbetweenoneanother.
§ Bydecomposingthesematricescollaboratively,wecantransferthesimilaritybetweendifferentobjectslearnedfromadatasettoanotherone,thereforecomplementingthemissingvaluesmoreaccurately.
CoupledMatrixFactorizationApplication§ Estimatethetravelspeedoneachroadsegmentinanentirecity,basedontheGPStrajectoryofasampleofvehicles
CoupledMatrixFactorizationApplication§ Coupledmatrixfactorization
§ Objectivefunction:
Similarity-Based:ManifoldAlignment§ Utilizestherelationshipsofinstanceswithineachdatasettostrengthentheknowledgeoftherelationships between thedatasets,therebyultimatelymapping initiallydisparatedatasetsto ajointlatentspace
§ Mapstwodatasets(X,Y)toanewjointlatentspace(f(X);g(Y)),
Similarity-Based:ManifoldAlignment§ Preserves2similarities:
§ Thelocalsimilaritywithinadataset,
§ Thecorrespondencesacrossdifferentdatasets.
§ C,costfunction;F,embeddingofdata;W,similaritymatrix;a,theathdataset
Similarity-Based:ManifoldAlignment§ Manifoldalignmentassumesthedisparatedatasetstobealignedhavethesameunderlyingmanifoldstructure
§ ThesecondlossfunctionissimplythelossfunctionforLaplacianEigen-mapsusingthejointadjacencymatrix:L=D- W
CoupledMatrixFactorization+manifold§ Example:Inferthefine-grainednoisesituationbyusingcomplaintdatatogetherwithsocialmedia,roadnetworkdata,andPOIs
Datafusionmethodologies§ Stage-basedmethods§ Featurelevel-basedmethods§ Semanticmeaning-baseddatafusionmethods
§ multi-viewlearning-based§ similarity-based§ probabilisticdependency-based§ andtransferlearning-basedmethods.
ProbabilisticDependency-BasedFusion§ Thiscategoryofapproachesbridgesthegapbetweendifferentdatasetsbytheprobabilisticdependency,whichemphasizemoreabouttheinteraction ratherthanthesimilarity betweentwoobjects.
§ Twobranchesofgraphicalrepresentationsofdistributionsarecommonlyused:§ BayesianNetworks§ MarkovNetworks(a.k.a.MarkovRandomField)
ProbabilisticDependency-BasedFusionModel§ ThegraphicalstructureoftrafficvolumeinferencemodelbasedonPOIs,roadnetworks,travelspeedandweather.§ Agraynodedenotesahiddenvariableandwhitenodesareobservations.§ 𝜃:roadhiddenvariable§ 𝛼:POIhiddenvariable§ 𝑁$:Trafficvolumehiddenvariable
Datafusionmethodologies§ Stage-basedmethods§ Featurelevel-basedmethods§ Semanticmeaning-baseddatafusionmethods
§ multi-viewlearning-based§ similarity-based§ probabilisticdependency-based§ transferlearning-basedmethods.
Transferlearning-basedmethods§ Anassumptioninmanymachinelearningalgorithmsisthatthetrainingandtestdatamustbeinthesamefeaturespace andhavethesamedistribution.
§ Transferlearning,incontrast,allowsthedomains,tasks,anddistributionsusedintrainingandtestingtobedifferent.
§ Examples:§ Auser’stransactionrecordsinAmazon->applicationoftravelrecommendation.
§ Theknowledgelearnedfromonecity’strafficdata->anothercity.
TaxonomyofTransferlearning
TransferbetweentheSameTypeofDatasets§ Examplesofmulti-tasktransferlearning
TransferLearningamongMultipleDatasets
ComparisonofDifferentDataFusionMethods
FillingMissingValues(ofasparsedataset),PredictFuture,CausalityInference,ObjectProfiling,andAnomalyDetection.
Thankyou!
Q&A