distributed model-based learning phd student: zhang, xiaofeng

Distributed Model-Based Distributed Model-Based LearningLearning

PhD student: Zhang, Xiaofeng

I. Model-Based LearningI. Model-Based Learning

• Methods used in Data Clustering Methods used in Data Clustering – dimension reductiondimension reduction

•1. Linear methods: SVD, PCA, Kernel PCA 1. Linear methods: SVD, PCA, Kernel PCA etc.etc.

•2. Pairwise distance methods: 2. Pairwise distance methods: Multidimensional scaling (MDS), etc.Multidimensional scaling (MDS), etc.

•3. Topographic maps: Elastic net, SOM, 3. Topographic maps: Elastic net, SOM, generative topographic mapping(GTM) etc.generative topographic mapping(GTM) etc.

•4. Manifold learning: LLE etc.4. Manifold learning: LLE etc.

• Characters:Characters:– Cope with incomplete dataCope with incomplete data– Better to explain dataBetter to explain data– VisualizationVisualization

• GTM as an exampleGTM as an example– Gaussian distribution over the datasetGaussian distribution over the dataset

/ 2 2( | , , ) ( ) exp{ || ( ; ) || }2 2

Di k k ip t z y z

W W t

• Collaborative Filtering using GTM:Collaborative Filtering using GTM:

Dataset: Movie data

Rate on movie [0~1]

Each color represent a class of movie

Visualize in a 2-D plane

Romance vs. Action

Blue one: Action

Pink one: Romance

• Centralized GTM in CF:Centralized GTM in CF:– Centralized datasetCentralized dataset

•Large scale, billions of recordsLarge scale, billions of records

•Expensive to maintain Expensive to maintain

• Distributed RequirementDistributed Requirement•Security concern: bank, government, militarySecurity concern: bank, government, military

•Privacy sensitive: bank, commercial site, Privacy sensitive: bank, commercial site, personal sitepersonal site

•ScalableScalable

•Expensive to centralizeExpensive to centralize

•Real time huge data streamReal time huge data stream

• Distributed learning way for statistical Distributed learning way for statistical model is an important issuemodel is an important issue

• Distributed Information RetrievalDistributed Information Retrieval– Globally building a P2P networkGlobally building a P2P network– Locally routing a queryLocally routing a query– Globally matching the query to a Globally matching the query to a

distributed datasetdistributed dataset

II. Related WorkII. Related Work

• Distributed Data MiningDistributed Data Mining– Partition of the datasetPartition of the dataset

•Horizontal or homogenousHorizontal or homogenous– Attributes are same in partitionsAttributes are same in partitions

•Vertical or heterogeneousVertical or heterogeneous– Attributed are different in partitionsAttributed are different in partitions

– Approach:Approach:•Distributed KNNDistributed KNN•Density-BasedDensity-Based•Distributed Bayesian networkDistributed Bayesian network

– For example: a global virtue table is built for For example: a global virtue table is built for vertical partitionvertical partition

• Approaches to distributed learning:Approaches to distributed learning:– Mediator basedMediator based– Agent basedAgent based– Grid basedGrid based– Middleware basedMiddleware based

– Density-basedDensity-based– Model-basedModel-based

III. Our ApproachIII. Our Approach• Problem review Problem review

– Local three modelLocal three model– Globally merge the Globally merge the

local modelslocal models– Merge again or not? Merge again or not?

– Sparse local dataSparse local data– Underlying a global Underlying a global

modelmodel

• A related approachA related approach– Artificial dataArtificial data– A Gaussian Mixture A Gaussian Mixture

Model over global Model over global datasetdataset

– MCMC samplingMCMC sampling– To learn local modelTo learn local model– From the average local From the average local

model to learn global model to learn global modelmodel

– Privacy cost distribution: Privacy cost distribution: a gaussian distributiona gaussian distribution

• Density based merging approachDensity based merging approach– The combined global modelThe combined global model

– K : the number of the components• pi(xt) : a Gaussian component

• αi=1 is the weight value. satisfy

• Merging criteriaMerging criteria– Q = argmax(Lij) + argmin(Cosij)

– Lij: likelihood measureLij: likelihood measure– Cosij: Privacy cost between two modelCosij: Privacy cost between two model

•Two consideration:Two consideration:– Privacy cost Privacy cost – Likelihood a data generated by the other modelLikelihood a data generated by the other model

• Steps:Steps:•Locally learning modelsLocally learning models

•Merging according to the likelihood and Merging according to the likelihood and privacy controlprivacy control

•Merging stop if no clusters is density Merging stop if no clusters is density connected.connected.

•Learn the parameters of a global GMM viaLearn the parameters of a global GMM via

K etc.K etc.

• Hierarchical Hierarchical ApproachApproach

• Local six modelsLocal six models

• Merge according to Merge according to the similarity measurethe similarity measure

• Each level can be Each level can be controlled by the controlled by the privacy costprivacy cost

• Bottom up learning a Bottom up learning a hierarchical modelhierarchical model

• After a global model is After a global model is learned, change the learned, change the privacy control level, privacy control level, can change the modelcan change the model

• Model selectionModel selection– Simij = Dist(Cost(Di);Cost(Dj)) < Const

• Cost(Di) : transform dataset use cost function

• Dist( x , y) : operation of computing distance between two dataset

• Smaller than a threshold then merge

• Steps:Steps:• 1. Learn a local model from local dataset.• 2. Based on the predefined the privacy

control function, merge local models to form a hierarchical global model.

• 3. Relabel the local model according to the changed privacy.

•Privacy Control by Data Sampling•Previously control the privacy functionPreviously control the privacy function

•Try to control the dataset sensitive to Try to control the dataset sensitive to privacyprivacy

– D1’ = D1 U Oa21 (D2) U Oa31(D3) U Oa41(D4)– D2’ = Oa12 (D1) U D2 U Oa32(D3) U Oa42(D4)– …

Oa12: Operator over the dataset

– New local dataset are reconstructed by sampling from the other local dataset at some privacy control level

• P2P Approach– Local small world of Local small world of

networknetwork– Local global modelLocal global model– Storing local Storing local

network information network information in each nodein each node

– Trust propagation Trust propagation to connected nodesto connected nodes

– Pass knowledge to Pass knowledge to connected small connected small worldworld

•Algorithms:• 1. Learn a global model for each small world of local

nodes.

• 2. pass back global information to each node in this small world.

• 3. Nodei pass its trust relationship to its connected outer small world nodes at a certain value.

• 4. The connected nodes merge the local model with new knowledge in another model.

• 5. Update the connected global model knowledge, and propagate to all the local models in this small world.

• 6. Sum all the knowledge L3 collected, and update the G2, then repeat the step 3 - 6 until the loop criteria is satisfied: reach the iteration number or the global model change little.

IV. IV. Model Evaluation

•Effective criterion– Precision

•How accurate a model can be

– Recall•Cover how many the right data in the model

• Efficiency criterion– The communication cost

• bandwidth is the same

• Only proportion to partition size

• Maximum data transferred

– Overhead• Compare three approach with the

centralized way

– Complexity• Computation complexity

V. Experiments IssueV. Experiments Issue

• Another approach for the dataset– Site vector instead of document vector

• Pick out meaningful representatives of local models

• LLE vs. GTM etc.

• Change the privacy distribution to control the shape of global model

Question & AnswerQuestion & Answer

distributed model-based learning phd student: zhang, xiaofeng

Documents

distributed dataset

global model slide

modelbased learning

distributed requirement

dataset slide

statistical model

romance slide

xiaofeng slide