differentially private data release for data mining benjamin c.m. fung concordia university...

33
Differentially Private Data Release for Data Mining Benjamin C.M. Fung Concordia University Montreal, QC, Canada Noman Mohammed Concordia University Montreal, QC, Canada Rui Chen Concordia University Montreal, QC, Canada Philip S. Yu University of Illinois at Chicago, IL, USA

Upload: margaret-bond

Post on 24-Dec-2015

216 views

Category:

Documents


0 download

TRANSCRIPT

  • Slide 1
  • Differentially Private Data Release for Data Mining Benjamin C.M. Fung Concordia University Montreal, QC, Canada Noman Mohammed Concordia University Montreal, QC, Canada Rui Chen Concordia University Montreal, QC, Canada Philip S. Yu University of Illinois at Chicago, IL, USA
  • Slide 2
  • 2 Outline Overview Differential privacy Related Work Our Algorithm Experimental results Conclusion 2
  • Slide 3
  • 3 Overview 3 Privacy model Anonymization algorithm Data utility
  • Slide 4
  • 4 Contributions Proposed an anonymization algorithm that provides differential privacy guarantee G eneralization-based algorithm for differentially private data release Proposed algorithm can handle both categorical and numerical attributes Preserves information for classification analysis 4
  • Slide 5
  • 5 Outline Overview Differential privacy Related Work Our Algorithm Experimental results Conclusion 5
  • Slide 6
  • 6 Differential Privacy [DMNS06] 6 A non-interactive privacy mechanism A gives -differential privacy if for all neighbour D and D, and for any possible sanitized database D* Pr A [A(D) = D*] exp() Pr A [A(D) = D*] DD D and D are neighbors if they differ on at most one record
  • Slide 7
  • 7 Laplace Mechanism [DMNS06] 7 For example, for a single counting query Q over a dataset D, returning Q(D) + Laplace(1/) maintains -differential privacy. f = max D,D ||f(D) f(D)|| 1 For a counting query f: f =1
  • Slide 8
  • 8 Given a utility function u : ( D T ) R for a database instance D, the mechanism A, A(D, u) = return t with probability proportional to exp(u(D, t)/2 u) gives -differential privacy. Exponential Mechanism [MT07] 8
  • Slide 9
  • 9 Composition properties 9 Sequential composition i i differential privacy Parallel composition max( i )differential privacy
  • Slide 10
  • 10 Outline Overview Differential privacy Related Work Our Algorithm Experimental results Conclusion 10
  • Slide 11
  • 11 Two Frameworks Interactive: Multiple questions asked/answered adaptively Anonymizer
  • Slide 12
  • 12 Two Frameworks Interactive: Multiple questions asked/answered adaptively Anonymizer Non-interactive: Data is anonymized and released
  • Slide 13
  • 13 Related Work 13 A. Blum, C. Dwork, F. McSherry, and K. Nissim. Practical privacy: The SuLQ framework. In PODS, 2005. A. Friedman and A. Schuster. Data mining with differential privacy. In SIGKDD, 2010. Is it possible to release data for classification analysis ?
  • Slide 14
  • 14 Why Non-interactive framework ? 14 Disadvantages of interactive approach: Database can answer a limited number of queries Big problem if there are many data miners Provide less flexibility to perform data analysis
  • Slide 15
  • 15 Non-interactive Framework 0 + Lap(1/ ) 15
  • Slide 16
  • 16 For high-dimensional data, noise is too big 0 + Lap(1/ ) 16 Non-interactive Framework
  • Slide 17
  • 17 Non-interactive Framework
  • Slide 18
  • 18 Outline Overview Differential privacy Related Work Our Algorithm Experimental results Conclusion 18
  • Slide 19
  • 19 JobAgeClassCount Any_Job[18-65)4Y4N8 Artist[18-65)2Y2N4 Professional[18-65)2Y2N4 Age [18-65) [18-40)[40-65) Artist[18-40)2Y2N4Artist[40-65)0Y0N0 Anonymization Algorithm [18-30)[30-40) 19 Professional[18-40)2Y1N3Professional[40-65)0Y1N1 Job Any_Job ProfessionalArtist EngineerLawyerDancerWriter
  • Slide 20
  • 20 Candidate Selection we favor the specialization with maximum Score value First utility function: u = Second utility function: u = 1 20
  • Slide 21
  • 21 Split Value The split value of a categorical attribute is determined according to the taxonomy tree of the attribute How to determine the split value for numerical attribute ? 21
  • Slide 22
  • 22 Split Value The split value of a categorical attribute is determined according to the taxonomy tree of the attribute How to determine the split value for numerical attribute ? AgeClass 60 Y 30 N 25 Y 40 N 25 Y 40 N 45 N 25 Y 1865 40 30 60 2545 22
  • Slide 23
  • 23 Anonymization Algorithm O(A pr x|D|log|D|) O(|candidates|) O(|D|) O(|D|log|D|) O(1) 23
  • Slide 24
  • 24 Anonymization Algorithm O(A pr x|D|log|D|) O(|candidates|) O(|D|) O(|D|log|D|) O(1) O((A pr +h)x|D|log|D|) 24
  • Slide 25
  • 25 Outline Overview Differential privacy Related Work Our Algorithm Experimental results Conclusion 25
  • Slide 26
  • 26 Experimental Evaluation Adult: is a Census data (from UCI repository) 6 continuous attributes. 8 categorical attributes. 45,222 census records 26
  • Slide 27
  • 27 Data Utility for Max 27
  • Slide 28
  • 28 Data Utility for InfoGain 28
  • Slide 29
  • 29 Comparison 29
  • Slide 30
  • 30 Scalability 30
  • Slide 31
  • 31 Outline Overview Differential privacy Related Work Our Algorithm Experimental results Conclusion 31
  • Slide 32
  • 32 Differentially Private Data Release Generalization-based differentially private algorithm Provides better utility than existing techniques Conclusions 32
  • Slide 33
  • 33 Q&A Thank You Very Much 33