l8: introduction to privacy-preserving computations
Post on 29-Jan-2022
1 Views
Preview:
TRANSCRIPT
L8: Introduction to privacy-preserving computationsPrivacy-preserving Technologies / LTAT.04.007
Dan Bogdanovdan.bogdanov@cyber.ee
Using privacy technologies to solve it
Source data:10 million tax records,600 000 education records.
Each record upload using secret sharing(think: “encryption”)Records linked and processed using secure multi-party computation (think: “data not decrypted for processing”) Data never existed outside the source in an unencrypted state.Solution based on Sharemind MPC. Cybernetica
Educationrecords
Employmenttax records
Estonian Information
System's Authority
Ministry of Finance
IT Center
Ministry of Education and Research
Estonian Tax and Customs Board
5
Tax and Customs
Board
Employmenttax payments
Ministry of Education
and Science
Higher studyevents
Monthly income
University career of a person
Aggregate by person
Average yearly income
Aggregate by year
Employment record of a person
Complete record of a person
Merge by person's ID
Analysis table
Compute additional attributes and
align tax payments Extract data
Extract data
Higher studyevents
Secret shareand upload
Employmenttax payments
Expand by years and aggregate by person
Aggregate by month
Data stored with secret sharing andprocessed with secure multi-party computation
Analysis results
?
Analysis results
Recoverresults from
shares
Statisticalanalyst
7
Sharemind-powered Analytics
Data scientists used analytics tools based on secure multi-party computation.The MPC system also prevented queries outside the study plan.Reports were given to industry, universities and the government.Result: no clear relation between working during studies and not graduating.
DataAnalyst
UniversitiesCompanies
Policymakers
Cybernetica
Estonian Information
System's Authority
Ministry of Finance
IT Center
8
8
2. Tulemused Tulemused kinnitavad, et nominaalajaga lõpetajate osakaal on madal tudengite hulgas üldiselt ja IKT-tudengite hulgas eriti. IKT-tudengite hulgas varieerub nominaalajaga lõpetajate osakaal bakalaureuse-õppes 20 protsendi piirimail, mis on madalam kui muude õppekavade tudengite vastav number (vt Joonis 1). Samasugune tendents ilmneb rakenduskõrgharidusõppe puhul. Magistriõppes on nominaalajaga lõpetajate osakaal veidi kõrgem, varieerudes sõltuvalt aastast 30% ja 40% vahel, kuid siingi on IKT õppurite hulgas see madalam kui teistel.
Joonis 1. Nominaalajaga lõpetajate osakaal immatrikuleerimisaastate lõikes, IKT- ja mitte-IKT õppekavad, bakalaureuseõpe
Meessoost tudengid lõpetavad nominaalaja jooksul väiksema tõenäosusega kui naistudengid (vt Joonis 2, Joonis 3). IKT tudengite madalam nominaalajaga lõpetamise tõenäosus ilmneb mõlema soo puhul.
Non-IT graduation rate is around 40%
IT graduation rate is around 20%
11
10
Kooliti on nominaalajaga lõpetavate IKT-tudengite osakaal bakalaureuseõppes kõrgeim Tartu Ülikoolis, järgnevad TTÜ ja TLÜ. Rakenduskõrghariduses on kõrgeim nominaalajaga lõpetajate osakaal TLÜ-s, järgnevad Infotehnoloogia Kolledž ja TTÜ. Magistriõppes on kõrgeim nominaalajaga lõpetajate osakaal TÜ-s, järgnevad TTÜ ja TLÜ (järjestus varieerub aastati). ˇ
Nominaalaja jooksul töötamist vaadates selgub üllatuslikult, et IKT-tudengid ei tööta õpingute ajal rohkem kui mitte-IKT õppekavadel õppivad tudengid. Bakalaureuseõppes on kõigi õppeaastate lõikes enamikul aastatel mitte-IKT õppekavade tudengite hulgas tööhõive määr kõrgem kui IKT-tudengitel (vt Joonis 4). Sama on järeldus rakenduskõrghariduse õppurite osas. Magistriõppes, kus tööhõive määrad ületavad 80%, on aga tulemus vastupidine: IKT-tudengite hulgas on tööhõive määr kõrgem kui mitte-IKT õppekavade õppuritel.
Joonis 4. Nominaalaja jooksul töötanud tudengite osakaal kõigist tudengitest aastati, IKT- ja mitte-IKT õppekavad, bakalaureuseõpe
Naissoost tudengite tööhõive määr on mõnevõrra kõrgem kui meessoost õppuritel, seda nii IKT- kui mitte-IKT tudengite hulgas (Joonis 5, Joonis 6). Soolised erinevused hõivemäärades varieeruvad aastati – on aastaid, kus erinevus on märkimisväärne, ning aastaid, kus olulist erinevust pole.
Non-IT and IT students have similar employment ratios, but IT students lost more in the financial crisis
12
Regulatory status of the project
In an official response, after a study of the system, the Estonian DPA suggested that
neither the hosts of the servers running the statisticsnor the analysts making the queriescould feasibly re-identify individuals in the source database (this was pre-GDPR).
The Internal Supervision Department of the Tax and Customs Board agreed to provide unmodified tax records after a code and process review.Follow-up legal review in the FP7 PRACTICE by a research from the University of Göttingen suggested that the same precedent could hold under GDPR as well.
DATA-DRIVEN SERVICES ON CONFIDENTIAL DATA
13
Concept of secure computing
15
encrypteddatabase
standard tools
secure computing
When a standard computerencrypts data, it must bedecrypted before analysis
Secure computing systemscan analyze data without removing the encryption.
Extended definition of secure multi-party computation
16
Input parties
IP1
IPk
...
Computing parties
CP1
CPl
x11
xk1
...
x1i
xki
...
x1l
xkl
...
y1
yl
yj...
Result parties
RP1
RPm
x1
xk
y1
ym
Step 1: upload and storage of inputs
Step 3: publishingof results
Step 2: secure
computing
...
Technique: property-preserving cryptography
Analogy: symmetric crypto that preserves a relation on inputs (e.g., order, equality).Pros:
Low performance overhead.Fits well into existing systems.
Cons: Only allows a few operations (e.g., only equality comparison or ordering).Multi-user systems are a challenge, but can be done with proxy re-encryption.
17
Technique: homomorphic encryption
18
Analogy: asymmetric crypto that allows addition and multiplication of ciphertexts.Pros:
Fits well into existing systems.Cons:
High performance overhead.Multi-user systems are a challenge, but can be done with proxy re-encryption.
Analogy: cryptographic versions of electrical circuits.Pros:
Flexible programming model. Cons:
Medium performance overhead.Fixed number of parties (can be solved by combining with other techniques).
Technique: garbled circuits
19
Analogy: give a number of people a random piece of each secret value and let them collaborate to compute results.Pros:
Low-to-medium performance overhead.Flexible programming model.
Cons: Distributed deployments do not fit into all existing systems.
Technique: secret sharing
21
Analogy: think of a computer process that hides the data from its ownerPros:
Minimal performance overhead.Relatively easy to convert applications to work on trusted execution environments
Cons: Side-channel attack mitigations are complicated to implement.
Technique: trusted execution environments
Ik
Rn
CSC
22
Think of an application that would support social distancing and limit infection rates. Write down very clearly, what is the expected benefit of the system.
Write down the list of input parties and the data they would provide.Write down the list of computing parties and describe the kind of processing they would perform. Write down the list of result parties and describe the outputs they would receive.
Bonus tasks, time permitting:Think of minimizing personal data processing using process redesign.See if any of the secure computing paradigms described above could support your application.
Prepare in 12 minutes and then we’ll have 1-2 students present their ideas.
Lecture task
24
A protection domain kind (PDK) is a set of data representations, algorithms and protocols for storing and computing on protected data.Examples:
SMC based on secret sharing,SMC based on garbled circuits,(fully) homomorphic encryption,trusted hardware (e.g., Intel SGX).
PDK as an abstraction of a secure computing paradigm
26
A protection domain (PD) is a set of data that is protected with the same resources and for which there is a well-defined set of algorithms and protocols for computing on that data while keeping the protection.Examples:
data held by a fixed group of servers performing secure multi-party computation,data encrypted under a fixed key of a homomorphic encryption scheme.
Protection domain as an instance of a PDK
27
Secureprimitiveoperations
Application
• private outputs from private inputs,
• have privacy proofs,• remain private under
sequential or parallel composition,
• optimized to have a low resource footprint.
• publish selected results tomake system useful,
• do not leak private inputs or show leakage as acceptable,
• compositions of secureprimitive operations,
• optimize for runningtime.
Privacy-preservingalgorithms
Application logic
Application model for privacy-preserving computing
28
We pick frequent itemset mining as a problem of choice.Frequent itemset mining is a data mining problem that helps with shopping basket analysis and the simplest kinds of recommender systems.
What kind of things do people buy from stores together most often?If the service provider knows this, they can recommend one to a customer who is planning to buy the other.
The simpler algorithms include Apriori (breadth-first search) and Eclat (depth-first search).We will know look at the basic primitive of frequent itemset mining and then build a privacy-preserving approach.
Converting an algorithm to a privacy-preserving one
29
t1 rendang nasi lemak chicken satay
t2 nasi lemak lontong
t3 chicken satay
t4 rendang nasi lemak
t5 nasi lemak chicken satay
t6 nasi lemak chicken satay
t7 lontong
rendang nasi lemak lontong chicken
satay
t1 1 1 0 1t2 0 1 1 0t3 0 0 0 1t4 1 1 0 0t5 0 1 0 1t6 0 1 0 1t7 0 0 1 0
Private data representations are the key towarddesaigning privacy-preserving algorithms.
Privacy-preserving data representations
30
The data representation allows for very efficientcalculation of item supports.
rendang nasi lemak lontong chicken
satay
t1 1 1 0 1t2 0 1 1 0t3 0 0 0 1t4 1 1 0 0t5 0 1 0 1t6 0 1 0 1t7 0 0 1 0
nasi lemak
1101110
∑ = 5
Calculating the support of an item
31
Checking the joint support of a pair of itemssimply requires a multiplication
chicken satay
1010110
rendang nasi lemak lontong chicken
satay
t1 1 1 0 1t2 0 1 1 0t3 0 0 0 1t4 1 1 0 0t5 0 1 0 1t6 0 1 0 1t7 0 0 1 0
nasi lemak
1101110
∑ = 3
xxxxxxx
=======
nasi lemak & chicken satay
1000110
Calculating support for a set of items
32
Depth-first search would be intuitive for pruning.
{ rendang } { nasi lemak } { lontong } { chicken satay }
rendang, nasi lemak{ } rendang,
lontong{ } rendang, chicken satay{ } nasi lemak,
lontong{ } nasi lemak, chicken satay{ } lontong,
chicken satay{ }rendang, nasi lemak,lontong{ } rendang,
nasi lemak,chicken satay{ } nasi lemak,
lontongchicken satay{ }rendang,
lontong,chicken satay{ }
rendang, nasi lemak,lontongchicken satay{ }
33
Evaluating itemsets with a depth-first strategy
However, breadth-first search can be done in parallel.
{ rendang } { nasi lemak } { lontong } { chicken satay }
rendang, nasi lemak,lontong{ } rendang,
nasi lemak,chicken satay{ } nasi lemak,
lontongchicken satay{ }rendang,
lontong,chicken satay{ }
rendang, nasi lemak,lontongchicken satay{ }
rendang, nasi lemak{ } rendang,
lontong{ } rendang, chicken satay{ } nasi lemak,
lontong{ } nasi lemak, chicken satay{ } lontong,
chicken satay{ }
Evaluating itemsets with a breadth-first strategy
34
Challenge: exploring all possible itemsets leads is slow due to combinatorial explosion.Pruning the search tree requires us to declassify itemset supports during computation (leak?).Solution: consider that the algorithm will publish all frequent itemsets, as that is its intended goal.We will compare support to the threshold privately, only declassifying the result bit.We will prune the search tree based on that bit.Not a leak - if the itemset is frequent, we would have learned it from the outputs anyway.
Balancing optimizations with privacy preservation
35
top related