highly-functional highly-scalable search on encrypted data hugo krawczyk, ibm (speaker: stanislaw...
TRANSCRIPT
Highly-Functional Highly-Scalable Search
on Encrypted Data
Hugo Krawczyk , IBM(Speaker: Stanislaw Jarecki)
Joint works with IBM-UCI teams: David Cash, Sky Faber, Joseph Jaeger, Stanislaw Jarecki, Charanjit Jutla, Hugo Krawczyk, Quan Nguyen, Marcel
Rosu, Michael Steiner
(related pubs: Crypto’13, CCS’13, NDSS’14, Esorics’15)
1
SSE before 2013
Generic tools: FHE, ORAM, PIR
Dedicated SSE solutions:
Single-Keyword Search (SKS) [SWP’00, Goh’03, CGKO’06, ChaKam’10, …]
“privacy optimal" (DB size and query results) but practical limitations
Conjunctions: Very little
naive (n single-keyword searches),
GSW’04: structured-data only, LINEAR, communication-pairings tradeoff
Practicality limitations:
non-scalable design (esp. for adaptive solutions), limited support for dynamic data, lack of massive I/O support for large scale DBs, few prototypes
2
Our work: extends SSE in 3 dimensions
1. Functionality beyond single-keyword search:
Conjunctions, general Boolean expressions (on keywords), range queries, substring/wildcard queries, phrase queries, relational DBs and free text
2. Scalability:
Terabyte-scale DB, 10’s millions documents/records, billions of indexed document-keyword pairs, dynamic data. (Prototype tested by Lincoln Labs.)
3. New application settings and trust models:
Multiple clients: Data owner can give 3rd parties read access to outsourced encrypted data
Blind authorization: Access can be policy-based (e.g. query by certain DB columns only) without data owner learning the exact query
4
Example Applications
Example: Hospital outsources DB, provides access to clients (doctors, administrators, insurance companies, etc.)
Policy-based authorization on a client/query-basis
Hospital doesn’t need to learn the query, only (blindly) enforce policy
Good for security, privacy, regulations
Warrant scenario (extended 4-party setting)
Judge provides warrant for a client C (e.g. FBI) to query a DB
DB owner enables access but only to queries allowed by judge
DB owner does not learn warrant content or queries
Client C (e.g., FBI) gets the matching documents for the allowed queries and nothing else
5
Obama’s 3rd Party Solution? (phone data)
Example: Medical Application(private outsourcing and delegation)
Charlie
Preprocessing
query := “zip=10598” & “age=(22,50)” & “condition=diabetes”
1: query2: query tokens
4:EN
C C(m
atch
ing
reco
rds)
Debbiehospital
5: decrypts matching records
EncD(DB*)
Eddie
3:EN
C C(q
uery
toke
ns)
6
------------ 1:
ENCC (query)
query := “zip=10598” & “age=(22,50)” & “condition=diabetes”
1: query2: query tokens
4:EN
C C(m
atch
ing
reco
rds)
Debbiehospital
5: decrypts matching records
EncD(DB*)
Eddie
Debbie authorizes query according to policy without even learning what the query is! (e.g. for regulatory compliance/liability reqt’s)
7
Example: Medical Application (+ blind authorization)
Charlie
3:EN
C C(q
uery
toke
ns)
2: ENCC (query tokens *)
----------------------
Scalable & Featureful Implementation
Validated on synthetic census data: 10Terabytes, 100 million records, > 100,000,000,000=1011 indexed record-keyword pairs !
Equivalent to DB with 1 record per US citizen with 400 keywords in each
with support for:
arbitrary Boolean queries
range queries, substring/wildcards, phrase queries
dynamic data (record additions, deletions, modifications)
preprocessing cost is linear in DB size (minutes-days for above DBs)
search proportional to # documents matching w1 (least frequent term)
Single-round: Tokens from client to server, matching results back
Query response time: Competitive w/ plaintext queries on indexed DB
8
4 seconds: fname='CHARLIE' AND sex='Female' AND NOT (state='NY' OR state='MA' OR state='PA' OR state='NJ) on 100M records/22Billion index entries US-Census DB
Crypto Design-Engineering Synergy
Major effort to build I/O-friendly data structures
Critical decision: Do not design/implement for RAM-resident data structures (they severely limit scalability)
On disk we need to avoid random access (e.g., avoid Bloom filters on disk, BF in RAM help reduce I/O access but high false positives for large DBs)
Challenge: Need randomized data structures to reduce leakage and need structured ones to improve I/O performance (locality of access)
Use of EC groups imply very fast exponentiation (esp. w/ same-base), typically: I/O and network latency dominate cost
On a midsize storage system: ~300 IOPS (I/O Operations Per Second)
~1000 expon’s per random I/O access (133 w/o same-base optimization)
On the other hand, HMAC cost is noticeable (we use CMAC)
9
500,000/sec, 8 cores, same-base opt , 100-1000 per IO
Carefully designed to scale beyond RAM
Big challenge TSet Π (oblivous hash) datastructure: Security implies maximal randomization yet efficiency calls for maximal “sequentialization” in disk and DB access!!
Πpack [Crypto’13]: handle >100x larger datasets than previous work
Π2lev [NDSS’14]: handle >100x larger datasets than Πpack
Prototype
10
DB (clear-text data)
ClientPre-processingApplication
Aux.tables
Cloud Server
EDB (Π)
ClientApplication
Updates
Security “The challenge of being imperfect”
Security-Performance trade-offs:
Leakage in the form of data access patterns (e.g. repeated retrieval), query patterns (e.g., repeated queries)
+ additional leakage (more complex functions of DB and query history)
Can lead to statistical inference based on side information on data (application dependent), can be alleviated by masking techniques
Security proofs: Formal model and precise provable leakage profile
Leakage profile: provides upper bounds on what is learned by the attacker -- definitions follow the simulation paradigm [CGKO, CK]
11
Security Formalism (adversarial server)
Based on the simulation-based definitions given for SKS [CGKO,CK].
There is an attacker E (acting as the server), a simulator Sim and a leakage function L(DB, queries):
Real: Attacker E chooses DB and gets the pre-processed encrypted DB, then interacts with client on adaptively chosen queries
Ideal: Attacker E chooses DB and queries (adaptively), E gets Sim(L(DB)) and Sim(L(DB,queries))
A SSE scheme is semantically secure with leakage L if for all attackers E, there is a simulator Sim such that the views of E in both experiments are indistinguishable
Server learns nothing beyond the specified leakage L even if it knows (and even if it chooses adaptively) the plaintext DB and queries
12
Some Techniques
#1: Conjunctive SSE
13
Boolean Query Support: Basic Ideas
Consider a conjunctive query AND(w1, w2, … wn)
(Boolean queries supported as a surprisingly simple extension)
1. s-term: choose the least frequent conjunctive term, say w1:
Find encrypted indexes of all documents containing w1 (w/o
revealing w1)
Standard Encrypted Index: Enc(w1) Enc(ind1), Enc(ind2), … ,
Enc(indk)
2. Suppose you had oracle O s.t. O(Enc(ind) , Enc(w)) = 1 iff ind ∊ D(w) ⇒ return Enc(ind) in Enc(w1) list if
O(Enc(ind),Enc(wi))=1 for i=2,...,n
I. Use PRF1(ind) instead of ind and PRF2(w) instead of w
II. Precompute H(ind,w) = (PRF1(w)) PRF2(ind) for all w and ind∊D(w)
III. O(Enc(ind),Enc(w)) returns 1 if H(ind,w) is in the precomputed set
Implement encryption of PRF1(ind) and PRF2(w) s.t.
H(ind,w) can be computed given Enc(PRF2(ind)) and Enc(PRF1(w))
14
15
Oblivious PRF Computation (OPRF) [NR’04,FIPR’05]
Multiple instantiations ([Yao’82], [FIPR’01], [JL’09], [JL’10], …)
Fastest (2 exp’s/party) is Hashed-DH PRF: fk(x)=[H(x)]k
Oblivious computation via “Blind DH Computation”:
(C sends a = [H(x)] r to S, S replies with b = ak, C computes fk(x) as b 1/r)
OPRF with enforcing access policy on query x: extensions…
S(k) C(x)
fk(x)丄
fk(x) is a Pseudo-Random Function (PRF) if
OPRF protocol
x
fk(x) or $fk-or-$ Adv
?
Interactive Solution Let xind = PRF1(ind) and xtrap = PRF2(w)
H(ind,w) = xtrapxind in group with DDH
C can compute xtrap with D using Oblivious PRF (for Blind Authorization)
C has xtrap from D, if E had xind in the list then:
C to E: a ⃪ xtrapz for z random in Zp
E to C: b ⃪ axind
C to E: xtag ⃪ b1/z (= xtrapxind)
E searches xtag in XSet
Two problems: Requires 2 rounds of interaction If encrypted index Enc(w) includes xind’s then E sees repeated
(x)ind’s 16
Non-Interactive Solution
Interactive: Encrypted index Enc(w) stores xind’s, C has xtrap
C to E: a ⃪ xtrapz z random in Zp
E to C: b ⃪ axind
C to E: xtag ⃪ b1/z (= (xtrapz)xind/z = xtrapxind )
Non-interactive: Enc(w) stores each xind blinded with a OTP factor z
1. At setup: Enc(w) includes y=xindz-1 at c-th position for z=PRF(w,c)
2. At search: C sends a=xtrapz to E ( recall that xtrap=PRF(wi) )
3. E retrieves y from EDB and sets xtag ⃪ ay ( = (xtrapz)xind/z = xtrapxind )
17
avoids interaction (and prevents E from learning xtrap or xind)
Leakage for Conjunctions w1… wn
Index size = upper bound on i |D(wi)|
Number of terms in each conjunction query
Size of s-term set |Rec(w1)| (unavoidable? reduction from
3SUM)
Repeated s-terms (i.e. s-term access pattern)
Size of Rec(w1wj) for j=2,…, n
For queries w1 x and w1’ x’ , if x=x’ and D(w1 w1’) is non-
empty then E learns that x=x’ and the size of D(w1 w1’)
Non-interactivity ⇒ Malicious server E cannot learn anything more
18
From Conjunctions to Boolean Queries
Boolean expressions of the form w1 B(w2,...,wn) resolved as
above
Find all ind’s that have w1
For each ind check if wi is in D(ind) and evaluate B on resultant 0/1
values
Can extend to disjunctions of above form
If query cannot be put in this form we set w1 as “true” (linear
search)
Leakage is same as for conjunction w1w2 … wn
(+ Boolean formula is leaked)
19
Some Techniques
20
#2: Multi-Client SSE with Blind Auth.#3: Dynamic SSE (i.e. Dynamic DB)
21
Multi-Client SSE with Blind Authorization
Keywords are attribute-value pairs, e.g. (name,Joe),
(zipcode,02415)
Attribute-based policies (“Is client C authorized for query Q?”)
Data owner D can apply access Policy based on attributes not values
E.g. C can query name and lastname but only with one of (zipcode, school)
Permissions can depend on client type and Boolean query formula
D enforces policy w/o learning the queried values, only the attrib’s
Blind (PIR-like) Authorization
Keywords tagged (e.g., xtrap) with Oblivious PRF (H(w))k
Meshes well with our DH-based OPRF used to instantiate H(w,ind)
Attribute-based authoriz’n enforced via attribute-specific OPRF keys
Mix-and-Match attacks against conjunctive (and Boolean) queries prevented via blinded homomorphic signatures
Authorization Cost: one round, 3n exp’s (n = # keywords in query)
Secure against fully malicious C and D, assuming D&E do not collude
22
Dynamic DB (updates)
Updates: add, delete, modify
Operational assumption: |DB| >> updates; thus:
EDB super-optimized for disk access but update structures planned for RAM
Periodic re-encryption eliminates leakage trails (e.g. new/old records)
Caching
Client can identify previously retrieved documents in result set before requesting them
Will retrieve a previously retrieved document only if the document changed since last retrieval
Important defense against server learning intersection b/w queries
Some Techniques
24
#4: Range SSE#5: Substring/Wildcard/... SSE
Range Queries
E.g., return all records with DOB between 2/3/87 and 3/4/88
Compute a cover of the range by intervals (via a tree cover algorithm)
We add log M attributes (for M = max interval) to the DB
Generate a disjunction of values representing each of the intervals
Thus Boolean SSE protocol implies Range SSE !
More interesting part: Authorization
Debbie authorizes based on total size of range (eg. query span ≤ one
year)
Needs to let Debbie learn the total size of the range w/o leaking on the endpoints? (e.g.., # of intervals should be same for any two ranges of
same size)
Two solutions: Canonical covers (all ranges of given size have same-lengths intervals); 3-node over-covers (always 3 intervals with 40% avg overhead)
25
Substring/Wildcard Queries
Substrings/Wildcards:
Any combination of _ (“single character wildcard”) and substrings of k or more consecutive characters (k is a tunable parameter)
Plus a % (“any string”) at the beginning and/or end of a search expression
Examples (return “encyclopedia”): (i) %cycl_ _ _dia (ii) %lope% (iii) enc_clo_ _dia
Tunable k:
smaller for more general queries
larger for search efficiency
26
Technique: Distance-Aware Queries
Represent substring query as conjunction of k-grams
(k=3) query %encycl% becomes %enc% AND (%ycl% @ distance 3)
If D(ind) has k-gram w at position pos then add tag H(ind,w,p) to tag set
Need to compute H(ind,w,pos+∆) given PRF(w), ∆, and Enc(ind,pos)
Define H(w,ind,p) as (PRF1(w)) { PRF2(ind) p
} [H a PRF under q-DH]
Generalizes to any search of the form:
Return all records where element e1 is at distance ∆ from e2
Examples:
ei are k-grams: resolves substrings and wildcards
ei are textual words: resolves phrases (e.g., “Ecole Normale
Superieure”)
Multi-dimensional distances (e.g., grid), etc.
27
(Some) Research Questions
Leveraging other tools: extend functionality and improve privacy via multiparty comp’n? identify modules that can benefit from ORAM?
Fundamental tradeoffs: Cost of preventing leakage (e.g., leakage via computation time, size of least frequent conjunctive term, etc.)
“Semantic leakage”: Proving formal leakage is nice but how bad is it for a given particular application, what forms of masking can help?
Can we have a theory to help us reason about it (cf. differential privacy)?
A theory of leakage composition?
Specific attacks welcome (e.g. Islam-Kuzu-Kantarcioglu NDSS’12)
Characterizing privacy-friendly plaintext search algorithms/data str.
Verifying results set: easy for single-keyword, possible for
SSE/Boolean, seems harder with multi-client
28
Crypto’2013: conjunctive SSE eprint.iacr.org/2013/169
CCS’2013: multi-client + blind auth. SSE eprint.iacr.org/2013/720
NDSS’2014: dynamic, much faster for large DBs
eprint.iacr.org/2014/?
Esorics’2015: ranges and substrings (+ wildcards & phrases)
29
Thanks!