highly-functional highly-scalable search on encrypted data hugo krawczyk, ibm (speaker: stanislaw...

28
Highly-Functional Highly-Scalable Search on Encrypted Data Hugo Krawczyk , IBM (Speaker: Stanislaw Jarecki) Joint works with IBM-UCI teams: David Cash, Sky Faber, Joseph Jaeger, Stanislaw Jarecki, Charanjit Jutla, Hugo Krawczyk, Quan Nguyen, Marcel Rosu, Michael Steiner (related pubs: Crypto’13, CCS’13, NDSS’14, Esorics’15) 1

Upload: tracey-parsons

Post on 18-Jan-2016

225 views

Category:

Documents


0 download

TRANSCRIPT

Page 1: Highly-Functional Highly-Scalable Search on Encrypted Data Hugo Krawczyk, IBM (Speaker: Stanislaw Jarecki) Joint works with IBM-UCI teams: David Cash,

Highly-Functional Highly-Scalable Search

on Encrypted Data

Hugo Krawczyk , IBM(Speaker: Stanislaw Jarecki)

Joint works with IBM-UCI teams: David Cash, Sky Faber, Joseph Jaeger, Stanislaw Jarecki, Charanjit Jutla, Hugo Krawczyk, Quan Nguyen, Marcel

Rosu, Michael Steiner

(related pubs: Crypto’13, CCS’13, NDSS’14, Esorics’15)

1

Page 2: Highly-Functional Highly-Scalable Search on Encrypted Data Hugo Krawczyk, IBM (Speaker: Stanislaw Jarecki) Joint works with IBM-UCI teams: David Cash,

SSE before 2013

Generic tools: FHE, ORAM, PIR

Dedicated SSE solutions:

Single-Keyword Search (SKS) [SWP’00, Goh’03, CGKO’06, ChaKam’10, …]

“privacy optimal" (DB size and query results) but practical limitations

Conjunctions: Very little

naive (n single-keyword searches),

GSW’04: structured-data only, LINEAR, communication-pairings tradeoff

Practicality limitations:

non-scalable design (esp. for adaptive solutions), limited support for dynamic data, lack of massive I/O support for large scale DBs, few prototypes

2

Page 3: Highly-Functional Highly-Scalable Search on Encrypted Data Hugo Krawczyk, IBM (Speaker: Stanislaw Jarecki) Joint works with IBM-UCI teams: David Cash,

Our work: extends SSE in 3 dimensions

1. Functionality beyond single-keyword search:

Conjunctions, general Boolean expressions (on keywords), range queries, substring/wildcard queries, phrase queries, relational DBs and free text

2. Scalability:

Terabyte-scale DB, 10’s millions documents/records, billions of indexed document-keyword pairs, dynamic data. (Prototype tested by Lincoln Labs.)

3. New application settings and trust models:

Multiple clients: Data owner can give 3rd parties read access to outsourced encrypted data

Blind authorization: Access can be policy-based (e.g. query by certain DB columns only) without data owner learning the exact query

4

Page 4: Highly-Functional Highly-Scalable Search on Encrypted Data Hugo Krawczyk, IBM (Speaker: Stanislaw Jarecki) Joint works with IBM-UCI teams: David Cash,

Example Applications

Example: Hospital outsources DB, provides access to clients (doctors, administrators, insurance companies, etc.)

Policy-based authorization on a client/query-basis

Hospital doesn’t need to learn the query, only (blindly) enforce policy

Good for security, privacy, regulations

Warrant scenario (extended 4-party setting)

Judge provides warrant for a client C (e.g. FBI) to query a DB

DB owner enables access but only to queries allowed by judge

DB owner does not learn warrant content or queries

Client C (e.g., FBI) gets the matching documents for the allowed queries and nothing else

5

Obama’s 3rd Party Solution? (phone data)

Page 5: Highly-Functional Highly-Scalable Search on Encrypted Data Hugo Krawczyk, IBM (Speaker: Stanislaw Jarecki) Joint works with IBM-UCI teams: David Cash,

Example: Medical Application(private outsourcing and delegation)

Charlie

Preprocessing

query := “zip=10598” & “age=(22,50)” & “condition=diabetes”

1: query2: query tokens

4:EN

C C(m

atch

ing

reco

rds)

Debbiehospital

5: decrypts matching records

EncD(DB*)

Eddie

3:EN

C C(q

uery

toke

ns)

6

Page 6: Highly-Functional Highly-Scalable Search on Encrypted Data Hugo Krawczyk, IBM (Speaker: Stanislaw Jarecki) Joint works with IBM-UCI teams: David Cash,

------------ 1:

ENCC (query)

query := “zip=10598” & “age=(22,50)” & “condition=diabetes”

1: query2: query tokens

4:EN

C C(m

atch

ing

reco

rds)

Debbiehospital

5: decrypts matching records

EncD(DB*)

Eddie

Debbie authorizes query according to policy without even learning what the query is! (e.g. for regulatory compliance/liability reqt’s)

7

Example: Medical Application (+ blind authorization)

Charlie

3:EN

C C(q

uery

toke

ns)

2: ENCC (query tokens *)

----------------------

IBM_USER
Add rectangle:query privacy from Debbie assumes no collusion with Eddie
Page 7: Highly-Functional Highly-Scalable Search on Encrypted Data Hugo Krawczyk, IBM (Speaker: Stanislaw Jarecki) Joint works with IBM-UCI teams: David Cash,

Scalable & Featureful Implementation

Validated on synthetic census data: 10Terabytes, 100 million records, > 100,000,000,000=1011 indexed record-keyword pairs !

Equivalent to DB with 1 record per US citizen with 400 keywords in each

with support for:

arbitrary Boolean queries

range queries, substring/wildcards, phrase queries

dynamic data (record additions, deletions, modifications)

preprocessing cost is linear in DB size (minutes-days for above DBs)

search proportional to # documents matching w1 (least frequent term)

Single-round: Tokens from client to server, matching results back

Query response time: Competitive w/ plaintext queries on indexed DB

8

4 seconds: fname='CHARLIE' AND sex='Female' AND NOT (state='NY' OR state='MA' OR state='PA' OR state='NJ) on 100M records/22Billion index entries US-Census DB

IBM_USER
# charlie = 52,530 result set = 2,871.
Page 8: Highly-Functional Highly-Scalable Search on Encrypted Data Hugo Krawczyk, IBM (Speaker: Stanislaw Jarecki) Joint works with IBM-UCI teams: David Cash,

Crypto Design-Engineering Synergy

Major effort to build I/O-friendly data structures

Critical decision: Do not design/implement for RAM-resident data structures (they severely limit scalability)

On disk we need to avoid random access (e.g., avoid Bloom filters on disk, BF in RAM help reduce I/O access but high false positives for large DBs)

Challenge: Need randomized data structures to reduce leakage and need structured ones to improve I/O performance (locality of access)

Use of EC groups imply very fast exponentiation (esp. w/ same-base), typically: I/O and network latency dominate cost

On a midsize storage system: ~300 IOPS (I/O Operations Per Second)

~1000 expon’s per random I/O access (133 w/o same-base optimization)

On the other hand, HMAC cost is noticeable (we use CMAC)

9

500,000/sec, 8 cores, same-base opt , 100-1000 per IO

Page 9: Highly-Functional Highly-Scalable Search on Encrypted Data Hugo Krawczyk, IBM (Speaker: Stanislaw Jarecki) Joint works with IBM-UCI teams: David Cash,

Carefully designed to scale beyond RAM

Big challenge TSet Π (oblivous hash) datastructure: Security implies maximal randomization yet efficiency calls for maximal “sequentialization” in disk and DB access!!

Πpack [Crypto’13]: handle >100x larger datasets than previous work

Π2lev [NDSS’14]: handle >100x larger datasets than Πpack

Prototype

10

DB (clear-text data)

ClientPre-processingApplication

Aux.tables

Cloud Server

EDB (Π)

ClientApplication

Updates

ADMINIBM
The same slide with animation is in the backup part
Page 10: Highly-Functional Highly-Scalable Search on Encrypted Data Hugo Krawczyk, IBM (Speaker: Stanislaw Jarecki) Joint works with IBM-UCI teams: David Cash,

Security “The challenge of being imperfect”

Security-Performance trade-offs:

Leakage in the form of data access patterns (e.g. repeated retrieval), query patterns (e.g., repeated queries)

+ additional leakage (more complex functions of DB and query history)

Can lead to statistical inference based on side information on data (application dependent), can be alleviated by masking techniques

Security proofs: Formal model and precise provable leakage profile

Leakage profile: provides upper bounds on what is learned by the attacker -- definitions follow the simulation paradigm [CGKO, CK]

11

Page 11: Highly-Functional Highly-Scalable Search on Encrypted Data Hugo Krawczyk, IBM (Speaker: Stanislaw Jarecki) Joint works with IBM-UCI teams: David Cash,

Security Formalism (adversarial server)

Based on the simulation-based definitions given for SKS [CGKO,CK].

There is an attacker E (acting as the server), a simulator Sim and a leakage function L(DB, queries):

Real: Attacker E chooses DB and gets the pre-processed encrypted DB, then interacts with client on adaptively chosen queries

Ideal: Attacker E chooses DB and queries (adaptively), E gets Sim(L(DB)) and Sim(L(DB,queries))

A SSE scheme is semantically secure with leakage L if for all attackers E, there is a simulator Sim such that the views of E in both experiments are indistinguishable

Server learns nothing beyond the specified leakage L even if it knows (and even if it chooses adaptively) the plaintext DB and queries

12

Page 12: Highly-Functional Highly-Scalable Search on Encrypted Data Hugo Krawczyk, IBM (Speaker: Stanislaw Jarecki) Joint works with IBM-UCI teams: David Cash,

Some Techniques

#1: Conjunctive SSE

13

Page 13: Highly-Functional Highly-Scalable Search on Encrypted Data Hugo Krawczyk, IBM (Speaker: Stanislaw Jarecki) Joint works with IBM-UCI teams: David Cash,

Boolean Query Support: Basic Ideas

Consider a conjunctive query AND(w1, w2, … wn)

(Boolean queries supported as a surprisingly simple extension)

1. s-term: choose the least frequent conjunctive term, say w1:

Find encrypted indexes of all documents containing w1 (w/o

revealing w1)

Standard Encrypted Index: Enc(w1) Enc(ind1), Enc(ind2), … ,

Enc(indk)

2. Suppose you had oracle O s.t. O(Enc(ind) , Enc(w)) = 1 iff ind ∊ D(w) ⇒ return Enc(ind) in Enc(w1) list if

O(Enc(ind),Enc(wi))=1 for i=2,...,n

I. Use PRF1(ind) instead of ind and PRF2(w) instead of w

II. Precompute H(ind,w) = (PRF1(w)) PRF2(ind) for all w and ind∊D(w)

III. O(Enc(ind),Enc(w)) returns 1 if H(ind,w) is in the precomputed set

Implement encryption of PRF1(ind) and PRF2(w) s.t.

H(ind,w) can be computed given Enc(PRF2(ind)) and Enc(PRF1(w))

14

Page 14: Highly-Functional Highly-Scalable Search on Encrypted Data Hugo Krawczyk, IBM (Speaker: Stanislaw Jarecki) Joint works with IBM-UCI teams: David Cash,

15

Oblivious PRF Computation (OPRF) [NR’04,FIPR’05]

Multiple instantiations ([Yao’82], [FIPR’01], [JL’09], [JL’10], …)

Fastest (2 exp’s/party) is Hashed-DH PRF: fk(x)=[H(x)]k

Oblivious computation via “Blind DH Computation”:

(C sends a = [H(x)] r to S, S replies with b = ak, C computes fk(x) as b 1/r)

OPRF with enforcing access policy on query x: extensions…

S(k) C(x)

fk(x)丄

fk(x) is a Pseudo-Random Function (PRF) if

OPRF protocol

x

fk(x) or $fk-or-$ Adv

?

Page 15: Highly-Functional Highly-Scalable Search on Encrypted Data Hugo Krawczyk, IBM (Speaker: Stanislaw Jarecki) Joint works with IBM-UCI teams: David Cash,

Interactive Solution Let xind = PRF1(ind) and xtrap = PRF2(w)

H(ind,w) = xtrapxind in group with DDH

C can compute xtrap with D using Oblivious PRF (for Blind Authorization)

C has xtrap from D, if E had xind in the list then:

C to E: a ⃪ xtrapz for z random in Zp

E to C: b ⃪ axind

C to E: xtag ⃪ b1/z (= xtrapxind)

E searches xtag in XSet

Two problems: Requires 2 rounds of interaction If encrypted index Enc(w) includes xind’s then E sees repeated

(x)ind’s 16

Page 16: Highly-Functional Highly-Scalable Search on Encrypted Data Hugo Krawczyk, IBM (Speaker: Stanislaw Jarecki) Joint works with IBM-UCI teams: David Cash,

Non-Interactive Solution

Interactive: Encrypted index Enc(w) stores xind’s, C has xtrap

C to E: a ⃪ xtrapz z random in Zp

E to C: b ⃪ axind

C to E: xtag ⃪ b1/z (= (xtrapz)xind/z = xtrapxind )

Non-interactive: Enc(w) stores each xind blinded with a OTP factor z

1. At setup: Enc(w) includes y=xindz-1 at c-th position for z=PRF(w,c)

2. At search: C sends a=xtrapz to E ( recall that xtrap=PRF(wi) )

3. E retrieves y from EDB and sets xtag ⃪ ay ( = (xtrapz)xind/z = xtrapxind )

17

avoids interaction (and prevents E from learning xtrap or xind)

Page 17: Highly-Functional Highly-Scalable Search on Encrypted Data Hugo Krawczyk, IBM (Speaker: Stanislaw Jarecki) Joint works with IBM-UCI teams: David Cash,

Leakage for Conjunctions w1… wn

Index size = upper bound on i |D(wi)|

Number of terms in each conjunction query

Size of s-term set |Rec(w1)| (unavoidable? reduction from

3SUM)

Repeated s-terms (i.e. s-term access pattern)

Size of Rec(w1wj) for j=2,…, n

For queries w1 x and w1’ x’ , if x=x’ and D(w1 w1’) is non-

empty then E learns that x=x’ and the size of D(w1 w1’)

Non-interactivity ⇒ Malicious server E cannot learn anything more

18

Page 18: Highly-Functional Highly-Scalable Search on Encrypted Data Hugo Krawczyk, IBM (Speaker: Stanislaw Jarecki) Joint works with IBM-UCI teams: David Cash,

From Conjunctions to Boolean Queries

Boolean expressions of the form w1 B(w2,...,wn) resolved as

above

Find all ind’s that have w1

For each ind check if wi is in D(ind) and evaluate B on resultant 0/1

values

Can extend to disjunctions of above form

If query cannot be put in this form we set w1 as “true” (linear

search)

Leakage is same as for conjunction w1w2 … wn

(+ Boolean formula is leaked)

19

Page 19: Highly-Functional Highly-Scalable Search on Encrypted Data Hugo Krawczyk, IBM (Speaker: Stanislaw Jarecki) Joint works with IBM-UCI teams: David Cash,

Some Techniques

20

#2: Multi-Client SSE with Blind Auth.#3: Dynamic SSE (i.e. Dynamic DB)

Page 20: Highly-Functional Highly-Scalable Search on Encrypted Data Hugo Krawczyk, IBM (Speaker: Stanislaw Jarecki) Joint works with IBM-UCI teams: David Cash,

21

Multi-Client SSE with Blind Authorization

Keywords are attribute-value pairs, e.g. (name,Joe),

(zipcode,02415)

Attribute-based policies (“Is client C authorized for query Q?”)

Data owner D can apply access Policy based on attributes not values

E.g. C can query name and lastname but only with one of (zipcode, school)

Permissions can depend on client type and Boolean query formula

D enforces policy w/o learning the queried values, only the attrib’s

Page 21: Highly-Functional Highly-Scalable Search on Encrypted Data Hugo Krawczyk, IBM (Speaker: Stanislaw Jarecki) Joint works with IBM-UCI teams: David Cash,

Blind (PIR-like) Authorization

Keywords tagged (e.g., xtrap) with Oblivious PRF (H(w))k

Meshes well with our DH-based OPRF used to instantiate H(w,ind)

Attribute-based authoriz’n enforced via attribute-specific OPRF keys

Mix-and-Match attacks against conjunctive (and Boolean) queries prevented via blinded homomorphic signatures

Authorization Cost: one round, 3n exp’s (n = # keywords in query)

Secure against fully malicious C and D, assuming D&E do not collude

22

Page 22: Highly-Functional Highly-Scalable Search on Encrypted Data Hugo Krawczyk, IBM (Speaker: Stanislaw Jarecki) Joint works with IBM-UCI teams: David Cash,

Dynamic DB (updates)

Updates: add, delete, modify

Operational assumption: |DB| >> updates; thus:

EDB super-optimized for disk access but update structures planned for RAM

Periodic re-encryption eliminates leakage trails (e.g. new/old records)

Caching

Client can identify previously retrieved documents in result set before requesting them

Will retrieve a previously retrieved document only if the document changed since last retrieval

Important defense against server learning intersection b/w queries

Page 23: Highly-Functional Highly-Scalable Search on Encrypted Data Hugo Krawczyk, IBM (Speaker: Stanislaw Jarecki) Joint works with IBM-UCI teams: David Cash,

Some Techniques

24

#4: Range SSE#5: Substring/Wildcard/... SSE

Page 24: Highly-Functional Highly-Scalable Search on Encrypted Data Hugo Krawczyk, IBM (Speaker: Stanislaw Jarecki) Joint works with IBM-UCI teams: David Cash,

Range Queries

E.g., return all records with DOB between 2/3/87 and 3/4/88

Compute a cover of the range by intervals (via a tree cover algorithm)

We add log M attributes (for M = max interval) to the DB

Generate a disjunction of values representing each of the intervals

Thus Boolean SSE protocol implies Range SSE !

More interesting part: Authorization

Debbie authorizes based on total size of range (eg. query span ≤ one

year)

Needs to let Debbie learn the total size of the range w/o leaking on the endpoints? (e.g.., # of intervals should be same for any two ranges of

same size)

Two solutions: Canonical covers (all ranges of given size have same-lengths intervals); 3-node over-covers (always 3 intervals with 40% avg overhead)

25

Page 25: Highly-Functional Highly-Scalable Search on Encrypted Data Hugo Krawczyk, IBM (Speaker: Stanislaw Jarecki) Joint works with IBM-UCI teams: David Cash,

Substring/Wildcard Queries

Substrings/Wildcards:

Any combination of _ (“single character wildcard”) and substrings of k or more consecutive characters (k is a tunable parameter)

Plus a % (“any string”) at the beginning and/or end of a search expression

Examples (return “encyclopedia”): (i) %cycl_ _ _dia (ii) %lope% (iii) enc_clo_ _dia

Tunable k:

smaller for more general queries

larger for search efficiency

26

Page 26: Highly-Functional Highly-Scalable Search on Encrypted Data Hugo Krawczyk, IBM (Speaker: Stanislaw Jarecki) Joint works with IBM-UCI teams: David Cash,

Technique: Distance-Aware Queries

Represent substring query as conjunction of k-grams

(k=3) query %encycl% becomes %enc% AND (%ycl% @ distance 3)

If D(ind) has k-gram w at position pos then add tag H(ind,w,p) to tag set

Need to compute H(ind,w,pos+∆) given PRF(w), ∆, and Enc(ind,pos)

Define H(w,ind,p) as (PRF1(w)) { PRF2(ind) p

} [H a PRF under q-DH]

Generalizes to any search of the form:

Return all records where element e1 is at distance ∆ from e2

Examples:

ei are k-grams: resolves substrings and wildcards

ei are textual words: resolves phrases (e.g., “Ecole Normale

Superieure”)

Multi-dimensional distances (e.g., grid), etc.

27

Page 27: Highly-Functional Highly-Scalable Search on Encrypted Data Hugo Krawczyk, IBM (Speaker: Stanislaw Jarecki) Joint works with IBM-UCI teams: David Cash,

(Some) Research Questions

Leveraging other tools: extend functionality and improve privacy via multiparty comp’n? identify modules that can benefit from ORAM?

Fundamental tradeoffs: Cost of preventing leakage (e.g., leakage via computation time, size of least frequent conjunctive term, etc.)

“Semantic leakage”: Proving formal leakage is nice but how bad is it for a given particular application, what forms of masking can help?

Can we have a theory to help us reason about it (cf. differential privacy)?

A theory of leakage composition?

Specific attacks welcome (e.g. Islam-Kuzu-Kantarcioglu NDSS’12)

Characterizing privacy-friendly plaintext search algorithms/data str.

Verifying results set: easy for single-keyword, possible for

SSE/Boolean, seems harder with multi-client

28

Page 28: Highly-Functional Highly-Scalable Search on Encrypted Data Hugo Krawczyk, IBM (Speaker: Stanislaw Jarecki) Joint works with IBM-UCI teams: David Cash,

Crypto’2013: conjunctive SSE eprint.iacr.org/2013/169

CCS’2013: multi-client + blind auth. SSE eprint.iacr.org/2013/720

NDSS’2014: dynamic, much faster for large DBs

eprint.iacr.org/2014/?

Esorics’2015: ranges and substrings (+ wildcards & phrases)

29

Thanks!