highly-functional highly-scalable search on encrypted data hugo krawczyk, ibm (speaker: stanislaw...

Highly-Functional Highly-Scalable Search

on Encrypted Data

Hugo Krawczyk , IBM(Speaker: Stanislaw Jarecki)

Joint works with IBM-UCI teams: David Cash, Sky Faber, Joseph Jaeger, Stanislaw Jarecki, Charanjit Jutla, Hugo Krawczyk, Quan Nguyen, Marcel

Rosu, Michael Steiner

(related pubs: Crypto’13, CCS’13, NDSS’14, Esorics’15)

1

SSE before 2013

Generic tools: FHE, ORAM, PIR

Dedicated SSE solutions:

Single-Keyword Search (SKS) [SWP’00, Goh’03, CGKO’06, ChaKam’10, …]

“privacy optimal" (DB size and query results) but practical limitations

Conjunctions: Very little

naive (n single-keyword searches),

GSW’04: structured-data only, LINEAR, communication-pairings tradeoff

Practicality limitations:

non-scalable design (esp. for adaptive solutions), limited support for dynamic data, lack of massive I/O support for large scale DBs, few prototypes

2

Our work: extends SSE in 3 dimensions

1. Functionality beyond single-keyword search:

Conjunctions, general Boolean expressions (on keywords), range queries, substring/wildcard queries, phrase queries, relational DBs and free text

2. Scalability:

Terabyte-scale DB, 10’s millions documents/records, billions of indexed document-keyword pairs, dynamic data. (Prototype tested by Lincoln Labs.)

3. New application settings and trust models:

Multiple clients: Data owner can give 3rd parties read access to outsourced encrypted data

Blind authorization: Access can be policy-based (e.g. query by certain DB columns only) without data owner learning the exact query

4

Example Applications

Example: Hospital outsources DB, provides access to clients (doctors, administrators, insurance companies, etc.)

Policy-based authorization on a client/query-basis

Hospital doesn’t need to learn the query, only (blindly) enforce policy

Good for security, privacy, regulations

Warrant scenario (extended 4-party setting)

Judge provides warrant for a client C (e.g. FBI) to query a DB

DB owner enables access but only to queries allowed by judge

DB owner does not learn warrant content or queries

Client C (e.g., FBI) gets the matching documents for the allowed queries and nothing else

5

Obama’s 3rd Party Solution? (phone data)

Example: Medical Application(private outsourcing and delegation)

Charlie

Preprocessing

query := “zip=10598” & “age=(22,50)” & “condition=diabetes”

1: query2: query tokens

4:EN

C C(m

atch

ing

reco

rds)

Debbiehospital

5: decrypts matching records

EncD(DB*)

Eddie

3:EN

C C(q

uery

toke

ns)

6

------------ 1:

ENCC (query)

query := “zip=10598” & “age=(22,50)” & “condition=diabetes”

1: query2: query tokens

4:EN

C C(m

atch

ing

reco

rds)

Debbiehospital

5: decrypts matching records

EncD(DB*)

Eddie

Debbie authorizes query according to policy without even learning what the query is! (e.g. for regulatory compliance/liability reqt’s)

7

Example: Medical Application (+ blind authorization)

Charlie

3:EN

C C(q

uery

toke

ns)

2: ENCC (query tokens *)

----------------------

IBM_USER

Add rectangle:query privacy from Debbie assumes no collusion with Eddie

Scalable & Featureful Implementation

Validated on synthetic census data: 10Terabytes, 100 million records, > 100,000,000,000=1011 indexed record-keyword pairs !

Equivalent to DB with 1 record per US citizen with 400 keywords in each

with support for:

arbitrary Boolean queries

range queries, substring/wildcards, phrase queries

dynamic data (record additions, deletions, modifications)

preprocessing cost is linear in DB size (minutes-days for above DBs)

search proportional to # documents matching w1 (least frequent term)

Single-round: Tokens from client to server, matching results back

Query response time: Competitive w/ plaintext queries on indexed DB

8

4 seconds: fname='CHARLIE' AND sex='Female' AND NOT (state='NY' OR state='MA' OR state='PA' OR state='NJ) on 100M records/22Billion index entries US-Census DB

IBM_USER

# charlie = 52,530 result set = 2,871.

Crypto Design-Engineering Synergy

Major effort to build I/O-friendly data structures

Critical decision: Do not design/implement for RAM-resident data structures (they severely limit scalability)

On disk we need to avoid random access (e.g., avoid Bloom filters on disk, BF in RAM help reduce I/O access but high false positives for large DBs)

Challenge: Need randomized data structures to reduce leakage and need structured ones to improve I/O performance (locality of access)

Use of EC groups imply very fast exponentiation (esp. w/ same-base), typically: I/O and network latency dominate cost

On a midsize storage system: ~300 IOPS (I/O Operations Per Second)

~1000 expon’s per random I/O access (133 w/o same-base optimization)

On the other hand, HMAC cost is noticeable (we use CMAC)

9

500,000/sec, 8 cores, same-base opt , 100-1000 per IO

Carefully designed to scale beyond RAM

Big challenge TSet Π (oblivous hash) datastructure: Security implies maximal randomization yet efficiency calls for maximal “sequentialization” in disk and DB access!!

Πpack [Crypto’13]: handle >100x larger datasets than previous work

Π2lev [NDSS’14]: handle >100x larger datasets than Πpack

Prototype

10

DB (clear-text data)

ClientPre-processingApplication

Aux.tables

Cloud Server

EDB (Π)

ClientApplication

Updates

ADMINIBM

The same slide with animation is in the backup part

Security “The challenge of being imperfect”

Security-Performance trade-offs:

Leakage in the form of data access patterns (e.g. repeated retrieval), query patterns (e.g., repeated queries)

+ additional leakage (more complex functions of DB and query history)

Can lead to statistical inference based on side information on data (application dependent), can be alleviated by masking techniques

Security proofs: Formal model and precise provable leakage profile

Leakage profile: provides upper bounds on what is learned by the attacker -- definitions follow the simulation paradigm [CGKO, CK]

11

Security Formalism (adversarial server)

Based on the simulation-based definitions given for SKS [CGKO,CK].

There is an attacker E (acting as the server), a simulator Sim and a leakage function L(DB, queries):

Real: Attacker E chooses DB and gets the pre-processed encrypted DB, then interacts with client on adaptively chosen queries

Ideal: Attacker E chooses DB and queries (adaptively), E gets Sim(L(DB)) and Sim(L(DB,queries))

A SSE scheme is semantically secure with leakage L if for all attackers E, there is a simulator Sim such that the views of E in both experiments are indistinguishable

Server learns nothing beyond the specified leakage L even if it knows (and even if it chooses adaptively) the plaintext DB and queries

12

Some Techniques

#1: Conjunctive SSE

13

Boolean Query Support: Basic Ideas

Consider a conjunctive query AND(w1, w2, … wn)

(Boolean queries supported as a surprisingly simple extension)

1. s-term: choose the least frequent conjunctive term, say w1:

Find encrypted indexes of all documents containing w1 (w/o

revealing w1)

Standard Encrypted Index: Enc(w1) Enc(ind1), Enc(ind2), … ,

Enc(indk)

2. Suppose you had oracle O s.t. O(Enc(ind) , Enc(w)) = 1 iff ind ∊ D(w) ⇒ return Enc(ind) in Enc(w1) list if

O(Enc(ind),Enc(wi))=1 for i=2,...,n

I. Use PRF1(ind) instead of ind and PRF2(w) instead of w

II. Precompute H(ind,w) = (PRF1(w)) PRF2(ind) for all w and ind∊D(w)

III. O(Enc(ind),Enc(w)) returns 1 if H(ind,w) is in the precomputed set

Implement encryption of PRF1(ind) and PRF2(w) s.t.

H(ind,w) can be computed given Enc(PRF2(ind)) and Enc(PRF1(w))

14

15

Oblivious PRF Computation (OPRF) [NR’04,FIPR’05]

Multiple instantiations ([Yao’82], [FIPR’01], [JL’09], [JL’10], …)

Fastest (2 exp’s/party) is Hashed-DH PRF: fk(x)=[H(x)]k

Oblivious computation via “Blind DH Computation”:

(C sends a = [H(x)] r to S, S replies with b = ak, C computes fk(x) as b 1/r)

OPRF with enforcing access policy on query x: extensions…

S(k) C(x)

fk(x)丄

fk(x) is a Pseudo-Random Function (PRF) if

OPRF protocol

x

fk(x) or $fk-or-$ Adv

?

Interactive Solution Let xind = PRF1(ind) and xtrap = PRF2(w)

H(ind,w) = xtrapxind in group with DDH

C can compute xtrap with D using Oblivious PRF (for Blind Authorization)

C has xtrap from D, if E had xind in the list then:

C to E: a ⃪ xtrapz for z random in Zp

E to C: b ⃪ axind

C to E: xtag ⃪ b1/z (= xtrapxind)

E searches xtag in XSet

Two problems: Requires 2 rounds of interaction If encrypted index Enc(w) includes xind’s then E sees repeated

(x)ind’s 16

Non-Interactive Solution

Interactive: Encrypted index Enc(w) stores xind’s, C has xtrap

C to E: a ⃪ xtrapz z random in Zp

E to C: b ⃪ axind

C to E: xtag ⃪ b1/z (= (xtrapz)xind/z = xtrapxind )

Non-interactive: Enc(w) stores each xind blinded with a OTP factor z

1. At setup: Enc(w) includes y=xindz-1 at c-th position for z=PRF(w,c)

2. At search: C sends a=xtrapz to E ( recall that xtrap=PRF(wi) )

3. E retrieves y from EDB and sets xtag ⃪ ay ( = (xtrapz)xind/z = xtrapxind )

17

avoids interaction (and prevents E from learning xtrap or xind)

Leakage for Conjunctions w1… wn

Index size = upper bound on i |D(wi)|

Number of terms in each conjunction query

Size of s-term set |Rec(w1)| (unavoidable? reduction from

3SUM)

Repeated s-terms (i.e. s-term access pattern)

Size of Rec(w1wj) for j=2,…, n

For queries w1 x and w1’ x’ , if x=x’ and D(w1 w1’) is non-

empty then E learns that x=x’ and the size of D(w1 w1’)

Non-interactivity ⇒ Malicious server E cannot learn anything more

18

From Conjunctions to Boolean Queries

Boolean expressions of the form w1 B(w2,...,wn) resolved as

above

Find all ind’s that have w1

For each ind check if wi is in D(ind) and evaluate B on resultant 0/1

values

Can extend to disjunctions of above form

If query cannot be put in this form we set w1 as “true” (linear

search)

Leakage is same as for conjunction w1w2 … wn

(+ Boolean formula is leaked)

19

Some Techniques

20

#2: Multi-Client SSE with Blind Auth.#3: Dynamic SSE (i.e. Dynamic DB)

21

Multi-Client SSE with Blind Authorization

Keywords are attribute-value pairs, e.g. (name,Joe),

(zipcode,02415)

Attribute-based policies (“Is client C authorized for query Q?”)

Data owner D can apply access Policy based on attributes not values

E.g. C can query name and lastname but only with one of (zipcode, school)

Permissions can depend on client type and Boolean query formula

D enforces policy w/o learning the queried values, only the attrib’s

Blind (PIR-like) Authorization

Keywords tagged (e.g., xtrap) with Oblivious PRF (H(w))k

Meshes well with our DH-based OPRF used to instantiate H(w,ind)

Attribute-based authoriz’n enforced via attribute-specific OPRF keys

Mix-and-Match attacks against conjunctive (and Boolean) queries prevented via blinded homomorphic signatures

Authorization Cost: one round, 3n exp’s (n = # keywords in query)

Secure against fully malicious C and D, assuming D&E do not collude

22

Dynamic DB (updates)

Updates: add, delete, modify

Operational assumption: |DB| >> updates; thus:

EDB super-optimized for disk access but update structures planned for RAM

Periodic re-encryption eliminates leakage trails (e.g. new/old records)

Caching

Client can identify previously retrieved documents in result set before requesting them

Will retrieve a previously retrieved document only if the document changed since last retrieval

Important defense against server learning intersection b/w queries

Some Techniques

24

#4: Range SSE#5: Substring/Wildcard/... SSE

Range Queries

E.g., return all records with DOB between 2/3/87 and 3/4/88

Compute a cover of the range by intervals (via a tree cover algorithm)

We add log M attributes (for M = max interval) to the DB

Generate a disjunction of values representing each of the intervals

Thus Boolean SSE protocol implies Range SSE !

More interesting part: Authorization

Debbie authorizes based on total size of range (eg. query span ≤ one

year)

Needs to let Debbie learn the total size of the range w/o leaking on the endpoints? (e.g.., # of intervals should be same for any two ranges of

same size)

Two solutions: Canonical covers (all ranges of given size have same-lengths intervals); 3-node over-covers (always 3 intervals with 40% avg overhead)

25

Substring/Wildcard Queries

Substrings/Wildcards:

Any combination of _ (“single character wildcard”) and substrings of k or more consecutive characters (k is a tunable parameter)

Plus a % (“any string”) at the beginning and/or end of a search expression

Examples (return “encyclopedia”): (i) %cycl_ _ _dia (ii) %lope% (iii) enc_clo_ _dia

Tunable k:

smaller for more general queries

larger for search efficiency

26

Technique: Distance-Aware Queries

Represent substring query as conjunction of k-grams

(k=3) query %encycl% becomes %enc% AND (%ycl% @ distance 3)

If D(ind) has k-gram w at position pos then add tag H(ind,w,p) to tag set

Need to compute H(ind,w,pos+∆) given PRF(w), ∆, and Enc(ind,pos)

Define H(w,ind,p) as (PRF1(w)) { PRF2(ind) p

} [H a PRF under q-DH]

Generalizes to any search of the form:

Return all records where element e1 is at distance ∆ from e2

Examples:

ei are k-grams: resolves substrings and wildcards

ei are textual words: resolves phrases (e.g., “Ecole Normale

Superieure”)

Multi-dimensional distances (e.g., grid), etc.

27

(Some) Research Questions

Leveraging other tools: extend functionality and improve privacy via multiparty comp’n? identify modules that can benefit from ORAM?

Fundamental tradeoffs: Cost of preventing leakage (e.g., leakage via computation time, size of least frequent conjunctive term, etc.)

“Semantic leakage”: Proving formal leakage is nice but how bad is it for a given particular application, what forms of masking can help?

Can we have a theory to help us reason about it (cf. differential privacy)?

A theory of leakage composition?

Specific attacks welcome (e.g. Islam-Kuzu-Kantarcioglu NDSS’12)

Characterizing privacy-friendly plaintext search algorithms/data str.

Verifying results set: easy for single-keyword, possible for

SSE/Boolean, seems harder with multi-client

28

Crypto’2013: conjunctive SSE eprint.iacr.org/2013/169

CCS’2013: multi-client + blind auth. SSE eprint.iacr.org/2013/720

NDSS’2014: dynamic, much faster for large DBs

eprint.iacr.org/2014/?

Esorics’2015: ranges and substrings (+ wildcards & phrases)

29

Thanks!

highly-functional highly-scalable search on encrypted data hugo krawczyk, ibm (speaker: stanislaw...

Documents

data owner

db db owner

query tokens4

enccquery query

dynamic data

query results

aviation data

allowed queries