nonlinear order preserving index for encrypted database query in service cloud environments

CONCURRENCY AND COMPUTATION: PRACTICE AND EXPERIENCEConcurrency Computat.: Pract. Exper. 2013; 25:1967–1984Published online 25 January 2013 in Wiley Online Library (wileyonlinelibrary.com). DOI: 10.1002/cpe.2992

SPECIAL ISSUE PAPER

Nonlinear order preserving index for encrypted database queryin service cloud environments

Dongxi Liu*,† and Shenlu Wang,‡

CSIRO ICT Centre, Marsfield, NSW 2122, Australia

SUMMARY

The database services on cloud are appearing as an attractive way of outsourcing databases. When a databaseis deployed on a cloud database service, the data security and privacy becomes a big concern for users.A straightforward way to address this concern is to encrypt the database. However, after encryption, thedatabase cannot be easily queried. In this paper, we propose a nonlinear order preserving scheme for indexingencrypted data, which facilitates the range queries over encrypted databases. The scheme is secure eventhere are a large number of duplicates in plaintexts. Moreover, our scheme allows the programmability ofbasic indexing expressions and thus provides the capability of hiding the distribution of plaintexts from thedistribution of indexes. This scheme is suitable for long-standing databases because its use does not needany assumption on the characteristics of database data, such as their distribution, range and number, whichmay change dramatically over time. Copyright © 2013 John Wiley & Sons, Ltd.

Received 19 July 2012; Accepted 13 December 2012

KEY WORDS: database encryption; cloud database; secure index; query

1. INTRODUCTION

Cloud database services, such as Amazon Relational Database Service (RDS) and Microsoft SQLAzure, are appearing as an attractive way for enterprises to outsource their databases. In clouddatabase services, the hardware and software underlying databases are shared among users. Thedatabase services allow enterprises to deploy their databases quickly without making the largeinvestment on their proprietary hardware and software, hence reducing the total cost of ownership.Moreover, the database services on cloud can be elastic, meaning that an enterprise can dynami-cally increase or decrease the compute resources allocated to its databases according to its businessrequirements.

Although attractive as a new paradigm of data management, database services cannot be fullyexploited if the problem of data privacy and security cannot be addressed [1, 2]. When a databaseis deployed into a public database service, the service provider has the complete physical controlover the database. The data in the database might be improperly accessed by the untrusted clouddatabase administrators accidentally or intentionally or by attackers who compromise the databaseservice platforms. Because the database services are a kind of cloud computing services, the tech-niques of trusted cloud computing have the potential to be used to build trusted database services.However, there is still a gap of applying the techniques of trusted cloud computing such as [3, 4] toaddress the security and privacy problem in database services.

*Correspondence to: Dongxi Liu, CSIRO ICT Centre, Marsfield, NSW 2122, Australia.†E-mail: [email protected]‡Shenlu was a vacation student in CSIRO, coming from RMIT University

Copyright © 2013 John Wiley & Sons, Ltd.

1968 D. LIU AND S. WANG

For cloud database services, a straightforward approach to addressing the security and privacyproblem is to encrypt the database. By this way, the untrusted cloud database administrators orattackers only can see meaningless ciphertexts. However, after being encrypted, a database cannotbe easily queried. It is not acceptable to decrypt the entire database before performing each querybecause the decryption might be very slow for a large database, and the decrypted database is againat the risk of having its security and privacy breached. Ideally, a query should be executed directlyover the encrypted database.

A database query can be an equality query, a range query, an aggregate query, or their combina-tions. In this paper, we focus on the problem of performing range queries over encrypted databases.For example, a range query can be ‘select staffs who join the company between 2000 and 2012’.For equality queries, they can be handled when a deterministic encryption scheme (e.g. AdvancedEncryption Standard (AES) in Electronic codebook (ECB) mode) is used, because in this scheme,the same plaintexts are always encrypted into the same ciphertexts. For aggregate queries of usingSUM and AVG operations, homomorphic encryption algorithms [5] are needed to sum and averageciphertexts directly. We have discussions on how to apply our method together with secure hashalgorithms and homomorphic encryption algorithms to deal with all types of queries over encrypteddatabases.

To deal with range queries over encrypted databases, an order preserving encryption scheme hasbeen proposed in [6]. In this scheme, the i th value in the plaintext domain is mapped to the i thvalue in the ciphertext domain, such that the order between plaintexts is preserved between cipher-texts. To use this scheme, users need to be able to model the distributions of values in the plaintextand ciphertext domains. However, when using cloud database services, an enterprise may not havedatabase professionals who know the techniques [7] needed for distribution modeling.

In addition, the scheme [6] can only deal with plaintexts in a finite domain. The cryptographicanalysis of the order preserving encryption scheme is performed in [8].

The work [1] shows a way of building order preserving polynomials, which are based on the poly-nomials proposed by Shamir for secret sharing [9]. However, to use this mechanism, the number ofplaintexts are needed to determine the range of coefficients in a polynomial. On the other hand, theevaluation results of order preserving polynomials may reveal the distribution of plaintexts, becausesimilar plaintexts are transformed with similar polynomials. As discussed in [6], the coupling ofthe plaintext distribution and the ciphertext distribution might be exploited by attackers to guess thescope of the plaintext for a ciphertext.

In [10], an indexing mechanism for range queries is proposed. This mechanism is not strictlyorder preserving because two different values may be mapped into the same bucket, which is usedwhen checking query conditions. The mechanism can lead to inaccuracy of query results, and hence,some post-processing is needed to remove unexpected query results.

In the previous work [11], we proposed an order preserving indexing scheme, which indexesplaintexts by using simple linear expressions of the form a � x C b C noise. In such indexingexpressions, the coefficients a and b are kept secret (not known by untrusted cloud database adminis-trators), and noise is randomly sampled from some particular range, such that the order of plaintextsis preserved. As in [6,12], the threat model taken in our work assumes that untrusted cloud databaseadministrators can access only ciphertexts. However, even in this threat model, the indexing schemein [11] might become vulnerable when there are duplicates in plaintexts. This vulnerability is veryrealistic, because the duplicates of plaintexts can happen in realistic databases. For example, in acompany, all staffs at the same level usually have the same salary (i.e. duplicates of salaries).

In this paper, we propose the nonlinear indexing scheme to address the vulnerability of linearindexing. An nonlinear indexing expression has the form a � f .x/ � x C b C noise, where f .x/is a function over x. To keep the order preserving property, we determine the correctness require-ments to the function f .x/. Any functions satisfying the requirements can be used as f .x/ to definenonlinear indexing expressions. We have identified several instances of f .x/ and proven their cor-rectness, such as the logarithm function and the cosine function. The nonlinear indexing expressionscan keep a and the definition of f .x/ secret even when there are duplicates in plaintexts.

In the indexing scheme [11], programmability is a feature giving users the capability to unlinkthe distributions of plaintexts and indexes. That is, indexing expressions can be programmed to

Copyright © 2013 John Wiley & Sons, Ltd. Concurrency Computat.: Pract. Exper. 2013; 25:1967–1984DOI: 10.1002/cpe

NONLINEAR ORDER PRESERVING INDEX 1969

process plaintexts in different ranges with different indexing expressions. In this work, we stillallow the programmability of nonlinear indexing expressions. Moreover, the programmability isenhanced from two aspects. The first aspect is that the addition of two indexing expressions issupported as a new way to compose indexing expressions. For example, from two expressionsa1 � f1.x/ � x C b1 C noise1 and a2 � f2.x/ � x C b2 C noise2, we can build the following oneby addition.

.a1 � f1.x/C a2 � f2.x// � xC b1C b2C noise1C noise2

The composite indexing expressions make it harder for the untrusted administrators to guess thesecret values a1, a2 and the definitions of f1.x/ and f2.x/ even there are a large number of plaintextduplicates in a cloud database.

The second aspect of programmability enhancement is that the function f .x/ can also be com-posed. For example, suppose f1.x/ and f2.x/ are two functions satisfying the correctness require-ments, then their composition f1.f2.x// also satisfies the requirements and hence can be used inindexing expressions. The composition of f .x/ increases the robustness of indexing expressions bygenerating more complex forms of f .x/. For example, a function f .x/ can be composed from thelogarithm function and the cosine function.

Like the indexing scheme in [11], the nonlinear indexing scheme in this paper can be appliedmore easily than the methods in [1, 6]. The nonlinear indexing scheme does not require users tomodel data distribution explicitly, nor require users to determine the range and number of plaintextsbefore indexing. These usability features are important for the protection of long-standing databases,because such databases might have their data distribution, range, and number change dramatically ina long period. In addition, the nonlinear indexing scheme is used together with existing encryptionalgorithms (e.g. AES) to deal with queries over encrypted databases. Thus, it can benefit from theadvances in encryption algorithm research.

The rest of the paper is organized as follows. Section 2 describes the architecture of queryingencrypted databases. Section 3 gives the details of our indexing scheme, with its programmabilitydiscussed in Section 4. We introduce query translation in Section 5 and describe a prototype inSection 6. In the last two sections, related works and the conclusion are discussed.

2. THE ARCHITECTURE OF QUERYING ENCRYPTED DATABASES

In this section, we describe the architecture in which our indexing scheme is used to manageencrypted databases. The architecture is shown in Figure 1. In this architecture, there is a databaseservice provided in a public cloud and an enterprise that deploys into the cloud a database, which isencrypted by the enterprise to protect its privacy.

To query or update the encrypted database, the enterprise has a query proxy managing the com-munication between the database applications and the encrypted database. When a query is receivedfrom an application, the proxy translates it into a query that can be executed directly over theencrypted database. When a query result is returned from the database, the query proxy decryptsit before forwarding the result to the application. The query proxy depends on some meta data, suchas keys and database schemas, to translate queries and decrypt query results.

Briefly, when a value is put into the database, the proxy uses the indexing mechanism to generateits index and also encrypts the value with some encryption algorithm such as AES. The index andthe encrypted value are then stored into corresponding fields in the same record of the encrypteddatabase. That is, the encrypted database has a different schema than the one designed by applica-tion developers. When a range query is received from the database applications, the proxy calculatesthe index of the value in the query condition, which is then sent to the database service to comparewith indexes in the encrypted databases. The details of query translation are described in Section 5.

The order preserving indexing mechanism reveals the order information of the encrypted values.Hence, the cryptographic system based on order preserving encryption or order preserving index-ing is vulnerable to plaintext-chosen attacks [6, 8]. In this architecture, the proxy is deployed intothe administrative boundary of the enterprise. Because the untrusted cloud database administra-tors cannot access the query proxy, they cannot perform plaintext-chosen attacks to the encrypted



Figure 1. Architecture of querying encrypted databases.

databases in the cloud. Thus, in our threat model, the untrusted cloud database administrators canaccess only the ciphertexts stored in the cloud databases and cannot know the keys and schemas ofthe encrypted databases. That is, the untrusted cloud database administrators are allowed to performciphertexts only attacks. This threat model is also taken in [6, 12].

3. NONLINEAR ORDER PRESERVING INDEXING

There are several data types (i.e. integer, double, string, etc.) used in a database. In our work, wedesign the nonlinear indexing scheme primitively for numerical values, and other data types aretranslated into numerical values before indexing, as discussed in [11].

3.1. Overview of linear indexing

The linear indexing scheme proposed in [11] is built over the expressions of the form a�xCbCnoise,where x is a plaintext, a and b are secret coefficients (only known by the query proxy in the archi-tecture of Figure 1), and noise is a randomly selected value. The order preserving property meansthat for all v1 and v2, if v1 > v2, then a � v1 C b C noise1 > a � v2 C b C noise2. To guaranteethe order preserving property, we require that a > 0 in the linear expression and noise is randomlyselected from some particular range, as described later.

To determine the range of noises, the sensitivity of input values is needed. Intuitively, the sen-sitivity characterizes the minimal difference between two plaintexts. The following is the formaldefinition of plaintext sensitivity.

DefinitionLet V be the set of all input values. The sensitivity of V is the minimum element in the setfjv1 � v2jjv1 2 V , v2 2 V , v1 ¤ v2g.

By its definition, the sensitivity is always bigger than 0. The sensitivity of input values is usuallyspecific to applications. For example, if the salary in a company takes the format of d1d2d3.d4d5,where di is a digit, then the sensitivity of salary is 0.01. That is, the least salary difference of betweentwo staffs is 0.01 in the company. For another example, if the input values in an application can onlybe even numbers, then the sensitivity of input values in this application is 2.



DefinitionGiven the sensitivity sens of input values V , the linear index of value v 2 V is a � v C b C noise,where a > 0 and noise is randomly sampled from the range Œ0, a � sens/.

For example, suppose the linear expression is 7.2 � x C 3.75, and the sensitivity of input valuesis 0.01. Then, the range for generating noises is Œ0, 0.072/. For two input values 2.04 and 2.05, theirlinear indexes are calculated by 7.2 � 2.04 C 3.75 C noise1 and 7.2 � 2.05 C 3.75 C noise2, andhence distributed in the ranges Œ18.438, 18.51/ and Œ18.51, 18.582/, respectively. The correctness ofthe linear index scheme is proved in [11].

3.2. Vulnerability of linear indexing

The linear indexing scheme might be vulnerable when there are too many duplicates in plaintexts.The duplicates of plaintexts can happen in realistic databases. For example, staffs in a companymight have the same salary if they are at the same level.

Suppose the sensitivity of plaintexts is sens. Because the sensitivity is application specific andnot changed frequently, we assume that the untrusted cloud database administrators can know thesensitivity, for example, by guessing the most commonly used sensitivity values. In addition, for thepurpose of processing equality queries, the administrators need to know whether two indexes aregenerated from the same plaintext, as to be shown in Section 5.

The vulnerability can be easily explained in an extreme case. Suppose an input value v has twoduplicates. For one duplicate, its index i1 is a�vCbCnoise1, where noise1 happens to be 0 sampledfrom the range Œ0, a � sens/. For the other duplicate, its index i2 is a � vC bC noise2, where noise2also sampled from the range Œ0, a � sens/ happens to be a value, say n, infinitely close to a � sens.Because i1 and i2 are stored on the cloud databases, i1 and i2 are known to the untrusted clouddatabase administrators. From i1 and i2, n can be calculated by i2 � i1. Thus, n is also known tothe untrusted administrators. Under the assumption that sens is known to the cloud administrators,a value that is infinitely close to the secret a can be calculated by n=sens.

When there are multiple duplicates of a plaintext, the cloud administrators can choose themaximum index and the minimum index of the plaintext and calculate their difference to estimatea as in the extreme case. The probability of estimating a correctly will increase with the increase ofplaintext duplicates. The limit of the estimated secret with respect to the number of duplicates willbe a. On the other hand, a bigger a in an indexing expression means a bigger noise range a � sens.Thus, for a bigger a, more duplicates will be needed to estimate a correctly. Note that the numberof duplicates needed to estimate a is independent of the input value. These claims are verified bythe following experiments.

Suppose there are three indexing expressions: 16�xC317, 172�xC813, and 1327�xC1000.They will be used to index integer values with sensitivity 1, and each input value might duplicate1 time, 2 times, and so on. For each number of duplicates, we calculate the difference betweenthe maximum index and the minimum index. For these three indexing expressions, the relationsbetween the difference and the number of duplicates are shown in three figures: Figure 2, Figure 3,and Figure 4, where 19, 119, . . . , 919 are input values. From these figures, we can see the differencesare approximating the secrets (16, 172, or 1372) when the number of duplicates increases. That is,the secrets (16, 172, or 1372) are vulnerable when there are too many duplicates.

3.3. Nonlinear order preserving indexing

To address the vulnerability of linear indexing, we propose the nonlinear indexing scheme, which isdefined below.

DefinitionGiven the sensitivity sens of input values V , the nonlinear index of value v 2 V is calculated fromthe expression a � f .x/ � v C b C noise, where a, f .x/, and b are kept secret, with the followingrequirements satisfied.



Figure 2. Secret estimation with 16 � xC 317.



� a > 0,� noise is sampled from the range Œ0, a � f .vC sens/ � .vC sens/� a � f .v/ � v/,� f .x/ > 0 for x ¤ 0,� f .x1/> f .x2/ for x1 > x2 > 0 or x1 < x2 6 0.



Because the indexing expression is nonlinear, the range of noises may be different for differentinput values. Note that on the basis of the requirements of a and f .x/, the right-bound of the noiserange a � f .vC sens/ � .vC sens/� a � f .v/ � v is always bigger than 0.

We use the notation nindexsensŒa,b,f �.v/ to represent the nonlinear index of value v. The following

theorem states that the nonlinear indexing scheme is order preserving.

TheoremGiven the sensitivity sens of input values V , for all v1 2 V and v2 2 V , if v1 > v2, thennindexsens

Œa,b,f �.v1/ > nindexsensŒa,b,f �.v2/.

ProofTo prove this theorem, we first determine the minimum index and the maximum index for a valuev. According to the aforementioned index definition, the minimum index for v is a � f .v/ � vC bgenerated when noise is 0, whereas the maximum index is obtained when noise is infinitely close tothe right-bound of the noise range a � f .v C sens/ � .v C sens/ � a � f .v/ � v. We represent themaximum index as

a � f .v/ � vC bC a � f .vC sens/ � .vC sens/� a � f .v/ � v � �,

where � > 0 and is infinitely close to 0.Then, we prove nindexsens

Œa,b,f �.v1/ > nindexsensŒa,b,f �.v2/ by proving that the minimum index of

v1 is still bigger than the maximum index of v2, that is, by proving

a�f .v1/�v1Cb� .a�f .v2/�v2CbCa�f .v2C sens/� .v2C sens/�a�f .v2/�v2� �/ > 0,

which can be reduced to a � f .v1/� v1 � a � f .v2C sens/� .v2C sens/C � > 0. Because v1 > v2and the sensitivity of input values is sens, the minimum v1 bigger than v2 is v2C sens. When v1 isv2 C sens (taking its minimum value), the left-hand side of the inequation has the minimum value�. Because � > 0, the theorem is proved. �

The linear indexing scheme in [11] is a special case of the nonlinear indexing scheme. Whenf .x/ D 1, which is a constant function, the nonlinear indexing scheme becomes the linearindexing scheme. In the next section, we introduce several instances of f .x/, which are not constantfunctions.

3.4. Instances of nonlinear indexing expressions

Any function f .x/ that satisfies the requirements can be used to define a nonlinear indexing expres-sion. In this section, we introduce several instances of f .x/ that satisfy the requirements of thenonlinear indexing scheme. We will also prove that the composition of two f .x/ still satisfiesthe requirements.

� f .x/D jxj. We have jxj > 0 when x ¤ 0. The condition jx1j > jx2j holds when x1 > x2 > 0or x1 < x2 6 0. Similarly, x2 is also a valid instance of f .x/.� f .x/ D logc.d C e � jxj/, where c > 1, d > 1 and e > 0. By using the change of base law,f .x/ can be rewritten as log10.d C e � jxj/=log10.c/. Because c > 1 and d C e � jxj > 1

when x > 0, we have log10.d C e � jxj/ > 0 and log10.c/ > 0. Hence, when x > 0, we havef .x/ > 0. Because f .x/D f .�x/, f .x/ > 0 holds when x ¤ 0.

When x1 > x2 > 0, logc.dCe�jx1j/> logc.dCe�jx2j/ because dCe�jx1j> dCe�jx2jand f .x/ as a logarithm function is strictly increasing over positive inputs. When x1 < x2 6 0,we have �x1 > �x2 > 0, implying logc.d C e � j � x1j/ > logc.d C e � j � x2j/. Hence, wehave logc.d C e � jx1j/> logc.d C e � jx2j/ when x1 < x2 6 0.� f .x/D c �bjxj=�cCd � cos .jxj%� C �/C e, where d > 0, c > 2�d , e > d and b_c and %

are the floor and modulo operators, respectively. In this f .x/, only the term cos .jxj%� C �/can be negative and have the smallest value �1 when x D 0. Thus, when x > � , we havec � bjxj=�c > c > 0 and d � cos .jxj%� C �/C e > 0, hence f .x/ > 0 holds when x > � .



When 0 < x < � , we have c � bjxj=�c D 0 and d � cos .jxj%� C �/C e > 0, because e > dand cos .jxj%� C �/ > �1. Thus, when x > 0, f .x/ > 0 holds. Because f .x/ D f .�x/, wehave f .x/ > 0 when x ¤ 0.

When x1 > x2 > 0, we may have (1) bjx1j=�c > bjx2j=�c or (2) bjx1j=�c D bjx2j=�c andjx1j%� > jx2j%� . In the first case, the value of f .x1/� f .x2/ is

c � .bjx1j=�c � bjx2j=�c/C d � .cos .jx1j%� C �/� cos .jx2j%� C �//,

which has the minimum value c � 2 � d , because bjx1j=�c � bjx2j=�c > 1 and the smallestvalue of cos .jx1j%� C �/� cos .jx1j%� C �/ is �2. Hence, f .x1/> f .x2/ holds in the firstcase under the requirement that c > 2�d . When bjx1j=�c D bjx2j=�c and jx1j%� > jx2j%� ,the value of f .x1/ � f .x2/ is cos .jx1j%� C �/ � cos .jx2j%� C �/, which must be big-ger than 0 because the cosine function in the range Œ� , 2�� is strictly increasing, and hence,f .x1/ > f .x2/ also holds in the second case. When x1 < x2 6 0, we have �x1 > �x2 > 0.Because f .x1/D f .�x1/ and f .x2/D f .�x2/, f .�x1/ > f .�x2/ implies f .x1/ > f .x2/.

These instances of functions f .x/ can be composed. For example, by composing the second andthird ones, we can obtain the following f .x/:

logc.d C e � j.g � bjxj=�c C h � cos .jxj%� C �/C i/j/

Moreover, the composite f .x/ still satisfies the requirements of the nonlinear indexing scheme,as stated by the following theorem.

TheoremIf f1.x/ and f2.x/ satisfy the requirements of the nonlinear indexing scheme, then the compositefunction f1.f2.x// also satisfies the requirements.

ProofWhen x > 0, we have f2.x/ > 0 and f1.x/ > 0, and hence f1.f2.x// > 0. When x1 > x2 > 0,f2.x1/ > f2.x2/ > 0 holds, and hence we have f1.f2.x1// > f1.f2.x2//. When x1 < x2 6 0, wehave f2.x1/> f2.x2/ > 0, and hence f1.f2.x1//> f1.f2.x2// holds. �

In Section 3.2, we have experiments to show the vulnerability of the linear indexing scheme.Now, we repeat these experiments only by changing the three linear indexing expressions into thethree nonlinear ones: 16 � log7.10 C 18 � jxj/ � x C 317, 172 � log7.10 C 18 � jxj/ � x C 813and 1327 � log7.10C 18 � jxj/ � x C 1000. The differences between the maximum index and theminimum index for each value is depicted in Figure 5, Figure 6 and Figure 7. We can see the limitof the index differences does not expose any particular secrets. Moreover, the limits of differencesare different for different input values.

4. PROGRAMMABILITY OF NONLINEAR INDEXING EXPRESSIONS

The nonlinear indexing expressions can be composed into indexing programs. An indexing pro-gram can contain more complex indexing expressions, allowing different input values to be indexedby different indexing expressions. Programmability of indexing expressions increases further therobustness of the nonlinear indexing scheme and also gives users the capability to unlink thedistributions of plaintexts and indexes.

Before introducing the syntax of nonlinear indexing programs, we first define the sensitivity-keeping nonlinear indexing expressions. As in [11], the sensitivity-keeping feature facilitatesthe composition of indexing programs because the same sensitivity parameter can be usedacross a whole program. The sensitivity-keeping nonlinear indexing expressions, denoted byskindexsens

Œa,b,f�, have the form a � .f .x/ C 1/ � x C b C noise, where a > 1 and noise is sam-pled from the following range Œ0, a�.f .xCsens/C1/�.xCsens/�a�.f .x/C1/�x�sens�. Theright-bound of this range can be reduced to a�f .xCsens/�.xCsens/�a�f .x/�xCa�sens�sens,which is bigger than 0 (the left-bound of the noise range). Thus, for any two input values v1 and v2,



Figure 5. Secret estimation with 16 � log7.10C 18 � jxj/ � xC 317.



when v1 D v2 C sens, the difference of their indexes skindexsensŒa,b,f�.v1/� skindex

sensŒa,b,f�.v2/ is

always bigger than sens the sensitivity of input values.The syntax of nonlinear indexing programs is shown in Figure 8. An indexing program I can

be a basic indexing expression nindexsensŒa,b,f�, the addition of two indexing expressions I1 C I2



Figure 8. Abstract syntax of indexing programs.

or a sequential composition S I I , where S is the composition of sensitivity-keeping indexingexpressions. S can be a basic sensitivity-keeping indexing expression skindexsens

Œa,b,f�, an additionS1 C S2, a sequential composition S1IS2, or a conditional indexing expression. In the condi-tional indexing expression, the condition ge.c/ (or le.c/) means the input value is bigger than(or less than) or equal to the constant c. The function f .x/, used as a parameter in nindexsens

Œa,b,f�and skindexsens

Œa,b,f�, can be a constant function (i.e. f .x/ D 1 for being compatible with linearindexing), the instances defined before, or obtained by composition f1.f2.x//. Other instances off .x/ can be added if they satisfy the requirements of the nonlinear indexing scheme.

The semantics of indexing programs is defined as follows. Suppose v is an input value. Then,I.v/ means the application of I to v, generating v’s index. If I is nindexsens

Œa,b,f�, then I.v/ Da�f .v/�vCbCnoise. If I is I1CI2, then I.v/D I1.v/CI2.v/. If I is S I I 0, then I.v/D I 0.v0/,where v0 D S.v/. Note that because S is sensitivity-keeping, all basic indexing expressions in I 0

i.e.�nindexsens

Œa,b,f�

�can take sens (the sensitivity of input values) as its parameter. That is, users do

not have the burden of calculating a new sensitivity for basic indexing expressions nindexsensŒa,b,f�

in I 0.The semantics of sensitivity-keeping indexing expressions S is defined similarly. If S is

skindexsensŒa,b,f�, then S.v/D a�.f .v/C1/�vCbCnoise. If S is S1CS2, then S.v/D S1.v/CS2.v/.

If S is S1IS2, then S.v/D S2.v0/, where v0 D S1.v/. If S is a conditional indexing expression withthe condition ge.c/, then S.v/ D S1.v/ if v > c, otherwise S.v/ D S2.v/. If S has the conditionle.c/, then S.v/D S1.v/ if v 6 c, otherwise S.v/D S2.v/.

The order preserving properties of indexing programs can be proved inductively. As examples,we prove the sensitivity-keeping indexing expressions S is order preserving. If S is a basic expres-sion skindexsens

Œa,b,f�, then it is order preserving as nindexsensŒa,b,f�, which has been analyzed in the

previous section. If S1 and S2 are order preserving, then S1 C S2 and S1IS2 are both order pre-serving according to their semantics. If S is a conditional indexing expression with the conditionge.c/, S1 and S2 are order preserving with S1.c/ D S2.c/C sens, then for any two input valuesv1 and v2, v1 > v2, S is order preserving because one of the following three cases must hold:S1.v1/ > S1.v2/, S2.v1/ > S2.v2/, or S1.v1/ > S2.v2/. Similarly, S is order preserving when it isa conditional expression with the condition le.c/.

The programmability of the nonlinear indexing scheme gives users the capability to unlink thedistributions of plaintexts and indexes by indexing input values in different ranges with differ-ent expressions. In addition, the requirements in the conditional indexing expression, S1.c/ DS2.c/Csens or S2.c/D S1.c/Csens, ensures indexing expressions S1 and S2 generate consecutiveindexes around the condition value c. Thus, every index in the index domain has the possibility ofbeing the index of some input value, making it hard for the cloud administrators to guess exactlyhow many expressions are used to produce indexes by observing the gaps between indexes. The fol-lowing experiments demonstrate the programmability used to unlink the distributions of plaintextsand indexes.

Suppose a user has a set of input values in the range Œ�100, 100� and their sensitivity is 1. Aninput value may have 10,000 duplicates. Figure 9 and Figure 10 show the input values in the uni-form distribution and the Gaussian distribution, respectively. By applying the following indexingprogram to the uniformly distributed input values, we get the indexes with their distribution depictedin Figure 11, which shows that the output indexes do not take the uniform distribution.



Figure 9. Input values in uniform distribution.

Figure 10. Input values in Gaussian distribution.

Figure 11. Index distribution for uniformly distributed plaintexts.

For the input values having Gaussian distribution, we apply the following indexing program,which generates the indexes having the distribution shown in Figure 12. We can see the indexeshave a distribution other than the Gaussian distribution of plaintexts.

By programming, an infinite number of complex forms of nonlinear indexing expressions can begenerated, for example, by repeatedly adding existing nonlinear indexing expressions and repeatedlycomposing available f .x/. The complex and unfixed forms of indexing expressions increase further



Figure 12. Index distribution for plaintexts in Gaussian distribution.

the robustness of the nonlinear indexing scheme. For example, we can build a complex indexingexpression I1C I2C I3C I4C I5, where Ii .16 i 6 5/ are defined as follows.

I1 D 21 � log7.168C 38 � jxj/ � x � 678



I2 D 56 � log12.681C 78 � j13 � bjxj=�c C 6 � cos .jxj%� C �/C 21j/ � xC 321I3 D 7� .20� bjlog3.26C 5� jxj/j=�cC 9� cos .jlog3.26C 5 � jxj/j%� C �/C 11/� xC 73I4 D 51 � .19 � bjxj=�c C 8 � cos .jxj%� C �/C 33/ � xC 7I5 D 8 � jxj � xC 3 � .log21.3C 9 � jx

2j// � xC 12The following notations will be used later. Let Index be an indexing program, which is used

secretly by the proxy when translating queries. Then, Index.v, s/ generates the index of v byusing the program Index, with all indexing expressions in the program taking s as their sensitivity.Specially, Index.v, 0/ means the index of v without adding any noise, which is the minimum indexof v.

5. QUERY OF ENCRYPTED DATABASES

We introduce how to perform range queries over encrypted databases under the architecture inFigure 1. The equality and aggregate queries are also discussed.

5.1. Table structures

The table structures designed by application developers are created differently in an encrypteddatabase. Suppose application developers have designed a database that has a Staff table, whichincludes only one column Salary. When creating such a table in a cloud database service, thequery proxy hashes the table name, such that the table name is meaningless to the untrusted cloudadministrators.

For the column Salary, the proxy actually creates three corresponding columns, with their namesobtained by hashing SalaryEqIdx, SalaryRngIdx, and SalaryEnc, where EqIdx, RngIdx, and Enc arepostfixes appended by the query proxy. Figure 13 shows the Staff table structure designed by appli-cation developers and the table structure managed by the cloud database service, where the notationStaff0 represents the hash of the name Staff, and similarly for other hashed names SalaryEqIdx0,SalaryRngIdx0, and SalaryEnc0.

When a salary from the database application is being put into the encrypted table, the proxy pro-duces three values for the corresponding columns SalaryEqIdx0, SalaryRngIdx0, and SalaryEnc0,by using the hash algorithm, our indexing scheme, and some encryption algorithm. The columnsSalaryEqIdx0 and SalaryRngIdx0 are used to process query conditions involving equality and rangecomparisons, and when the query conditions are satisfied, the values in the column SalaryEnc0 willbe returned.

To generate the ciphertexts for the column SalaryEnc0, we can choose the widely used encryptionalgorithm such as AES. However, to support the aggregate queries of SUM and AVG, the homo-morphic encryption algorithms, such as [13, 14], should be used after they become more practical.Thus, the aggregate operations (e.g. the sum or average of salaries) can be performed directly overthe encrypted data in the SalaryEnc0 column.

5.2. The translation of SQL statements

The queries from database applications are translated by the proxy before being executed by thecloud database service. The translation of some representative queries is introduced as follows.

Figure 13. Change of table structures.



Assume the proxy has the key k. We write Enc.k, v/ for the encryption of v with k, and Hash.k, v/for the secure hash of v with k. The numerical and string data type is represented by Numand String.

5.2.1. Creation of encrypted databases and tables. To create a database and a table, the databaseapplication can issue the following two statements.

In the aforementioned statement, Type is the data type for the column colnm. The statements aretranslated into the following statements by the proxy. In addition, the proxy records the schema ofthe created table in its meta data.

That is, three columns are created for the column colnm. The column colnm+“EqIdx” have thetype String, because its values are always hexadecimal strings generated by secure hash functions.The values of column colnm+“RngIdx” are generated by our indexing mechanism and have thenumerical type. The column colnm+“Enc” for ciphertexts also has the type String.

5.2.2. Insertion of values into tables. After a table is created, the database application can put anew record into the table by using the following statement.

Assume the sensitivity of values in column colnm is sens, which is configured in the proxy. Theproxy translates the aforementioned statement into the following one for execution. In the newstatement, the value v is hashed, indexed, and encrypted for being stored into different columns.

5.2.3. Queries. A query from the database application can take the following basic form.

If � is used in the query (i.e. select * from . . . ), the proxy can replace � with all column namesaccording to the table schema in its meta data. For the basic query statement, the proxy translates itinto the following form, where the translation of cond into cond0 is discussed as follows.



For the condition cond, it is defined over the primitive logical forms colnm < c, colnm D c, andcolnm > c, where c is a constant from the domain of the colnm column, by using the logical con-nectives (i.e and, or). When translating the condition cond, we just need to replace each primitivelogical expression with the translated one.

The condition colnm< c is translated into Hash(k,colnm+“RngIdx”)< Index(c,0). Recall thatIndex.c, 0/ is the minimum index of c, because no noise is added. The condition colnmD c is simplytranslated into Hash(k,colnm+“EqIdx”)D Hash(k,c). Assume the sensitivity of values in the colnmcolumn is sens. Then, c+sens is the next value of c, and colnm> c is equivalent to the new conditioncolnm> cC sens, which is translated into Hash(k,colnm+“RngIdx”)> Index(c+sens,0). Note thatIndex(c+sens,0) is the minimum index of c+sens.

The keywords order by colnm and group by colnm are frequently used in queries. Theyare translated into order by Hash(k,colnm+“RngIdx”) and group by Hash(k,colnm+“EqIdx”),respectively.

There are some range queries that cannot be supported by our indexing scheme, nor by other orderpreserving encryption schemes [6, 8]. Briefly, the queries having conditions operating on severalcolumns, such as the condition colnm1 � colnm2 C colnm3 > c, cannot be processed by any exist-ing order preserving encryption schemes. For such conditions, we actually need the homomorphicorder preserving encryption or indexing schemes. To the best of our knowledge, the problem ofhomomorphic order preserving encryption or indexing schemes have not been identified and for-mulated in existing works. It will be our future work to develop homomorphic order preservingindexing schemes.

6. IMPLEMENTATION AND EVALUATION

We implemented a prototype of our indexing scheme for querying encrypted database. In the imple-mentation, we use the SQL Server as the underlying database management system to providethe service of managing encrypted databases. In this prototype, the database application is a webapplication deployed on the Apache Tomcat platform. The web application is designed to manageinformation of persons in a population census. The person table has the following schema.

A fragment of the encrypted person table is shown in Figure 14, where the first row is the hashes ofsix column names, idEqIdx, idRngIdx, idEnc, nameEqIdx, nameRngIdx, and nameEnc.Other rows are encrypted records. In the application, the HMACSHA1 algorithm is used for hashing,and the AES algorithm is used for encryption. The hashes and ciphertexts are stored as binary data inthe encrypted database. Figure 14 shows that our method does not give any meaningful informationabout the outsourced databases to the untrusted cloud database administrators because all schemasand data are encrypted.

In the previous sections, we have experiments showing the robustness of our indexing schemeeven when there are a large number of duplicates in plaintexts and the programmability of ourscheme that allows users to unlink the distributions of plaintexts and indexes. In the following, wetest the performance of our indexing scheme and also the performance of the prototype of queryingencrypted databases.

Figure 14. A fragment of encrypted table.



The performance of our indexing scheme is tested together the performance of AES and HMAC-SHA1, which are provided by the SunJCE security package. Two indexing expressions used in thetest are the following:

7.3 � log120.2.53.8C 7.32 � jxj/ � xC 81.87

5.19 � .3.2 � bjxj=�c C 1.2 � cos .jxj%� C �/C 2.3/ � xC 20

The test is taken two million times on the double input 393.8 and the string input ‘Tom 99999’,respectively, on a Dell Latitude E4310 Laptop. Table I shows the performance of the aforementionedtwo indexing expressions together with the performance of AES and HMACSHA1. When the inputis a double value, the basic indexing expressions are much faster; whereas for the string input, theperformance of indexing expressions becomes lower, but still comparable with AES and HMAC-SHA1. In our current implementation of indexing expressions, a string is converted into a big integerbefore indexing. The big integers cause slow arithmetic calculation. For AES and HMACSHA1 inSunJCE, a double value has to be converted into a byte array to process, so AES and HMACSHA1are slower for double values than for strings.

The performance of querying encrypted databases is tested with respect to data insertion andquery. To test the performance of data insertion, we generate 10,000 person records and insert theminto a plain database, where data is not encrypted, and an encrypted database, respectively. Theexpression used for indexing encrypted data is the following:

7.3 � log120.2.53.8C 7.32 � j.3.2 � bjxj=�c C 1.2 � cos .jxj%� C �/C 2.3/j/ � xC 81.87

The average time of inserting all 10,000 person records into the encrypted database is 7725ms,whereas the time for the plain database is 6012ms. That is, for insertion, the performance overheadof using our method is .7725� 6012/=6012D 28.5%.

For queries over encrypted data, the performance overhead of our system is dependent on thenumber of records returned, because more records need more time to decrypt. Hence, the decryp-tion algorithm (rather than our indexing scheme) is critical to the whole system performance. This istrue for all systems of querying encrypted databases, such as [12]. In our experiment, the followingquery is used to test performance.

select � from person where income> min and income< max

By controlling the values of min and max, we can obtain the query result containing the followingnumber of records: 2000, 4000, 6000, 8000, and 10,000. For each query result, we test the timefrom constructing the query over the encrypted database until all records are decrypted. Table II

Table I. Performance of basic indexing expressions.

Double input (time: ms) String input (time: ms)

Cosine-based indexing 445 4250Logarithm-based indexing 227.5 6656.75HMACSHA1 Hashing 11,945.25 7222.5AES Encryption 5722.5 1273.25

AES, advanced encryption standard.

Table II. Performance of data query (time: ms).

2000 Records 4000 Records 6000 Records 8000 Records 10,000 Records

Enc DB 141 156 178 187 203Plain DB 94 109 109 116 116

Enc DB, encrypted database; Plain DB, plain database.



shows the performance of queries over the encrypted database (Enc DB) and the plain database(Plain DB). Basically, the performance overhead for querying encrypted database increases linearlywith the number of query results.

7. RELATED WORKS

The most related works include the order preserving encryption scheme [6], the order preservingpolynomials [1], and the order preserving indexing scheme [10]. Some differences with these workshave been discussed in the first section. It should be noted that the programmability of indexingexpressions is a unique feature of our scheme and can improve the robustness of our scheme byindexing different input values with different indexing expressions.

The work [15] uses strictly increasing functions to implement order preserving encryption. Theirfunctions can be higher order and can be sequentially composed. However, all input values areencrypted by the same functions. These functions do not add noises into the encryption result, andhence, the secret coefficients can be recovered when some pairs of plaintexts and ciphertexts areknown by attackers.

The order preserving hash functions discussed in [16] map a set of input values into a set of hashvalues for fast information retrieval, with the hash values preserving the order of input values. Thesehash functions are not designed for protecting security. For example, the hash functions [16] usuallydescribe some algorithmic procedures used by all users. That is, there is no concept of secret values(such as encryption keys) that prevents the recovery of input values from hash values.

The CryptDB [12] is a system supporting SQL queries over encrypted databases, where rangequeries rely on order preserving encryption [8]. Our method can be incorporated into such systemsto process range queries. The sensitivity is an important parameter in our indexing scheme. It is alsoused to determine noises added to aggregate query results for achieving differential privacy [17].

In our architecture, the untrusted cloud database administrators cannot understand the encrypteddata, but they still can observe some data access patterns. For example, an access pattern can bethe data in a particular table column is accessed more frequently at a particular day of a month.The access patterns may leak some information to the untrusted administrators. In [18], a method isproposed to obfuscate the requests of customers to an untrusted cloud platform by injecting noiserequests. Similarly, in our architecture, the query proxy can issue noise queries to obfuscate the realqueries from database applications.

8. CONCLUSION

In this paper, we proposed a scheme of generating nonlinear order preserving indexes to facilitaterange queries over encrypted databases. This scheme addresses the vulnerability of the existinglinear indexing scheme and does not leak the information of secrets in indexing expressions evenwhen there are a large number of duplicates in plaintexts. This scheme is programmable, meaningthat the basic indexing expressions can be composed together to improve the robustness of theindexing programs and hide the distribution of input values from the distribution of indexes. Thisscheme does not need the range of input values and their number, and does not need distributionmodeling before indexing. Hence, it is suitable for the protection of long-standing databases. Weintroduced how to apply the indexing scheme to query encrypted databases by query translation. Aprototype is implemented to demonstrate our system, and the performance of the prototype is evalu-ated. Our future work will include the development of the homomorphic order preserving indexingscheme, which has not been identified and formulated in literature.

REFERENCES

1. Agrawal D, Abbadi AE, Emekçi F, Metwally A. Database management as a service: challenges and opportunities.Proceedings of the 25th International Conference on Data Engineering, Shanghai, China, 2009; 1709–1716.

2. CircleID Reporter. Survey: cloud computing ‘no hype’, but fear of security and control slowing adoption, Feb 2009.(Available from: http://www.circleid.com/posts/20090226_cloud_computing_hype_security), Access Date:January 15, 2013.



3. Haeberlen A. A case for the accountable cloud. SIGOPS Operating Systems Review April 2010; 44:52–57.4. Santos N, Gummadi KP, Rodrigues R. Towards trusted cloud computing. Proceedings of the 2009 Conference on Hot

Topics in Cloud Computing, San Diego, CA, USA, 2009. (Available from: http://static.usenix.org/event/hotcloud09/tech/full_papers/santos.pdf), Access Date: January 15, 2013.

5. Micciancio D. A first glimpse of cryptography’s holy grail. Communications of the ACM 2010; 53(3):96.6. Agrawal R, Kiernan J, Srikant R, Xu Y. Order preserving encryption for numeric data. Proceedings of the 2004 ACM

SIGMOD International Conference on Management of Data, SIGMOD ’04, Paris, France, 2004; 563–574.7. König AC, Weikum G. Combining histograms and parametric curve fitting for feedback-driven query result-size

estimation. Proceedings of the 25th International Conference on Very Large Data Bases, San Francisco, CA, USA,1999; 423–434.

8. Boldyreva A, Chenette N, Lee Y, O’Neill A. Order-preserving symmetric encryption. Proceedings of the 28th AnnualInternational Conference on Advances in Cryptology, EUROCRYPT ’09, Cologne, Germany, 2009; 224–241.

9. Shamir A. How to share a secret. Communications of the ACM November 1979; 22:612–613.10. Hore B, Mehrotra S, Tsudik G. A privacy-preserving index for range queries. Proceedings of the 30th International

Conference on Very Large Data Bases, Toronto, Canada, 2004; 720–731.11. Liu D, Wang S. Programmable order preserving secure index for encrypted database query. Proceedings of the 5th

IEEE International Conference on Cloud Computing, Honolulu, Hawaii, USA, 2012; 502–509.12. Popa RA, Redfield CMS, Zeldovich N, Balakrishnan H. CryptDB: protecting confidentiality with encrypted query

processing. In Proceedings of the Twenty-Third ACM Symposium on Operating Systems Principles, SOSP ’11. ACM:New York, NY, USA, 2011; 85–100, DOI: 10.1145/2043556.2043566.

13. Brakerski Z, Vaikuntanathan V. Fully homomorphic encryption from ring-LWE and security for key dependentmessages. Proceedings of the 31st Annual Conference on Advances in Cryptology, CRYPTO’11, Santa Barbara,California, USA, 2011; 505–524.

14. Paillier P. Public-key cryptosystems based on composite degree residuosity classes. Proceedings of the 17thInternational Conference on Theory and Application of Cryptographic Techniques, EUROCRYPT’99, Prague,Czech Republic, 1999.

15. Ozsoyoglu G, Singer DA, Chung SS. Anti-tamper databases: querying encrypted databases. In Proceedings of the17th Annual IFIP WG 11.3 Working Conference on Database and Applications Security, Estes Park, Colorado, USA,2003; 133–146.

16. Fox EA, Chen QF, Daoud AM, Heath LS. Order-preserving minimal perfect hash functions and information retrieval.ACM Transactions on Information Systems July 1991; 9:281–308.

17. McSherry FD. Privacy integrated queries: an extensible platform for privacy-preserving data analysis. Proceedingsof the 35th SIGMOD International Conference on Management of Data, SIGMOD ’09, Providence, Rhode Island,USA, 2009; 19–30.

18. Zhang G, Yang Y, Chen J. A historical probability based noise generation strategy for privacy protection in cloudcomputing. Journal of Computer and System Sciences Sep 2012; 78(5):1374–1381.


nonlinear order preserving index for encrypted database query in service cloud environments

Documents